#**스마트폰 센서 데이터 기반 모션 분류**
# 단계2 : 기본 모델링


## 0.미션

* 데이터 전처리
    * 가변수화, 데이터 분할, NaN 확인 및 조치, 스케일링 등 필요한 전처리 수행
* 다양한 알고리즘으로 분류 모델 생성
    * 최소 4개 이상의 알고리즘을 적용하여 모델링 수행 
    * 성능 비교
    * 각 모델의 성능을 저장하는 별도 데이터 프레임을 만들고 비교
* 옵션 : 다음 사항은 선택사항입니다. 시간이 허용하는 범위 내에서 수행하세요.
    * 상위 N개 변수를 선정하여 모델링 및 성능 비교
        * 모델링에 항상 모든 변수가 필요한 것은 아닙니다.
        * 변수 중요도 상위 N개를 선정하여 모델링하고 타 모델과 성능을 비교하세요.
        * 상위 N개를 선택하는 방법은, 변수를 하나씩 늘려가며 모델링 및 성능 검증을 수행하여 적절한 지점을 찾는 것입니다.

## 1.환경설정

### (1) 라이브러리 불러오기

* 세부 요구사항
    - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
    - 필요하다고 판단되는 라이브러리를 추가하세요.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 필요하다고 판단되는 라이브러리를 추가하세요.




* 함수 생성

In [2]:
# 변수의 특성 중요도 계산하기
def plot_feature_importance(importance, names, result_only = False, topn = 'all'):
    feature_importance = np.array(importance)
    feature_name = np.array(names)

    data={'feature_name':feature_name,'feature_importance':feature_importance}
    fi_temp = pd.DataFrame(data)

    #변수의 특성 중요도 순으로 정렬하기
    fi_temp.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_temp.reset_index(drop=True, inplace = True)

    if topn == 'all' :
        fi_df = fi_temp.copy()
    else :
        fi_df = fi_temp.iloc[:topn]

    #변수의 특성 중요도 그래프로 그리기
    if result_only == False :
        plt.figure(figsize=(10,20))
        sns.barplot(x='feature_importance', y='feature_name', data = fi_df)

        plt.xlabel('importance')
        plt.ylabel('feature name')
        plt.grid()

    return fi_df

### (2) 데이터 불러오기

* 주어진 데이터셋
    * data01_train.csv : 학습 및 검증용
* 세부 요구사항
    - 전체 데이터 'data01_train.csv' 를 불러와 'data' 이름으로 저장합니다.
        - data에서 변수 subject는 삭제합니다.
    - 데이터프레임에 대한 기본 정보를 확인합니다.( .head(), .shape 등)

#### 1) 데이터 로딩

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
data = pd.read_csv("/content/drive/MyDrive/2023.04.12_미니프로젝트5차_3_5일차 실습자료/data01_train.csv")

In [126]:
fi = pd.read_csv("/content/drive/MyDrive/2023.04.12_미니프로젝트5차_3_5일차 실습자료/fi.csv")

In [5]:
data.drop('subject', axis=1, inplace=True)

In [7]:
data.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",Activity
0,0.288508,-0.009196,-0.103362,-0.988986,-0.962797,-0.967422,-0.989,-0.962596,-0.96565,-0.929747,...,-0.487737,-0.816696,-0.042494,-0.044218,0.307873,0.07279,-0.60112,0.331298,0.165163,STANDING
1,0.265757,-0.016576,-0.098163,-0.989551,-0.994636,-0.987435,-0.990189,-0.99387,-0.987558,-0.937337,...,-0.23782,-0.693515,-0.062899,0.388459,-0.765014,0.771524,0.345205,-0.769186,-0.147944,LAYING
2,0.278709,-0.014511,-0.108717,-0.99772,-0.981088,-0.994008,-0.997934,-0.982187,-0.995017,-0.942584,...,-0.535287,-0.829311,0.000265,-0.525022,-0.891875,0.021528,-0.833564,0.202434,-0.032755,STANDING
3,0.289795,-0.035536,-0.150354,-0.231727,-0.006412,-0.338117,-0.273557,0.014245,-0.347916,0.008288,...,-0.004012,-0.408956,-0.255125,0.612804,0.747381,-0.072944,-0.695819,0.287154,0.111388,WALKING
4,0.394807,0.034098,0.091229,0.088489,-0.106636,-0.388502,-0.010469,-0.10968,-0.346372,0.584131,...,-0.157832,-0.563437,-0.044344,-0.845268,-0.97465,-0.887846,-0.705029,0.264952,0.137758,WALKING_DOWNSTAIRS


In [8]:
data.shape

(5881, 562)

#### 2) 기본 정보 조회

## **2. 데이터 전처리**

* 가변수화, 데이터 분할, NaN 확인 및 조치, 스케일링 등 필요한 전처리를 수행한다. 


### (1) 데이터 분할1 : x, y

* 세부 요구사항
    - x, y로 분할합니다.

In [9]:
x = data.drop('Activity', axis=1)
y = data.loc[:,'Activity']

In [10]:
x.shape, y.shape

((5881, 561), (5881,))

### (2) 스케일링(필요시)


* 세부 요구사항
    - 스케일링을 필요로 하는 알고리즘 사용을 위해서 코드 수행
    - min-max 방식 혹은 standard 방식 중 한가지 사용.

In [19]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)
x_train_s = scaler.transform(x_train)
x_val_s = scaler.transform(x_val)

### (3) 데이터분할2 : train, validation

* 세부 요구사항
    - train : val = 8 : 2 혹은 7 : 3
    - random_state 옵션을 사용하여 다른 모델과 비교를 위해 성능이 재현되도록 합니다.

In [17]:
from sklearn.model_selection import train_test_split
x_train ,x_val, y_train, y_val = train_test_split(x, y, test_size = 0.2)

In [18]:
x_train.shape, x_val.shape, y_train.shape, y_val.shape

((4704, 561), (1177, 561), (4704,), (1177,))

In [14]:
x_train

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-meanFreq(),fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)"
4774,0.376920,-0.032645,-0.064039,0.289539,0.053635,-0.471835,0.272506,0.016392,-0.492155,0.434334,...,0.419752,-0.567567,-0.846305,-0.582829,0.232542,-0.961077,-0.351361,-0.758606,0.260774,0.055919
5539,0.238701,0.006739,-0.039181,0.144792,0.103281,0.274856,0.052146,0.079981,0.177332,0.576332,...,0.098880,-0.304029,-0.731637,0.341869,0.703404,-0.352313,0.410013,-0.774703,0.169565,0.162278
4131,0.281186,-0.016442,-0.105301,-0.986643,-0.975752,-0.937280,-0.987881,-0.976540,-0.932300,-0.931658,...,-0.026305,-0.428275,-0.831533,-0.216300,0.001455,0.630236,0.616058,0.598543,-0.948139,-0.041162
4439,0.291252,-0.007155,-0.071670,-0.981788,-0.951306,-0.888608,-0.984762,-0.955647,-0.875912,-0.900791,...,0.008509,-0.474763,-0.812534,0.026308,0.162407,-0.160792,-0.230637,-0.757341,0.160185,0.180606
4988,0.277851,-0.016398,-0.113465,-0.973787,-0.981209,-0.991203,-0.974319,-0.988913,-0.994028,-0.919364,...,-0.462122,-0.332268,-0.795195,0.332095,-0.255708,0.262056,-0.563233,0.837550,-0.464479,-0.502288
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,0.339080,-0.016085,-0.106023,-0.252613,-0.028976,-0.302370,-0.322871,-0.123665,-0.365803,0.119574,...,0.051267,-0.339116,-0.677347,-0.689925,-0.674171,-0.948016,0.108591,-0.582349,0.321003,0.202207
3667,0.249714,-0.056886,-0.130803,-0.277175,0.125114,0.082175,-0.333570,0.008973,0.108521,-0.086423,...,0.134790,-0.285846,-0.693693,-0.045776,-0.754565,-0.765198,0.279643,-0.564108,0.244522,0.288078
4761,0.380831,0.028296,-0.051711,0.061823,0.039712,-0.169901,0.036758,-0.065920,-0.139151,0.373524,...,0.032993,-0.301935,-0.722696,-0.355907,0.831396,-0.872199,0.496593,-0.893133,0.163480,-0.018540
5549,0.252981,-0.048451,0.022727,-0.400391,0.044912,-0.110975,-0.475616,-0.014197,0.009488,-0.029802,...,0.101919,-0.612107,-0.892213,0.060718,0.310241,-0.804300,-0.082584,-0.767021,0.253142,0.062616


## **3. 기본 모델링**



* 세부 요구사항
    - 최소 4개 이상의 알고리즘을 적용하여 모델링을 수행한다. 
    - 각 알고리즘별로 전체 변수로 모델링, 상위 N개 변수를 선택하여 모델링을 수행하고 성능 비교를 한다.
    - (옵션) 알고리즘 중 1~2개에 대해서, 변수 중요도 상위 N개를 선정하여 모델링하고 타 모델과 성능을 비교.
        * 상위 N개를 선택하는 방법은, 변수를 하나씩 늘려가며 모델링 및 성능 검증을 수행하여 적절한 지점을 찾는 것이다.

### (1) 알고리즘1 : 

In [96]:
# 불러오기
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import *

# 선언하기
model = DecisionTreeClassifier()

# 학습하기
model.fit(x_train, y_train)

# 예측하기
dt_pred = model.predict(x_val)

# 평가하기
print(confusion_matrix(y_val, dt_pred))
print(classification_report(y_val, dt_pred))

# 성능정보 수집
result = {}
result['Decision Tree'] = accuracy_score(y_val, dt_pred)

[[222   0   0   0   0   0]
 [  0 195  24   0   0   0]
 [  0  27 184   0   0   0]
 [  0   0   0 174   3   6]
 [  0   0   0   4 149  10]
 [  0   0   1  10   7 161]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       222
           1       0.88      0.89      0.88       219
           2       0.88      0.87      0.88       211
           3       0.93      0.95      0.94       183
           4       0.94      0.91      0.93       163
           5       0.91      0.90      0.90       179

    accuracy                           0.92      1177
   macro avg       0.92      0.92      0.92      1177
weighted avg       0.92      0.92      0.92      1177



In [97]:
# 불러오기
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import *

# 선언하기
model = KNeighborsClassifier()

# 학습하기
model.fit(x_train_s, y_train)

# 예측하기
kn_pred = model.predict(x_val_s)

# 평가하기
print(confusion_matrix(y_val, kn_pred))
print(classification_report(y_val, kn_pred))

# 성능정보 수집
result['KNN'] = accuracy_score(y_val, kn_pred)

[[219   3   0   0   0   0]
 [  0 192  27   0   0   0]
 [  0  14 197   0   0   0]
 [  0   0   0 182   0   1]
 [  0   0   0   3 158   2]
 [  0   0   0   5   1 173]]
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       222
           1       0.92      0.88      0.90       219
           2       0.88      0.93      0.91       211
           3       0.96      0.99      0.98       183
           4       0.99      0.97      0.98       163
           5       0.98      0.97      0.97       179

    accuracy                           0.95      1177
   macro avg       0.96      0.95      0.95      1177
weighted avg       0.95      0.95      0.95      1177



In [98]:
# 불러오기
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *

# 선언하기
model = LogisticRegression(max_iter=1000)

# 학습하기
model.fit(x_train, y_train)

# 예측하기
lr_pred = model.predict(x_val)

# 평가하기
print(confusion_matrix(y_val, lr_pred))
print(classification_report(y_val, lr_pred))

# 성능정보 수집
result['Logistic Regression'] = accuracy_score(y_val, lr_pred)

[[222   0   0   0   0   0]
 [  0 208  11   0   0   0]
 [  0   6 205   0   0   0]
 [  0   0   0 182   0   1]
 [  0   0   0   0 163   0]
 [  0   0   0   2   0 177]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       222
           1       0.97      0.95      0.96       219
           2       0.95      0.97      0.96       211
           3       0.99      0.99      0.99       183
           4       1.00      1.00      1.00       163
           5       0.99      0.99      0.99       179

    accuracy                           0.98      1177
   macro avg       0.98      0.98      0.98      1177
weighted avg       0.98      0.98      0.98      1177



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [99]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import *
rf_dynamic = RandomForestClassifier(random_state=0, max_depth=8, min_samples_leaf=8, min_samples_split=8, n_estimators=100)
rf_dynamic.fit(x_train, y_train)
rf_pred=rf_dynamic.predict(x_val)
print(confusion_matrix(y_val, rf_pred))
print(classification_report(y_val, rf_pred))
# 성능정보 수집
result['RandomForestClassifier'] = accuracy_score(y_val,rf_pred)

[[222   0   0   0   0   0]
 [  0 204  15   0   0   0]
 [  0   9 202   0   0   0]
 [  0   0   0 178   2   3]
 [  0   0   0   3 159   1]
 [  0   0   0   3   5 171]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       222
           1       0.96      0.93      0.94       219
           2       0.93      0.96      0.94       211
           3       0.97      0.97      0.97       183
           4       0.96      0.98      0.97       163
           5       0.98      0.96      0.97       179

    accuracy                           0.97      1177
   macro avg       0.97      0.97      0.97      1177
weighted avg       0.97      0.97      0.97      1177



In [100]:
# 불러오기
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from lightgbm import LGBMClassifier

from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import *

# 선언하기
estimators = [('dt', DecisionTreeClassifier()),
              ('knn', make_pipeline(MinMaxScaler(), KNeighborsClassifier())),
              ('lr', LogisticRegression(max_iter=1000)),
              ('rf', RandomForestClassifier())]

model = StackingClassifier(estimators=estimators,
                           final_estimator= RandomForestClassifier())

# 학습하기
model.fit(x_train, y_train)

# 예측하기
y_pred = model.predict(x_val)

# 평가하기
print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

# 성능정보 수집
result['Stacking'] = accuracy_score(y_val, y_pred)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

[[222   0   0   0   0   0]
 [  0 211   8   0   0   0]
 [  0   5 206   0   0   0]
 [  0   0   0 182   0   1]
 [  0   0   0   0 163   0]
 [  0   0   0   0   0 179]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       222
           1       0.98      0.96      0.97       219
           2       0.96      0.98      0.97       211
           3       1.00      0.99      1.00       183
           4       1.00      1.00      1.00       163
           5       0.99      1.00      1.00       179

    accuracy                           0.99      1177
   macro avg       0.99      0.99      0.99      1177
weighted avg       0.99      0.99      0.99      1177



In [None]:
# 데이터프레임 만들기
df = pd.DataFrame.from_dict(result, orient='index', columns=['score'])
df.sort_values(by='score', ascending=True, inplace=True)

# 성능 비교
plt.figure(figsize=(8, 5))
plt.barh(y=df.index, width=df['score'])
plt.xlabel('Score')
plt.ylabel('Model')
plt.show()

### (2) 알고리즘2 : 

In [102]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [103]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

y_train = encoder.fit_transform(y_train)
y_val = encoder.fit_transform(y_val)

In [77]:
x_train

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-meanFreq(),fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)"
200,0.218624,-0.045571,-0.065077,-0.378867,-0.238143,-0.122126,-0.438799,-0.277910,-0.146274,-0.118095,...,-0.086319,-0.438741,-0.729491,0.303799,0.905409,-0.986938,0.121822,-0.836778,0.200953,0.066249
5409,0.282593,-0.014729,-0.113073,-0.988792,-0.967902,-0.979663,-0.990011,-0.972861,-0.982064,-0.932500,...,0.058531,0.041729,-0.351394,-0.030502,-0.239791,-0.710071,0.513002,0.594198,-0.576566,-0.427567
1550,0.280813,-0.015690,-0.104579,-0.995386,-0.981972,-0.977796,-0.996066,-0.982110,-0.977154,-0.936665,...,0.557163,-0.844663,-0.987544,-0.071045,-0.213212,-0.066830,0.677112,-0.655946,0.296805,0.150275
1122,0.266870,-0.045472,-0.130040,-0.189596,-0.010933,-0.176226,-0.213362,-0.099327,-0.236046,-0.048083,...,0.090192,-0.147144,-0.586538,0.057979,0.717316,-0.873864,0.713649,-0.840219,0.174623,-0.065511
2709,0.279542,-0.016839,-0.107701,-0.949328,-0.974293,-0.982174,-0.948889,-0.979773,-0.981425,-0.902992,...,-0.628034,-0.260082,-0.752148,-0.181435,0.839043,-0.270803,-0.797416,0.860630,-0.434572,-0.525496
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3092,0.272065,-0.018295,-0.109208,-0.968129,-0.991483,-0.989518,-0.969156,-0.990928,-0.988891,-0.916608,...,0.300028,-0.461185,-0.794380,0.187249,-0.251382,-0.195202,0.642971,0.734735,-0.215885,-0.784570
2903,0.303456,-0.021978,-0.170100,-0.457049,-0.240353,-0.082992,-0.491212,-0.251179,0.001873,-0.186569,...,-0.373179,-0.329502,-0.697240,-0.179463,0.095538,-0.247759,0.135125,-0.821572,0.220829,0.031049
854,0.275527,-0.013269,-0.119917,-0.989635,-0.951600,-0.960183,-0.991191,-0.949449,-0.964220,-0.933929,...,0.151049,-0.306094,-0.603643,0.002691,0.252106,0.562008,-0.390594,-0.848499,0.132405,0.122006
619,0.271697,-0.014982,-0.111099,-0.996959,-0.986717,-0.985266,-0.997440,-0.989302,-0.987102,-0.945599,...,-0.141914,-0.350359,-0.748195,0.182874,0.827303,-0.417557,0.667427,-0.574599,-0.157621,0.168344


In [81]:
import re
x_train_r = x_train.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

In [82]:
x_train_r

Unnamed: 0,tBodyAccmeanX,tBodyAccmeanY,tBodyAccmeanZ,tBodyAccstdX,tBodyAccstdY,tBodyAccstdZ,tBodyAccmadX,tBodyAccmadY,tBodyAccmadZ,tBodyAccmaxX,...,fBodyBodyGyroJerkMagmeanFreq,fBodyBodyGyroJerkMagskewness,fBodyBodyGyroJerkMagkurtosis,angletBodyAccMeangravity,angletBodyAccJerkMeangravityMean,angletBodyGyroMeangravityMean,angletBodyGyroJerkMeangravityMean,angleXgravityMean,angleYgravityMean,angleZgravityMean
200,0.218624,-0.045571,-0.065077,-0.378867,-0.238143,-0.122126,-0.438799,-0.277910,-0.146274,-0.118095,...,-0.086319,-0.438741,-0.729491,0.303799,0.905409,-0.986938,0.121822,-0.836778,0.200953,0.066249
5409,0.282593,-0.014729,-0.113073,-0.988792,-0.967902,-0.979663,-0.990011,-0.972861,-0.982064,-0.932500,...,0.058531,0.041729,-0.351394,-0.030502,-0.239791,-0.710071,0.513002,0.594198,-0.576566,-0.427567
1550,0.280813,-0.015690,-0.104579,-0.995386,-0.981972,-0.977796,-0.996066,-0.982110,-0.977154,-0.936665,...,0.557163,-0.844663,-0.987544,-0.071045,-0.213212,-0.066830,0.677112,-0.655946,0.296805,0.150275
1122,0.266870,-0.045472,-0.130040,-0.189596,-0.010933,-0.176226,-0.213362,-0.099327,-0.236046,-0.048083,...,0.090192,-0.147144,-0.586538,0.057979,0.717316,-0.873864,0.713649,-0.840219,0.174623,-0.065511
2709,0.279542,-0.016839,-0.107701,-0.949328,-0.974293,-0.982174,-0.948889,-0.979773,-0.981425,-0.902992,...,-0.628034,-0.260082,-0.752148,-0.181435,0.839043,-0.270803,-0.797416,0.860630,-0.434572,-0.525496
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3092,0.272065,-0.018295,-0.109208,-0.968129,-0.991483,-0.989518,-0.969156,-0.990928,-0.988891,-0.916608,...,0.300028,-0.461185,-0.794380,0.187249,-0.251382,-0.195202,0.642971,0.734735,-0.215885,-0.784570
2903,0.303456,-0.021978,-0.170100,-0.457049,-0.240353,-0.082992,-0.491212,-0.251179,0.001873,-0.186569,...,-0.373179,-0.329502,-0.697240,-0.179463,0.095538,-0.247759,0.135125,-0.821572,0.220829,0.031049
854,0.275527,-0.013269,-0.119917,-0.989635,-0.951600,-0.960183,-0.991191,-0.949449,-0.964220,-0.933929,...,0.151049,-0.306094,-0.603643,0.002691,0.252106,0.562008,-0.390594,-0.848499,0.132405,0.122006
619,0.271697,-0.014982,-0.111099,-0.996959,-0.986717,-0.985266,-0.997440,-0.989302,-0.987102,-0.945599,...,-0.141914,-0.350359,-0.748195,0.182874,0.827303,-0.417557,0.667427,-0.574599,-0.157621,0.168344


In [104]:
# 불러오기
from lightgbm import LGBMClassifier
from sklearn.metrics import *

# 선언하기
model = LGBMClassifier()

# 학습하기
model.fit(x_train_r, y_train)

# 예측하기
LGBM_pred = model.predict(x_val)

# 평가하기
print(confusion_matrix(y_val, LGBM_pred))
print(classification_report(y_val, LGBM_pred))

# 성능정보 수집
result['LightGBM'] = accuracy_score(y_val, LGBM_pred)

[[222   0   0   0   0   0]
 [  0 214   5   0   0   0]
 [  0   3 208   0   0   0]
 [  0   0   0 182   0   1]
 [  0   0   0   0 162   1]
 [  0   0   0   1   0 178]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       222
           1       0.99      0.98      0.98       219
           2       0.98      0.99      0.98       211
           3       0.99      0.99      0.99       183
           4       1.00      0.99      1.00       163
           5       0.99      0.99      0.99       179

    accuracy                           0.99      1177
   macro avg       0.99      0.99      0.99      1177
weighted avg       0.99      0.99      0.99      1177



### (3) 알고리즘3 : 

catboost

In [106]:
from catboost import CatBoostClassifier, Pool, cv
train_pool = Pool(x_train,y_train)
eval_pool = Pool(x_val , y_val )

model = CatBoostClassifier()
model.fit(train_pool, eval_set=eval_pool)
# 예측하기
catb_pred = model.predict(x_val)

# 평가하기
print(confusion_matrix(y_val, catb_pred))
print(classification_report(y_val, catb_pred))

# 성능정보 수집
result['CatBoostClassifier'] = accuracy_score(y_val, catb_pred)

Learning rate set to 0.111372
0:	learn: 1.4332317	test: 1.4295410	best: 1.4295410 (0)	total: 1.41s	remaining: 23m 31s
1:	learn: 1.2202365	test: 1.2208866	best: 1.2208866 (1)	total: 2.18s	remaining: 18m 9s
2:	learn: 1.0604969	test: 1.0640774	best: 1.0640774 (2)	total: 2.75s	remaining: 15m 12s
3:	learn: 0.9368019	test: 0.9394632	best: 0.9394632 (3)	total: 3.31s	remaining: 13m 45s
4:	learn: 0.8346739	test: 0.8390969	best: 0.8390969 (4)	total: 3.87s	remaining: 12m 50s
5:	learn: 0.7520325	test: 0.7581689	best: 0.7581689 (5)	total: 4.43s	remaining: 12m 13s
6:	learn: 0.6826697	test: 0.6889811	best: 0.6889811 (6)	total: 5.01s	remaining: 11m 50s
7:	learn: 0.6212599	test: 0.6290619	best: 0.6290619 (7)	total: 5.56s	remaining: 11m 29s
8:	learn: 0.5689720	test: 0.5787820	best: 0.5787820 (8)	total: 6.14s	remaining: 11m 15s
9:	learn: 0.5287156	test: 0.5387822	best: 0.5387822 (9)	total: 6.69s	remaining: 11m 2s
10:	learn: 0.4900748	test: 0.5018150	best: 0.5018150 (10)	total: 7.3s	remaining: 10m 56s
11:

### (4) 알고리즘4 : 

xgboost

In [107]:
from xgboost.sklearn import XGBClassifier
model = XGBClassifier(n_estimators=400)
model.fit(x_train, y_train)
# 예측하기
XGBC_pred = model.predict(x_val)

# 평가하기
print(confusion_matrix(y_val, XGBC_pred))
print(classification_report(y_val, XGBC_pred))
result['XGBClassifier'] = accuracy_score(y_val, XGBC_pred)

[[222   0   0   0   0   0]
 [  0 215   4   0   0   0]
 [  0   3 208   0   0   0]
 [  0   0   0 182   0   1]
 [  0   0   0   0 162   1]
 [  0   0   0   0   0 179]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       222
           1       0.99      0.98      0.98       219
           2       0.98      0.99      0.98       211
           3       1.00      0.99      1.00       183
           4       1.00      0.99      1.00       163
           5       0.99      1.00      0.99       179

    accuracy                           0.99      1177
   macro avg       0.99      0.99      0.99      1177
weighted avg       0.99      0.99      0.99      1177



In [114]:
!pip install pycaret

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pycaret
  Downloading pycaret-3.0.0-py3-none-any.whl (481 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m481.8/481.8 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Collecting kaleido>=0.2.1
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
Collecting pmdarima!=1.8.1,<3.0.0,>=1.8.0
  Downloading pmdarima-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m89.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting category-encoders>=2.4.0
  Downloading category_encoders-2.6.0-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 kB[0m [31m9.4 MB/s[0m e

In [116]:
from pycaret.classification import *

reg_test_1 = setup(data=data,
                   target='Activity',
                   train_size= 0.8,
                   fold=5)

best = compare_models(sort='Accuracy')

model = tune_model(best)

x=data.drop('Activity',axis=1,inplace=False)
y=data['Activity']

x_train,x_val,y_train,y_val=train_test_split(x,y,test_size=0.2,random_state=56456)

model.fit(x_train,y_train)

y_pred=model.predict(x_val)

print(accuracy_score(y_val, y_pred))

Unnamed: 0,Description,Value
0,Session id,7770
1,Target,Activity
2,Target type,Multiclass
3,Target mapping,"LAYING: 0, SITTING: 1, STANDING: 2, WALKING: 3, WALKING_DOWNSTAIRS: 4, WALKING_UPSTAIRS: 5"
4,Original data shape,"(5881, 562)"
5,Transformed data shape,"(5881, 562)"
6,Transformed train set shape,"(4704, 562)"
7,Transformed test set shape,"(1177, 562)"
8,Numeric features,561
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.9834,0.9992,0.9834,0.9835,0.9834,0.98,0.9801,8.412
ridge,Ridge Classifier,0.9794,0.0,0.9794,0.9794,0.9793,0.9752,0.9752,0.226
rf,Random Forest Classifier,0.9745,0.9993,0.9745,0.9746,0.9745,0.9693,0.9693,4.772
svm,SVM - Linear Kernel,0.9677,0.0,0.9677,0.97,0.9674,0.9611,0.9617,0.566
knn,K Neighbors Classifier,0.9585,0.9956,0.9585,0.9596,0.9583,0.9501,0.9504,0.562
dt,Decision Tree Classifier,0.9324,0.9594,0.9324,0.9327,0.9323,0.9186,0.9187,2.172
nb,Naive Bayes,0.7319,0.9595,0.7319,0.7928,0.7276,0.6785,0.6923,0.296
ada,Ada Boost Classifier,0.544,0.8811,0.544,0.3483,0.4077,0.4447,0.504,11.07
qda,Quadratic Discriminant Analysis,0.1346,0.5,0.1346,0.0181,0.0319,0.0,0.0,1.192


Processing:   0%|          | 0/69 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

In [None]:
X_train = train.drop(['id', 'target'], axis = 1)
y_train = train['target']

feat_labels = X_train.columns

rf = RandomForestClassifier(n_estimators = 1000, random_state = 0, n_jobs = -1)

rf.fit(X_train, y_train)
importances = rf.feature_importances_

indices = np.argsort(rf.feature_importances_)[::-1] 
# np.argsort() : 작은 것 부터 순서대로 뽑아내는 index
# [::-1] 다시 역순으로

for f in range(X_train.shape[1]):
    print('%2d) %-*s %f' % (f +1, 30, feat_labels[indices[f]], importances[indices[f]]))
    # 순서, 30으로 나누기, 인덱스와 중요도 출력

sfm = SelectFromModel(rf, threshold = 'median', prefit = True)
print('Number of features before selection : {}'.format(X_train.shape[1]))

# sfm 적용
n_features = sfm.transform(X_train).shape[1]

print("feature selection 후 feature 수 : {}". format(n_features))

selected_vars = list(feat_labels[sfm.get_support()])

In [None]:
## 전진 단계별 선택법
variables = df.columns[:-2].tolist() ## 설명 변수 리스트
 
y = df['Survival_Time'] ## 반응 변수
selected_variables = [] ## 선택된 변수들
sl_enter = 0.05
sl_remove = 0.05
 
sv_per_step = [] ## 각 스텝별로 선택된 변수들
adjusted_r_squared = [] ## 각 스텝별 수정된 결정계수
steps = [] ## 스텝
step = 0
while len(variables) > 0:
    remainder = list(set(variables) - set(selected_variables))
    pval = pd.Series(index=remainder) ## 변수의 p-value
    ## 기존에 포함된 변수와 새로운 변수 하나씩 돌아가면서 
    ## 선형 모형을 적합한다.
    for col in remainder: 
        X = df[selected_variables+[col]]
        X = sm.add_constant(X)
        model = sm.OLS(y,X).fit()
        pval[col] = model.pvalues[col]
 
    min_pval = pval.min()
    if min_pval < sl_enter: ## 최소 p-value 값이 기준 값보다 작으면 포함
        selected_variables.append(pval.idxmin())
        ## 선택된 변수들에대해서
        ## 어떤 변수를 제거할지 고른다.
        while len(selected_variables) > 0:
            selected_X = df[selected_variables]
            selected_X = sm.add_constant(selected_X)
            selected_pval = sm.OLS(y,selected_X).fit().pvalues[1:] ## 절편항의 p-value는 뺀다
            max_pval = selected_pval.max()
            if max_pval >= sl_remove: ## 최대 p-value값이 기준값보다 크거나 같으면 제외
                remove_variable = selected_pval.idxmax()
                selected_variables.remove(remove_variable)
            else:
                break
        
        step += 1
        steps.append(step)
        adj_r_squared = sm.OLS(y,sm.add_constant(df[selected_variables])).fit().rsquared_adj
        adjusted_r_squared.append(adj_r_squared)
        sv_per_step.append(selected_variables.copy())
    else:
        break