### Quest
- dataset : RecurrenceOfSurgery
- 목표변수(재발여부), 설명변수(범주1개/연속2개)
- 결측치 행 삭제
- 최고 모델 찾기(default score 사용)

#### 변수설정
1. 목표변수 : '재발여부'
2. 설명변수
    - 범주형 : '고혈압여부'
    - 연속형 : '신장'
    - 연속형 : '연령'

In [1]:
import pandas as pd

In [2]:
df_ROS = pd.read_csv('../../../datasets/RecurrenceOfSurgery.csv')
df_ROS[:2]

Unnamed: 0.1,Unnamed: 0,환자ID,Large Lymphocyte,Location of herniation,ODI,가족력,간질성폐질환,고혈압여부,과거수술횟수,당뇨여부,...,Modic change,PI,PT,Seg Angle(raw),Vaccum disc,골밀도,디스크단면적,디스크위치,척추이동척도,척추전방위증
0,0,1PT,22.8,3,51.0,0.0,0,0,0,0,...,3,51.6,36.6,14.4,0,-1.01,2048.5,4,Down,0
1,1,2PT,44.9,4,26.0,0.0,0,0,0,0,...,0,40.8,7.2,17.8,0,-1.14,1753.1,4,Up,0


In [3]:
df_ROS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1894 entries, 0 to 1893
Data columns (total 52 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              1894 non-null   int64  
 1   환자ID                    1894 non-null   object 
 2   Large Lymphocyte        1894 non-null   float64
 3   Location of herniation  1894 non-null   int64  
 4   ODI                     462 non-null    float64
 5   가족력                     1843 non-null   float64
 6   간질성폐질환                  1894 non-null   int64  
 7   고혈압여부                   1894 non-null   int64  
 8   과거수술횟수                  1894 non-null   int64  
 9   당뇨여부                    1894 non-null   int64  
 10  말초동맥질환여부                1894 non-null   int64  
 11  빈혈여부                    1894 non-null   int64  
 12  성별                      1894 non-null   int64  
 13  스테로이드치료                 1894 non-null   int64  
 14  신부전여부                   1894 non-null   

#### 전처리

In [4]:
df_ROS_extract = df_ROS[['재발여부','고혈압여부','신장','연령']]
df_ROS_extract.isnull().sum() # null값 확인 - 없음

재발여부     0
고혈압여부    0
신장       0
연령       0
dtype: int64

In [5]:
df_ROS_extract[:2]

Unnamed: 0,재발여부,고혈압여부,신장,연령
0,0,0,163,66
1,0,0,171,47


#### Scaling & Encoding

##### - Encoding with OneHotEncoding

In [6]:
df_ROS_extract['고혈압여부'].value_counts()

0    1646
1     248
Name: 고혈압여부, dtype: int64

In [7]:
from sklearn.preprocessing import OneHotEncoder

In [8]:
oneHotEncoder = OneHotEncoder() #인스턴스화

In [9]:
oneHotEncoder.fit(df_ROS_extract[['고혈압여부']])

In [10]:
columns_name = oneHotEncoder.categories_
columns_name

[array([0, 1], dtype=int64)]

In [11]:
encoded_data = oneHotEncoder.transform(df_ROS_extract[['고혈압여부']]).toarray()
encoded_data, encoded_data.shape

(array([[1., 0.],
        [1., 0.],
        [1., 0.],
        ...,
        [1., 0.],
        [1., 0.],
        [1., 0.]]),
 (1894, 2))

In [12]:
# 병합을 위해서 dataframe으로 만들어줌
df_encoded_data = pd.DataFrame(data=encoded_data, columns=oneHotEncoder.get_feature_names_out(['고혈압여부']))
df_encoded_data[:2]

Unnamed: 0,고혈압여부_0,고혈압여부_1
0,1.0,0.0
1,1.0,0.0


In [13]:
df_encoded_data.index, df_encoded_data.shape

(RangeIndex(start=0, stop=1894, step=1), (1894, 2))

In [14]:
# df_ROS_extract와 df_encoded_data 병합
df_ROS_extract = pd.concat([df_ROS_extract.reset_index(drop=True), df_encoded_data.reset_index(drop=True)], axis=1)
df_ROS_extract[:2]

Unnamed: 0,재발여부,고혈압여부,신장,연령,고혈압여부_0,고혈압여부_1
0,0,0,163,66,1.0,0.0
1,0,0,171,47,1.0,0.0


##### - Scaling

In [15]:
df_ROS_extract.columns

Index(['재발여부', '고혈압여부', '신장', '연령', '고혈압여부_0', '고혈압여부_1'], dtype='object')

In [16]:
target = df_ROS_extract['재발여부']

In [17]:
features = df_ROS_extract.drop(columns=['재발여부', '고혈압여부'])

In [18]:
features.columns

Index(['신장', '연령', '고혈압여부_0', '고혈압여부_1'], dtype='object')

##### MinMaxScaler(정규화)

In [19]:
from sklearn.preprocessing import MinMaxScaler

In [20]:
minMaxScaler = MinMaxScaler() #인스턴스화
features = minMaxScaler.fit_transform(features)
features.shape

(1894, 4)

#### 정형화

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
features_train, features_test, target_train, target_test = train_test_split(features, target, random_state=111)
features_train.shape, target_train.shape, features_test.shape, target_test.shape

((1420, 4), (1420,), (474, 4), (474,))

#### 모델학습

In [25]:
from sklearn.tree import DecisionTreeClassifier

In [26]:
model = DecisionTreeClassifier()

In [27]:
from sklearn.model_selection import GridSearchCV

In [28]:
hyper_params = {'min_samples_leaf' : [5, 7, 9]
                , 'max_depth' : [9, 11],
                'min_samples_split' : [5, 6, 7]}

#### 평가

In [30]:
grid_search = GridSearchCV(model, param_grid=hyper_params, cv=3
                           , verbose=1)

In [31]:
grid_search.fit(features_train, target_train)

Fitting 3 folds for each of 18 candidates, totalling 54 fits


In [32]:
grid_search.best_estimator_

In [35]:
grid_search.best_score_, grid_search.best_params_

# ㅁ

(0.8774706142972261,
 {'max_depth': 9, 'min_samples_leaf': 7, 'min_samples_split': 5})

In [36]:
best_model = grid_search.best_estimator_
best_model

In [37]:
target_test_predict = best_model.predict(features_test)
target_test_predict

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [38]:
from sklearn.metrics import classification_report

In [39]:
print(classification_report(target_test, target_test_predict))

              precision    recall  f1-score   support

           0       0.88      0.99      0.93       416
           1       0.33      0.03      0.06        58

    accuracy                           0.87       474
   macro avg       0.61      0.51      0.50       474
weighted avg       0.81      0.87      0.83       474

