### 목표변수와 설명변수
- Subject: 환자
- Goal: Features가 Target인 수술실패여부에 미치는 영향 분석
- Target: '수술실패여부'
- Features: '고혈압여부', '성별', '신부전여부', '연령', '체중', '수술시간'

In [53]:
import pandas as pd

In [54]:
df_ROS = pd.read_csv('../datasets/RecurrenceOfSurgery.csv')
df_ROS_select = df_ROS[['수술실패여부', '고혈압여부', '성별', '신부전여부', '연령', '체중', '수술시간']]
df_ROS_select[:2]

Unnamed: 0,수술실패여부,고혈압여부,성별,신부전여부,연령,체중,수술시간
0,0,0,2,0,66,60.3,68.0
1,0,0,1,0,47,71.7,31.0


In [55]:
df_ROS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1894 entries, 0 to 1893
Data columns (total 52 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              1894 non-null   int64  
 1   환자ID                    1894 non-null   object 
 2   Large Lymphocyte        1894 non-null   float64
 3   Location of herniation  1894 non-null   int64  
 4   ODI                     462 non-null    float64
 5   가족력                     1843 non-null   float64
 6   간질성폐질환                  1894 non-null   int64  
 7   고혈압여부                   1894 non-null   int64  
 8   과거수술횟수                  1894 non-null   int64  
 9   당뇨여부                    1894 non-null   int64  
 10  말초동맥질환여부                1894 non-null   int64  
 11  빈혈여부                    1894 non-null   int64  
 12  성별                      1894 non-null   int64  
 13  스테로이드치료                 1894 non-null   int64  
 14  신부전여부                   1894 non-null   

### 전처리

#### 모델학습, apply()함수를 사용하여 null값 채우기
- null 값 채우기 : '수술시간'

In [56]:
# null값 확인 -> 수술시간 null값 존재
df_ROS_select.isnull().sum()

수술실패여부     0
고혈압여부      0
성별         0
신부전여부      0
연령         0
체중         0
수술시간      54
dtype: int64

In [57]:
# null값 삭제 -> null값이 없는 데이터로 모델 학습시키기 위함
df_ROS_select_drop = df_ROS_select.dropna()
df_ROS_select_drop.isnull().sum()

수술실패여부    0
고혈압여부     0
성별        0
신부전여부     0
연령        0
체중        0
수술시간      0
dtype: int64

In [58]:
# null값이 없는 데이터로 모델학습 준비
# 1. target : '수술시간', feature: '성별'
target = df_ROS_select_drop[['수술시간']]
feature = df_ROS_select_drop[['성별']]
target.shape, feature.shape

((1840, 1), (1840, 1))

In [59]:
# 2. null값이 없는 데이터와 실제값을 사용하여 회귀모델 훈련
from sklearn.linear_model import LinearRegression
model = LinearRegression() # 인스턴스화(초기화)
model.fit(feature, target) # 모델 훈련

In [60]:
# 모델 예측 확인해보기 (type : numpy의 array)
model.predict(feature)

array([[62.40798859],
       [61.85601405],
       [61.85601405],
       ...,
       [61.85601405],
       [61.85601405],
       [62.40798859]])

In [61]:
# apply()
import numpy as np
def convert_notnull(row) :
    if np.isnan(row['수술시간'])  : # 변수 row의 값이 null이라면
        feature = df_ROS_select[['성별']]
        result = model.predict(feature)
        return result[0]
    else :
        return row['수술시간']  # null이 아니면 원래 데이터 값 반환

In [62]:
df_ROS_select['수술시간'] = df_ROS_select.apply(convert_notnull, axis=1)
df_ROS_select['수술시간']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ROS_select['수술시간'] = df_ROS_select.apply(convert_notnull, axis=1)


0       68.0
1       31.0
2       78.0
3       73.0
4       29.0
        ... 
1889    80.0
1890    20.0
1891    50.0
1892    25.0
1893    45.0
Name: 수술시간, Length: 1894, dtype: object

In [63]:
# apply() 적용 후 null값 확인
df_ROS_select['수술시간'].isnull().sum()

0

In [64]:
df_ROS_select.isnull().sum()

수술실패여부    0
고혈압여부     0
성별        0
신부전여부     0
연령        0
체중        0
수술시간      0
dtype: int64

In [65]:
df_ROS_select = df_ROS[['수술실패여부', '고혈압여부', '성별', '신부전여부', '연령', '체중', '수술시간']]
df_ROS_select[:2]

Unnamed: 0,수술실패여부,고혈압여부,성별,신부전여부,연령,체중,수술시간
0,0,0,2,0,66,60.3,68.0
1,0,0,1,0,47,71.7,31.0


#### Scaling & Encoding & Concat

##### - OneHotEncoding

In [66]:
# 범주형 데이터 확인 : '고혈압여부', '성별', '신부전여부'
df_ROS_select['고혈압여부'].value_counts(),df_ROS_select['성별'].value_counts(),df_ROS_select['신부전여부'].value_counts()

(0    1646
 1     248
 Name: 고혈압여부, dtype: int64,
 1    1168
 2     726
 Name: 성별, dtype: int64,
 0    1846
 1      48
 Name: 신부전여부, dtype: int64)

In [67]:
from sklearn.preprocessing import OneHotEncoder

In [68]:
# 범주형 설명변수 OneHotEncoding
oneHotEncoder = OneHotEncoder() # 인스턴스화
oneHotEncoder.fit(df_ROS_select[['고혈압여부', '성별', '신부전여부']])

In [69]:
oneHotEncoder.categories_

[array([0, 1], dtype=int64),
 array([1, 2], dtype=int64),
 array([0, 1], dtype=int64)]

In [70]:
encoded_data = oneHotEncoder.transform(df_ROS_select[['고혈압여부', '성별', '신부전여부']]).toarray()
encoded_data.shape

(1894, 6)

In [71]:
df_encoded_data = pd.DataFrame(data=encoded_data, columns=oneHotEncoder.get_feature_names_out(['고혈압여부', '성별', '신부전여부']))
df_encoded_data[:2]

Unnamed: 0,고혈압여부_0,고혈압여부_1,성별_1,성별_2,신부전여부_0,신부전여부_1
0,1.0,0.0,0.0,1.0,1.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0


##### - 병합(Concat)

In [72]:
df_ROS_select= pd.concat([df_ROS_select.reset_index(drop=True), df_encoded_data.reset_index(drop=True)], axis=1)
df_ROS_select[:2]

Unnamed: 0,수술실패여부,고혈압여부,성별,신부전여부,연령,체중,수술시간,고혈압여부_0,고혈압여부_1,성별_1,성별_2,신부전여부_0,신부전여부_1
0,0,0,2,0,66,60.3,68.0,1.0,0.0,0.0,1.0,1.0,0.0
1,0,0,1,0,47,71.7,31.0,1.0,0.0,1.0,0.0,1.0,0.0


In [73]:
df_ROS_select.shape

(1894, 13)

##### - Scaling

In [74]:
df_ROS_select.columns

Index(['수술실패여부', '고혈압여부', '성별', '신부전여부', '연령', '체중', '수술시간', '고혈압여부_0',
       '고혈압여부_1', '성별_1', '성별_2', '신부전여부_0', '신부전여부_1'],
      dtype='object')

In [75]:
target = df_ROS_select['수술실패여부']
features = df_ROS_select.drop(columns=['수술실패여부', '고혈압여부', '성별', '신부전여부'])

In [76]:
features.columns

Index(['연령', '체중', '수술시간', '고혈압여부_0', '고혈압여부_1', '성별_1', '성별_2', '신부전여부_0',
       '신부전여부_1'],
      dtype='object')

##### - MinMaxScaler

In [77]:
from sklearn.preprocessing import MinMaxScaler

In [78]:
minMaxScaler = MinMaxScaler() #인스턴스화
# features = minMaxSclaer.fit_transform(features)
fit_scaler = minMaxScaler.fit(features)
features_scaler = fit_scaler.transform(features)
features_scaler.shape

(1894, 9)

#### Imbalanced Data Sampling
- under sampling : Tomek's Link

In [79]:
from imblearn.under_sampling import TomekLinks

In [80]:
from sklearn.datasets import make_classification

In [81]:
features, target = make_classification(n_classes=2, class_sep=2,
                    weights=[0.4, 0.6], n_informative=3, n_redundant=1, flip_y=0,
                    n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)

In [82]:
features.shape, target.shape

((1000, 20), (1000,))

In [83]:
from collections import Counter

In [84]:
Counter(target)

Counter({0: 400, 1: 600})

In [85]:
tomekLinks = TomekLinks() #인스턴스화
features_resample, target_resample = tomekLinks.fit_resample(features, target) #교육

In [86]:
features_resample.shape, target_resample.shape

((995, 20), (995,))

In [87]:
Counter(target_resample)

Counter({0: 400, 1: 595})

#### 정형화

In [88]:
from sklearn.model_selection import train_test_split
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size = 0.3, random_state = 10)
features_train.shape, features_test.shape, target_train.shape, target_test.shape

((700, 20), (300, 20), (700,), (300,))

#### 모델학습

In [89]:
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
model = DecisionTreeClassifier()
# Target이 범주형이므로 Classifier을 사용

In [90]:
from sklearn.model_selection import GridSearchCV
hyper_params = {'min_samples_leaf' : range(2,5),
               'max_depth' : range(2,5),
               'min_samples_split' : range(2,5)}

#### 예측

In [91]:
from sklearn.metrics import f1_score, make_scorer
scoring = make_scorer(f1_score)

In [92]:
grid_search = GridSearchCV(model, param_grid = hyper_params, cv=3, verbose=1, scoring=scoring)
grid_search.fit(features, target)

Fitting 3 folds for each of 27 candidates, totalling 81 fits


In [93]:
grid_search.best_estimator_

In [94]:
grid_search.best_score_, grid_search.best_params_

# 전처리 전의 정확도(accuracy) : 0.028571428571428574
# 낮은 정확도, 모델이 예측을 잘 수행하지 못함

(0.9932913090807828,
 {'max_depth': 2, 'min_samples_leaf': 3, 'min_samples_split': 2})

In [95]:
best_model = grid_search.best_estimator_
best_model

In [96]:
target_test_predict = best_model.predict(features_test)
target_test_predict

array([1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
       1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0])

#### 평가

In [97]:
from sklearn.metrics import classification_report

In [98]:
print(classification_report(target_test, target_test_predict))

# 전처리 전의 값
#  precision  recall  f1-score   support
#0  0.94      1.00      0.97       516
#1  1.00      0.03      0.05        36

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       131
           1       0.99      0.99      0.99       169

    accuracy                           0.99       300
   macro avg       0.99      0.99      0.99       300
weighted avg       0.99      0.99      0.99       300



#### 서비스 배포

In [99]:
# 데이터를 저장하고 불러올 때 매우 유용한 라이브러리
# 클래스 자체를 통째로 파일로저장했다가 그것을 그대로 불러올 수 있음
import pickle

In [100]:
# MinMaxScaler class pikle
with open('../datasets/RecurrenceOfSurgery_scaling.pkl','wb') as scaling_file :
    pickle.dump(obj=fit_scaler, file=scaling_file)
    pass

In [101]:
# encoding class pikle
with open('../datasets/RecurrenceOfSurgery_encoding.pkl','wb') as encoding_file :
    pickle.dump(obj=oneHotEncoder, file=encoding_file)
    pass

In [102]:
# best model pikle
# with open('../datasets/RecurrenceOfSurgery_best_model.pkl','wb') as best_model :
#     pickle.dump(obj=,file=best_model)
#     pass