## Quest
- 업무분장(전처리, 모델학습).
- RecurrenceOfSurgery.csv 사용
- 목표변수 : 범주형, 설명변수 : 최소 6개 
- 서비스 대상과 목표 설명, 변수 선택 이유

### 변수설정
0. 서비스 대상: 환자
    * 수술실패에 영향을 미치는 요인들이 무엇이 있을까?
1. target: 수술실패여부
2. features 
    * 범주형: 고혈압여부
    * 범주형: 성별
    * 범주형: 신부전여부
    * 연속형: 연령
    * 연속형: 체중
    * 연속형: 수술시간

In [1]:
import pandas as pd

In [2]:
df_ROS = pd.read_csv('../../datasets/RecurrenceOfSurgery.csv')
df_ROS[:2]

Unnamed: 0.1,Unnamed: 0,환자ID,Large Lymphocyte,Location of herniation,ODI,가족력,간질성폐질환,고혈압여부,과거수술횟수,당뇨여부,...,Modic change,PI,PT,Seg Angle(raw),Vaccum disc,골밀도,디스크단면적,디스크위치,척추이동척도,척추전방위증
0,0,1PT,22.8,3,51.0,0.0,0,0,0,0,...,3,51.6,36.6,14.4,0,-1.01,2048.5,4,Down,0
1,1,2PT,44.9,4,26.0,0.0,0,0,0,0,...,0,40.8,7.2,17.8,0,-1.14,1753.1,4,Up,0


#### 전처리

In [3]:
df_ROS_select = df_ROS[['수술실패여부', '고혈압여부', '성별', '신부전여부', '연령', '체중', '수술시간']]
df_ROS_select[:2]

Unnamed: 0,수술실패여부,고혈압여부,성별,신부전여부,연령,체중,수술시간
0,0,0,2,0,66,60.3,68.0
1,0,0,1,0,47,71.7,31.0


In [4]:
# null값 확인 -> 수술시간 null값 존재
df_ROS_select.isnull().sum()

수술실패여부     0
고혈압여부      0
성별         0
신부전여부      0
연령         0
체중         0
수술시간      54
dtype: int64

In [5]:
# null값 삭제
df_ROS_select = df_ROS_select.dropna()
df_ROS_select.isnull().sum()

수술실패여부    0
고혈압여부     0
성별        0
신부전여부     0
연령        0
체중        0
수술시간      0
dtype: int64

In [6]:
target_column = df_ROS_select['수술실패여부']
features_null_column = df_ROS_select['수술시간']

In [7]:
# def fill_null (row) :
#     if df_ROS_select['수술시간'].isnull().sum()
#         target = df_ROS_select['수술실패여부']
#         features = df_ROS_select['수술시간']
        
        

In [8]:
# apply() 함수를 사용하여 null값 채우기
# df_ROS_select['수술시간'] = df_ROS_select['수술시간'].apply(fill_null)

#### Scaling & Encoding

##### - OneHotEncoding

In [9]:
# 범주형 데이터 확인 : '고혈압여부', '성별', '신부전여부'
df_ROS_select['고혈압여부'].value_counts(),df_ROS_select['성별'].value_counts(),df_ROS_select['신부전여부'].value_counts()

(0    1598
 1     242
 Name: 고혈압여부, dtype: int64,
 1    1139
 2     701
 Name: 성별, dtype: int64,
 0    1792
 1      48
 Name: 신부전여부, dtype: int64)

In [10]:
from sklearn.preprocessing import OneHotEncoder

In [11]:
# 범주형 설명변수 OneHotEncoding
oneHotEncoder = OneHotEncoder() # 인스턴스화
oneHotEncoder.fit(df_ROS_select[['고혈압여부', '성별', '신부전여부']])

In [12]:
oneHotEncoder.categories_

[array([0, 1], dtype=int64),
 array([1, 2], dtype=int64),
 array([0, 1], dtype=int64)]

In [13]:
encoded_data = oneHotEncoder.transform(df_ROS_select[['고혈압여부', '성별', '신부전여부']]).toarray()
encoded_data.shape

(1840, 6)

In [14]:
df_encoded_data = pd.DataFrame(data=encoded_data, columns=oneHotEncoder.get_feature_names_out(['고혈압여부', '성별', '신부전여부']))
df_encoded_data[:2]

Unnamed: 0,고혈압여부_0,고혈압여부_1,성별_1,성별_2,신부전여부_0,신부전여부_1
0,1.0,0.0,0.0,1.0,1.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0


In [15]:
# 병합
df_ROS_select= pd.concat([df_ROS_select.reset_index(drop=True), df_encoded_data.reset_index(drop=True)], axis=1)
df_ROS_select[:2]

Unnamed: 0,수술실패여부,고혈압여부,성별,신부전여부,연령,체중,수술시간,고혈압여부_0,고혈압여부_1,성별_1,성별_2,신부전여부_0,신부전여부_1
0,0,0,2,0,66,60.3,68.0,1.0,0.0,0.0,1.0,1.0,0.0
1,0,0,1,0,47,71.7,31.0,1.0,0.0,1.0,0.0,1.0,0.0


In [16]:
df_ROS_select.shape

(1840, 13)

##### Scaling

In [17]:
df_ROS_select.columns

Index(['수술실패여부', '고혈압여부', '성별', '신부전여부', '연령', '체중', '수술시간', '고혈압여부_0',
       '고혈압여부_1', '성별_1', '성별_2', '신부전여부_0', '신부전여부_1'],
      dtype='object')

In [18]:
target = df_ROS_select['수술실패여부']
features = df_ROS_select.drop(columns=['수술실패여부', '고혈압여부', '성별', '신부전여부'])

In [19]:
features.columns

Index(['연령', '체중', '수술시간', '고혈압여부_0', '고혈압여부_1', '성별_1', '성별_2', '신부전여부_0',
       '신부전여부_1'],
      dtype='object')

MinMaxScaler

In [20]:
from sklearn.preprocessing import MinMaxScaler

In [21]:
minMaxScaler = MinMaxScaler() #인스턴스화
features= minMaxScaler.fit_transform(features)
features.shape

(1840, 9)