# Titanic - Machine Learning from Disaster

## Predict survival on the Titanic and get familiar with ML basics

https://www.kaggle.com/c/titanic/data

![image.png](../Images/Titanic.png)

#### 탐색적 데이터 분석 (EDA ; Exploratory Data Analysis) 

- 주어진 각 feature들의 분포 살펴보기, 생존자/사망자 별로 데이터 분리하여 살펴본다.
- 어떤 정보를 통해 **생존율을 예측**할 수 있을 지, 가설을 세우고 실제 그래프로 검증한다.

#### [ 데이터 설명 ]

- 891명의 승객에 대한 데이터 : 생존여부 / 좌석 등급 / 성별 / 나이 / 일행 / 자녀 / 운임 등의 feature

|Feature|Definition|Value|
|:------|:---------|:------------|
|pclass|티겟 등급 (1등석, 2등석, 3등석)|1 = 1st, 2 = 2nd, 3 = 3rd|
|name| 탑승자 이름 | 문자열
|sex|성별| male 남성,  female 여성
|age|나이  |((숫자))
|sibsp|함께 탑승한 배우자, 형제자매의 수 합|   ((숫자))
|parch|함께 탑승한 부모님, 자녀의 수 합|   ((숫자))
|ticket| 티켓번호 | 문자열
|fare|운임 요금 (티켓 가격)|   ((숫자))
|cabin|선박에서 배정받은 좌석의 구역|    A, B, C, D, E, F, G, 빈 값
|embarked|출항지 (한글자)|C = Cherbourg, Q = Queenstown, S = Southampton|
|survived|생존 여부|0 = No, 1 = Yes|

### Library & Data Import

In [1]:
import pandas as pd
import numpy as np

In [2]:
X_train = pd.read_csv('../Datasets/Titanic_X_train.csv')
X_test = pd.read_csv('../Datasets/Titanic_X_test.csv')
y_train = pd.read_csv('../Datasets/Titanic_y_train.csv')

### 1. Data Exploration

In [3]:
X_train

Unnamed: 0,ID,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,3,Sdy*****,male,,0,0,349222,7.8958,,S
1,2,3,Pel*****,male,25.0,0,0,STON/O 2. 3101291,7.9250,,S
2,3,3,Kar*****,male,22.0,0,0,350060,7.5208,,S
3,4,3,Saa*****,male,,0,0,2676,7.2250,,C
4,5,3,Cor*****,male,19.0,0,0,349231,7.8958,,S
...,...,...,...,...,...,...,...,...,...,...,...
780,781,1,Ear*****,female,23.0,0,1,11767,83.1583,C54,C
781,782,2,Har*****,female,6.0,0,1,248727,33.0000,,S
782,783,3,Lul*****,male,27.0,0,0,315098,8.6625,,S
783,784,3,Alb*****,male,26.0,0,0,2699,18.7875,,C


In [4]:
X_test

Unnamed: 0,ID,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,786,1,All*****,female,2.0,1,2,113781,151.5500,C22 C26,S
1,787,1,And*****,male,39.0,0,0,112050,0.0000,A36,S
2,788,1,Bau*****,male,,0,0,PC 17318,25.9250,,S
3,789,1,Bax*****,male,24.0,0,1,PC 17558,247.5208,B58 B60,C
4,790,1,Bea*****,male,36.0,0,0,13050,75.2417,C6,C
...,...,...,...,...,...,...,...,...,...,...,...
519,1305,3,Sun*****,male,44.0,0,0,STON/O 2. 3101269,7.9250,,S
520,1306,3,Tho*****,female,,1,0,376564,16.1000,,S
521,1307,3,Tou*****,male,7.0,1,1,2650,15.2458,,C
522,1308,3,Tur*****,female,63.0,0,0,4134,9.5875,,S


In [5]:
y_train

Unnamed: 0,ID,survived
0,1,0
1,2,0
2,3,0
3,4,0
4,5,0
...,...,...
780,781,1
781,782,1
782,783,1
783,784,1


In [6]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 785 entries, 0 to 784
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   ID        785 non-null    int64  
 1   pclass    785 non-null    int64  
 2   name      785 non-null    object 
 3   sex       785 non-null    object 
 4   age       628 non-null    float64
 5   sibsp     785 non-null    int64  
 6   parch     785 non-null    int64  
 7   ticket    785 non-null    object 
 8   fare      784 non-null    float64
 9   cabin     171 non-null    object 
 10  embarked  784 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 67.6+ KB


In [7]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 524 entries, 0 to 523
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   ID        524 non-null    int64  
 1   pclass    524 non-null    int64  
 2   name      524 non-null    object 
 3   sex       524 non-null    object 
 4   age       418 non-null    float64
 5   sibsp     524 non-null    int64  
 6   parch     524 non-null    int64  
 7   ticket    524 non-null    object 
 8   fare      524 non-null    float64
 9   cabin     124 non-null    object 
 10  embarked  523 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 45.2+ KB


### 2. Data Preprocessing

#### (1) 상관관계가 낮은 변수 삭제

In [8]:
# ID 컬럼은 탑승자에 대한 고유 정보로 key 역할로 모델에는 불필요함
# 결과 제출 시에는 X_test의 ID 컬럼이 필요하기 때문에 별도 저장
ID = X_test['ID'].copy()

# name은 텍스트 전처리 등의 방법으로 분석 가능하기도 하지만 편의상 제외
# 데이터들에서 ID, name 컬럼 삭제

# 또한, age, ticket는 survived와 상관관계가 낮으므로 컬럼을 삭제
X_train = X_train.drop(columns = ['ID', 'name', 'age', 'ticket'])
X_test = X_test.drop(columns = ['ID', 'name', 'age', 'ticket'])
y_train = y_train.drop(columns = ['ID'])

#### (2) Missing Value

In [9]:
# fare는 티켓요금을 의미하고 train에만 결측치가 1개 존재하므로 레코드를 삭제

# 결측일 조건
cond_na = X_train['fare'].isna()

# 행 삭제
X_train = X_train[~cond_na]
y_train = y_train[~cond_na]

In [10]:
####### cabin 컬럼(train 614, test 400 결측)
# cabin는 선실번호를 의미하고 train은 레코드의 78%, test는 레코드의 76%가 결측이므로 컬럼을 삭제

# cabin 컬럼을 삭제
X_train = X_train.drop('cabin', axis = 1)
X_test = X_test.drop('cabin', axis = 1)

In [11]:
####### embarked 컬럼(train 1, test 1 결측)
# embarked는 탑승한 곳을 의미하고 범주형으로 최다빈도를 가지는 범주로 대체

# 최다빈도
top = X_train['embarked'].value_counts().idxmax()

# 대치
X_train['embarked'] = X_train['embarked'].fillna(top)
X_test['embarked'] = X_test['embarked'].fillna(top)

In [12]:
# train, test 모두 'F'를 'female'로 통일
X_train['sex'] = X_train['sex'].map({'male':'male', 'female':'female', 'F':'female'})
X_test['sex'] = X_test['sex'].map({'male':'male', 'female':'female', 'F':'female'})

In [13]:
####### pclass 컬럼
# 수치형으로 인식되지만 1,2,3등석 정보를 각 1,2,3으로 저장한 것으로 카테고리의 의미를 가지는 컬럼
# dtype 변경 후 파생변수 pclass_gp에 할당하고 기존 컬럼 삭제
X_train['pclass_gp'] = X_train['pclass'].astype('object')
X_test['pclass_gp'] = X_test['pclass'].astype('object')

# 완료 후 삭제
X_train = X_train.drop('pclass', axis = 1)
X_test = X_test.drop('pclass', axis = 1)

####### sibsp, parch 컬럼
# sibsp는 동승한 형제 또는 배우자의 수, parch는 동승한 부모 또는 자녀의 수이므로
# 두 컬럼을 합한 파생변수 fam을 생성하고 이는 동승한 가족 인원을 의미
X_train['fam'] = X_train['sibsp'] + X_train['parch']
X_test['fam'] = X_test['sibsp'] + X_test['parch']

# 완료 후 삭제
X_train = X_train.drop(['sibsp', 'parch'], axis = 1)
X_test = X_test.drop(['sibsp', 'parch'], axis = 1)

In [14]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 784 entries, 0 to 784
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   sex        784 non-null    object 
 1   fare       784 non-null    float64
 2   embarked   784 non-null    object 
 3   pclass_gp  784 non-null    object 
 4   fam        784 non-null    int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 36.8+ KB


In [15]:
X_train.shape

(784, 5)

In [16]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 524 entries, 0 to 523
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   sex        524 non-null    object 
 1   fare       524 non-null    float64
 2   embarked   524 non-null    object 
 3   pclass_gp  524 non-null    object 
 4   fam        524 non-null    int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 20.6+ KB


### 3. Data Modeling

In [17]:
from sklearn.preprocessing import OneHotEncoder

X_train_cat = X_train.select_dtypes('object').copy()
X_test_cat =  X_test.select_dtypes('object').copy()

ohe = OneHotEncoder(sparse=False)

ohe.fit(X_train_cat)

X_train_ohe = ohe.transform(X_train_cat)
X_test_ohe = ohe.transform(X_test_cat)

#### (2) Scaling

In [18]:
from sklearn.preprocessing import MinMaxScaler

X_train_num = X_train.select_dtypes(exclude='object').copy()
X_test_num = X_test.select_dtypes(exclude='object').copy()

scaler = MinMaxScaler()

scaler.fit(X_train_num)

X_train_sca = scaler.transform(X_train_num)
X_test_sca = scaler.transform(X_test_num)

#### (3) Data Concat & Split

In [19]:
X_TRAIN = np.concatenate([X_train_ohe, X_train_sca], axis=1)
X_TEST = np.concatenate([X_test_ohe, X_test_sca], axis=1)

y_TRAIN = y_train['survived']

print(type(X_TRAIN), type(X_TEST), type(y_TRAIN))
print(X_TRAIN.shape, X_TEST.shape, y_TRAIN.shape)

<class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'pandas.core.series.Series'>
(784, 10) (524, 10) (784,)


In [20]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(X_TRAIN, y_TRAIN, test_size = 0.25, stratify=y_TRAIN, random_state=1234)

print(xtrain.shape, xtest.shape, ytrain.shape, ytest.shape)

(588, 10) (196, 10) (588,) (196,)


### 4. Modeling

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

In [22]:
def make_models(xtrain, xtest, ytrain, ytest):
    model1 = LogisticRegression().fit(xtrain, ytrain)
    print('model1', get_scores(model1, xtrain, xtest, ytrain, ytest))

    model2 = DecisionTreeClassifier(random_state=0).fit(xtrain, ytrain)
    print('model2', get_scores(model2, xtrain, xtest, ytrain, ytest))

    for d in range(3, 8):
        model2 = DecisionTreeClassifier(max_depth=d, random_state=0).fit(xtrain, ytrain)
        print('model2', d, get_scores(model2, xtrain, xtest, ytrain, ytest))

    model3 = RandomForestClassifier(random_state=0).fit(xtrain, ytrain)
    print('model3', get_scores(model3, xtrain, xtest, ytrain, ytest))

    for d in range(3, 8):
        model3 = RandomForestClassifier(500, max_depth=d, random_state=0).fit(xtrain, ytrain)
        print('model3', d, get_scores(model3, xtrain, xtest, ytrain, ytest))

    model4 = XGBClassifier(eval_metric='logloss').fit(xtrain, ytrain)
    print('model4', get_scores(model4, xtrain, xtest, ytrain, ytest))

### 5. Model Evaluation

In [23]:
from sklearn.metrics import roc_auc_score

def get_scores(model, xtrain, xtest, ytrain, ytest):
    A = model.score(xtrain, ytrain)
    
    ypred = model.predict_proba(xtest)[:, 1]
    
    B = roc_auc_score(ytest, ypred)
    
    return f'{A:.4} {B:.4}'

In [24]:
make_models(xtrain, xtest, ytrain, ytest)

model1 0.7891 0.8382
model2 0.9286 0.7267
model2 3 0.8248 0.8014
model2 4 0.8265 0.7928
model2 5 0.8282 0.7929
model2 6 0.8401 0.7735
model2 7 0.8503 0.7617
model3 0.9286 0.7963
model3 3 0.8078 0.8266
model3 4 0.8282 0.8338
model3 5 0.8299 0.8245
model3 6 0.8588 0.8273
model3 7 0.8776 0.8206
model4 0.9184 0.8051


In [25]:
final_model = RandomForestClassifier(max_depth=4, random_state=0).fit(xtrain, ytrain)

print('final model', get_scores(final_model, xtrain, xtest, ytrain, ytest))

final model 0.8248 0.8337


### 6. Save Result

In [26]:
y_pred = final_model.predict(X_TEST)

obj = {'ID' : ID,
       'survived' : y_pred}

result = pd.DataFrame(obj)
result.to_csv("./result.csv", index = False)