## 타이타닉 데이터셋 도전

- 승객의 나이, 성별, 승객 등급, 승선 위치 같은 속성을 기반으로 하여 승객의 생존 여부를 예측하는 것이 목표

- [캐글](https://www.kaggle.com)의 [타이타닉 챌린지](https://www.kaggle.com/c/titanic)에서 `train.csv`와 `test.csv`를 다운로드
- 두 파일을 각각 datasets 디렉토리에 titanic_train.csv titanic_test.csv로 저장

### 1. 데이터 적재

In [1]:
import pandas as pd
train_data = pd.read_csv("datasets/titanic_train.csv")
test_data = pd.read_csv("datasets/titanic_test.csv")

### 2. 데이터 탐색

#### train_data 살펴보기

In [2]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


* **Survived**: 타깃. 0은 생존하지 못한 것이고 1은 생존을 의미
* **Pclass**: 승객 등급. 1, 2, 3등석.
* **Name**, **Sex**, **Age**: 이름 그대로의 의미
* **SibSp**: 함께 탑승한 형제, 배우자의 수
* **Parch**: 함께 탑승한 자녀, 부모의 수
* **Ticket**: 티켓 아이디
* **Fare**: 티켓 요금 (파운드)
* **Cabin**: 객실 번호
* **Embarked**: 승객이 탑승한 곳. C(Cherbourg), Q(Queenstown), S(Southampton)


#### 누락 데이터 살펴보기

In [3]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


- **Age**, **Cabin**, **Embarked** 속성의 일부가 null
- 특히 **Cabin**은 77%가 null. 일단 **Cabin**은 무시하고 나머지를 활용
- **Age**는 177개(19%)가 null이므로 이를 어떻게 처리할지 결정해야 함 - null을 중간 나이로 바꾸기 고려
- **Name**과 **Ticket** 속성은 숫자로 변환하는 것이 조금 까다로와서 지금은 무시

#### 통계치 살펴보기

In [4]:
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


* 38%만 **Survived**
* 평균 **Fare**는 32.20 파운드
* 평균 **Age**는 30보다 적음

#### Survived(머신러닝에서 타깃)가 0과 1로 이루어졌는지 확인

In [5]:
train_data['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

#### 범주형(카테고리) 특성들을 확인

In [6]:
train_data['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [7]:
train_data['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [8]:
train_data['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

**Embarked** 특성은 승객이 탑승한 곳 : C=Cherbourg, Q=Queenstown, S=Southampton.

### 3. 전처리 파이프라인

* 특성과 레이블 분리

In [9]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

In [10]:
data = train_data.drop('Survived', axis=1)
label = train_data['Survived'].copy()

In [11]:
num_attribs = ['Age', 'SibSp', 'Parch', 'Fare']
cat_attribs = ['Pclass', 'Sex', 'Embarked']

* 나만의 파이프라인

In [12]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

col_names = "SibSp", "Parch"
# 열 인덱스
SibSp_ix, Parch_ix = [num_attribs.index(c) for c in col_names]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin): 
    def fit(self, X):
        return self
    
    def transform(self, X):
        RelativesOnboard = X[:, SibSp_ix] + X[:, Parch_ix] + 1
        return np.c_[X, RelativesOnboard]

* 범주형 파이프라인 구성

In [13]:
# 1. 누락값을 most_frequent로 대체
# 2. OneHot Encoding

In [14]:
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('cat', OneHotEncoder(sparse=False))
])

In [16]:
type(data[cat_attribs])

pandas.core.frame.DataFrame

In [None]:
tmp = cat_pipeline.fit_transform(data[cat_attribs])

In [None]:
tmp.shape

* 수치형 파이프라인 구성

In [None]:
# 1. 누락값을 median로 대체

In [None]:
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder())
])

In [None]:
tmp2 = num_pipeline.fit_transform(data[num_attribs])

In [None]:
tmp2.shape

* 범주형 파이프라인 + 수치형 파이프라인

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
full_pipeline = ColumnTransformer([
    ('cat', cat_pipeline, cat_attribs),
    ('num', num_pipeline, num_attribs)
])

In [None]:
data_prepared = full_pipeline.fit_transform(data)

In [None]:
data_prepared.shape

In [None]:
data.head(10)

In [None]:
# 참고
# columns = []
# cat_encoder = full_pipeline.named_transformers_["cat"]["cat"]

# for i in range(len(cat_encoder.categories_)):
#     columns.extend(cat_encoder.categories_[i])
# columns

* 방법 1 : one-hot encoding의 categories_ 속성 활용하기

In [None]:
cat_encoder = full_pipeline.named_transformers_["cat"]["cat"]
columns = list(cat_encoder.get_feature_names(cat_attribs))
columns

In [None]:
data = pd.DataFrame(data_prepared, columns = columns+num_attribs+['RelativesOnboard'])

* 방법 2 : 컬럼명 직접 지정하기

In [None]:
columns = ['Pclass_1', 'Pclass_2', 'Pclass_3', 'female', 'male', 'Embarked_C', 'Embarked_Q', 'Embarked_S'] + num_attribs + ['RelativesOnboard']
pd.DataFrame(data_prepared, columns=columns)

In [None]:
data

### 모델 선택, 훈련, 평가(교차 검증)

* 정확도/정밀도/재현율/F1 score/ ROC

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

In [None]:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.legend(loc="center right", fontsize=16) 
    plt.xlabel("Threshold", fontsize=16)        
    plt.grid(True)                              
    plt.axis([-50000, 50000, 0, 1])   

In [None]:
def training(clf, X, y, cv, scoring, method):
    clf.fit(X, y)
    print("cross_val_score :" , cross_val_score(clf, X, y, cv=cv, scoring=scoring))
    y_pred = cross_val_predict(clf, X, y, cv=cv)
    conf_mx = confusion_matrix(y, y_pred)
    print("confusion_matrix : ", conf_mx, sep="\n")
    print("accuracy : ", accuracy_score(y, y_pred))
    print("precision : ", precision_score(y, y_pred))
    print("recall : ", recall_score(y, y_pred))
    print("f1 score : ", f1_score(y, y_pred))
    y_scores = cross_val_predict(clf, X, y, cv=cv, method=method)
    if method == "predict_proba":
        precisions, recalls, thresholds = precision_recall_curve(y, y_scores[:, 1])
        plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
        print("roc_auc_score : " , roc_auc_score(y, y_scores[:, 1]))
        fpr, tpr, thresholds = roc_curve(y, y_scores[:, 1])
        plt.figure(figsize=(8, 6))                        
        plot_roc_curve(fpr, tpr)
        plt.show()
    else:
        precisions, recalls, thresholds = precision_recall_curve(y, y_scores)
        plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
        print("roc_auc_score : " , roc_auc_score(y, y_scores))

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
    plt.axis([0, 1, 0, 1])                                    # Not shown in the book
    plt.xlabel('False Positive Rate', fontsize=16) # Not shown
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)    # Not shown
    plt.grid(True)                                            # Not shown


In [None]:
sgd_clf = SGDClassifier(random_state=42)
training(sgd_clf, data, label, 3, "accuracy", "decision_function")

In [None]:
forest_clf = RandomForestClassifier(random_state=42)
training(forest_clf, data, label, 3, "accuracy", "predict_proba")

In [None]:
svm_clf = SVC(gamma='auto', random_state=42)
training(svm_clf, data, label, 3, "accuracy", "decision_function")

In [None]:
knn_clf = KNeighborsClassifier()
training(knn_clf, data, label, 3, "accuracy", "predict_proba")

* 캐클 가입 -> 로그인 후 -> submit predictions -> 만들어진 submission 파일 제출하기

* 최종 성능 평가

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_gid = [{'n_estimators':[50, 100, 200]}]

In [None]:
grid_search = GridSearchCV(forest_clf, param_gid, scoring='neg_mean_squared_error', cv=5, return_train_score=True, n_jobs=-1)
grid_search.fit(data, label)

In [None]:
grid_search.best_estimator_

In [None]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

In [None]:
final_model = grid_search.best_estimator_

In [None]:
X_test = full_pipeline.transform(test_data)

In [None]:
final_predictions = final_model.predict(X_test)

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree_clf = DecisionTreeClassifier(random_state=42)
training(tree_clf, data, label, 10, "accuracy", "predict_proba")

In [None]:
param_gid2 = [{'max_depth':[3, 5, 7, 10]}]

In [None]:
grid_search = GridSearchCV(tree_clf, param_gid2, scoring='neg_mean_squared_error', cv=5, return_train_score=True, n_jobs=-1)
grid_search.fit(data, label)

In [None]:
final_model = grid_search.best_estimator_
final_predictions = final_model.predict(X_test)

In [None]:
param_gid3 = [{'n_estimators' : [10, 20, 30, 40, 50, 100], 'max_depth':[10, 13, 15]}]

In [None]:
grid_search = GridSearchCV(forest_clf, param_gid3, scoring='neg_mean_squared_error', cv=5, return_train_score=True, n_jobs=-1)
grid_search.fit(data, label)

In [None]:
final_model = grid_search.best_estimator_
final_predictions = final_model.predict(X_test)

In [None]:
submission = pd.read_csv("datasets/gender_submission.csv")

submission["Survived"] = final_predictions

ver = 3
model = "forest"
submission.to_csv("datasets/ver_{0}_{1}_submission.csv".format(ver, model), index=False)