**LATIHAN**

Dengan menggunakan data titanic:  
* carilah model dan parameter terbaik untuk memprediksi apakah seseorang selamat atau tidak  
* konteks: memprediksi kemungkinan seseorang selamat jika kapal yang **akan** dinaikinya karam  
* model yang dicoba:  
  * logistic regression, decision tree classifier, KNClassifier
* pilih 1 model terbaik dari hasil cross validasinya, lalu tuning model tersebut


In [7]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import confusion_matrix, recall_score, precision_score, classification_report, f1_score

import category_encoders as ce
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [2]:
df= pd.read_csv('titanic.csv')
df.head()

Unnamed: 0,sex,age,parch,fare,class,deck,embark_town,alive,alone
0,male,22.0,0,7.25,Third,,Southampton,no,False
1,female,38.0,0,71.2833,First,C,Cherbourg,yes,False
2,female,26.0,0,7.925,Third,,Southampton,yes,True
3,female,35.0,0,53.1,First,C,Southampton,yes,False
4,male,35.0,0,8.05,Third,,Southampton,no,True


In [3]:
x = df.loc[:,['sex', 'age', 'parch', 'fare', 'class','embark_town','alone']]
y = np.where(df['alive']=='yes',1,0)

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x,y, stratify=y, test_size=.2, random_state=2020)

In [8]:
onehot = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one hot', OneHotEncoder(drop='first'))
])

In [9]:
mapping = [{'col':'class',
           'mapping':{None:0, 'First':1, 'Second':2, 'Third':3}}]
ordinal = ce.OrdinalEncoder(mapping=mapping)

In [10]:
transformer = ColumnTransformer([
    ('One Hot', onehot, ['sex','embark_town','alone']),
    ('ordinal', ordinal, ['class']),
    ('impute', SimpleImputer(strategy='median'), ['age'])
], remainder='passthrough')

In [12]:
x_train_preprocessed= pd.DataFrame(transformer.fit_transform(x_train))
x_test_preprocessed= pd.DataFrame(transformer.transform(x_test))

In [17]:
features= list(transformer.transformers_[0][1][1].get_feature_names())+transformer.transformers_[1][1].get_feature_names()+['age','parch', 'fare']

In [15]:
x_train_preprocessed.columns=features
x_test_preprocessed.columns=features

In [18]:
logreg= LogisticRegression(solver='liblinear', random_state=2020)
knn= KNeighborsClassifier()
tree= DecisionTreeClassifier(max_depth=3, criterion='entropy')

In [19]:
models = [logreg, knn, tree]
rata = []
standv = []
for i in models:
    skfold = StratifiedKFold(n_splits=5)
    model_cv = cross_val_score(i, x_train_preprocessed, y_train, cv=skfold, scoring='precision') #karena yang diperhatikan FP
    rata.append(model_cv.mean())
    standv.append(model_cv.std())

In [20]:
pd.DataFrame({
    'model':['logreg','knn','tree'],
    'mean':rata,
    'std':standv
})

Unnamed: 0,model,mean,std
0,logreg,0.760383,0.064139
1,knn,0.620021,0.065225
2,tree,0.809648,0.048492


_model yang dipilih: tree karena memiliki precision tertinggi dan std rendah_

In [None]:
# Model Performance in Test Set

In [21]:
tree = DecisionTreeClassifier(max_depth=3)
tree.fit(x_train_preprocessed, y_train)

DecisionTreeClassifier(max_depth=3)

In [22]:
y_pred = tree.predict(x_test_preprocessed)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.85      0.82       110
           1       0.73      0.67      0.70        69

    accuracy                           0.78       179
   macro avg       0.77      0.76      0.76       179
weighted avg       0.77      0.78      0.77       179



In [23]:
# Hyperparameter tunning

In [58]:
from sklearn.model_selection import RandomizedSearchCV

In [59]:
hyperparam_space= {
    'min_samples_leaf':[1,5,10,15,20,50], #benchmark 1
    'max_depth':[2,3,4,5,6,7] #benchmark 5
}

In [60]:
tree = DecisionTreeClassifier(max_depth=5, random_state=2020)

random_search = RandomizedSearchCV(  
    tree,
    param_distributions=hyperparam_space,
    n_iter = 20, #jumlah training, dalam kasus ini random search akan memilih 20 kombinasi secara acak dari total 72 kombinasi
    cv=5,
    scoring='neg_mean_squared_error', #metric
    random_state=2020,
    n_jobs= -1
)

In [61]:
random_search.fit(x_train_preprocessed, y_train)

RandomizedSearchCV(cv=5,
                   estimator=DecisionTreeClassifier(max_depth=5,
                                                    random_state=2020),
                   n_iter=20, n_jobs=-1,
                   param_distributions={'max_depth': [2, 3, 4, 5, 6, 7],
                                        'min_samples_leaf': [1, 5, 10, 15, 20,
                                                             50]},
                   random_state=2020, scoring='neg_mean_squared_error')

In [62]:
print('best score', random_search.best_score_)
print('best param', random_search.best_params_)

best score -0.1797202797202797
best param {'min_samples_leaf': 10, 'max_depth': 4}


In [63]:
# Before and After Tunning

In [65]:
tree = DecisionTreeClassifier(max_depth=5, random_state=2020)
tree.fit(x_train_preprocessed, y_train)
y_pred = tree.predict(x_test_preprocessed)
print('mse:', mean_squared_error(y_test, y_pred))

mse: 0.20670391061452514


In [67]:
tree_final = random_search.best_estimator_
tree_final.fit(x_train_preprocessed, y_train)
y_pred = tree_final.predict(x_test_preprocessed)
print('mse', mean_squared_error(y_test, y_pred))

mse 0.22346368715083798


_nilai mse sebelum tunning lebih baik dibandingkan after tunning_