dengan menggunakan data Titanic:
* carilah model dan parameter terbaik untuk memprediksi apakah seseorang selamat atau tidak.
* konteks: memprediksi kemungkinan seseorang selamat jika kapal yg **akan** dinaikinya karam
* model yg dicoba:
    - logreg
    - dtc
    - knn classifier
* pilih 1 model terbaik dari hasil cross validation nya, lalu tuning model tsb

kumpulkan ke Brigita.gems@gmail.com dengan subject: algorithm chain

ini dari awal bgt sampe hyperparameter tuning

In [1]:
import pandas as pd
import numpy as np


In [2]:
titanic=pd.read_csv('titanic.csv')
titanic.head()

Unnamed: 0,sex,age,parch,fare,class,deck,embark_town,alive,alone
0,male,22.0,0,7.25,Third,,Southampton,no,False
1,female,38.0,0,71.2833,First,C,Cherbourg,yes,False
2,female,26.0,0,7.925,Third,,Southampton,yes,True
3,female,35.0,0,53.1,First,C,Southampton,yes,False
4,male,35.0,0,8.05,Third,,Southampton,no,True


In [3]:
titanic=titanic.drop(columns='deck')

# Preprocessing

In [4]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline
import category_encoders as ce
from sklearn.compose import ColumnTransformer

In [5]:
titanic.isna().sum()

sex              0
age            177
parch            0
fare             0
class            0
embark_town      2
alive            0
alone            0
dtype: int64

    - age : simple imputer median
    - missing value embark town : modus
    - sex, embark town, alone : OneHot encoding
    - class : ordinal

In [6]:
pipe=Pipeline([('mod',SimpleImputer(strategy='most_frequent')),
               ('OH', OneHotEncoder(drop='first'))
              ])

mapping=[{'col':'class','mapping':{None:0,'First':1,'Second':2,'Third':3}}]
ordinal=ce.OrdinalEncoder(mapping=mapping)

transformer=ColumnTransformer([
    ('imputer', SimpleImputer(strategy='median'),['age']),
    ('One Hot', pipe,['sex','embark_town', 'alone']),
    ('ordinal', ordinal,['class']), 
], remainder='passthrough')


# Data Splitting

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [8]:
x = titanic.drop(columns=['alive'])
y = np.where(titanic['alive']=='yes',1,0)

In [9]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,stratify=y,random_state = 2020)

# Data Transform

In [10]:
x_train_preprocessed=pd.DataFrame(transformer.fit_transform(x_train))
x_test_preprocessed=pd.DataFrame(transformer.fit_transform(x_test))

konteks: memprediksi kemungkinan seseorang selamat jika kapal yg **akan** dinaikinya karam

1 = selamat (akan hidup),
0 = tidak selamat (akan mati) 

FP = dibilang akan selamat padahal tidak --> akan tetap naik kapal, jumlah (calon) penumpang yg hidup akan jadi berkurang

FN = dibilang akan tidak selamat padahal selamat --> tidak jadi naik kapal, jumlah (calon) penumpang yg tetap hidup setidaknya tidak berkurang

jadi minimalisir/menekan FP --> menggunakan Precision

# Benchmark Model

model yg dicoba:
* logreg
* dtc
* knn classifier

In [54]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import confusion_matrix, recall_score, precision_score, classification_report, f1_score

In [28]:
logreg=LogisticRegression(solver='liblinear', random_state=2020)
tree=DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=2020)
knn=KNeighborsClassifier(n_neighbors=5, weights='distance')

In [31]:
models=[logreg,tree,knn]
score=[]
mean=[]
stdev=[]

for i in models:
    skfold=StratifiedKFold(n_splits=5)
    estimator=Pipeline([
        ('preprocess',transformer),
        ('model',i)
    ])
    model_cv=cross_val_score(estimator, x_train, y_train, cv=skfold, scoring='precision')    
    
    score.append(model_cv)
    mean.append(model_cv.mean())
    stdev.append(model_cv.std())

In [33]:
pd.DataFrame({
    'Model':['logreg','tree','knn'],
    'score':score,
    'mean':mean,
    'stdev':stdev
})

Unnamed: 0,Model,score,mean,stdev
0,logreg,"[0.6530612244897959, 0.7843137254901961, 0.804...",0.756809,0.061015
1,tree,"[0.7435897435897436, 0.8095238095238095, 0.864...",0.810111,0.041742
2,knn,"[0.5348837209302325, 0.7619047619047619, 0.583...",0.634834,0.076761


>disini `Tree` memiliki nilai _mean_ yg paling tinggi serta nilai _stDev_ yg paling rendah, bisa dikatakan model yg performanya paling dan lebih stabil adalah ketika menggunakan `Tree`. untuk selanjutnya akan dilakukan optimasi model logistik pada `Tree`.

In [74]:
tree=DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=2020)
pipeline=Pipeline([
        ('preprocess',transformer),
        ('model',tree)
])

estimator.fit(x_train, y_train)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('imputer',
                                                  SimpleImputer(strategy='median'),
                                                  ['age']),
                                                 ('One Hot',
                                                  Pipeline(steps=[('mod',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('OH',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['sex', 'embark_town',
                                                   'alone']),
                                                 ('ordinal',
                                                  OrdinalEncoder(mapping=[{'col': 'class',
 

In [75]:
y_pred=estimator.predict(x_test)

In [76]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.76      0.94      0.84       110
           1       0.84      0.54      0.65        69

    accuracy                           0.78       179
   macro avg       0.80      0.74      0.75       179
weighted avg       0.79      0.78      0.77       179



## tuning

In [77]:
tree=DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=2020)
pipeline=Pipeline([
        ('preprocess',transformer),
        ('model',tree)
])

In [78]:
estimator.get_params()

{'memory': None,
 'steps': [('preprocess',
   ColumnTransformer(remainder='passthrough',
                     transformers=[('imputer', SimpleImputer(strategy='median'),
                                    ['age']),
                                   ('One Hot',
                                    Pipeline(steps=[('mod',
                                                     SimpleImputer(strategy='most_frequent')),
                                                    ('OH',
                                                     OneHotEncoder(drop='first'))]),
                                    ['sex', 'embark_town', 'alone']),
                                   ('ordinal',
                                    OrdinalEncoder(mapping=[{'col': 'class',
                                                             'mapping': {None: 0,
                                                                         'First': 1,
                                                                         'Sec

In [100]:
hyperparam_space={'model__criterion':['gini', 'entropy'],
                  'model__max_depth':[3,4,5,6,7],
                  'model__max_features':['auto', 'sqrt', 'log2'],
                  'model__splitter':['best','random'],
                  'model__min_samples_leaf':[1,2,3,4,5],
                  'model__min_samples_split':[1,2,3,4,5]
                 }

In [101]:
skfold=StratifiedKFold(n_splits=5)
grid_search=GridSearchCV(
    estimator,
    param_grid=hyperparam_space,
    cv=skfold,
    scoring='precision',
    n_jobs=-1
)

In [102]:
grid_search.fit(x_train, y_train)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=Pipeline(steps=[('preprocess',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('imputer',
                                                                         SimpleImputer(strategy='median'),
                                                                         ['age']),
                                                                        ('One '
                                                                         'Hot',
                                                                         Pipeline(steps=[('mod',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('OH',
                         

In [114]:
print('best score:',grid_search.best_score_)
print('best parameters:',grid_search.best_params_)

best score: 0.8930412436225528
best parameters: {'model__criterion': 'gini', 'model__max_depth': 3, 'model__max_features': 'log2', 'model__min_samples_leaf': 1, 'model__min_samples_split': 2, 'model__splitter': 'random'}


## perbandingan before vs after tuning

In [130]:
from sklearn.metrics import precision_score

#### before tuning

In [127]:
tree=DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=2020)
estimator=Pipeline([
        ('preprocess',transformer),
        ('model',tree)
])
estimator.fit(x_train, y_train)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('imputer',
                                                  SimpleImputer(strategy='median'),
                                                  ['age']),
                                                 ('One Hot',
                                                  Pipeline(steps=[('mod',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('OH',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['sex', 'embark_town',
                                                   'alone']),
                                                 ('ordinal',
                                                  OrdinalEncoder(mapping=[{'col': 'class',
 

In [128]:
y_predBef=estimator.predict(x_test)
print(classification_report(y_test,y_predBef))

              precision    recall  f1-score   support

           0       0.76      0.94      0.84       110
           1       0.84      0.54      0.65        69

    accuracy                           0.78       179
   macro avg       0.80      0.74      0.75       179
weighted avg       0.79      0.78      0.77       179



In [131]:
before=precision_score(y_test, y_predBef)
print('before tuning:',before)

before tuning: 0.8409090909090909


#### after tuning

In [132]:
best_model=grid_search.best_estimator_
best_model.fit(x_train, y_train)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('imputer',
                                                  SimpleImputer(strategy='median'),
                                                  ['age']),
                                                 ('One Hot',
                                                  Pipeline(steps=[('mod',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('OH',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['sex', 'embark_town',
                                                   'alone']),
                                                 ('ordinal',
                                                  OrdinalEncoder(mapping=[{'col': 'class',
 

In [133]:
y_predAf=best_model.predict(x_test)

In [134]:
print(classification_report(y_test, y_predAf))

              precision    recall  f1-score   support

           0       0.74      0.97      0.84       110
           1       0.91      0.46      0.62        69

    accuracy                           0.78       179
   macro avg       0.83      0.72      0.73       179
weighted avg       0.81      0.78      0.75       179



In [135]:
after=precision_score(y_test, y_predAf)
print('after tuning:',after)

after tuning: 0.9142857142857143


### Kesimpulan

In [136]:
pd.DataFrame({
    'tuning':['before','after'],
    'score':[before,after]
})

Unnamed: 0,tuning,score
0,before,0.840909
1,after,0.914286


> setelah dilakukan tuning, performa model mengalami peningkatan performa sebesar 7.3377%