# <font color=blue>Assignment</font>

In this assignment, you are going to measure the performance of the model you created with the Titanic dataset in the previous lesson. To complete this assignment, send a link to a Jupyter notebook containing solutions to the following tasks.

- Evaluate your model's performance with cross validation and using different metrics.
- Determine the model with the most appropriate parameters by hyperparameter tuning.

In [112]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,KFold,cross_validate,GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

In [32]:
titanic=pd.read_csv('titanic.csv')
titanic.drop(columns='Unnamed: 0',inplace=True)

In [33]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        891 non-null    object 
 11  Embarked     891 non-null    object 
 12  is_male      891 non-null    int64  
 13  Embarked_ID  891 non-null    int64  
 14  Cabin_ID     891 non-null    int64  
 15  Ticket_ID    891 non-null    int64  
dtypes: float64(2), int64(9), object(5)
memory usage: 111.5+ KB


In [49]:
#target variable
Y=titanic.Survived

#independent variables
X=titanic.iloc[:,[0,2,5,6,7,9,12,13,14,15]]

In [50]:
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=1111,stratify=Y)

The Dataset isnot imbalanced. I am going to apply cross validation technique which make the model more efficient with optimal C value.

In [51]:
print(titanic[titanic.Survived==0].shape[0])
print(titanic[titanic.Survived==1].shape[0])

549
342


In [52]:
log_reg=LogisticRegression(multi_class='ovr')

* Cross- validation by using `cross_validate()` function

In [53]:
cv=cross_validate(estimator=log_reg,X=X_train,y=y_train,cv=3,return_train_score=True,
                  scoring=['accuracy','precision','recall','r2'])

In [56]:
print('Train Set Mean Accuracy  : {:.2f}  '.format(cv['train_accuracy'].mean()))
print('Train Set Mean R-square  : {:.2f}  '.format(cv['train_r2'].mean()))
print('Train Set Mean Precision : {:.2f} '.format(cv['train_precision'].mean()))
print('Train Set Mean Recall    : {:.2f}\n'.format(cv['train_recall'].mean()))
 
print('Test Set Mean Accuracy   : {:.2f}  '.format(cv['test_accuracy'].mean()))
print('Test Set Mean R-square   : {:.2f}  '.format(cv['test_r2'].mean()))
print('Test Set Mean Precision  : {:.2f}  '.format(cv['test_precision'].mean()))
print('Test Set Mean Recall     : {:.2f}\n'.format(cv['test_recall'].mean()))


Train Set Mean Accuracy  : 0.80  
Train Set Mean R-square  : 0.15  
Train Set Mean Precision : 0.78 
Train Set Mean Recall    : 0.66

Test Set Mean Accuracy   : 0.79  
Test Set Mean R-square   : 0.13  
Test Set Mean Precision  : 0.78  
Test Set Mean Recall     : 0.65



* Hyperparameters tuning

In [58]:
parameters={'C': [10**x for x in range(-5,5,1)]}

In [98]:
grid_cv=GridSearchCV(estimator=log_reg, 
                     param_grid=parameters,
                     cv=5,
                     return_train_score=True,
                    )

In [100]:
grid_cv.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=LogisticRegression(multi_class='ovr'),
             param_grid={'C': [1e-05, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100,
                               1000, 10000]},
             return_train_score=True)

In [102]:
print("Best Parameters : ", grid_cv.best_params_)
print("Best Score      : ", grid_cv.best_score_)

Best Parameters :  {'C': 10}
Best Score      :  0.7977937555402344


In [104]:
results = grid_cv.cv_results_

df = pd.DataFrame(results)
df = df[['param_C', 'mean_test_score','mean_train_score']]
df = df.sort_values(by='mean_test_score', ascending = False)
df

Unnamed: 0,param_C,mean_test_score,mean_train_score
6,10.0,0.797794,0.803376
7,100.0,0.796395,0.801266
9,10000.0,0.794997,0.794948
8,1000.0,0.794997,0.803725
5,1.0,0.790761,0.802671
4,0.1,0.779533,0.790034
3,0.01,0.737339,0.744374
2,0.001,0.709189,0.714181
1,0.0001,0.700758,0.70505
0,1e-05,0.695154,0.698378


* Hyperparameters can be set to 10

In [105]:
log_reg1=LogisticRegression(C=10,multi_class='ovr')

In [106]:
log_reg1.fit(X_train,y_train)

LogisticRegression(C=10, multi_class='ovr')

In [111]:
train_preds=log_reg1.predict(X_train)
test_preds=log_reg1.predict(X_test)

In [113]:
cr=classification_report(y_test,test_preds)

In [115]:
print(cr)

              precision    recall  f1-score   support

           0       0.76      0.89      0.82       110
           1       0.76      0.55      0.64        69

    accuracy                           0.76       179
   macro avg       0.76      0.72      0.73       179
weighted avg       0.76      0.76      0.75       179

