# <font color=blue>Assignments for "Cross Validation"</font>

In this assignment, you are going to measure the performance of the model you created with the Titanic dataset in the previous lesson. To complete this assignment, send a link to a Jupyter notebook containing solutions to the following tasks.

- Evaluate your model's performance with cross validation and using different metrics.
- Determine the model with the most appropriate parameters by hyperparameter tuning.

In [16]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [17]:
titanic_df = pd.read_csv("titanic.csv")
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [22]:
#Filling age column by using median.
titanic_df.Age = titanic_df.Age.fillna(titanic_df.Age.median())
#Dropping column with many empty values
titanic_df = titanic_df.drop("Cabin",axis=1)
#converting sex feature to numerical values
titanic_df = pd.concat([titanic_df, pd.get_dummies(titanic_df.Sex, drop_first=True)], axis=1)

In [23]:
#Creating X and y variable to use in our model.
y = titanic_df.Survived
X = titanic_df[['Pclass', 'male', 'Age', 'SibSp','Parch', 'Fare']]

### Cross validation

In [20]:
from sklearn.model_selection import cross_validate

In [24]:
#Creating a logistic regression object
logreg = LogisticRegression(solver='lbfgs', multi_class="ovr", max_iter=110)

In [35]:
cv = cross_validate(estimator= logreg, X= X, y= y, cv=5, return_train_score=True,scoring = ['accuracy', 'precision', 'recall'])

In [36]:
print('Train Set Mean Accuracy  : {:.2f}  '.format(cv['train_accuracy'].mean()))
print('Train Set Mean Recall  : {:.2f}  '.format(cv['train_recall'].mean()))
print('Train Set Mean Precision : {:.2f}\n'.format(cv['train_precision'].mean()))

print('Test Set Mean Accuracy   : {:.2f}  '.format(cv['test_accuracy'].mean()))
print('Test Set Mean Recall   : {:.2f}  '.format(cv['test_recall'].mean()))
print('Test Set Mean Precision  : {:.2f}  '.format(cv['test_precision'].mean()))

Train Set Mean Accuracy  : 0.80  
Train Set Mean Recall  : 0.71  
Train Set Mean Precision : 0.76

Test Set Mean Accuracy   : 0.78  
Test Set Mean Recall   : 0.69  
Test Set Mean Precision  : 0.73  


### Hyperparameter Tuning

#### Grid Search

In [40]:
from sklearn.model_selection import GridSearchCV

In [39]:
parameter_dict = {"C": [10 ** x for x in range(-5, 5, 1)],
                  "penalty": ["l1", "l2"]}

In [42]:
grid_cv = GridSearchCV(estimator = logreg, param_grid=parameter_dict, cv= 5)

In [43]:
grid_cv.fit(X,y)

GridSearchCV(cv=5,
             estimator=LogisticRegression(max_iter=110, multi_class='ovr'),
             param_grid={'C': [1e-05, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100,
                               1000, 10000],
                         'penalty': ['l1', 'l2']})

In [44]:
grid_cv.best_params_

{'C': 0.1, 'penalty': 'l2'}

In [45]:
grid_cv.best_score_

0.7867804908668634

**Comment:** Best score is 0.787 and we get that result by using alpha as 0.1 and penalty as l2.

#### Random Search

In [48]:
from sklearn.model_selection import RandomizedSearchCV

In [49]:
parameter_dict2 = {"C": [10 ** x for x in range(-5, 5, 1)],
                  "penalty": ["l1", "l2"],
                   "solver" : ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]}

In [56]:
rs_cv = RandomizedSearchCV(estimator=logreg,
                           param_distributions = parameter_dict2,
                           cv = 5,
                           n_iter = 25,
                           random_state = 34)
rs_cv.fit(X,y)

RandomizedSearchCV(cv=5,
                   estimator=LogisticRegression(max_iter=110,
                                                multi_class='ovr'),
                   n_iter=25,
                   param_distributions={'C': [1e-05, 0.0001, 0.001, 0.01, 0.1,
                                              1, 10, 100, 1000, 10000],
                                        'penalty': ['l1', 'l2'],
                                        'solver': ['newton-cg', 'lbfgs',
                                                   'liblinear', 'sag',
                                                   'saga']},
                   random_state=34)

In [57]:
rs_cv.best_params_

{'solver': 'newton-cg', 'penalty': 'l2', 'C': 100}

In [58]:
rs_cv.best_score_

0.7833845960705543

**Comment:**Best score is 0.783 and we get that result by using alpha as 100 and penalty as l2 and solver as newton-cg.