In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

from acquire import get_titanic_data
from prepare import prep_titanic_data

In [2]:
#loading the prepped titanic data and 
#splitting it into train, validate, and test 
train, validate, test = prep_titanic_data()
print("train: ", train.shape, ", validate: ", validate.shape, ", test: ", test.shape)

train:  (497, 14) , validate:  (214, 14) , test:  (178, 14)


### Base Model

In [3]:
#making a baseline model
train.survived.value_counts(normalize=True)

0    0.617706
1    0.382294
Name: survived, dtype: float64

In [4]:
#taking a peak at the data
train.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,Q,S
583,583,0,1,male,36.0,0,0,40.125,C,First,Cherbourg,1,0,0
337,337,1,1,female,41.0,0,0,134.5,C,First,Cherbourg,1,0,0
50,50,0,3,male,7.0,4,1,39.6875,S,Third,Southampton,0,0,1
218,218,1,1,female,32.0,0,0,76.2917,C,First,Cherbourg,1,0,0
31,31,1,1,female,29.916875,1,0,146.5208,C,First,Cherbourg,0,0,0


### Example Model

In [5]:
#creating an example model based of the curriculum
X_train = train[['pclass', 'age', 'fare', 'sibsp', 'parch']]
y_train = train[['survived']]

In [6]:
#calling the Logistic Regression function and saving it 
#under the variable called logit for shorthand
logit = LogisticRegression()

In [7]:
#fitting the train dataframe into a logistic regression model
logit.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [8]:
#printing the coefficients of each category 
#along with the intercept of the function
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.98505432 -0.02975293  0.00233927 -0.17750706  0.32613578]]
Intercept: 
 [2.49738603]


In [9]:
# 'logit.predict' predicts class labels for samples in the parenthesis
y_pred = logit.predict(X_train)
# 'predict_prob' predicts probability estimates
y_pred_proba = logit.predict_proba(X_train)

In [10]:
# 'logit.score' returns the mean accuracy on the given test data and labels.
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.71


In [11]:
#creates a confusion matrix to see how accurate the model is
print(confusion_matrix(y_train, y_pred))

[[262  45]
 [100  90]]


In [12]:
#classification report to get all scores in an easy to read table
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.72      0.85      0.78       307
           1       0.67      0.47      0.55       190

    accuracy                           0.71       497
   macro avg       0.70      0.66      0.67       497
weighted avg       0.70      0.71      0.70       497



For all of the models you create, choose a threshold that optimizes for accuracy.

Do your work for these exercises in either a notebook or a python script named model within your classification-exercises repository. Add, commit, and push your work.

### Exercise 1
Create another model that includes age in addition to fare and pclass. 

### Model 1

In [13]:
#changing the parameters for another model
X1_train = train[['age', 'fare', 'pclass']]
y1_train = train[['survived']]

In [14]:
#fitting the data into a logisti regression model
logit.fit(X1_train, y1_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [15]:
#printing the coefficients and intercepts of the model
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.03051881  0.00266519 -0.97983178]]
Intercept: 
 [2.52970125]


In [16]:
# 'logit.predict' predicts class labels for samples in the parenthesis
y1_pred = logit.predict(X1_train)
# 'predict_prob' predicts probability estimates
y1_pred_proba = logit.predict_proba(X1_train)

In [17]:
# 'logit.score' returns the mean accuracy on the given test data and labels.
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X1_train, y1_train)))

Accuracy of Logistic Regression classifier on training set: 0.72


In [18]:
#creates a confusion matrix to see how accurate the model is
print(confusion_matrix(y1_train, y1_pred))

[[265  42]
 [ 99  91]]


In [19]:
#classification report to get all scores in an easy to read table
print(classification_report(y1_train, y1_pred))

              precision    recall  f1-score   support

           0       0.73      0.86      0.79       307
           1       0.68      0.48      0.56       190

    accuracy                           0.72       497
   macro avg       0.71      0.67      0.68       497
weighted avg       0.71      0.72      0.70       497



- Does this model perform better than your previous one?

It performed slightly better than the previous model. It seems that the coefficients of 'sibsp' and 'parch' did not have much of an effect on the model.

### Exercise 2
Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [20]:
#creating a dummy variable for gender with the train dataset
titanic_dummies = pd.get_dummies(train.sex, drop_first=True)
#concating the dummy variables to the training dataset
train = pd.concat([train, titanic_dummies], axis=1)

In [21]:
#verifying the above functions worked
train

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,Q,S,male
583,583,0,1,male,36.000000,0,0,40.1250,C,First,Cherbourg,1,0,0,1
337,337,1,1,female,41.000000,0,0,134.5000,C,First,Cherbourg,1,0,0,0
50,50,0,3,male,7.000000,4,1,39.6875,S,Third,Southampton,0,0,1,1
218,218,1,1,female,32.000000,0,0,76.2917,C,First,Cherbourg,1,0,0,0
31,31,1,1,female,29.916875,1,0,146.5208,C,First,Cherbourg,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
313,313,0,3,male,28.000000,0,0,7.8958,S,Third,Southampton,1,0,1,1
636,636,0,3,male,32.000000,0,0,7.9250,S,Third,Southampton,1,0,1,1
222,222,0,3,male,51.000000,0,0,8.0500,S,Third,Southampton,1,0,1,1
485,485,0,3,female,29.916875,3,1,25.4667,S,Third,Southampton,0,0,1,0


### Model 2

In [22]:
#Rinse and repeat for the next few models to test different variables
X2_train = train[['age', 'fare', 'pclass', 'male']]
y2_train = train[['survived']]

In [23]:
logit.fit(X2_train, y2_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [24]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-2.66594879e-02  9.02716903e-04 -1.11402368e+00 -2.45878213e+00]]
Intercept: 
 [4.30664987]


In [25]:
y2_pred = logit.predict(X2_train)
y2_pred_proba = logit.predict_proba(X2_train)

In [26]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X2_train, y2_train)))

Accuracy of Logistic Regression classifier on training set: 0.80


In [27]:
print(confusion_matrix(y2_train, y2_pred))

[[263  44]
 [ 56 134]]


In [28]:
print(classification_report(y2_train, y2_pred))

              precision    recall  f1-score   support

           0       0.82      0.86      0.84       307
           1       0.75      0.71      0.73       190

    accuracy                           0.80       497
   macro avg       0.79      0.78      0.78       497
weighted avg       0.80      0.80      0.80       497



### Exercise 3
Try out other combinations of features and models.

### Model 3

In [29]:
X3_train = train[['age', 'fare', 'pclass', 'Q', 'S']]
y3_train = train[['survived']]

In [30]:
logit.fit(X3_train, y3_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [31]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.03100806  0.00234461 -1.02568769  0.54782748 -0.15070561]]
Intercept: 
 [2.72357442]


In [32]:
y3_pred = logit.predict(X3_train)
y3_pred_proba = logit.predict_proba(X3_train)

In [33]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X3_train, y3_train)))

Accuracy of Logistic Regression classifier on training set: 0.71


In [34]:
print(confusion_matrix(y3_train, y3_pred))

[[266  41]
 [101  89]]


In [35]:
print(classification_report(y3_train, y3_pred))

              precision    recall  f1-score   support

           0       0.72      0.87      0.79       307
           1       0.68      0.47      0.56       190

    accuracy                           0.71       497
   macro avg       0.70      0.67      0.67       497
weighted avg       0.71      0.71      0.70       497



### Model 4

In [36]:
X4_train = train[['age', 'fare', 'pclass', 'alone']]
y4_train = train[['survived']]

In [37]:
logit.fit(X4_train, y4_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [38]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-2.41114976e-02  7.86427878e-04 -9.59730818e-01 -7.85463207e-01]]
Intercept: 
 [2.81126777]


In [39]:
y4_pred = logit.predict(X4_train)
y4_pred_proba = logit.predict_proba(X4_train)

In [40]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X4_train, y4_train)))

Accuracy of Logistic Regression classifier on training set: 0.72


In [41]:
print(confusion_matrix(y4_train, y4_pred))

[[260  47]
 [ 94  96]]


In [42]:
print(classification_report(y4_train, y4_pred))

              precision    recall  f1-score   support

           0       0.73      0.85      0.79       307
           1       0.67      0.51      0.58       190

    accuracy                           0.72       497
   macro avg       0.70      0.68      0.68       497
weighted avg       0.71      0.72      0.71       497



### Model 5

In [43]:
X5_train = train[['age', 'fare', 'male', 'alone']]
y5_train = train[['survived']]

In [44]:
logit.fit(X5_train, y5_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [45]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-2.10571953e-03  9.81503198e-03 -2.25962521e+00 -1.90950778e-01]]
Intercept: 
 [0.78850085]


In [46]:
y5_pred = logit.predict(X5_train)
y5_pred_proba = logit.predict_proba(X5_train)

In [47]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X5_train, y5_train)))

Accuracy of Logistic Regression classifier on training set: 0.78


In [48]:
print(confusion_matrix(y5_train, y5_pred))

[[261  46]
 [ 63 127]]


In [49]:
print(classification_report(y5_train, y5_pred))

              precision    recall  f1-score   support

           0       0.81      0.85      0.83       307
           1       0.73      0.67      0.70       190

    accuracy                           0.78       497
   macro avg       0.77      0.76      0.76       497
weighted avg       0.78      0.78      0.78       497



### Model 6

In [50]:
X6_train = train[['age', 'pclass', 'male', 'alone']]
y6_train = train[['survived']]

In [51]:
logit.fit(X6_train, y6_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [52]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.02570129 -1.12720398 -2.41479961 -0.17176794]]
Intercept: 
 [4.4084933]


In [53]:
y6_pred = logit.predict(X6_train)
y6_pred_proba = logit.predict_proba(X6_train)

In [54]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X6_train, y6_train)))

Accuracy of Logistic Regression classifier on training set: 0.80


In [55]:
print(confusion_matrix(y6_train, y6_pred))

[[264  43]
 [ 58 132]]


In [56]:
print(classification_report(y6_train, y6_pred))

              precision    recall  f1-score   support

           0       0.82      0.86      0.84       307
           1       0.75      0.69      0.72       190

    accuracy                           0.80       497
   macro avg       0.79      0.78      0.78       497
weighted avg       0.79      0.80      0.80       497



Displaying the accuracy of all six models from above
- Model 1: .72
- Model 2: .80
- Model 3: .71
- Model 4: .72
- Model 5: .78
- Model 6: .80

### Exercise 4
Use you best 3 models to predict and evaluate on your validate sample.

The 3 best models are model 2, 5 and 6.

In [57]:
#creating gender dummy variables for the validate section because 
#a dummy male variable was not created before splitting 
#into train, validate and test sets
titanic_dummies = pd.get_dummies(validate.sex, drop_first=True)
validate = pd.concat([validate, titanic_dummies], axis=1)

In [58]:
#recreating the 3 best training variables from the train models
#under the validate datasets to retest with new data
X2_validate = validate[['age', 'fare', 'pclass', 'male']]
y2_validate = validate[['survived']]

X5_validate = validate[['age', 'fare', 'male', 'alone']]
y5_validate = validate[['survived']]

X6_validate = validate[['age', 'pclass', 'male', 'alone']]
y6_validate = validate[['survived']]

In [59]:
# 'logit.predict' predicts class labels for validate samples in the parenthesis
y_pred2 = logit.predict(X2_validate)
y_pred5 = logit.predict(X5_validate)
y_pred6 = logit.predict(X6_validate)

In [60]:
# printing out the mean accuracy on the given test data and labels
print("model 2\n", logit.score(X2_validate, y2_validate))
print("model 5\n", logit.score(X5_validate, y5_validate))
print("model 6\n", logit.score(X6_validate, y6_validate))

model 2
 0.6074766355140186
model 5
 0.5981308411214953
model 6
 0.7850467289719626


In [63]:
#printing out a confusion matrix for all models
print("model 2\n", confusion_matrix(y2_validate, y_pred2))
print("model 5\n", confusion_matrix(y5_validate, y_pred5))
print("model 6\n", confusion_matrix(y6_validate, y_pred6))

model 2
 [[130   2]
 [ 82   0]]
model 5
 [[128   4]
 [ 82   0]]
model 6
 [[110  22]
 [ 24  58]]


In [64]:
#printing out a classification report for all models
print("model 2\n", classification_report(y2_validate, y_pred2))
print("model 5\n", classification_report(y5_validate, y_pred5))
print("model 6\n", classification_report(y6_validate, y_pred6))

model 2
               precision    recall  f1-score   support

           0       0.61      0.98      0.76       132
           1       0.00      0.00      0.00        82

    accuracy                           0.61       214
   macro avg       0.31      0.49      0.38       214
weighted avg       0.38      0.61      0.47       214

model 5
               precision    recall  f1-score   support

           0       0.61      0.97      0.75       132
           1       0.00      0.00      0.00        82

    accuracy                           0.60       214
   macro avg       0.30      0.48      0.37       214
weighted avg       0.38      0.60      0.46       214

model 6
               precision    recall  f1-score   support

           0       0.82      0.83      0.83       132
           1       0.72      0.71      0.72        82

    accuracy                           0.79       214
   macro avg       0.77      0.77      0.77       214
weighted avg       0.78      0.79      0.78    

### Exercise 5
Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

In [65]:
#creating gender dummy variables for the test section because 
#a dummy male variable was not created before splitting 
#into train, validate and test sets
titanic_dummies = pd.get_dummies(test.sex, drop_first=True)
test = pd.concat([test, titanic_dummies], axis=1)

In [66]:
#recreating the best train and validate model variables
#under the test datasets to run a final test
X6_test = test[['age', 'pclass', 'male', 'alone']]
y6_test = test[['survived']]

In [68]:
# 'logit.predict' predicts class labels for validate samples in the parenthesis
y_pred = logit.predict(X6_test)
# 'predict_proba' creates probability estimates
y_pred_proba = logit.predict_proba(X6_test)

# print the mean accuracy on the given test data and labels.
accuracy = logit.score(X6_test, y6_test)
print(accuracy)

#print the confusion matrix and classification report for final analysis
print(confusion_matrix(y6_test, y_pred))
print(classification_report(y6_test, y_pred))

0.8146067415730337
[[93 17]
 [16 52]]
              precision    recall  f1-score   support

           0       0.85      0.85      0.85       110
           1       0.75      0.76      0.76        68

    accuracy                           0.81       178
   macro avg       0.80      0.81      0.80       178
weighted avg       0.82      0.81      0.81       178



The performance metrics stayed relaively the same throughout the train, validate and test stage. This would be a great model to predict the survival rate since it is roughly 20 percentage points above the baseline. 