# Logistic Regression Exercises

In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

from acquire import get_titanic_data
from prepare import prep_titanic_data

In [2]:
#loading the prepped titanic data and 
#splitting it into train, validate, and test 
train, validate, test = prep_titanic_data()
print("train: ", train.shape, ", validate: ", validate.shape, ", test: ", test.shape)

train:  (497, 10) , validate:  (214, 10) , test:  (178, 10)


In [3]:
train.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,alone,sex_male,embarked_Q,embarked_S
583,0,1,36.0,0,0,40.125,1,1,0,0
337,1,1,41.0,0,0,134.5,1,0,0,0
50,0,3,7.0,4,1,39.6875,0,1,0,1
218,1,1,32.0,0,0,76.2917,1,0,0,0
31,1,1,29.916875,1,0,146.5208,0,0,0,0


## Exercise 1

Start by defining your baseline model.

#### Base Model

In [4]:
#making a baseline model
train.survived.value_counts(normalize=True)

0    0.617706
1    0.382294
Name: survived, dtype: float64

In [5]:
#taking a peak at the data
train.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,alone,sex_male,embarked_Q,embarked_S
583,0,1,36.0,0,0,40.125,1,1,0,0
337,1,1,41.0,0,0,134.5,1,0,0,0
50,0,3,7.0,4,1,39.6875,0,1,0,1
218,1,1,32.0,0,0,76.2917,1,0,0,0
31,1,1,29.916875,1,0,146.5208,0,0,0,0


#### Baseline Model from Curriculum

In [6]:
#creating an example model based of the curriculum
X_train = train[['pclass', 'age', 'fare', 'sibsp', 'parch']]
y_train = train[['survived']]

In [7]:
#calling the Logistic Regression function and saving it 
#under the variable called logit for shorthand
logit = LogisticRegression()

In [8]:
#fitting the train dataframe into a logistic regression model
logit.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [9]:
#printing the coefficients of each category 
#along with the intercept of the function
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.98505432 -0.02975293  0.00233927 -0.17750706  0.32613578]]
Intercept: 
 [2.49738603]


In [10]:
logit.intercept_

array([2.49738603])

In [11]:
# 'logit.predict' predicts class labels for samples in the parenthesis
y_pred = logit.predict(X_train)
# 'predict_prob' predicts probability estimates
y_pred_proba = logit.predict_proba(X_train)

In [12]:
# 'logit.score' returns the mean accuracy on the given test data and labels.
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.71


In [13]:
#creates a confusion matrix to see how accurate the model is
print(confusion_matrix(y_train, y_pred))

[[262  45]
 [100  90]]


In [14]:
#classification report to get all scores in an easy to read table
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.72      0.85      0.78       307
           1       0.67      0.47      0.55       190

    accuracy                           0.71       497
   macro avg       0.70      0.66      0.67       497
weighted avg       0.70      0.71      0.70       497



For all of the models you create, choose a threshold that optimizes for accuracy.

Do your work for these exercises in either a notebook or a python script named model within your classification-exercises repository. Add, commit, and push your work.

## Exercise 2
Create another model that includes age in addition to fare and pclass. 

### Model 1

In [15]:
#changing the parameters for another model
X1_train = train[['age', 'fare', 'pclass']]
y1_train = train[['survived']]

In [16]:
#fitting the data into a logisti regression model
logit.fit(X1_train, y1_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [17]:
#printing the coefficients and intercepts of the model
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.03051881  0.00266519 -0.97983178]]
Intercept: 
 [2.52970125]


In [18]:
# 'logit.predict' predicts class labels for samples in the parenthesis
y1_pred = logit.predict(X1_train)
# 'predict_prob' predicts probability estimates
y1_pred_proba = logit.predict_proba(X1_train)

In [19]:
# 'logit.score' returns the mean accuracy on the given test data and labels.
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X1_train, y1_train)))

Accuracy of Logistic Regression classifier on training set: 0.72


In [20]:
#creates a confusion matrix to see how accurate the model is
print(confusion_matrix(y1_train, y1_pred))

[[265  42]
 [ 99  91]]


In [21]:
#classification report to get all scores in an easy to read table
print(classification_report(y1_train, y1_pred))

              precision    recall  f1-score   support

           0       0.73      0.86      0.79       307
           1       0.68      0.48      0.56       190

    accuracy                           0.72       497
   macro avg       0.71      0.67      0.68       497
weighted avg       0.71      0.72      0.70       497



- Does this model perform better than your previous one?

It performed slightly better than the previous model. It seems that the coefficients of 'sibsp' and 'parch' did not have much of an effect on the model.

## Exercise 3
Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

### Model 2

In [22]:
#Rinse and repeat for the next few models to test different variables
X2_train = train[['age', 'fare', 'pclass', 'sex_male']]
y2_train = train[['survived']]

In [23]:
logit.fit(X2_train, y2_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [24]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-2.66594879e-02  9.02716903e-04 -1.11402368e+00 -2.45878213e+00]]
Intercept: 
 [4.30664987]


In [25]:
y2_pred = logit.predict(X2_train)
y2_pred_proba = logit.predict_proba(X2_train)

In [26]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X2_train, y2_train)))

Accuracy of Logistic Regression classifier on training set: 0.80


In [27]:
print(confusion_matrix(y2_train, y2_pred))

[[263  44]
 [ 56 134]]


In [28]:
print(classification_report(y2_train, y2_pred))

              precision    recall  f1-score   support

           0       0.82      0.86      0.84       307
           1       0.75      0.71      0.73       190

    accuracy                           0.80       497
   macro avg       0.79      0.78      0.78       497
weighted avg       0.80      0.80      0.80       497



## Exercise 4
Try out other combinations of features and models.

### Model 3

In [29]:
X3_train = train[['age', 'fare', 'pclass', 'embarked_Q', 'embarked_S']]
y3_train = train[['survived']]

In [30]:
logit.fit(X3_train, y3_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [31]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.03100806  0.00234461 -1.02568769  0.54782748 -0.15070561]]
Intercept: 
 [2.72357442]


In [32]:
y3_pred = logit.predict(X3_train)
y3_pred_proba = logit.predict_proba(X3_train)

In [33]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X3_train, y3_train)))

Accuracy of Logistic Regression classifier on training set: 0.71


In [34]:
print(confusion_matrix(y3_train, y3_pred))

[[266  41]
 [101  89]]


In [35]:
print(classification_report(y3_train, y3_pred))

              precision    recall  f1-score   support

           0       0.72      0.87      0.79       307
           1       0.68      0.47      0.56       190

    accuracy                           0.71       497
   macro avg       0.70      0.67      0.67       497
weighted avg       0.71      0.71      0.70       497



### Model 4

In [36]:
X4_train = train[['age', 'fare', 'pclass', 'alone']]
y4_train = train[['survived']]

In [37]:
logit.fit(X4_train, y4_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [38]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-2.41114976e-02  7.86427878e-04 -9.59730818e-01 -7.85463207e-01]]
Intercept: 
 [2.81126777]


In [39]:
y4_pred = logit.predict(X4_train)
y4_pred_proba = logit.predict_proba(X4_train)

In [40]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X4_train, y4_train)))

Accuracy of Logistic Regression classifier on training set: 0.72


In [41]:
print(confusion_matrix(y4_train, y4_pred))

[[260  47]
 [ 94  96]]


In [42]:
print(classification_report(y4_train, y4_pred))

              precision    recall  f1-score   support

           0       0.73      0.85      0.79       307
           1       0.67      0.51      0.58       190

    accuracy                           0.72       497
   macro avg       0.70      0.68      0.68       497
weighted avg       0.71      0.72      0.71       497



### Model 5

In [43]:
X5_train = train[['age', 'fare', 'sex_male', 'alone']]
y5_train = train[['survived']]

In [44]:
logit.fit(X5_train, y5_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [45]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-2.10571953e-03  9.81503198e-03 -2.25962521e+00 -1.90950778e-01]]
Intercept: 
 [0.78850085]


In [46]:
y5_pred = logit.predict(X5_train)
y5_pred_proba = logit.predict_proba(X5_train)

In [47]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X5_train, y5_train)))

Accuracy of Logistic Regression classifier on training set: 0.78


In [48]:
print(confusion_matrix(y5_train, y5_pred))

[[261  46]
 [ 63 127]]


In [49]:
print(classification_report(y5_train, y5_pred))

              precision    recall  f1-score   support

           0       0.81      0.85      0.83       307
           1       0.73      0.67      0.70       190

    accuracy                           0.78       497
   macro avg       0.77      0.76      0.76       497
weighted avg       0.78      0.78      0.78       497



### Model 6

In [50]:
X6_train = train[['age', 'pclass', 'sex_male', 'alone']]
y6_train = train[['survived']]

In [51]:
logit.fit(X6_train, y6_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [52]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.02570129 -1.12720398 -2.41479961 -0.17176794]]
Intercept: 
 [4.4084933]


In [53]:
y6_pred = logit.predict(X6_train)
y6_pred_proba = logit.predict_proba(X6_train)

In [54]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X6_train, y6_train)))

Accuracy of Logistic Regression classifier on training set: 0.80


In [55]:
print(confusion_matrix(y6_train, y6_pred))

[[264  43]
 [ 58 132]]


In [56]:
print(classification_report(y6_train, y6_pred))

              precision    recall  f1-score   support

           0       0.82      0.86      0.84       307
           1       0.75      0.69      0.72       190

    accuracy                           0.80       497
   macro avg       0.79      0.78      0.78       497
weighted avg       0.79      0.80      0.80       497



Displaying the accuracy of all six models from above
- Model 1: .72
- Model 2: .80
- Model 3: .71
- Model 4: .72
- Model 5: .78
- Model 6: .80

## Exercise 5
Use you best 3 models to predict and evaluate on your validate sample.

The 3 best models are model 2, 5 and 6.

In [57]:
#recreating the 3 best training variables from the train models
#under the validate datasets to retest with new data
X2_validate = validate[['age', 'fare', 'pclass', 'sex_male']]
y2_validate = validate[['survived']]

X5_validate = validate[['age', 'fare', 'sex_male', 'alone']]
y5_validate = validate[['survived']]

X6_validate = validate[['age', 'pclass', 'sex_male', 'alone']]
y6_validate = validate[['survived']]

In [58]:
logit.fit(X2_validate, y2_validate)
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.04587951  0.00467339 -1.12012323 -1.88320059]]
Intercept: 
 [4.39626524]


In [59]:
logit.fit(X5_validate, y5_validate)
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.02817128  0.01992692 -1.85331673  0.38786936]]
Intercept: 
 [0.6207659]


In [60]:
logit.fit(X6_validate, y6_validate)
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.04655122 -1.23266078 -1.95035015  0.11929282]]
Intercept: 
 [4.78000967]


In [61]:
y_pred2 = logit.predict(X2_validate)
y2_pred_proba = logit.predict_proba(X2_validate)
y_pred5 = logit.predict(X5_validate)
y5_pred_proba = logit.predict_proba(X5_validate)
y_pred6 = logit.predict(X6_validate)
y6_pred_proba = logit.predict_proba(X6_validate)

In [62]:
# printing out the mean accuracy on the given test data and labels
print("model 2\n", logit.score(X2_validate, y2_validate))
print("model 5\n", logit.score(X5_validate, y5_validate))
print("model 6\n", logit.score(X6_validate, y6_validate))

model 2
 0.6074766355140186
model 5
 0.5981308411214953
model 6
 0.7850467289719626


In [63]:
#printing out a confusion matrix for all models
print("model 2\n", confusion_matrix(y2_validate, y_pred2))
print("model 5\n", confusion_matrix(y5_validate, y_pred5))
print("model 6\n", confusion_matrix(y6_validate, y_pred6))

model 2
 [[130   2]
 [ 82   0]]
model 5
 [[128   4]
 [ 82   0]]
model 6
 [[115  17]
 [ 29  53]]


In [64]:
#printing out a classification report for all models
print("model 2\n", classification_report(y2_validate, y_pred2))
print("model 5\n", classification_report(y5_validate, y_pred5))
print("model 6\n", classification_report(y6_validate, y_pred6))

model 2
               precision    recall  f1-score   support

           0       0.61      0.98      0.76       132
           1       0.00      0.00      0.00        82

    accuracy                           0.61       214
   macro avg       0.31      0.49      0.38       214
weighted avg       0.38      0.61      0.47       214

model 5
               precision    recall  f1-score   support

           0       0.61      0.97      0.75       132
           1       0.00      0.00      0.00        82

    accuracy                           0.60       214
   macro avg       0.30      0.48      0.37       214
weighted avg       0.38      0.60      0.46       214

model 6
               precision    recall  f1-score   support

           0       0.80      0.87      0.83       132
           1       0.76      0.65      0.70        82

    accuracy                           0.79       214
   macro avg       0.78      0.76      0.77       214
weighted avg       0.78      0.79      0.78    

## Exercise 6
Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

In [65]:
#recreating the best train and validate model variables
#under the test datasets to run a final test
X6_test = test[['age', 'pclass', 'sex_male', 'alone']]
y6_test = test[['survived']]

In [66]:
logit.fit(X6_test, y6_test)
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.03826895 -0.83611527 -2.63666645  0.30714706]]
Intercept: 
 [3.92034512]


In [67]:
# 'logit.predict' predicts class labels for validate samples in the parenthesis
y_pred = logit.predict(X6_test)
# 'predict_proba' creates probability estimates
y_pred_proba = logit.predict_proba(X6_test)

# print the mean accuracy on the given test data and labels.
accuracy = logit.score(X6_test, y6_test)
print(accuracy)

#print the confusion matrix and classification report for final analysis
print(confusion_matrix(y6_test, y_pred))
print(classification_report(y6_test, y_pred))

0.8258426966292135
[[96 14]
 [17 51]]
              precision    recall  f1-score   support

           0       0.85      0.87      0.86       110
           1       0.78      0.75      0.77        68

    accuracy                           0.83       178
   macro avg       0.82      0.81      0.81       178
weighted avg       0.82      0.83      0.83       178



The performance metrics stayed relaively the same throughout the train, validate and test stage. This would be a great model to predict the survival rate since it is roughly 20 percentage points above the baseline. 

## Review example

#### Train

In [252]:
X_train_ex = train.drop(columns=['age', 'fare', 'survived'])
y_train_ex = train[['survived']]
X_train_ex

Unnamed: 0,pclass,sibsp,parch,alone,sex_male,embarked_Q,embarked_S
583,1,0,0,1,1,0,0
337,1,0,0,1,0,0,0
50,3,4,1,0,1,0,1
218,1,0,0,1,0,0,0
31,1,1,0,0,0,0,0
...,...,...,...,...,...,...,...
313,3,0,0,1,1,0,1
636,3,0,0,1,1,0,1
222,3,0,0,1,1,0,1
485,3,3,1,0,0,0,1


In [69]:
#fitting the data into a logisti regression model
logit.fit(X_train_ex, y_train_ex)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [70]:
#printing the coefficients of each category 
#along with the intercept of the function
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.89320109 -0.47113497 -0.14990043 -1.01531958 -2.41796105  0.37728737
  -0.0155336 ]]
Intercept: 
 [3.8948907]


In [71]:
# 'logit.predict' predicts class labels for samples in the parenthesis
y_pred_ex = logit.predict(X_train_ex)
# 'predict_prob' predicts probability estimates
y_pred_proba_ex = logit.predict_proba(X_train_ex)

In [72]:
# 'logit.score' returns the mean accuracy on the given test data and labels.
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train_ex, y_train_ex)))

Accuracy of Logistic Regression classifier on training set: 0.79


In [73]:
#creates a confusion matrix to see how accurate the model is
print(confusion_matrix(y_train_ex, y_pred_ex))

[[262  45]
 [ 57 133]]


In [74]:
#classification report to get all scores in an easy to read table
print(classification_report(y_train_ex, y_pred_ex))

              precision    recall  f1-score   support

           0       0.82      0.85      0.84       307
           1       0.75      0.70      0.72       190

    accuracy                           0.79       497
   macro avg       0.78      0.78      0.78       497
weighted avg       0.79      0.79      0.79       497



#### Validate

In [75]:
X_validate_ex = validate.drop(columns=['age', 'fare', 'survived'])
y_validate_ex = validate[['survived']]

In [76]:
#fitting the data into a logisti regression model
logit.fit(X_validate_ex, y_validate_ex)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [77]:
#printing the coefficients of each category 
#along with the intercept of the function
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.87303369 -0.39371458 -0.17907146 -0.66087797 -1.98658939  0.09322755
  -0.35107385]]
Intercept: 
 [3.6238827]


In [78]:
# 'logit.predict' predicts class labels for samples in the parenthesis
y_pred_ex = logit.predict(X_validate_ex)
# 'predict_prob' predicts probability estimates
y_pred_proba_ex = logit.predict_proba(X_validate_ex)

In [79]:
# 'logit.score' returns the mean accuracy on the given test data and labels.
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_validate_ex, y_validate_ex)))

Accuracy of Logistic Regression classifier on training set: 0.80


In [80]:
#creates a confusion matrix to see how accurate the model is
print(confusion_matrix(y_validate_ex, y_pred_ex))

[[117  15]
 [ 27  55]]


In [81]:
#classification report to get all scores in an easy to read table
print(classification_report(y_validate_ex, y_pred_ex))

              precision    recall  f1-score   support

           0       0.81      0.89      0.85       132
           1       0.79      0.67      0.72        82

    accuracy                           0.80       214
   macro avg       0.80      0.78      0.79       214
weighted avg       0.80      0.80      0.80       214



#### Test: Use only one model for test

In [82]:
X_test_ex = test.drop(columns=['age', 'fare', 'survived'])
y_test_ex = test[['survived']]

In [83]:
#fitting the data into a logisti regression model
logit.fit(X_test_ex, y_test_ex)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [84]:
#printing the coefficients of each category 
#along with the intercept of the function
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.57492301 -0.15517538 -0.13565495 -0.04940023 -2.57086937 -0.72527695
  -1.10042851]]
Intercept: 
 [3.3083142]


In [85]:
# 'logit.predict' predicts class labels for samples in the parenthesis
y_pred_ex = logit.predict(X_test_ex)
# 'predict_prob' predicts probability estimates
y_pred_proba_ex = logit.predict_proba(X_test_ex)

In [86]:
# 'logit.score' returns the mean accuracy on the given test data and labels.
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_test_ex, y_test_ex)))

Accuracy of Logistic Regression classifier on training set: 0.83


In [87]:
#creates a confusion matrix to see how accurate the model is
print(confusion_matrix(y_test_ex, y_pred_ex))

[[94 16]
 [15 53]]


In [88]:
#classification report to get all scores in an easy to read table
print(classification_report(y_test_ex, y_pred_ex))

              precision    recall  f1-score   support

           0       0.86      0.85      0.86       110
           1       0.77      0.78      0.77        68

    accuracy                           0.83       178
   macro avg       0.82      0.82      0.82       178
weighted avg       0.83      0.83      0.83       178



In [89]:
def logistic_regression(X_train, y_train):
    #importing libraries
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix

    #defining logistic regression function
    logit = LogisticRegression()
    #fitting the data into the model
    logit.fit(X_train, y_train)
    
    #creating a list comprehension for the column names
    names = [column for column in X_train.columns]
    #adding intercept to the end of the list
    names.append('intercept')
    #creating a dataframe from the regression coefficient values and intercept
    coeff = pd.DataFrame(np.append(logit.coef_, logit.intercept_)).T
    #renaming the column names with the list of names
    coeff.columns = names
    
    # 'logit.predict' predicts class labels for samples in the parenthesis
    y_pred = logit.predict(X_train)
    # 'predict_prob' predicts probability estimates
    y_pred_proba = logit.predict_proba(X_train)
    
    #creates a confusion matrix to see how accurate the model is
    cm = pd.DataFrame(confusion_matrix(y_train, y_train))
    
    #creating a copy of y_train
    label = y_train
    #renaming column in copy of y_train
    label = label.rename(columns={label.columns[0]:'label'})
    #creating labels out of unique values for 
    labels = sorted(label.label.unique())
    #creating a classification report and saving it as a DataFrame
    class_report = pd.DataFrame(classification_report(y_train, y_pred, target_names=labels, output_dict=True))
    
    return coeff, cm, class_report

In [90]:
logistic_regression(X_train, y_train)

(     pclass       age      fare     sibsp     parch  intercept
 0 -0.985054 -0.029753  0.002339 -0.177507  0.326136   2.497386,
      0    1
 0  307    0
 1    0  190,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.723757    0.666667  0.708249    0.695212      0.701932
 recall       0.853420    0.473684  0.708249    0.663552      0.708249
 f1-score     0.783259    0.553846  0.708249    0.668552      0.695556
 support    307.000000  190.000000  0.708249  497.000000    497.000000)

# Decision Tree Exercises

In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

In [91]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

## Exercise 1

Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [92]:
#Prepping data for a decision tree
X_train = train.drop(columns=['survived'])
y_train = train[['survived']]

In [93]:
#setting variable to decision tree classifier
clf = DecisionTreeClassifier(max_depth=3, random_state=123)

In [94]:
#fitting the data to the model
clf.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=3, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=123, splitter='best')

In [95]:
# 'logit.predict' predicts class labels for samples in the parenthesis
y_pred = clf.predict(X_train)
# 'predict_proba' predicts porbability estimates
y_pred_proba = clf.predict_proba(X_train)

## Exercise 2

Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [96]:
#printing accuracy of the model
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))

Accuracy of Decision Tree classifier on training set: 0.82


In [97]:
#creating a confusion matrix
confusion_matrix(y_train, y_pred)

array([[279,  28],
       [ 62, 128]])

In [98]:
#creating labels for the confusion matrix
labels = ['did not survive', 'survive']

#creating a confusion matrix and saving it as a dataframe
cm = pd.DataFrame(confusion_matrix(y_train, y_pred), index=labels, columns=labels)
cm

Unnamed: 0,did not survive,survive
did not survive,279,28
survive,62,128


In [99]:
#creating a classification report
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.91      0.86       307
           1       0.82      0.67      0.74       190

    accuracy                           0.82       497
   macro avg       0.82      0.79      0.80       497
weighted avg       0.82      0.82      0.81       497



## Exercise 3

Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [100]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [101]:
#identifying and saving the confusion matrix variables 
TP = cm.iloc[0,0]
FN = cm.iloc[0,1]
FP = cm.iloc[1,0]
TN = cm.iloc[1,1]
TP, FN, FP, TN

(279, 28, 62, 128)

In [102]:
#creating labels for the classification report
target_names = ['did not survive', 'survive']

#creating the classification report
x = classification_report(y_train, y_pred, target_names=target_names, output_dict=True)

#saving the report as a dataframe
class_report = pd.DataFrame(x)
class_report

Unnamed: 0,did not survive,survive,accuracy,macro avg,weighted avg
precision,0.818182,0.820513,0.818913,0.819347,0.819073
recall,0.908795,0.673684,0.818913,0.791239,0.818913
f1-score,0.861111,0.739884,0.818913,0.800498,0.814767
support,307.0,190.0,0.818913,497.0,497.0


In [103]:
#True pos rate
TP_rate = round(TP / (TP + FN),3)
#False pos rate
FP_rate = round(FP / (FP + TN),3)
#True neg rate
TN_rate = round(TN / (TN + FP),3)
#False neg rate
FN_rate = round(FN / (FN + TP),3)

accuracy = round(accuracy_score(y_true = y_train, y_pred = y_pred),3)
precision = round(precision_score(y_true = y_train, y_pred = y_pred),3)
recall = round(recall_score(y_true = y_train, y_pred = y_pred),3)
f1score = round(f1_score(y_true = y_train, y_pred = y_pred),3)

In [104]:
print(f'True Pos Rate: {TP_rate}')
print(f'False Pos Rate: {FP_rate}')
print(f'True Neg Rate: {TN_rate}')
print(f'False Pos Rate: {FP_rate}')

print('\n')
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1score}')

True Pos Rate: 0.909
False Pos Rate: 0.326
True Neg Rate: 0.674
False Pos Rate: 0.326


Accuracy: 0.819
Precision: 0.821
Recall: 0.674
F1-score: 0.74


## Exercise 4

Run through steps 2-4 using a different max_depth value.

In [105]:
clf2 = DecisionTreeClassifier(max_depth=9, random_state=123)

In [106]:
clf2.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=9, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=123, splitter='best')

In [107]:
y_pred2 = clf2.predict(X_train)
y_pred_proba2 = clf2.predict_proba(X_train)

In [108]:
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf2.score(X_train, y_train)))

Accuracy of Decision Tree classifier on training set: 0.91


In [109]:
cm2 = pd.DataFrame(confusion_matrix(y_train, y_pred2), index=labels, columns=labels)
cm2

Unnamed: 0,did not survive,survive
did not survive,294,13
survive,30,160


In [110]:
TP = cm2.iloc[0,0]
FN = cm2.iloc[0,1]
FP = cm2.iloc[1,0]
TN = cm2.iloc[1,1]
TP, FP, FN, TN

(294, 30, 13, 160)

In [111]:
x2 = classification_report(y_train, y_pred2, target_names=target_names, output_dict=True)
class_report2 = pd.DataFrame(x2)
class_report2

Unnamed: 0,did not survive,survive,accuracy,macro avg,weighted avg
precision,0.907407,0.924855,0.913481,0.916131,0.914078
recall,0.957655,0.842105,0.913481,0.89988,0.913481
f1-score,0.931854,0.881543,0.913481,0.906698,0.91262
support,307.0,190.0,0.913481,497.0,497.0


In [112]:
TP_rate = round(TP / (TP + FN),3)
FP_rate = round(FP / (FP + TN),3)
TN_rate = round(TN / (TN + FP),3)
FN_rate = round(FN / (FN + TP),3)
accuracy = round(accuracy_score(y_true = y_train, y_pred = y_pred2),3)
precision = round(precision_score(y_true = y_train, y_pred = y_pred2),3)
recall = round(recall_score(y_true = y_train, y_pred = y_pred2),3)
f1score = round(f1_score(y_true = y_train, y_pred = y_pred2),3)

In [113]:
print(f'True Pos Rate: {TP_rate}')
print(f'False Pos Rate: {FP_rate}')
print(f'True Neg Rate: {TN_rate}')
print(f'False Pos Rate: {FP_rate}')

print('\n')
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1score}')

True Pos Rate: 0.958
False Pos Rate: 0.158
True Neg Rate: 0.842
False Pos Rate: 0.158


Accuracy: 0.913
Precision: 0.925
Recall: 0.842
F1-score: 0.882


## Exercise 5

Which performs better on your in-sample data?

The decision tree with a higher max depth value performed better on my in_sample data but it may not be the best data for other sets. 

In [258]:
## creating a function to perform decision trees quicker
def decision_tree(X_train, y_train, depth_number):
    
    #setting max depth number for DecisionTreeClassifier
    clf = DecisionTreeClassifier(max_depth= depth_number, random_state=123)
    
    #fitting the data to the model
    clf.fit(X_train, y_train)
    # 'logit.predict' predicts class labels for samples in the parenthesis
    y_pred = clf.predict(X_train)
    # 'predict_proba' predicts porbability estimates
    y_pred_proba = clf.predict_proba(X_train)
    #creating a confusion matrix and storing it in a DataFrame
    cm = pd.DataFrame(confusion_matrix(y_train, y_pred))
    #creating a copy of y_train
    label = y_train
    #renaming column in copy of y_train
    label = label.rename(columns={label.columns[0]:'label'})
    #creating labels out of unique values for 
    labels = sorted(label.label.unique())
    #creating a classification report and saving it as a DataFrame
    class_report = pd.DataFrame(classification_report(y_train, y_pred, target_names=labels, output_dict=True))
    
    return cm, class_report

In [116]:
def confusion_matrix_rates(cm):
    TP = cm[0][0]
    FN = cm[0][1]
    FP = cm[1][0]
    TN = cm[1][1]
    TPrate = round(TP / (TP + FN),3)
    FPrate = round(FP / (FP + TN),3)
    TNrate = round(TN / (TN + FP),3)
    FNrate = round(FN / (FN + TP),3)
    return TPrate, FPrate, FNrate, TNrate

In [254]:
decision_tree(X_train, y_train, 3)

(     0   1
 0  289  18
 1  129  61,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.691388    0.772152  0.704225    0.731770      0.722263
 recall       0.941368    0.321053  0.704225    0.631210      0.704225
 f1-score     0.797241    0.453532  0.704225    0.625386      0.665843
 support    307.000000  190.000000  0.704225  497.000000    497.000000)

In [255]:
decision_tree(X_train, y_train, 5)

(     0   1
 0  290  17
 1  114  76,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.717822    0.817204  0.736419    0.767513      0.755815
 recall       0.944625    0.400000  0.736419    0.672313      0.736419
 f1-score     0.815752    0.537102  0.736419    0.676427      0.709226
 support    307.000000  190.000000  0.736419  497.000000    497.000000)

In [256]:
decision_tree(X_train, y_train, 7)

(     0    1
 0  287   20
 1   86  104,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.769437    0.838710   0.78672    0.804073      0.795920
 recall       0.934853    0.547368   0.78672    0.741111      0.786720
 f1-score     0.844118    0.662420   0.78672    0.753269      0.774656
 support    307.000000  190.000000   0.78672  497.000000    497.000000)

In [257]:
decision_tree(X_train, y_train, 10)

(     0    1
 0  256   51
 1   25  165,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.911032    0.763889  0.847082    0.837460      0.854780
 recall       0.833876    0.868421  0.847082    0.851149      0.847082
 f1-score     0.870748    0.812808  0.847082    0.841778      0.848598
 support    307.000000  190.000000  0.847082  497.000000    497.000000)

In [259]:
decision_tree(X_train, y_train, 15)

(     0    1
 0  298    9
 1   20  170,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.937107    0.949721   0.94165    0.943414      0.941929
 recall       0.970684    0.894737   0.94165    0.932710      0.941650
 f1-score     0.953600    0.921409   0.94165    0.937505      0.941294
 support    307.000000  190.000000   0.94165  497.000000    497.000000)

In [120]:
# dot_data = export_graphviz(model, feature_names= X.columns, class_names= {0:'not survived', 1:'survived'}, rounded=True, filled=True, out_file=None)

# Random Forest Exercises

Continue working in your model file. Be sure to add, commit, and push your changes.

In [121]:
from sklearn.ensemble import RandomForestClassifier

In [187]:
X_train = train[['pclass', 'age', 'fare', 'sibsp', 'parch']]
y_train = train[['survived']]
X_validate = validate[['pclass', 'age', 'fare', 'sibsp', 'parch']]
y_validate = validate[['survived']]
X_test = test[['pclass', 'age', 'fare', 'sibsp', 'parch']]
y_test = test[['survived']]

## Exercise 1
Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 20.

In [227]:
rf = RandomForestClassifier(min_samples_leaf= 1, max_depth = 20, random_state = 123)

In [228]:
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=20, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=123,
                       verbose=0, warm_start=False)

In [229]:
y_pred = rf.predict(X_train)
y_pred_proba = rf.predict_proba(X_train)

## Exercise 2
Evaluate your results using the model score, confusion matrix, and classification report.

In [230]:
print('Accuracy of random forest classifier on training set: {:.2f}'
     .format(rf.score(X_train, y_train)))

Accuracy of random forest classifier on training set: 0.97


In [231]:
cmrf = pd.DataFrame(confusion_matrix(y_train, y_pred))
cmrf

Unnamed: 0,0,1
0,297,10
1,3,187


In [232]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.97      0.98       307
           1       0.95      0.98      0.97       190

    accuracy                           0.97       497
   macro avg       0.97      0.98      0.97       497
weighted avg       0.97      0.97      0.97       497



## Exercise 3
Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [233]:
TPrate, FPrate, FNrate, TNrate = confusion_matrix_rates(cmrf)

In [234]:
accuracy = round(accuracy_score(y_true = y_train, y_pred = y_pred),3)
precision = round(precision_score(y_true = y_train, y_pred = y_pred),3)
recall = round(recall_score(y_true = y_train, y_pred = y_pred),3)
f1score = round(f1_score(y_true = y_train, y_pred = y_pred),3)

In [235]:
print(f'True Pos Rate: {TPrate}')
print(f'False Pos Rate: {FPrate}')
print(f'True Neg Rate: {TNrate}')
print(f'False Pos Rate: {FPrate}')

print('\n')
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1score}')

True Pos Rate: 0.99
False Pos Rate: 0.051
True Neg Rate: 0.949
False Pos Rate: 0.051


Accuracy: 0.974
Precision: 0.949
Recall: 0.984
F1-score: 0.966


## Exercise 4
Run through steps increasing your min_samples_leaf to 5 and decreasing your max_depth to 3.

In [164]:
rf2 = RandomForestClassifier(min_samples_leaf= 5, max_depth = 3, random_state = 123)

In [165]:
rf2.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=3, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=123,
                       verbose=0, warm_start=False)

In [166]:
y_pred = rf2.predict(X_train)
y_pred_proba = rf2.predict_proba(X_train)

In [167]:
print('Accuracy of random forest classifier on training set: {:.2f}'
     .format(rf2.score(X_train, y_train)))

Accuracy of random forest classifier on training set: 0.82


In [168]:
cmrf2 = pd.DataFrame(confusion_matrix(y_train, y_pred))
cmrf2

Unnamed: 0,0,1
0,286,21
1,70,120


In [169]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.93      0.86       307
           1       0.85      0.63      0.73       190

    accuracy                           0.82       497
   macro avg       0.83      0.78      0.79       497
weighted avg       0.82      0.82      0.81       497



In [170]:
TPrate2, FPrate2, FNrate2, TNrate2 = confusion_matrix_rates(cmrf2)

In [171]:
accuracy = round(accuracy_score(y_true = y_train, y_pred = y_pred),3)
precision = round(precision_score(y_true = y_train, y_pred = y_pred),3)
recall = round(recall_score(y_true = y_train, y_pred = y_pred),3)
f1score = round(f1_score(y_true = y_train, y_pred = y_pred),3)

In [172]:
print(f'True Pos Rate: {TPrate2}')
print(f'False Pos Rate: {FPrate2}')
print(f'True Neg Rate: {TNrate2}')
print(f'False Pos Rate: {FPrate2}')

print('\n')
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1score}')

True Pos Rate: 0.803
False Pos Rate: 0.149
True Neg Rate: 0.851
False Pos Rate: 0.149


Accuracy: 0.817
Precision: 0.851
Recall: 0.632
F1-score: 0.725


## Exercise 5
What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

The 'min_samples_leaf' is the minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves are at the number set at the 'min_samples_leaf' The first model had the 'min_samples_leaf' at one vs the second model is set at 5. Once the model reaches below 5, it will no longer do any splits, therefore will be less accurate than the model set at 1.

The 'max_depth' is the maximum depth of the tree. It will keep splitting if necessary until the 'min_samples_leaf' is reached or until it reaches the max_depth. Since the max_depth of the first model is set at 20 we can will have more branches and better fit the data than the second model that is set at 3. 

The first first model performes better on the in_sample data because it is overfitting the model. With a higher 'max_depth' and lower 'min_samples_leaf' it is able to make better predictions on the data but it may only apply to the training set and not be representative on the sample as a whole.

After making a few models, which one has the best performance (or closest metrics) on both train and validate?

### Model 1

In [176]:
random_forest(X_train, y_train, 5, 3)

(     0    1
 0  286   21
 1   70  120,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.803371    0.851064  0.816901    0.827217      0.821604
 recall       0.931596    0.631579  0.816901    0.781588      0.816901
 f1-score     0.862745    0.725076  0.816901    0.793910      0.810115
 support    307.000000  190.000000  0.816901  497.000000    497.000000)

In [188]:
random_forest(X_validate, y_validate, 5, 3)

(     0   1
 0  124   8
 1   39  43,
                     0          1  accuracy   macro avg  weighted avg
 precision    0.760736   0.843137  0.780374    0.801937      0.792310
 recall       0.939394   0.524390  0.780374    0.731892      0.780374
 f1-score     0.840678   0.646617  0.780374    0.743647      0.766318
 support    132.000000  82.000000  0.780374  214.000000    214.000000)

### Model 2

In [177]:
random_forest(X_train, y_train, 5, 5)

(     0    1
 0  291   16
 1   59  131,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.831429    0.891156  0.849095    0.861293      0.854262
 recall       0.947883    0.689474  0.849095    0.818678      0.849095
 f1-score     0.885845    0.777448  0.849095    0.831646      0.844405
 support    307.000000  190.000000  0.849095  497.000000    497.000000)

In [189]:
random_forest(X_validate, y_validate, 5, 5)

(     0   1
 0  123   9
 1   36  46,
                     0          1  accuracy   macro avg  weighted avg
 precision    0.773585   0.836364   0.78972    0.804974      0.797640
 recall       0.931818   0.560976   0.78972    0.746397      0.789720
 f1-score     0.845361   0.671533   0.78972    0.758447      0.778754
 support    132.000000  82.000000   0.78972  214.000000    214.000000)

### Model 3

In [178]:
random_forest(X_train, y_train, 5, 10)

(     0    1
 0  293   14
 1   48  142,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.859238    0.910256  0.875252    0.884747      0.878742
 recall       0.954397    0.747368  0.875252    0.850883      0.875252
 f1-score     0.904321    0.820809  0.875252    0.862565      0.872395
 support    307.000000  190.000000  0.875252  497.000000    497.000000)

In [190]:
random_forest(X_validate, y_validate, 5, 10)

(     0   1
 0  122  10
 1   27  55,
                     0          1  accuracy   macro avg  weighted avg
 precision    0.818792   0.846154  0.827103    0.832473      0.829276
 recall       0.924242   0.670732  0.827103    0.797487      0.827103
 f1-score     0.868327   0.748299  0.827103    0.808313      0.822335
 support    132.000000  82.000000  0.827103  214.000000    214.000000)

### Model 4

In [182]:
random_forest(X_train, y_train, 5, 15)

(     0    1
 0  293   14
 1   45  145,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.866864    0.911950  0.881288    0.889407      0.884100
 recall       0.954397    0.763158  0.881288    0.858778      0.881288
 f1-score     0.908527    0.830946  0.881288    0.869736      0.878868
 support    307.000000  190.000000  0.881288  497.000000    497.000000)

In [191]:
random_forest(X_validate, y_validate, 5, 15)

(     0   1
 0  122  10
 1   27  55,
                     0          1  accuracy   macro avg  weighted avg
 precision    0.818792   0.846154  0.827103    0.832473      0.829276
 recall       0.924242   0.670732  0.827103    0.797487      0.827103
 f1-score     0.868327   0.748299  0.827103    0.808313      0.822335
 support    132.000000  82.000000  0.827103  214.000000    214.000000)

### Model 5

In [179]:
random_forest(X_train, y_train, 3, 3)

(     0    1
 0  290   17
 1   66  124,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.814607    0.879433  0.832998    0.847020      0.839389
 recall       0.944625    0.652632  0.832998    0.798628      0.832998
 f1-score     0.874811    0.749245  0.832998    0.812028      0.826808
 support    307.000000  190.000000  0.832998  497.000000    497.000000)

In [193]:
random_forest(X_validate, y_validate, 3, 3)

(     0   1
 0  126   6
 1   39  43,
                     0          1  accuracy   macro avg  weighted avg
 precision    0.763636   0.877551   0.78972    0.820594      0.807286
 recall       0.954545   0.524390   0.78972    0.739468      0.789720
 f1-score     0.848485   0.656489   0.78972    0.752487      0.774916
 support    132.000000  82.000000   0.78972  214.000000    214.000000)

### Model 6

In [184]:
random_forest(X_train, y_train, 10, 5)

(     0    1
 0  294   13
 1   70  120,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.807692    0.902256  0.832998    0.854974      0.843843
 recall       0.957655    0.631579  0.832998    0.794617      0.832998
 f1-score     0.876304    0.743034  0.832998    0.809669      0.825356
 support    307.000000  190.000000  0.832998  497.000000    497.000000)

In [194]:
random_forest(X_validate, y_validate, 10, 5)

(     0   1
 0  121  11
 1   41  41,
                     0          1  accuracy   macro avg  weighted avg
 precision    0.746914   0.788462  0.757009    0.767688      0.762834
 recall       0.916667   0.500000  0.757009    0.708333      0.757009
 f1-score     0.823129   0.611940  0.757009    0.717535      0.742206
 support    132.000000  82.000000  0.757009  214.000000    214.000000)

### Model 7

In [185]:
random_forest(X_train, y_train, 10, 10)

(     0    1
 0  292   15
 1   62  128,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.824859    0.895105   0.84507    0.859982      0.851713
 recall       0.951140    0.673684   0.84507    0.812412      0.845070
 f1-score     0.883510    0.768769   0.84507    0.826139      0.839645
 support    307.000000  190.000000   0.84507  497.000000    497.000000)

In [195]:
random_forest(X_validate, y_validate, 10, 10)

(     0   1
 0  121  11
 1   40  42,
                     0          1  accuracy   macro avg  weighted avg
 precision    0.751553   0.792453  0.761682    0.772003      0.767225
 recall       0.916667   0.512195  0.761682    0.714431      0.761682
 f1-score     0.825939   0.622222  0.761682    0.724080      0.747879
 support    132.000000  82.000000  0.761682  214.000000    214.000000)

- Model 1: .817-.780 = .037
- Model 2: .849-.790 = .059
- Model 3: .875-.827 = .048
- Model 4: .881-.827 = .054
- Model 5: .833-.790 = .043
- Model 6: .833-.757 = .076
- Model 7: .845-.762 = .083

Model 1 had the closest metrics with a difference of 3.7%. Other models perfored better on the training set but had a greater discrepancy on the validate set. 

In [253]:
def random_forest(X_train, y_train, min_sample, maximum_depth):
    rf = RandomForestClassifier(min_samples_leaf= min_sample , max_depth = maximum_depth, random_state = 123)
    rf.fit(X_train,y_train)
    y_pred = rf.predict(X_train)
    cm = pd.DataFrame(confusion_matrix(y_train, y_pred))
    #creating a copy of y_train
    label = y_train
    #renaming column in copy of y_train
    label = label.rename(columns={label.columns[0]:'label'})
    #creating labels out of unique values for 
    labels = sorted(label.label.unique())
    #creating a classification report and saving it as a DataFrame
    class_report = pd.DataFrame(classification_report(y_train, y_pred, target_names=labels, output_dict=True))
    return cm, class_report

# KNN Exercises

Continue working in your model notebook or python script.

In [196]:
from sklearn.neighbors import KNeighborsClassifier

## Exercise 1
Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

In [200]:
knn = KNeighborsClassifier()

In [201]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [202]:
y_pred = knn.predict(X_train)

## Exercise 2
Evaluate your results using the model score, confusion matrix, and classification report.

In [203]:
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))

Accuracy of KNN classifier on training set: 0.77


In [208]:
cmknn = pd.DataFrame(confusion_matrix(y_train, y_pred))
cmknn

Unnamed: 0,0,1
0,256,51
1,63,127


In [209]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.83      0.82       307
           1       0.71      0.67      0.69       190

    accuracy                           0.77       497
   macro avg       0.76      0.75      0.75       497
weighted avg       0.77      0.77      0.77       497



## Exercise 3
Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [210]:
TPrate, FPrate, FNrate, TNrate = confusion_matrix_rates(cmknn)

In [211]:
accuracy = round(accuracy_score(y_true = y_train, y_pred = y_pred),3)
precision = round(precision_score(y_true = y_train, y_pred = y_pred),3)
recall = round(recall_score(y_true = y_train, y_pred = y_pred),3)
f1score = round(f1_score(y_true = y_train, y_pred = y_pred),3)

In [212]:
print(f'True Pos Rate: {TPrate}')
print(f'False Pos Rate: {FPrate}')
print(f'True Neg Rate: {TNrate}')
print(f'False Pos Rate: {FPrate}')

print('\n')
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1score}')

True Pos Rate: 0.803
False Pos Rate: 0.287
True Neg Rate: 0.713
False Pos Rate: 0.287


Accuracy: 0.771
Precision: 0.713
Recall: 0.668
F1-score: 0.69


## Exercise 4
Run through steps 2-4 setting k to 10

In [239]:
knn10 = KNeighborsClassifier(n_neighbors=10)

In [240]:
knn10.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')

In [241]:
y_pred = knn10.predict(X_train)

In [242]:
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn10.score(X_train, y_train)))

Accuracy of KNN classifier on training set: 0.75


In [243]:
cmknn10 = pd.DataFrame(confusion_matrix(y_train, y_pred))
cmknn10

Unnamed: 0,0,1
0,278,29
1,94,96


In [244]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.91      0.82       307
           1       0.77      0.51      0.61       190

    accuracy                           0.75       497
   macro avg       0.76      0.71      0.71       497
weighted avg       0.76      0.75      0.74       497



In [245]:
TPrate, FPrate, FNrate, TNrate = confusion_matrix_rates(cmknn10)

In [246]:
accuracy = round(accuracy_score(y_true = y_train, y_pred = y_pred),3)
precision = round(precision_score(y_true = y_train, y_pred = y_pred),3)
recall = round(recall_score(y_true = y_train, y_pred = y_pred),3)
f1score = round(f1_score(y_true = y_train, y_pred = y_pred),3)

In [247]:
print(f'True Pos Rate: {TPrate}')
print(f'False Pos Rate: {FPrate}')
print(f'True Neg Rate: {TNrate}')
print(f'False Pos Rate: {FPrate}')

print('\n')
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1score}')

True Pos Rate: 0.747
False Pos Rate: 0.232
True Neg Rate: 0.768
False Pos Rate: 0.232


Accuracy: 0.753
Precision: 0.768
Recall: 0.505
F1-score: 0.61


## Exercise 5
Run through setps 2-4 setting k to 20

### Model 1

In [251]:
kneighbors(X_train, y_train, 3)

(     0    1
 0  269   38
 1   49  141,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.845912    0.787709   0.82495    0.816811      0.823662
 recall       0.876221    0.742105   0.82495    0.809163      0.824950
 f1-score     0.860800    0.764228   0.82495    0.812514      0.823881
 support    307.000000  190.000000   0.82495  497.000000    497.000000)

### Model 2

In [249]:
kneighbors(X_train, y_train, 5)

(     0    1
 0  256   51
 1   63  127,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.802508    0.713483  0.770624    0.757995      0.768474
 recall       0.833876    0.668421  0.770624    0.751149      0.770624
 f1-score     0.817891    0.690217  0.770624    0.754054      0.769082
 support    307.000000  190.000000  0.770624  497.000000    497.000000)

### Model 3

In [250]:
kneighbors(X_train, y_train, 10)

(     0   1
 0  278  29
 1   94  96,
                     0           1  accuracy   macro avg  weighted avg
 precision    0.747312    0.768000  0.752515    0.757656      0.755221
 recall       0.905537    0.505263  0.752515    0.705400      0.752515
 f1-score     0.818851    0.609524  0.752515    0.714188      0.738827
 support    307.000000  190.000000  0.752515  497.000000    497.000000)

## Exercise 6
What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

The accuracy on model 1 has a better accuracy than the following 2 models. Model 1 also has a better overall precision and recall. 

Overall Model 1 has the better fit using the KNN.

With less data points to analyze, the data points themselves hold more weight so a more accurate prediction can be made.

In [219]:
def kneighbors(X_train, y_train, n_neighbor):
    knn = KNeighborsClassifier(n_neighbors=n_neighbor)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_train)
    cm = pd.DataFrame(confusion_matrix(y_train, y_pred))
    #creating a copy of y_train
    label = y_train
    #renaming column in copy of y_train
    label = label.rename(columns={label.columns[0]:'label'})
    #creating labels out of unique values for 
    labels = sorted(label.label.unique())
    #creating a classification report and saving it as a DataFrame
    class_report = pd.DataFrame(classification_report(y_train, y_pred, target_names=labels, output_dict=True))
    return cm, class_report

# Test

For both the iris and the titanic data,

## Exercise 1
Determine which model (with hyperparameters) performs the best (try reducing the number of features to the top 4 features in terms of information gained for each feature individually).

In [None]:
logistic_regression(X_train, y_train)

In [None]:
decision_tree(X_train, y_train, 10)

In [None]:
random_forest(X_train, y_train, 5, 3)

In [None]:
random_forest(X_train, y_train, 5, 15)

In [None]:
kneighbors(X_train, y_train, 3)

## Exercise 2
Create a new dataframe with top 4 features.

In [None]:
X_train_final = train[['parch', 'pclass', 'sex_male', 'alone']]
y_train_final = train[['survived']]

## Exercise 3
Use the top performing algorithm with the metaparameters used in that model. Create the object, fit, transform on in-sample data, and evaluate the results with the training data. Compare your evaluation metrics with those from the original model (with all the features).

## Exercise 4
Run your final model on your out-of-sample dataframe (test_df). Evaluate the results.