### Decision Tree Model Exercises

Using the titanic data, in your classification-exercises repository, create a notebook, model.ipynb where you will do the following:

1. What is your baseline prediction? What is your baseline accuracy? remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

In [63]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import acquire
import prepare
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [65]:
titanic = acquire.get_titanic_data()
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


What is your baseline prediction?

In [66]:
# baseline prediction is that passangers did not survive.

In [67]:
# prepare titanic
titanic = titanic.drop_duplicates()
cols_to_drop = ['deck', 'embarked', 'class']
titanic = titanic.drop(columns=cols_to_drop)
titanic['embark_town'] = titanic.embark_town.fillna(value='Southampton')
dummy_titanic = pd.get_dummies(titanic[['sex', 'embark_town']], dummy_na=False, drop_first=[True, True])
titanic = titanic.drop(columns = ['sex', 'embark_town'])
titanic = pd.concat([titanic, dummy_titanic], axis=1)

In [68]:
titanic.head(3)

Unnamed: 0,passenger_id,survived,pclass,age,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,0,3,22.0,1,0,7.25,0,1,0,1
1,1,1,1,38.0,1,0,71.2833,0,0,0,0
2,2,1,3,26.0,0,0,7.925,1,0,0,1


In [69]:
# there are some nulls in the age column
titanic.isna().sum()

passenger_id                 0
survived                     0
pclass                       0
age                        177
sibsp                        0
parch                        0
fare                         0
alone                        0
sex_male                     0
embark_town_Queenstown       0
embark_town_Southampton      0
dtype: int64

In [70]:
# plugging missing age values with mean of entire dataset
titanic.age = titanic.age.fillna(value = titanic.age.mean(0))

In [71]:
titanic.isna().sum()

passenger_id               0
survived                   0
pclass                     0
age                        0
sibsp                      0
parch                      0
fare                       0
alone                      0
sex_male                   0
embark_town_Queenstown     0
embark_town_Southampton    0
dtype: int64

In [143]:
#split titanic
train, test = train_test_split(titanic, test_size = .2, random_state=123, stratify=titanic.survived)
train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train.survived)

In [144]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 498 entries, 583 to 744
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   passenger_id             498 non-null    int64  
 1   survived                 498 non-null    int64  
 2   pclass                   498 non-null    int64  
 3   age                      498 non-null    float64
 4   sibsp                    498 non-null    int64  
 5   parch                    498 non-null    int64  
 6   fare                     498 non-null    float64
 7   alone                    498 non-null    int64  
 8   sex_male                 498 non-null    uint8  
 9   embark_town_Queenstown   498 non-null    uint8  
 10  embark_town_Southampton  498 non-null    uint8  
dtypes: float64(2), int64(6), uint8(3)
memory usage: 36.5 KB


In [145]:
train.survived.value_counts()

0    307
1    191
Name: survived, dtype: int64

What is your baseline accuracy?

In [146]:
baseline_accuracy = (0 == train.survived).mean()

print(f'baseline accuracy: {baseline_accuracy:.2%}')

baseline accuracy: 61.65%


2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [147]:
X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [148]:
# starting with a max_depth value of 5
titanic_clf = DecisionTreeClassifier(max_depth = 5, random_state = 123)

In [149]:
titanic_clf = titanic_clf.fit(X_train, y_train)

In [150]:
y_pred = titanic_clf.predict(X_train)
#y_pred  ==> array output

3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [151]:
print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(titanic_clf.score(X_train, y_train)))

Accuracy of Decision Tree classifier on training set: 0.86


In [152]:
c_m = confusion_matrix(y_train, y_pred)
c_m

array([[296,  11],
       [ 58, 133]])

In [154]:
labels = ['Did not survive', 'Survived']
print('Actual on the left, predicted on the top')
pd.DataFrame(c_m, index = labels, columns = labels)

Actual on the left, predicted on the top


Unnamed: 0,Did not survive,Survived
Did not survive,296,11
Survived,58,133


In [155]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.96      0.90       307
           1       0.92      0.70      0.79       191

    accuracy                           0.86       498
   macro avg       0.88      0.83      0.84       498
weighted avg       0.87      0.86      0.86       498



4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [156]:
print('Accuracy of Decision Tree classifier on validate set: {:.2f}'.format(titanic_clf.score(X_validate, y_validate)))

Accuracy of Decision Tree classifier on validate set: 0.77


In [157]:
y_pred = titanic_clf.predict(X_validate)
print(classification_report(y_validate, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.86      0.82       132
           1       0.74      0.62      0.68        82

    accuracy                           0.77       214
   macro avg       0.76      0.74      0.75       214
weighted avg       0.77      0.77      0.77       214



5. Run through steps 2-4 using a different max_depth value.

In [162]:
for i in range(5, 20):
    print('max_depth value: {}'.format(i))
    titanic_clf = DecisionTreeClassifier(max_depth = i, random_state = 123)
    titanic_clf = titanic_clf.fit(X_train, y_train)
    y_pred = titanic_clf.predict(X_train)
    print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(titanic_clf.score(X_train, y_train)))
    print(classification_report(y_train, y_pred))
    print('Accuracy of Decision Tree classifier on validate set: {:.2f}'.format(titanic_clf.score(X_validate, y_validate)))
    y_pred = titanic_clf.predict(X_validate)
    print(classification_report(y_validate, y_pred))
    print('-------------------------------------------------')



max_depth value: 5
Accuracy of Decision Tree classifier on training set: 0.86
              precision    recall  f1-score   support

           0       0.84      0.96      0.90       307
           1       0.92      0.70      0.79       191

    accuracy                           0.86       498
   macro avg       0.88      0.83      0.84       498
weighted avg       0.87      0.86      0.86       498

Accuracy of Decision Tree classifier on validate set: 0.77
              precision    recall  f1-score   support

           0       0.79      0.86      0.82       132
           1       0.74      0.62      0.68        82

    accuracy                           0.77       214
   macro avg       0.76      0.74      0.75       214
weighted avg       0.77      0.77      0.77       214

---------------------------------------
max_depth value: 6
Accuracy of Decision Tree classifier on training set: 0.88
              precision    recall  f1-score   support

           0       0.87      0.94   

6. Which model performs better on your in-sample data?

In [None]:
# I've observed that the higher the depth the better the performance on the in-sample
# data but thay may be to overfitting too (?)

7. Which model performs best on your out-of-sample data, the validate set?

In [None]:
# depth values 6 and 8 perform the best

### Random Forest Model Exercises

Continue working in your `model` file with titanic data to do the following: 

1. Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 10.

In [196]:
X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [197]:
rf = RandomForestClassifier(max_depth = 10,
                           random_state = 123, 
                           min_samples_leaf = 1)

In [198]:
rf.fit(X_train, y_train)

RandomForestClassifier(max_depth=10, random_state=123)

In [199]:
print(rf.feature_importances_)

[0.16359522 0.08740554 0.15093152 0.04708349 0.02672832 0.18238905
 0.01822827 0.28960813 0.01169783 0.02233262]


In [200]:
X_train.columns

Index(['passenger_id', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'alone',
       'sex_male', 'embark_town_Queenstown', 'embark_town_Southampton'],
      dtype='object')

In [255]:
y_pred = rf.predict(X_train)
y_pred

array([0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,

In [236]:
print('Accuracy of random forest classifier on training set: {:.2f}'.format(rf.score(X_train, y_train)))

Accuracy of random forest classifier on training set: 0.98


In [237]:
# 0 = did not survive, 1 = survived
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       307
           1       1.00      0.94      0.97       191

    accuracy                           0.98       498
   macro avg       0.98      0.97      0.98       498
weighted avg       0.98      0.98      0.98       498



2. Evaluate your results using the model score, confusion matrix, and classification report.

In [252]:
print('Accuracy of random forest classifier on test set: {:.2f}'.format(rf.score(X_train, y_train)))

Accuracy of random forest classifier on test set: 0.98


In [256]:
c_m = confusion_matrix(y_train, y_pred)
c_m

array([[307,   0],
       [ 11, 180]])

In [257]:
labels = ['Did not survive', 'Survived']
print('Actual on the left, predicted on the top')
pd.DataFrame(c_m, index = labels, columns = labels)

Actual on the left, predicted on the top


Unnamed: 0,Did not survive,Survived
Did not survive,307,0
Survived,11,180


In [258]:
y_pred = rf.predict(X_train)
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       307
           1       1.00      0.94      0.97       191

    accuracy                           0.98       498
   macro avg       0.98      0.97      0.98       498
weighted avg       0.98      0.98      0.98       498



3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [259]:
print('Accuracy of Decision Tree classifier on validate set: {:.2f}'.format(rf.score(X_validate, y_validate)))

Accuracy of Decision Tree classifier on validate set: 0.80


In [261]:
y_pred = rf.predict(X_validate)
print(confusion_matrix(y_validate, y_pred))

[[116  16]
 [ 26  56]]


In [263]:
labels = ['Did not survive', 'Survived']
print('Actual on the left, predicted on the top')
pd.DataFrame(confusion_matrix(y_validate, y_pred), index = labels, columns = labels)

Actual on the left, predicted on the top


Unnamed: 0,Did not survive,Survived
Did not survive,116,16
Survived,26,56


In [264]:
print(classification_report(y_validate, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.88      0.85       132
           1       0.78      0.68      0.73        82

    accuracy                           0.80       214
   macro avg       0.80      0.78      0.79       214
weighted avg       0.80      0.80      0.80       214



4. Run through steps increasing your min_samples_leaf and decreasing your max_depth.

In [270]:
max_depth = 20
for i in range(2, max_depth):
    depth = max_depth - 1
    print('max_depth value: {}'.format(depth))
    print('min_samples_leaf value: {}'.format(i))
    rf = RandomForestClassifier(max_depth = depth,
                           random_state = 123, 
                           min_samples_leaf = i)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_train)
    print('Accuracy of Random Forest model on training set: {:.2f}'.format(rf.score(X_train, y_train)))
    print(classification_report(y_train, y_pred))
    print('Accuracy of Random Forest model on validate set: {:.2f}'.format(rf.score(X_validate, y_validate)))
    y_pred = rf.predict(X_validate)
    print(classification_report(y_validate, y_pred))
    print('-----------------------------------------------------')



max_depth value: 19
min_samples_leaf value: 2
Accuracy of Random Forest model on training set: 0.94
              precision    recall  f1-score   support

           0       0.92      0.98      0.95       307
           1       0.97      0.87      0.92       191

    accuracy                           0.94       498
   macro avg       0.95      0.93      0.93       498
weighted avg       0.94      0.94      0.94       498

Accuracy of Random Forest model on validate set: 0.82
              precision    recall  f1-score   support

           0       0.83      0.89      0.86       132
           1       0.81      0.71      0.75        82

    accuracy                           0.82       214
   macro avg       0.82      0.80      0.81       214
weighted avg       0.82      0.82      0.82       214

-----------------------------------------------------
max_depth value: 19
min_samples_leaf value: 3
Accuracy of Random Forest model on training set: 0.92
              precision    recall  f1-

Accuracy of Random Forest model on training set: 0.85
              precision    recall  f1-score   support

           0       0.83      0.95      0.89       307
           1       0.90      0.69      0.78       191

    accuracy                           0.85       498
   macro avg       0.86      0.82      0.83       498
weighted avg       0.86      0.85      0.85       498

Accuracy of Random Forest model on validate set: 0.79
              precision    recall  f1-score   support

           0       0.78      0.91      0.84       132
           1       0.80      0.60      0.69        82

    accuracy                           0.79       214
   macro avg       0.79      0.75      0.76       214
weighted avg       0.79      0.79      0.78       214

-----------------------------------------------------
max_depth value: 19
min_samples_leaf value: 13
Accuracy of Random Forest model on training set: 0.85
              precision    recall  f1-score   support

           0       0.83     

5. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

After making a few models, which one has the best performance (or closest metrics) on both train and validate?

### KNN Model Exercises

Continue working in your model file with the titanic dataset. 

1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)​.

In [11]:
#split titanic
train, test = train_test_split(titanic, test_size = .2, random_state=123, stratify=titanic.survived)
train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train.survived)

In [12]:
X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [13]:
knn = KNeighborsClassifier(n_neighbors = 5, weights = 'uniform')

In [14]:
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [15]:
y_pred = knn.predict(X_train)

In [16]:
y_pred_proba = knn.predict_proba(X_train)

2. Evaluate your results using the model score, confusion matrix, and classification report.

In [17]:
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))

Accuracy of KNN classifier on training set: 0.74


In [18]:
print(confusion_matrix(y_train, y_pred))

[[266  41]
 [ 87 104]]


In [19]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.87      0.81       307
           1       0.72      0.54      0.62       191

    accuracy                           0.74       498
   macro avg       0.74      0.71      0.71       498
weighted avg       0.74      0.74      0.73       498



In [21]:
TN, FP, FN, TP = confusion_matrix(y_train,y_pred).ravel()
ALL = TP + TN + FP + FN

TP, TN, FP, FN

(104, 266, 41, 87)

In [22]:
accuracy = (TP + TN)/ALL
print(f"Accuracy: {accuracy}")

true_positive_rate = TP/(TP+FN)
print(f"True Positive Rate: {true_positive_rate}")

false_positive_rate = FP/(FP+TN)
print(f"False Positive Rate: {false_positive_rate}")

true_negative_rate = TN/(TN+FP)
print(f"True Negative Rate: {true_negative_rate}")

false_negative_rate = FN/(FN+TP)
print(f"False Negative Rate: {false_negative_rate}")

precision = TP/(TP+FP)
print(f"Precision: {precision}")

recall = TP/(TP+FN)
print(f"Recall: {recall}")

f1_score = 2*(precision*recall)/(precision+recall)
print(f"F1 Score: {f1_score}")

support_pos = TP + FN
print(f"Support (0): {support_pos}")

support_neg = FP + TN
print(f"Support (1): {support_neg}")

Accuracy: 0.7429718875502008
True Positive Rate: 0.5445026178010471
False Positive Rate: 0.13355048859934854
True Negative Rate: 0.8664495114006515
False Negative Rate: 0.45549738219895286
Precision: 0.7172413793103448
Recall: 0.5445026178010471
F1 Score: 0.6190476190476191
Support (0): 191
Support (1): 307


3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [23]:
print('Accuracy of KNN classifier on validate set: {:.2f}'
     .format(knn.score(X_validate, y_validate)))

Accuracy of KNN classifier on validate set: 0.61


In [25]:
y_pred = knn.predict(X_validate)

In [26]:
print(confusion_matrix(y_validate, y_pred))

[[99 33]
 [50 32]]


In [27]:
print(classification_report(y_validate, y_pred))

              precision    recall  f1-score   support

           0       0.66      0.75      0.70       132
           1       0.49      0.39      0.44        82

    accuracy                           0.61       214
   macro avg       0.58      0.57      0.57       214
weighted avg       0.60      0.61      0.60       214



In [28]:
TN, FP, FN, TP = confusion_matrix(y_validate,y_pred).ravel()
ALL = TP + TN + FP + FN

TP, TN, FP, FN

(32, 99, 33, 50)

In [29]:
accuracy = (TP + TN)/ALL
print(f"Accuracy: {accuracy}")

true_positive_rate = TP/(TP+FN)
print(f"True Positive Rate: {true_positive_rate}")

false_positive_rate = FP/(FP+TN)
print(f"False Positive Rate: {false_positive_rate}")

true_negative_rate = TN/(TN+FP)
print(f"True Negative Rate: {true_negative_rate}")

false_negative_rate = FN/(FN+TP)
print(f"False Negative Rate: {false_negative_rate}")

precision = TP/(TP+FP)
print(f"Precision: {precision}")

recall = TP/(TP+FN)
print(f"Recall: {recall}")

f1_score = 2*(precision*recall)/(precision+recall)
print(f"F1 Score: {f1_score}")

support_pos = TP + FN
print(f"Support (0): {support_pos}")

support_neg = FP + TN
print(f"Support (1): {support_neg}")

Accuracy: 0.6121495327102804
True Positive Rate: 0.3902439024390244
False Positive Rate: 0.25
True Negative Rate: 0.75
False Negative Rate: 0.6097560975609756
Precision: 0.49230769230769234
Recall: 0.3902439024390244
F1 Score: 0.435374149659864
Support (0): 82
Support (1): 132


4. Run through steps 2-3 setting k to 10

In [30]:
knn = KNeighborsClassifier(n_neighbors = 10, weights = 'uniform')

In [31]:
knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=10)

In [32]:
y_pred = knn.predict(X_train)

In [33]:
y_pred_proba = knn.predict_proba(X_train)

In [34]:
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))

Accuracy of KNN classifier on training set: 0.70


In [35]:
print(confusion_matrix(y_train, y_pred))

[[284  23]
 [128  63]]


In [36]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.69      0.93      0.79       307
           1       0.73      0.33      0.45       191

    accuracy                           0.70       498
   macro avg       0.71      0.63      0.62       498
weighted avg       0.71      0.70      0.66       498



In [37]:
print('Accuracy of KNN classifier on validate set: {:.2f}'
     .format(knn.score(X_validate, y_validate)))

Accuracy of KNN classifier on validate set: 0.62


In [38]:
y_pred = knn.predict(X_validate)

In [39]:
print(confusion_matrix(y_validate, y_pred))

[[115  17]
 [ 65  17]]


In [40]:
print(classification_report(y_validate, y_pred))

              precision    recall  f1-score   support

           0       0.64      0.87      0.74       132
           1       0.50      0.21      0.29        82

    accuracy                           0.62       214
   macro avg       0.57      0.54      0.52       214
weighted avg       0.59      0.62      0.57       214



5. Run through setps 2-3 setting k to 20

In [41]:
knn = KNeighborsClassifier(n_neighbors = 20, weights = 'uniform')

In [42]:
knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=20)

In [43]:
y_pred = knn.predict(X_train)

In [44]:
y_pred_proba = knn.predict_proba(X_train)

In [45]:
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))

Accuracy of KNN classifier on training set: 0.67


In [46]:
print(confusion_matrix(y_train, y_pred))

[[288  19]
 [147  44]]


In [47]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.66      0.94      0.78       307
           1       0.70      0.23      0.35       191

    accuracy                           0.67       498
   macro avg       0.68      0.58      0.56       498
weighted avg       0.68      0.67      0.61       498



In [48]:
print('Accuracy of KNN classifier on validate set: {:.2f}'
     .format(knn.score(X_validate, y_validate)))

Accuracy of KNN classifier on validate set: 0.65


In [49]:
y_pred = knn.predict(X_validate)

In [50]:
print(confusion_matrix(y_validate, y_pred))

[[122  10]
 [ 65  17]]


In [51]:
print(classification_report(y_validate, y_pred))

              precision    recall  f1-score   support

           0       0.65      0.92      0.76       132
           1       0.63      0.21      0.31        82

    accuracy                           0.65       214
   macro avg       0.64      0.57      0.54       214
weighted avg       0.64      0.65      0.59       214



6. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

In [60]:
knn_no = range(1, 21)
for i in knn_no:
    print('Number of KNN: {}'.format(i))
    knn = KNeighborsClassifier(n_neighbors = i, weights = 'uniform')
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_train)
    #print(classification_report(y_train, y_pred))
    print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))
    print('-----------------------------------------------------')

Number of KNN: 1
Accuracy of KNN classifier on training set: 1.00
-----------------------------------------------------
Number of KNN: 2
Accuracy of KNN classifier on training set: 0.78
-----------------------------------------------------
Number of KNN: 3
Accuracy of KNN classifier on training set: 0.80
-----------------------------------------------------
Number of KNN: 4
Accuracy of KNN classifier on training set: 0.76
-----------------------------------------------------
Number of KNN: 5
Accuracy of KNN classifier on training set: 0.74
-----------------------------------------------------
Number of KNN: 6
Accuracy of KNN classifier on training set: 0.72
-----------------------------------------------------
Number of KNN: 7
Accuracy of KNN classifier on training set: 0.72
-----------------------------------------------------
Number of KNN: 8
Accuracy of KNN classifier on training set: 0.70
-----------------------------------------------------
Number of KNN: 9
Accuracy of KNN classif

7. Which model performs best on our out-of-sample data from validate?

In [61]:
knn_no = range(1, 21)
for i in knn_no:
    print('Number of KNN: {}'.format(i))
    knn = KNeighborsClassifier(n_neighbors = i, weights = 'uniform')
    knn.fit(X_validate, y_validate)
    y_pred = knn.predict(X_validate)
    #print(classification_report(y_validate, y_validate))
    print('Accuracy of KNN classifier on validate set: {:.2f}'
     .format(knn.score(X_validate, y_validate)))
    print('-----------------------------------------------------')

Number of KNN: 1
Accuracy of KNN classifier on validate set: 1.00
-----------------------------------------------------
Number of KNN: 2
Accuracy of KNN classifier on validate set: 0.79
-----------------------------------------------------
Number of KNN: 3
Accuracy of KNN classifier on validate set: 0.79
-----------------------------------------------------
Number of KNN: 4
Accuracy of KNN classifier on validate set: 0.71
-----------------------------------------------------
Number of KNN: 5
Accuracy of KNN classifier on validate set: 0.74
-----------------------------------------------------
Number of KNN: 6
Accuracy of KNN classifier on validate set: 0.70
-----------------------------------------------------
Number of KNN: 7
Accuracy of KNN classifier on validate set: 0.69
-----------------------------------------------------
Number of KNN: 8
Accuracy of KNN classifier on validate set: 0.68
-----------------------------------------------------
Number of KNN: 9
Accuracy of KNN classif

### Regression Model Exercises

In these exercises, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

Do your work for these exercises in either a notebook or a python script named model within your classification-exercises repository. Add, commit, and push your work.

1. Create a model that includes age in addition to fare and pclass. Does this model perform better than your baseline?

In [291]:
titanic = acquire.get_titanic_data()
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [292]:
titanic = titanic.drop_duplicates()
cols_to_drop = ['passenger_id', 'sex','sibsp','parch','embarked','class','deck','embark_town','alone']
titanic = titanic.drop(columns=cols_to_drop)

In [293]:
titanic.head()

Unnamed: 0,survived,pclass,age,fare
0,0,3,22.0,7.25
1,1,1,38.0,71.2833
2,1,3,26.0,7.925
3,1,1,35.0,53.1
4,0,3,35.0,8.05


In [294]:
# plugging missing age values with mean of entire dataset
titanic.age = titanic.age.fillna(value = titanic.age.mean(0))

In [295]:
# baseline model
baseline_accuracy = (0 == train.survived).mean()

print(f'baseline accuracy: {baseline_accuracy:.2%}')

baseline accuracy: 61.65%


In [296]:
#split titanic
train, test = train_test_split(titanic, test_size = .2, random_state=123, stratify=titanic.survived)
train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train.survived)

In [297]:
X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [298]:
logit = LogisticRegression(C = 1, random_state = 123)

In [299]:
logit.fit(X_train, y_train)

LogisticRegression(C=1, random_state=123)

In [300]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.94961658 -0.03063394  0.00141006]]
Intercept: 
 [2.52863437]


In [301]:
y_pred = logit.predict(X_train)

In [302]:
y_pred[0:10]

array([1, 0, 0, 0, 1, 0, 0, 0, 0, 0])

In [303]:
logit.predict_proba(X_train)[0:10]

array([[0.36987005, 0.63012995],
       [0.63806591, 0.36193409],
       [0.61743881, 0.38256119],
       [0.70383651, 0.29616349],
       [0.30458292, 0.69541708],
       [0.56359759, 0.43640241],
       [0.65720071, 0.34279929],
       [0.55318395, 0.44681605],
       [0.7719031 , 0.2280969 ],
       [0.75619804, 0.24380196]])

In [304]:
logit.classes_

array([0, 1])

In [305]:
y_pred_proba = logit.predict_proba(X_train)
y_pred_proba = pd.DataFrame(y_pred_proba, columns = [0, 1])
y_pred_proba.head()

Unnamed: 0,0,1
0,0.36987,0.63013
1,0.638066,0.361934
2,0.617439,0.382561
3,0.703837,0.296163
4,0.304583,0.695417


In [306]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'.format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.70


In [307]:
print(confusion_matrix(y_train, y_pred))

[[266  41]
 [107  84]]


In [308]:
# classification report
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.71      0.87      0.78       307
           1       0.67      0.44      0.53       191

    accuracy                           0.70       498
   macro avg       0.69      0.65      0.66       498
weighted avg       0.70      0.70      0.69       498



2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [309]:
titanic = acquire.get_titanic_data()
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [310]:
titanic = titanic.drop_duplicates()
cols_to_drop = ['passenger_id','sibsp','parch','embarked','class','deck','embark_town','alone']
titanic = titanic.drop(columns = cols_to_drop)
dummy_titanic = pd.get_dummies(titanic.sex, dummy_na = False, drop_first = True)
titanic = pd.concat([titanic, dummy_titanic], axis = 1)
titanic = titanic.drop(columns = 'sex')
# plugging missing age values with mean of entire dataset
titanic.age = titanic.age.fillna(value = titanic.age.mean(0))

In [311]:
#split titanic
train, test = train_test_split(titanic, test_size = .2, random_state=123, stratify=titanic.survived)
train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train.survived)

In [312]:
X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [313]:
logit = LogisticRegression(C = 1, random_state = 123)

In [314]:
logit.fit(X_train, y_train)

LogisticRegression(C=1, random_state=123)

In [315]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-1.21048734e+00 -2.97258240e-02 -2.02978353e-03 -2.71609100e+00]]
Intercept: 
 [4.84166149]


In [316]:
y_pred = logit.predict(X_train)

In [317]:
y_pred_proba = logit.predict_proba(X_train)
y_pred_proba = pd.DataFrame(y_pred_proba, columns = [0, 1])
y_pred_proba.head()

Unnamed: 0,0,1
0,0.558849,0.441151
1,0.859975,0.140025
2,0.857482,0.142518
3,0.292842,0.707158
4,0.074243,0.925757


In [318]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.81


In [319]:
print(confusion_matrix(y_train, y_pred))

[[266  41]
 [ 52 139]]


In [320]:
# classification report
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.87      0.85       307
           1       0.77      0.73      0.75       191

    accuracy                           0.81       498
   macro avg       0.80      0.80      0.80       498
weighted avg       0.81      0.81      0.81       498



3. Try out other combinations of features and models.

In [321]:
titanic = acquire.get_titanic_data()
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [322]:
# trying out model w class, age, fare, encoded sex and parch
titanic = titanic.drop_duplicates()
cols_to_drop = ['passenger_id','sibsp','embarked','class','deck','embark_town','alone']
titanic = titanic.drop(columns = cols_to_drop)
dummy_titanic = pd.get_dummies(titanic.sex, dummy_na = False, drop_first = True)
titanic = pd.concat([titanic, dummy_titanic], axis = 1)
titanic = titanic.drop(columns = 'sex')
# plugging missing age values with mean of entire dataset
titanic.age = titanic.age.fillna(value = titanic.age.mean(0))

In [323]:
#split titanic
train, test = train_test_split(titanic, test_size = .2, random_state=123, stratify=titanic.survived)
train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train.survived)

In [324]:
X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [325]:
logit = LogisticRegression(C = 1, random_state = 123)

In [326]:
logit.fit(X_train, y_train)

LogisticRegression(C=1, random_state=123)

In [327]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-1.18892617e+00 -3.13159914e-02 -1.77144249e-01 -1.16024145e-03
  -2.77684763e+00]]
Intercept: 
 [4.92074498]


In [328]:
y_pred = logit.predict(X_train)

In [329]:
y_pred_proba = logit.predict_proba(X_train)
y_pred_proba = pd.DataFrame(y_pred_proba, columns = [0, 1])
y_pred_proba.head()

Unnamed: 0,0,1
0,0.554522,0.445478
1,0.889223,0.110777
2,0.865912,0.134088
3,0.316526,0.683474
4,0.064579,0.935421


In [330]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'.format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.82


In [331]:
print(confusion_matrix(y_train, y_pred))

[[268  39]
 [ 53 138]]


In [332]:
# classification report
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.87      0.85       307
           1       0.78      0.72      0.75       191

    accuracy                           0.82       498
   macro avg       0.81      0.80      0.80       498
weighted avg       0.81      0.82      0.81       498



In [333]:
titanic = acquire.get_titanic_data()
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [334]:
# trying out model w class, age, fare, encoded sex and sibsp
titanic = titanic.drop_duplicates()
cols_to_drop = ['passenger_id','parch','embarked','class','deck','embark_town','alone']
titanic = titanic.drop(columns = cols_to_drop)
dummy_titanic = pd.get_dummies(titanic.sex, dummy_na = False, drop_first = True)
titanic = pd.concat([titanic, dummy_titanic], axis = 1)
titanic = titanic.drop(columns = 'sex')
# plugging missing age values with mean of entire dataset
titanic.age = titanic.age.fillna(value = titanic.age.mean(0))

In [335]:
#split titanic
train, test = train_test_split(titanic, test_size = .2, random_state=123, stratify=titanic.survived)
train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train.survived)

In [336]:
X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [337]:
logit = LogisticRegression(C = 1, random_state = 123)

In [338]:
logit.fit(X_train, y_train)

LogisticRegression(C=1, random_state=123)

In [339]:
y_pred = logit.predict(X_train)

In [340]:
y_pred_proba = logit.predict_proba(X_train)
y_pred_proba = pd.DataFrame(y_pred_proba, columns = [0, 1])
y_pred_proba.head()

Unnamed: 0,0,1
0,0.546985,0.453015
1,0.820058,0.179942
2,0.951423,0.048577
3,0.272681,0.727319
4,0.054659,0.945341


In [341]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'.format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.80


In [342]:
print(confusion_matrix(y_train, y_pred))

[[259  48]
 [ 53 138]]


In [343]:
# classification report
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.84      0.84       307
           1       0.74      0.72      0.73       191

    accuracy                           0.80       498
   macro avg       0.79      0.78      0.78       498
weighted avg       0.80      0.80      0.80       498



In [344]:
titanic = acquire.get_titanic_data()
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [345]:
# trying out model w class, age, fare, encoded sex, sibsp and parch
titanic = titanic.drop_duplicates()
cols_to_drop = ['passenger_id','embarked','class','deck','embark_town','alone']
titanic = titanic.drop(columns = cols_to_drop)
dummy_titanic = pd.get_dummies(titanic.sex, dummy_na = False, drop_first = True)
titanic = pd.concat([titanic, dummy_titanic], axis = 1)
titanic = titanic.drop(columns = 'sex')
# plugging missing age values with mean of entire dataset
titanic.age = titanic.age.fillna(value = titanic.age.mean(0))

In [346]:
#split titanic
train, test = train_test_split(titanic, test_size = .2, random_state=123, stratify=titanic.survived)
train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train.survived)

In [347]:
X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [348]:
logit = LogisticRegression(C = 1, random_state = 123)

In [349]:
logit.fit(X_train, y_train)

LogisticRegression(C=1, random_state=123)

In [350]:
y_pred = logit.predict(X_train)

In [351]:
y_pred_proba = logit.predict_proba(X_train)
y_pred_proba = pd.DataFrame(y_pred_proba, columns = [0, 1])
y_pred_proba.head()

Unnamed: 0,0,1
0,0.546144,0.453856
1,0.831749,0.168251
2,0.950818,0.049182
3,0.278926,0.721074
4,0.053076,0.946924


In [352]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'.format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.80


In [353]:
print(confusion_matrix(y_train, y_pred))

[[259  48]
 [ 53 138]]


In [354]:
# classification report
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.84      0.84       307
           1       0.74      0.72      0.73       191

    accuracy                           0.80       498
   macro avg       0.79      0.78      0.78       498
weighted avg       0.80      0.80      0.80       498



In [389]:
titanic = acquire.get_titanic_data()
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [390]:
# trying out model w class, age, fare, encoded sex, sibsp and is_minor
titanic = titanic.drop_duplicates()
cols_to_drop = ['passenger_id', 'parch','embarked','class','deck','embark_town','alone']
titanic = titanic.drop(columns = cols_to_drop)
dummy_titanic = pd.get_dummies(titanic.sex, dummy_na = False, drop_first = True)
titanic = pd.concat([titanic, dummy_titanic], axis = 1)
titanic = titanic.drop(columns = 'sex')
titanic['is_minor'] = (titanic['age']  < 13).astype(int)
# plugging missing age values with mean of entire dataset
titanic.age = titanic.age.fillna(value = titanic.age.mean(0))


In [391]:
#split titanic
train, test = train_test_split(titanic, test_size = .2, random_state=123, stratify=titanic.survived)
train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train.survived)

In [392]:
X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [393]:
logit = LogisticRegression(C = 1, random_state = 123)

In [394]:
logit.fit(X_train, y_train)

LogisticRegression(C=1, random_state=123)

In [395]:
y_pred = logit.predict(X_train)

In [396]:
y_pred_proba = logit.predict_proba(X_train)
y_pred_proba = pd.DataFrame(y_pred_proba, columns = [0, 1])
y_pred_proba.head()

Unnamed: 0,0,1
0,0.549621,0.450379
1,0.74971,0.25029
2,0.932588,0.067412
3,0.238133,0.761867
4,0.057691,0.942309


In [397]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'.format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.81


In [398]:
print(confusion_matrix(y_train, y_pred))

[[269  38]
 [ 56 135]]


In [399]:
# classification report
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.88      0.85       307
           1       0.78      0.71      0.74       191

    accuracy                           0.81       498
   macro avg       0.80      0.79      0.80       498
weighted avg       0.81      0.81      0.81       498



4. Use you best 3 models to predict and evaluate on your validate sample.

5. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?