### In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Througout this exercise, be sure you are training, evaluation, and comparing models on the train and validate dataset. The test dataset should be only used for your final model. 

### For all of the models you create, choose a threshold that optimizes for accuracy. 

### Do your work for these exercises in either a notebook or a python script named model within your classification-exercises repository. Add, commit, and push your work.

**Takeways**:
1. Build logistic regression models for titanic dataset.
2. Several models need to be build. 
3. Accuray is the evaluation metrics. 
4. Target varibale: the survivied (categorical)
5. The positive case is predicting the survivied
    - TP: predicting survived actually survivied
    - FP: predicting survived actually being a victim
    - TN: predicting being a victim acturally was a victim
    - FN: predicting being a victim acturally survived

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import acquire
import prepare

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import classification_report, confusion_matrix

### 1. Create another model that includes age in addition to fare and pclass. Does this model perform better than your previous one? 

In [None]:
# Acquire titanic data.

titanic = acquire.get_titanic_data()
titanic.head()

In [None]:
# Prepare titanic dataset

train, validate, test = prepare.prep_titanic(titanic)
train.head()

In [None]:
train.shape, validate.shape, test.shape

In [None]:
# Double check if there is any missing values

train.isnull().sum()

In [None]:
# Compute the baseline accuracy

train.survived.value_counts(normalize=True)

### BL_Model: X = ['fare', 'pclass'], y = 'survived'
1. fare: continuous
2. pclass: categotical

In [None]:
# fare and pclass are the X in model1.

X_train_model1 = train[['fare', 'pclass']]
y_train_model1 = train[['survived']]

X_train_model1.shape, y_train_model1.shape

In [None]:
# Create the logistic regression object

logit1 = LogisticRegression(C=1)

# Fit the model to the training data

logit1.fit(X_train_model1, y_train_model1)

# Print the coefficients and intercept of the model

print('Coefficient: \n', logit1.coef_)
print('Intercept: \n', logit1.intercept_)

In [None]:
# Estimate whether or not a passenger would survive, using the training data

y_pred_model1 = logit1.predict(X_train_model1)
y_pred_model1

# Estimate the probablity of a passenger surviving, using the training data
y_pred_proba_model1 = logit1.predict_proba(X_train_model1)

**Evalute model on train**

In [None]:
# Compute the accuracy

print(logit1.score(X_train_model1, y_train_model1))

# Create a confusion matrix

print(confusion_matrix(y_train_model1, y_pred_model1))

# Compute Precision, Recall, F1-score, and Support

print(classification_report(y_train_model1, y_pred_model1))

### Model 2: X = ['fare', 'pclass', 'age'], y = 'survived'

In [None]:
# fare, pclass, age are the X in model2.

X_train_model2 = train[['fare', 'pclass', 'age']]
y_train_model2 = train[['survived']]

X_train_model2.shape, y_train_model2.shape

**Create, Fit & Predict**

In [None]:
# Create the logistic regression object

logit2 = LogisticRegression(C=1)

# Fit the model to the training data

logit2.fit(X_train_model2, y_train_model2)

# Print the coefficients and intercept of the model

print('Coefficient: \n', logit2.coef_)
print('Intercept: \n', logit2.intercept_)

# Estimate whether or not a passenger would survive, using the training data

y_pred_model2 = logit2.predict(X_train_model2)
y_pred_model2

# Estimate the probablity of a passenger surviving, using the training data
y_pred_proba_model2 = logit2.predict_proba(X_train_model2)

**Evalute model on train**

In [None]:
# Compute the accuracy

print('Accuracy: {: .2f}'.format(logit2.score(X_train_model2, y_train_model2)))

# Create a confusion matrix

print('Confusion matrix: \n', confusion_matrix(y_train_model2, y_pred_model2))

# Compute Precision, Recall, F1-score, and Support

print(classification_report(y_train_model2, y_pred_model2))

### 2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [None]:
train.head()

In [None]:
sex_dummy = pd.get_dummies(train.sex)
train = pd.concat([train, sex_dummy], axis=1)
train.head()

In [None]:
train.info()

### Model 3: X = ['fare', 'pclass', 'age', 'male'], y = ['survivied']

In [None]:
# fare, pclass, age, and male are the X in model2.

X_train_model3 = train[['fare', 'pclass', 'age', 'male']]
y_train_model3 = train[['survived']]

X_train_model3.shape, y_train_model3.shape

In [None]:
# Create the logistic regression object

logit3 = LogisticRegression(C=1)

# Fit the model to the training data

logit3.fit(X_train_model3, y_train_model3)

# Print the coefficients and intercept of the model

print('Coefficient: \n', logit3.coef_)
print('Intercept: \n', logit3.intercept_)

# Estimate whether or not a passenger would survive, using the training data

y_pred_model3 = logit3.predict(X_train_model3)
y_pred_model3

# Estimate the probablity of a passenger surviving, using the training data
y_pred_proba_model3 = logit3.predict_proba(X_train_model3)

In [None]:
# Compute the accuracy

print('Accuracy: {: .2f}'.format(logit3.score(X_train_model3, y_train_model3)))

# Create a confusion matrix

print('Confusion matrix: ', confusion_matrix(y_train_model3, y_pred_model3))

# Compute Precision, Recall, F1-score, and Support

print(classification_report(y_train_model3, y_pred_model3))

**Notes**
1. Previous model only contains fare and pclass as the X. 
2. No missing values in the train dataset. 

### 3. Try out other combinations of features and models.
* Model 4: X = ['pcalss', 'male'], y = 'survived'
* Create, fit and predict
* Accuracy, Confustion matrix, and Report

In [None]:
# pclass and male are the X in model 4.

X_train_model4 = train[['pclass', 'male']]
y_train_model4 = train[['survived']]

# Create the logistic regression object

logit4 = LogisticRegression(C=1)

# Fit the model to the training data

logit4.fit(X_train_model4, y_train_model4)

# Print the coefficients and intercept of the model

print('Coefficient: \n', logit4.coef_)
print('Intercept: \n', logit4.intercept_)

# Estimate whether or not a passenger would survive, using the training data

y_pred_model4 = logit4.predict(X_train_model4)
y_pred_model4

# Estimate the probablity of a passenger surviving, using the training data
y_pred_proba_model4 = logit4.predict_proba(X_train_model4)

In [None]:
# Compute the accuracy

print('Accuracy: {: .2f}'.format(logit4.score(X_train_model4, y_train_model4)))

# Create a confusion matrix

print('Confusion matrix: \n', confusion_matrix(y_train_model4, y_pred_model4))

# Compute Precision, Recall, F1-score, and Support

print(classification_report(y_train_model4, y_pred_model4))

### 4. Use best 3 models to predict and evaluate your validate sample
* Best 3: model 2, 3, 4

In [None]:
# Load validate dataset

validate.head()

In [None]:
sex_dummy = pd.get_dummies(validate.sex)
validate = pd.concat([validate, sex_dummy], axis=1)
validate.head()

In [None]:
validate.shape

In [None]:
# Load validate dataset for Model 2

X_validate_model2 = validate[['fare', 'pclass', 'age']]
y_validate_model2 = validate[['survived']]

# Estimate whether or not a passenger would survive, using the training data

y_pred_model2 = logit2.predict(X_validate_model2)
y_pred_model2

# Estimate the probablity of a passenger surviving, using the training data
y_pred_proba_model2 = logit2.predict_proba(X_validate_model2)

# Compute the accuracy

print('Accuracy: {: .2f}'.format(logit2.score(X_validate_model2, y_validate_model2)))

# Create a confusion matrix

print(confusion_matrix(y_validate_model2, y_pred_model2))

# Compute Precision, Recall, F1-score, and Support

print(classification_report(y_validate_model2, y_pred_model2))

In [None]:
# Load validate dataset for Model 3

X_validate_model3 = validate[['fare', 'pclass', 'age', 'male']]
y_validate_model3 = validate[['survived']]

# Estimate whether or not a passenger would survive, using the training data

y_pred_model3 = logit3.predict(X_validate_model3)
y_pred_model3

# Estimate the probablity of a passenger surviving, using the training data
y_pred_proba_model3 = logit3.predict_proba(X_validate_model3)

# Compute the accuracy

print('Accuracy: {: .2f}'.format(logit3.score(X_validate_model3, y_validate_model3)))

# Create a confusion matrix

print(confusion_matrix(y_validate_model3, y_pred_model3))

# Compute Precision, Recall, F1-score, and Support

print(classification_report(y_validate_model3, y_pred_model3))

In [None]:
# Load validate dataset for Model 4

X_validate_model4 = validate[['pclass', 'male']]
y_validate_model4 = validate[['survived']]

# Estimate whether or not a passenger would survive, using the training data

y_pred_model4 = logit4.predict(X_validate_model4)
y_pred_model4

# Estimate the probablity of a passenger surviving, using the training data
y_pred_proba_model4 = logit4.predict_proba(X_validate_model4)

# Compute the accuracy

print('Accuracy: {: .2f}'.format(logit4.score(X_validate_model4, y_validate_model4)))

# Create a confusion matrix

print(confusion_matrix(y_validate_model4, y_pred_model4))

# Compute Precision, Recall, F1-score, and Support

print(classification_report(y_validate_model4, y_pred_model4))

### 5. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?
* Best model from the validation: Model 3

In [None]:
test.head()

In [None]:
test.shape

In [None]:
sex_dummy = pd.get_dummies(test.sex)
test = pd.concat([test, sex_dummy], axis=1)
test.head()

In [None]:
# Load test dataset for Model 3

X_test_model3 = test[['fare', 'pclass', 'age', 'male']]
y_test_model3 = test[['survived']]

# Estimate whether or not a passenger would survive, using the training data

y_pred_model3 = logit3.predict(X_test_model3)
y_pred_model3

# Estimate the probablity of a passenger surviving, using the training data
y_pred_proba_model3 = logit3.predict_proba(X_test_model3)

# Compute the accuracy

print('Accuracy: {: .2f}'.format(logit3.score(X_test_model3, y_test_model3)))

# Create a confusion matrix

print(confusion_matrix(y_test_model3, y_pred_model3))

# Compute Precision, Recall, F1-score, and Support

print(classification_report(y_test_model3, y_pred_model3))

**Notes**: The accuracy from test dataset(0.80) is a little better than the validate(0.78) and train(0.79).

### Bonus 1. How do different strategies for handling the missing values in the age column affect model performance? 

**Notes**
1. In the current titanic dataset, the stragegy for handling the missing values in the age column is SimpleImpute = 'most_frequent')
2. There are four strategies in the SimpleImpute:
    - mean
    - median
    - most_frequent
    - constant
3. The best model for now is Model 3

**My Plan**

1. I will use a different strategy in SimpleImpute and then compare the performance for model 3. 
2. Which stragegy I am gonna use? mean or median. 

In [None]:
raw_titanic = acquire.get_titanic_data()
raw_titanic.head()

In [None]:
# age columns has 177 null values

null_age = raw_titanic.age.isnull().sum()
null_age

In [None]:
# The percentage of null values in age column

null_age/raw_titanic.age.size

In [None]:
# Most frequent age

# raw_titanic.age.value_counts().head(1)
raw_titanic.age.mode()

In [None]:
raw_titanic.age.plot.hist()

In [None]:
raw_titanic.age.agg(['mean', 'median'])

In [None]:
# Who are missing the age values?

mask = raw_titanic.age.isnull()
raw_titanic[mask].alone.value_counts()

In [None]:
raw_titanic.head()

In [None]:
train, validate, test = prepare.prep_titanic_mean(raw_titanic)
train.head()

In [None]:
# Visualization of age distrubtion after replacing the missing values with mean. 

train.age.plot.hist(alpha=0.5)
raw_titanic.age.plot.hist(alpha=0.5)

In [None]:
train.shape, validate.shape, test.shape

In [None]:
# Create dummy variables of sex in train dataset. 

sex_dummy = pd.get_dummies(train.sex)
train = pd.concat([train, sex_dummy], axis=1)
train.head()

In [None]:
# fare, pclass, age, and male are the X in model 3. 

X_train_model3 = train[['fare', 'pclass', 'age', 'male']]
y_train_model3 = train[['survived']]

X_train_model3.shape, y_train_model3.shape

In [None]:
# Create the logistic regression object

logit3 = LogisticRegression(C=1)

# Fit the model to the training data

logit3.fit(X_train_model3, y_train_model3)

# Print the coefficients and intercept of the model

print('Coefficient: \n', logit3.coef_)
print('Intercept: \n', logit3.intercept_)

# Estimate whether or not a passenger would survive, using the training data

y_pred_model3 = logit3.predict(X_train_model3)
y_pred_model3

# Estimate the probablity of a passenger surviving, using the training data
y_pred_proba_model3 = logit3.predict_proba(X_train_model3)

In [None]:
# Compute the accuracy

print('Accuracy: {: .2f}'.format(logit3.score(X_train_model3, y_train_model3)))

# Create a confusion matrix

print('Confusion matrix: \n', confusion_matrix(y_train_model3, y_pred_model3))

# Compute Precision, Recall, F1-score, and Support

print(classification_report(y_train_model3, y_pred_model3))

**Notes**
1. Age columns has 177 null values, about ~20% of all data. 
2. Most frequence age is 24. 
3. The mean age is 29.7.
4. The median age is 28.0.
5. Among whom are missing the age values, 133 are alone, and 44 had accompaniers. 

**Choice**
1. I am gonna use mean as the alternative strategy. Let's see how it affect the performance. 

**Results**
1. The accuracy is increased slightly from 0.79 to 0.80. 
2. Since the coefficient of age is small (0.027), it doesn't weigh that much in the model, which may explain the reason for such small change in accuracy. 

### Bonus 2: How do different strategies for encoding sex affect model performance.

In [None]:
# Acquire titianic dataset. 

titanic = acquire.get_titanic_data()
titanic.head()

In [None]:
# Prepare titianic dataset

train, validate, test = prepare.prep_titanic(titanic)
train.head()

In [None]:
# Create dummy variables for column sex

sex_dummy = pd.get_dummies(train.sex)
train = pd.concat([train, sex_dummy], axis=1)
train.head()

In [None]:
# X = ['fare', 'pclass', 'age', 'female']
# y = 'survived'

X_train_model3 = train[['fare', 'pclass', 'age', 'female']]
y_train_model3 = train[['survived']]

X_train_model3.shape, y_train_model3.shape

In [None]:
# Create the logistic regression object

logit3 = LogisticRegression(C=1)

# Fit the model to the training data

logit3.fit(X_train_model3, y_train_model3)

# Print the coefficients and intercept of the model

print('Coefficient: \n', logit3.coef_)
print('Intercept: \n', logit3.intercept_)

# Estimate whether or not a passenger would survive, using the training data

y_pred_model3 = logit3.predict(X_train_model3)
y_pred_model3

# Estimate the probablity of a passenger surviving, using the training data
y_pred_proba_model3 = logit3.predict_proba(X_train_model3)

In [None]:
# Compute the accuracy

print('Accuracy: {: .2f}'.format(logit3.score(X_train_model3, y_train_model3)))

# Create a confusion matrix

print('Confusion matrix: \n', confusion_matrix(y_train_model3, y_pred_model3))

# Compute Precision, Recall, F1-score, and Support

print(classification_report(y_train_model3, y_pred_model3))

**Notes**
1. Sex columns has zero null values.
2. Sex columns contain 577 males and 312 females.
3. In the model 3, the male is 1 and the female is 0.

**Alternative stragety for encoding sex**
1. The male is 0 and the female is 1. 
2. My hypothesis is there is no change in the model performance. 

**Results:**
Such encoding doesn't change performance of the model 3. 

### Bonus 3: `scikit-learn`'s `LogisticRegression` classifier is actually applying a regularization penalty to the coefficients by default. 

* This penalty causes the magnitude of the coefficients in the resulting model to be smaller than they otherwise would be. 
* This value can be modified with the `C` hyperparameter.
* Small values of `C` correspond to a larger penalty, and large values of `C` correspond to a smaller penalty.

### Try out the following values for `C` and note how the coefficients and the model's performance on both the dataset it was trained on and on the validate split are affected. 
* C = 0.01, 0.1, 1, 10, 100, 1000
* Use model 3:
 - X: fare, pclass, age, male
 - y: survived

In [None]:
# Load titanic dataset

titanic = acquire.get_titanic_data()
titanic.head()

In [None]:
# Prepare the titanic dataset

train, validate, test = prepare.prep_titanic(titanic)
train.head()

In [None]:
# Create dummy variable for column 'sex'

sex_dummy = pd.get_dummies(train.sex)
train = pd.concat([train, sex_dummy], axis=1)
train.head()

In [None]:
X_train = train[['fare', 'pclass', 'age', 'male']]
y_train = train[['survived']]

X_train.shape, y_train.shape

In [None]:
# Define a function that return coefficients given the C value. 

def logit_model_coefficient(c_value, X, y):
    logit = LogisticRegression(C=c_value)
    logit.fit(X, y)
    coefficient = logit.coef_
    return pd.DataFrame(coefficient)

In [None]:
# Calcualte the coefficient according to a list of C values

list_C = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
df = pd.DataFrame()
for i in list_C:
    df = pd.concat([df, logit_model_coefficient(i, X_train, y_train)])
df.index = list_C
df.columns = ['fare', 'pclass', 'age', 'male']
df

In [None]:
import math

x = [math.log10(i) for i in df.index]
y = df.male
plt.scatter(x, y)

In [None]:
def logit_model_accuracy(c_value, X, y):
    logit = LogisticRegression(C=c_value)
    logit.fit(X, y)
    accuracy = logit.score(X, y)
    return accuracy

In [None]:
logit_model_accuracy(1, X_train, y_train)

In [None]:
list_C = [0.001, 0.01, 0.1, 1, 10, 100, 1000]

accuracy_list = [logit_model_accuracy(i, X_train, y_train) for i in list_C]
df = pd.DataFrame(accuracy_list)
df.columns = ['Accuracy']
df.index = list_C
df

In [None]:
x = [math.log10(i) for i in df.index]
y = df.Accuracy
plt.scatter(x, y)

### Decision Tree Exercises

### In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.
* Continue working in your model file. Add, commit, and push your changes.


In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import acquire
import prepare

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import classification_report, confusion_matrix

In [2]:
# Acquire titanic dataset

titanic = acquire.get_titanic_data()
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [3]:
# Prepare titanic dataset

train, validate, test = prepare.prep_titanic(titanic)
train.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,alone,sex_male,embarked_Q,embarked_S
583,0,1,36.0,0,0,40.125,1,1,0,0
337,1,1,41.0,0,0,134.5,1,0,0,0
50,0,3,7.0,4,1,39.6875,0,1,0,1
218,1,1,32.0,0,0,76.2917,1,0,0,0
31,1,1,24.0,1,0,146.5208,0,0,0,0


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 497 entries, 583 to 553
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   survived    497 non-null    int64  
 1   pclass      497 non-null    int64  
 2   age         497 non-null    float64
 3   sibsp       497 non-null    int64  
 4   parch       497 non-null    int64  
 5   fare        497 non-null    float64
 6   alone       497 non-null    int64  
 7   sex_male    497 non-null    uint8  
 8   embarked_Q  497 non-null    uint8  
 9   embarked_S  497 non-null    uint8  
dtypes: float64(2), int64(5), uint8(3)
memory usage: 32.5 KB


In [4]:
train.shape, validate.shape, test.shape

((497, 10), (214, 10), (178, 10))

### 1. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)
### 2. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [31]:
# X_train and y_train

X_train_BL = train[['fare', 'pclass']]
y_train_BL = train['survived']

X_train_1 = train[['fare', 'pclass', 'age']]
y_train_1 = train['survived']

X_train_2 = train[['fare', 'pclass', 'age', 'sex_male']]
y_train_2 = train['survived']

X_train_3 = train[['pclass', 'sex_male']]
y_train_3 = train['survived']

In [33]:
X_train_BL.shape, X_train_1.shape

((497, 2), (497, 3))

In [34]:
# Create the Decision Tree Object

clf = DecisionTreeClassifier(max_depth=3, random_state=123)

In [35]:
# Evalute the models using model score

def tree_accuracy(X, y):
    clf.fit(X, y)
    accuracy = clf.score(X, y)
    return accuracy

print('Accuray of Baseline Model:', tree_accuracy(X_train_BL, y_train_BL))
print('Accuray of Model 1:', tree_accuracy(X_train_1, y_train_1))
print('Accuray of Model 2:', tree_accuracy(X_train_2, y_train_2))
print('Accuray of Model 3:', tree_accuracy(X_train_3, y_train_3))

Accuray of Baseline Model: 0.6901408450704225
Accuray of Model 1: 0.6901408450704225
Accuray of Model 2: 0.8189134808853119
Accuray of Model 3: 0.7907444668008048


In [36]:
# Evaluate models by confusion matrix

def tree_matrix(X, y):
    clf.fit(X, y)
    y_pred = clf.predict(X)
    matrix = confusion_matrix(y, y_pred)
    return matrix

print('Confusion matrix of Baseline Model:\n', tree_matrix(X_train_BL, y_train_BL))
print('Confusion matrix of Model 1:\n', tree_matrix(X_train_1, y_train_1))
print('Confusion matrix of Model 2:\n', tree_matrix(X_train_2, y_train_2))
print('Confusion matrix of Model 3:\n', tree_matrix(X_train_3, y_train_3))

Confusion matrix of Baseline Model:
 [[279  28]
 [126  64]]
Confusion matrix of Model 1:
 [[279  28]
 [126  64]]
Confusion matrix of Model 2:
 [[279  28]
 [ 62 128]]
Confusion matrix of Model 3:
 [[303   4]
 [100  90]]


In [30]:
# Evaluate models by classification report

def tree_report(X, y):
    clf.fit(X, y)
    y_pred = clf.predict(X)
    report = classification_report(y, y_pred)
    return report

print('Classification report of Baseline Model:\n', tree_report(X_train_BL, y_train_BL))
print('Classification report of Model 1:\n', tree_report(X_train_1, y_train_1))
print('Classification report of Model 2:\n', tree_report(X_train_2, y_train_2))
print('Classification report of Model 3:\n', tree_report(X_train_3, y_train_3))

Classification report of Baseline Model:
               precision    recall  f1-score   support

           0       0.69      0.91      0.78       307
           1       0.70      0.34      0.45       190

    accuracy                           0.69       497
   macro avg       0.69      0.62      0.62       497
weighted avg       0.69      0.69      0.66       497

Classification report of Model 1:
               precision    recall  f1-score   support

           0       0.69      0.91      0.78       307
           1       0.70      0.34      0.45       190

    accuracy                           0.69       497
   macro avg       0.69      0.62      0.62       497
weighted avg       0.69      0.69      0.66       497

Classification report of Model 2:
               precision    recall  f1-score   support

           0       0.82      0.91      0.86       307
           1       0.82      0.67      0.74       190

    accuracy                           0.82       497
   macro avg    

### 3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [40]:
def matrix_rate(X, y):
    clf.fit(X, y)
    y_pred = clf.predict(X)
    matrix = confusion_matrix(y, y_pred)
    total = matrix.sum()
    TN_rate = matrix[0][0]/total
    TP_rate = matrix[1][1]/total
    FP_rate = matrix[0][1]/total
    FN_rate = matrix[1][0]/total
    return matrix, TN_rate, TP_rate, FP_rate, FN_rate

matrix_rate(X_train_BL, y_train_BL)

(array([[279,  28],
        [126,  64]]),
 0.5613682092555332,
 0.12877263581488935,
 0.056338028169014086,
 0.2535211267605634)

In [None]:
def model_eval(X, y): 

**Notes**
1. The baseline accuracy is 0.618
2. My baseline model is X = ['fare', 'pclass'], y = 'survived'
3. Model 1: X = ['fare', 'pclass', 'age']
4. Model 2: X = ['fare', 'pclass', 'age', 'male']
5. Model 3: X = ['pclass', 'male']