### In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Througout this exercise, be sure you are training, evaluation, and comparing models on the train and validate dataset. The test dataset should be only used for your final model. 

### For all of the models you create, choose a threshold that optimizes for accuracy. 

### Do your work for these exercises in either a notebook or a python script named model within your classification-exercises repository. Add, commit, and push your work.

**Takeways**:
1. Build logistic regression models for titanic dataset.
2. Several models need to be build. 
3. Accuray is the evaluation metrics. 
4. Target varibale: the survivied (categorical)
5. The positive case is predicting the survivied
    - TP: predicting survived actually survivied
    - FP: predicting survived actually being a victim
    - TN: predicting being a victim acturally was a victim
    - FN: predicting being a victim acturally survived

In [21]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import acquire
import prepare

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

### 1. Create another model that includes age in addition to fare and pclass. Does this model perform better than your previous one? 

In [10]:
# Acquire titanic data.

titanic = acquire.get_titanic_data()
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [11]:
# Prepare titanic dataset

train, validate, test = prepare.prep_titanic(titanic)
train.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,embarked_Q,embarked_S
583,583,0,1,male,36.0,0,0,40.125,C,First,Cherbourg,1,0,0
337,337,1,1,female,41.0,0,0,134.5,C,First,Cherbourg,1,0,0
50,50,0,3,male,7.0,4,1,39.6875,S,Third,Southampton,0,0,1
218,218,1,1,female,32.0,0,0,76.2917,C,First,Cherbourg,1,0,0
31,31,1,1,female,24.0,1,0,146.5208,C,First,Cherbourg,0,0,0


In [16]:
train.shape

(497, 14)

In [14]:
# Double check if there is any missing values

train.isnull().sum()

passenger_id    0
survived        0
pclass          0
sex             0
age             0
sibsp           0
parch           0
fare            0
embarked        0
class           0
embark_town     0
alone           0
embarked_Q      0
embarked_S      0
dtype: int64

In [29]:
# Establish the baseline

train.survived.value_counts()

0    307
1    190
Name: survived, dtype: int64

In [39]:
# Compute baseline accuracy: (TP + TN) / All

190 /(190+307)

0.3822937625754527

### Model 1: X = ['fare', 'pclass'], y = 'survived'

In [43]:
# fare and pclass are the X in model1.

X_train_model1 = train[['fare', 'pclass']]
y_train_model1 = train[['survived']]

X_train_model1.shape, y_train_model1.shape

((497, 2), (497, 1))

In [44]:
# Create the logistic regression object

logit1 = LogisticRegression(C=1)

# Fit the model to the training data

logit1.fit(X_train_model1, y_train_model1)

# Print the coefficients and intercept of the model

print('Coefficient: \n', logit1.coef_)
print('Intercept: \n', logit1.intercept_)

Coefficient: 
 [[ 0.00391168 -0.74992231]]
Intercept: 
 [1.06417548]


In [28]:
# Estimate whether or not a passenger would survive, using the training data

y_pred_model1 = logit1.predict(X_train_model1)
y_pred_model1

# Estimate the probablity of a passenger surviving, using the training data
y_pred_proba_model1 = logit1.predict_proba(X_train_model1)

0.24949698189134809

**Evalute model on train**

In [33]:
# Compute the accuracy

logit1.score(X_train_model1, y_train_model1)

0.682092555331992

In [34]:
# Create a confusion matrix

confusion_matrix(y_train_model1, y_pred_model1)

array([[261,  46],
       [112,  78]])

In [36]:
# Compute Precision, Recall, F1-score, and Support

print(classification_report(y_train_model1, y_pred_model1))

              precision    recall  f1-score   support

           0       0.70      0.85      0.77       307
           1       0.63      0.41      0.50       190

    accuracy                           0.68       497
   macro avg       0.66      0.63      0.63       497
weighted avg       0.67      0.68      0.66       497



### Model 2: X = ['fare', 'pclass', 'age'], y = 'survived'

In [41]:
# fare, pclass, age are the X in model2.

X_train_model2 = train[['fare', 'pclass', 'age']]
y_train_model2 = train[['survived']]

X_train_model2.shape, y_train_model2.shape

((497, 3), (497, 1))

In [42]:
# Create the logistic regression object

logit2 = LogisticRegression(C=1)

# Fit the model to the training data

logit2.fit(X_train_model2, y_train_model2)

# Print the coefficients and intercept of the model

print('Coefficient: \n', logit2.coef_)
print('Intercept: \n', logit2.intercept_)

Coefficient: 
 [[ 0.00276706 -0.97541889 -0.02864116]]
Intercept: 
 [2.42980594]


In [46]:
# Estimate whether or not a passenger would survive, using the training data

y_pred_model2 = logit2.predict(X_train_model2)
y_pred_model2

# Estimate the probablity of a passenger surviving, using the training data
y_pred_proba_model2 = logit2.predict_proba(X_train_model2)

**Evalute model on train**

In [51]:
# Compute the accuracy

logit2.score(X_train_model2, y_train_model2)

0.716297786720322

In [49]:
# Create a confusion matrix

confusion_matrix(y_train_model2, y_pred_model2)

array([[268,  39],
       [102,  88]])

In [52]:
# Compute Precision, Recall, F1-score, and Support

print(classification_report(y_train_model2, y_pred_model2))

              precision    recall  f1-score   support

           0       0.72      0.87      0.79       307
           1       0.69      0.46      0.56       190

    accuracy                           0.72       497
   macro avg       0.71      0.67      0.67       497
weighted avg       0.71      0.72      0.70       497



### 2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

**Notes**
1. Previous model only contains fare and pclass as the X. 
2. No missing values in the train dataset. 