# Modeling

## Logistic Regression 

In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

Do your work for these exercises in either a notebook or a python script named model within your classification-exercises repository. Add, commit, and push your work.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings("ignore")

from env import username, password, host
from acquire import get_titanic_data
from prepare import prep_titanic

In [2]:
def get_connection(db, username=username, host=host, password=password):
    return f'mysql+pymysql://{username}:{password}@{host}/{db}'

In [3]:
def new_titanic_data():
    sql_query = 'SELECT * FROM passengers'
    df = pd.read_sql(sql_query, get_connection('titanic_db'))
    df.to_csv('titanic_df.csv')
    return df

In [4]:
df = get_titanic_data(cached=False)
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  891 non-null    int64  
 1   survived      891 non-null    int64  
 2   pclass        891 non-null    int64  
 3   sex           891 non-null    object 
 4   age           714 non-null    float64
 5   sibsp         891 non-null    int64  
 6   parch         891 non-null    int64  
 7   fare          891 non-null    float64
 8   embarked      889 non-null    object 
 9   class         891 non-null    object 
 10  deck          203 non-null    object 
 11  embark_town   889 non-null    object 
 12  alone         891 non-null    int64  
dtypes: float64(2), int64(6), object(5)
memory usage: 97.5+ KB


### Problem 1

Start by defining your baseline model.

In [6]:
X1 = df[['pclass','fare']]
y1 = df[['survived']]

X1_train_validate, X1_test, y1_train_validate, y1_test = train_test_split(X1, y1, test_size = .20, random_state = 123)

X1_train, X1_validate, y1_train, y1_validate = train_test_split(X1_train_validate, y1_train_validate, test_size = .30, random_state = 123)

print("train: ", X1_train.shape, ", validate: ", X1_validate.shape, ", test: ", X1_test.shape)
print("train: ", y1_train.shape, ", validate: ", y1_validate.shape, ", test: ", y1_test.shape)

train:  (498, 2) , validate:  (214, 2) , test:  (179, 2)
train:  (498, 1) , validate:  (214, 1) , test:  (179, 1)


In [7]:
y1_train.survived.value_counts()

0    302
1    196
Name: survived, dtype: int64

- Baseline will predict "not survived" because it is the majority.

### Problem 2

Create another model that includes age in addition to fare and pclass. Does this model perform better than your baseline?

In [8]:
X1_train.head()

Unnamed: 0,pclass,fare
689,1,211.3375
84,2,10.5
738,3,7.8958
441,3,9.5
643,3,56.4958


In [9]:
X1_train.describe()

Unnamed: 0,pclass,fare
count,498.0,498.0
mean,2.295181,33.911888
std,0.848254,54.158606
min,1.0,0.0
25%,1.0,7.9031
50%,3.0,14.45625
75%,3.0,30.5
max,3.0,512.3292


In [10]:
y1_train.describe()

Unnamed: 0,survived
count,498.0
mean,0.393574
std,0.489034
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [13]:
model1 = LogisticRegression(C = 1)
model1.fit(X1_train, y1_train)
print('Coefficient: ', model1.coef_)
print('Intercept: ', model1.intercept_)

Coefficient:  [[-0.74859387  0.0039592 ]]
Intercept:  [1.11884386]


In [14]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(model1.score(X1_train, y1_train)))

Accuracy of Logistic Regression classifier on training set: 0.69


In [18]:
y_pred1 = model1.predict(X1_validate)
print("Model 4")

print('Accuracy: {:.2f}'.format(model1.score(X1_validate, y1_validate)))

print(confusion_matrix(y1_validate, y_pred1))

print(classification_report(y1_validate, y_pred1))

Model 4
Accuracy: 0.63
[[114  19]
 [ 61  20]]
              precision    recall  f1-score   support

           0       0.65      0.86      0.74       133
           1       0.51      0.25      0.33        81

    accuracy                           0.63       214
   macro avg       0.58      0.55      0.54       214
weighted avg       0.60      0.63      0.59       214



### Problem 3

Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

### Problem 4

Try out other combinations of features and models.

### Problem 5

Use you best 3 models to predict and evaluate on your validate sample.

### Problem 6

Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

## Decision Tree

### Problem 1

Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

### Problem 2

Evaluate your in-sample results using the model score, confusion matrix, and classification report.

### Problem 3

Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

### Problem 4

Run through steps 2-4 using a different max_depth value.

### Problem 5

Which model performs better on your in-sample data?

### Problem 6

Which model performs best on your out-of-sample data, the validate set?

## Random Forest

### Problem 1

Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 20.

### Problem 2

Evaluate your results using the model score, confusion matrix, and classification report.

### Problem 3

Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

### Problem 4

Run through steps increasing your min_samples_leaf to 5 and decreasing your max_depth to 3.

### Problem 5

What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

## K-Nearest Neighbor

### Problem 1

Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

### Problem 2

Evaluate your results using the model score, confusion matrix, and classification report.

### Problem 3

Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

### Problem 4

Run through steps 2-4 setting k to 10

### Problem 5

Run through setps 2-4 setting k to 20

### Problem 6

What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

### Problem 7

Which model performs best on our out-of-sample data from validate?

## Test

1. Determine which model (with hyperparameters) performs the best (try reducing the number of features to the top 4 features in terms of information gained for each feature individually).

2. Create a new dataframe with top 4 features.

3. Use the top performing algorithm with the metaparameters used in that model. Create the object, fit, transform on in-sample data, and evaluate the results with the training data. Compare your evaluation metrics with those from the original model (with all the features).

4. Run your final model on your out-of-sample dataframe (test_df). Evaluate the results.

## Feature Engineering

Titanic Data
- Create a feature named who, this should be either man, woman, or child. How does including this feature affect your model's performance?
- Create a feature named adult_male that is either a 1 or a 0. How does this affect your model's predictions?

Iris Data
- Create features named petal_area and sepal_area.