# Intro to Modeling

Plan ➔ Acquire ➔ Prepare ➔ Explore ➔ **Model** ➔ Deliver

Before modeling:

0. Split your data
1. Data preprocessing

The modeling "loop"

1. Create a model
    - algorithm + hyperparams
    - training data
1. Evaluate the model
1. Repeat

After a certain amount of time or repititions has passed:

1. Compare models
1. Evaluate on test

In [2]:
# We'll use sklearn's Dummy Classifier as a standin for other classification algorithms
# it behaves the same way and we use it the same way that we'll use the "real" models
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report
import acquire
import prepare
import pandas as pd

## Data Split

In [3]:
train, validate, test = prepare.prep_titanic_data(acquire.get_titanic_data(), column = 'age', method = 'median', dummies = ['embarked', 'sex'])


In [4]:
train.shape, validate.shape, test.shape

((498, 12), (214, 12), (179, 12))

In [5]:
X_train, y_train = train.drop(columns='survived'), train.survived
X_validate, y_validate = validate.drop(columns='survived'), validate.survived
X_test, y_test = test.drop(columns='survived'), test.survived

In [6]:
X_train.shape, X_validate.shape, X_test.shape

((498, 11), (214, 11), (179, 11))

## Create our First Model

### Aside: Working with sklearn ML objects

1. Create the object
1. Fit the object on training data
1. Use the object (.score, .predict, .transform)

In [7]:
# 1. Create the object
model = DummyClassifier(strategy='constant', constant=1)
# 2. Fit the object
model.fit(X_train, y_train)

DummyClassifier(constant=1, strategy='constant')

Ways we use sklearn classification models:

- `.score` gives us accuracy
- `.predict` lets us make predictions given a set of indep vars
- `.predict_proba` gives us the probability that each observation falls into each label
- some specific model types have additional properties

In [8]:
print('Training accuracy: %.4f' % model.score(X_train, y_train))

Training accuracy: 0.3835


In [9]:
# TODO: view the accuracy on the validate split

In [10]:
model.score(X_validate, y_validate)

0.38317757009345793

In [14]:
model.predict(X_validate)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

**when using .predict, the values passed must be the same shape as the value used to fit!**

In [None]:
# TODO: create a new column on the train dataframe that contains the models predictions

In [15]:
train.head()

Unnamed: 0,passenger_id,survived,pclass,age,sibsp,parch,fare,class,alone,embarked_Q,embarked_S,sex_male
583,583,0,1,36.0,0,0,40.125,First,1,0,0,1
165,165,1,3,9.0,0,2,20.525,Third,0,0,1,1
50,50,0,3,7.0,4,1,39.6875,Third,0,0,1,1
259,259,1,2,50.0,0,1,26.0,Second,0,0,1,0
306,306,1,1,28.0,0,0,110.8833,First,1,0,0,0


In [16]:
train['prediction']= model.predict(X_train)

In [17]:
train.head()

Unnamed: 0,passenger_id,survived,pclass,age,sibsp,parch,fare,class,alone,embarked_Q,embarked_S,sex_male,prediction
583,583,0,1,36.0,0,0,40.125,First,1,0,0,1,1
165,165,1,3,9.0,0,2,20.525,Third,0,0,1,1,1
50,50,0,3,7.0,4,1,39.6875,Third,0,0,1,1,1
259,259,1,2,50.0,0,1,26.0,Second,0,0,1,0,1
306,306,1,1,28.0,0,0,110.8833,First,1,0,0,0,1


In [None]:
# use the column you just created and the actual values in the survived column
# to generate a classification report

In [18]:
print(classification_report(train.survived, train.prediction))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       307
           1       0.38      1.00      0.55       191

    accuracy                           0.38       498
   macro avg       0.19      0.50      0.28       498
weighted avg       0.15      0.38      0.21       498



In [19]:
print(classification_report(train.survived, train.prediction, zero_division=True))

              precision    recall  f1-score   support

           0       1.00      0.00      0.00       307
           1       0.38      1.00      0.55       191

    accuracy                           0.38       498
   macro avg       0.69      0.50      0.28       498
weighted avg       0.76      0.38      0.21       498



In [20]:
pd.DataFrame(classification_report(train.survived, train.prediction, output_dict= True))

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.0,0.383534,0.383534,0.191767,0.147098
recall,0.0,1.0,0.383534,0.5,0.383534
f1-score,0.0,0.554427,0.383534,0.277213,0.212642
support,307.0,191.0,0.383534,498.0,498.0


In [21]:
pd.DataFrame(classification_report(train.survived, train.prediction, output_dict= True)).transpose()

Unnamed: 0,precision,recall,f1-score,support
0,0.0,0.0,0.0,307.0
1,0.383534,1.0,0.554427,191.0
accuracy,0.383534,0.383534,0.383534,0.383534
macro avg,0.191767,0.5,0.277213,498.0
weighted avg,0.147098,0.383534,0.212642,498.0


## More models

Now we'll make more models, one model is the unique combination of:

- algorithm
- hyperparameters
- training data

In [22]:
model1 = DummyClassifier(strategy='constant', constant=0)
# TODO: fit the model on the training data
# 2. Fit the object
model1.fit(X_train, y_train)
# TODO: see how this model performs on train and validate
model1.score(X_train,y_train), model1.score(X_validate,y_validate)

(0.6164658634538153, 0.616822429906542)

In [23]:
model2 = DummyClassifier(strategy='uniform', random_state=0)
# TODO: fit the model on the training data
model2.fit(X_train, y_train)
# TODO: see how this model performs on train and validate
model2.score(X_train,y_train), model2.score(X_validate,y_validate)

(0.4578313253012048, 0.5)

In [None]:
# Following the pattern above, create 2 more models that vary in either hyperparameters or training data
# fit the models and view their performance

In [24]:
model3 = DummyClassifier(strategy='stratified', random_state=123)

In [25]:
model3.fit(X_train, y_train)

DummyClassifier(random_state=123, strategy='stratified')

In [26]:
model3.score(X_train,y_train), model3.score(X_validate,y_validate)

(0.5622489959839357, 0.4953271028037383)

In [27]:
model4 = DummyClassifier(strategy='uniform', random_state = 13)
model4.fit(X_train, y_train)
model4.score(X_train,y_train), model4.score(X_validate,y_validate)

(0.4939759036144578, 0.5)

In [28]:
model5 = DummyClassifier(strategy='most_frequent', random_state = 13)
model5.fit(X_train, y_train)
model5.score(X_train,y_train), model5.score(X_validate,y_validate)

(0.6164658634538153, 0.616822429906542)

What are we looking for when evaluating model performance?
- Is the model overfit? I.e. does it perform drastically better on the training data compared to the validate split.
- if score.train > score.validate means the model is overfit
- how good or bad is the mpdel, i.e. how does it   perform ?
    - compared to the other models 
    - compared to the baseline model

## Compare and Finalize

In [None]:
# TODO: compare the performance of your models on the validate split

model 1 is our best model with 61% accuracy on validate

In [None]:
# TODO: find the performance of your best model on the test split

In [30]:
model1.score(X_test, y_test)

0.6145251396648045