<center><img src=https://raw.githubusercontent.com/UncertainQubit/firstrepo/master/GoldilocksAI.png></center>

Much of machine learning is trying to *optimize* models, or find the perfect fit.
<center> <img src =https://raw.githubusercontent.com/UncertainQubit/firstrepo/master/OverfitAI.png></center>

When designing models ML engineers look for the right fit, the lowest point on the red curve which indicates the lowest average error in their model. Optimizing ML models is important because just like clothes or uncomfortable beds a model that isn't properly optimized doesn't do its job very well. Therefore, one of the important skills to have as a ML engineer is to be able to identify inaccuricies within ML models and to optimize them to increase performance and find the perfect fit.

## Improving our model...
We got pretty good accuracy on our model. However, since when do we settle for "pretty good". In AI it's all about optimization. You'll learn about hyperparameter tuning as a way to decrease overfitting.

In [18]:
#Initial Setup Code
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

file_path = 'https://raw.githubusercontent.com/UncertainQubit/firstrepo/master/train.csv'

titanic_data = pd.read_csv(file_path)

titanic_data = titanic_data.replace(to_replace='male', value='0', regex=True)
titanic_data = titanic_data.replace(to_replace='fe0', value='1', regex=True)

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch']

titanic_data = titanic_data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch']].dropna()

y = titanic_data.Survived

X = titanic_data[features]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

model = DecisionTreeClassifier(max_leaf_nodes = 10000)
model.fit(train_X, train_y)

DecisionTreeClassifier(max_leaf_nodes=10000)

## Identifying overfitting
Overfitting is when a model isn't generalized enough and is instead closely fit to training data, giving it high accuracy on training data but low accuracy on data the model hasn't seen before. As you can see below, this model is slightly overfit.

In [19]:
# Make predictions on training and validation data and calculate accuracy
accuracy = model.score(train_X, train_y)
print('Training Accuracy: {0:%}'.format(accuracy))
val_accuracy = model.score(val_X, val_y)
print('Validation Accuracy: {0:%}'.format(val_accuracy))

Training Accuracy: 94.018692%
Validation Accuracy: 77.094972%


## How do we reduce overfitting?
One method is hyperparameter tuning, changing a number of variables that can affect our model. At industrial levels thousands of variables may be computationally modified but we'll just be changing two variables in our model today.

In [20]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from ipywidgets import IntSlider
def test(leafs, minsplit):
    model = DecisionTreeClassifier(max_leaf_nodes = leafs, min_samples_split = minsplit, random_state = 0)
    model.fit(train_X, train_y)
    accuracy = model.score(val_X, val_y)
    print('Accuracy: {0:%}'.format(accuracy))
style = {'description_width': 'initial'}
interact(
    test, leafs=widgets.IntSlider(min=2, max=1000, step=10, description='Number of leaves:', style=style, readout=True, continuous_update=True),
    minsplit=widgets.IntSlider(min=2, max=100, step=1, description='Minimum samples:', style=style, readout=True, continuous_update=True)
);

interactive(children=(IntSlider(value=2, description='Number of leaves:', max=1000, min=2, step=10, style=Slid…

## How can we make this easier?
We don't have to do everyhting manually. With a little bit of code we can make a function that determines what number of maximum leaf nodes leads to the highest accuracy.

In [21]:
def score(leaf_size):
    model = DecisionTreeClassifier(max_leaf_nodes = leaf_size, min_samples_split = 2, random_state = 0)
    model.fit(train_X, train_y)
    return model.score(val_X, val_y)

In [22]:
candidate_max_leaf_nodes = list(range(2, 1000))
scores = {leaf_size: score(leaf_size) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = max(scores, key=scores.get)
print('Best max_leaf_nodes = ' + str(best_tree_size))

Best max_leaf_nodes = 20


## How much have we improved?
Now that we've solved for some of the overfitting issues in our model let's train on ALL the data, not just the training data and use our optimal number of maximum leaf nodes.

In [23]:
final_model = DecisionTreeClassifier(max_leaf_nodes = best_tree_size, min_samples_split = 2, random_state = 0)
final_model.fit(X,y)

final_accuracy = final_model.score(val_X, val_y)
print('Validation Accuracy: {0:%}'.format(final_accuracy))

Validation Accuracy: 86.592179%


In [24]:
print('Improvement in Accuracy: {0:%}'.format(final_accuracy - val_accuracy))

Improvement in Accuracy: 9.497207%


## Wow!
That's a pretty significant improvement. You're well on your way to designing professional AI models.