% 30 Days of Kaggle - Day 10: (https://www.kaggle.com/dansbecker/underfitting-and-overfitting)[Over-Fitting and Under-Fitting].

Now that I can create models I need to be able to evaluate their accuracy.

I calculated mean absolute error in the last notebook using sklearn.

MAE = \frac{\sum_0^N | predicted - actual |}{N}

The lesson notes give a great explanation of under- and over-fitting:

![Under-And-Over-Fitting](../images/30-days-of-kaggle/under-and-overfitting.png, 'Under-And-Over-Fitting')

They use the example of housing data.   Decision tree depth is the variable to watch.  A binary tree of depth n will have 2^n leaf nodes. If n is too small we may be under-fitting.  If n is too large we eventually end up with one case in each leaf node.  There's a sweet spot that we have to find for training data.

Use this utility method to compare MAE for different max leaf nodes:

In [10]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)
    model.fit(train_X, train_y)
    predictions_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, predictions_val)
    return mae

These cells repeat the earlier calculations for the Melbourne housing data:

In [2]:
# Data Loading Code Runs At This Point
import pandas as pd

# Load data
melbourne_file_path = '../datasets/kaggle/melbourne-house-prices/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
                      'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]


Split data into train and test sets:

In [3]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

Now let's calculate MEA with differing values of max_leaf_nodes:

In [5]:
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %4d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))

Max leaf nodes:    5 		 Mean Absolute Error: 347380
Max leaf nodes:   50 		 Mean Absolute Error: 258171
Max leaf nodes:  500 		 Mean Absolute Error: 243495
Max leaf nodes: 5000 		 Mean Absolute Error: 254983


Exercises: do the same thing with the Iowa housing model.

In [9]:
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

iowa_file_path = '../datasets/kaggle/iowa-house-prices/train.csv'
home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['Lot Area', 'Year Built', '1st Flr SF', '2nd Flr SF', 'Bedroom AbvGr', 'TotRms AbvGrd']
X = home_data[feature_columns]
iowa_model = DecisionTreeRegressor(random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))

Validation MAE: 262,494


In [13]:
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500, 600, 700, 800, 900, 1000]
for max_leaf_nodes in candidate_max_leaf_nodes:
    mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %4d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, mae))

Max leaf nodes:    5 		 Mean Absolute Error: 347380
Max leaf nodes:   25 		 Mean Absolute Error: 271044
Max leaf nodes:   50 		 Mean Absolute Error: 258171
Max leaf nodes:  100 		 Mean Absolute Error: 248734
Max leaf nodes:  250 		 Mean Absolute Error: 247206
Max leaf nodes:  500 		 Mean Absolute Error: 243495
Max leaf nodes:  600 		 Mean Absolute Error: 243951
Max leaf nodes:  700 		 Mean Absolute Error: 242954
Max leaf nodes:  800 		 Mean Absolute Error: 244042
Max leaf nodes:  900 		 Mean Absolute Error: 246292
Max leaf nodes: 1000 		 Mean Absolute Error: 247345


Now that we know that we want 500 leaf nodes we can use all the data to create the final model.

In [15]:
final_model = DecisionTreeRegressor(max_leaf_nodes=500)
final_model.fit(X, y)
print(final_model)

DecisionTreeRegressor(max_leaf_nodes=500)
