# Underfitting and Overfitting
Fine-tune your model for better performance.

**overfitting**, where a model matches the training data almost perfectly, but does poorly in validation and other new data. 

On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called **underfitting**.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in the figure below.

![alt text](http://i.imgur.com/AXSEOfI.png "Title")

In [1]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [2]:
# Data Loading Code Runs At This Point
import pandas as pd
    
# Load data
melbourne_file_path = '01_input/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

In [4]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print(f"Max leaf nodes: {max_leaf_nodes}  \t\t Mean Absolute Error:  {my_mae}")

Max leaf nodes: 5  		 Mean Absolute Error:  347380.33833344496
Max leaf nodes: 50  		 Mean Absolute Error:  258171.21202406782
Max leaf nodes: 500  		 Mean Absolute Error:  243495.96361790417
Max leaf nodes: 5000  		 Mean Absolute Error:  254983.64299548094


Of the options listed, 500 is the optimal number of leaves.

Here's the takeaway: Models can suffer from either:

* Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
* Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.

<br>

We use validation data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.