# Preview

Measuring model quality is the key to iteratively improving your models.

## What is Model Validation

You'll want to evaluate almost every model you ever build. In most (though not all) applications, the relevant measure of model quality is *predictive accuracy*. In other words, will the model's predictions be close to what actually happens?

There are many metrics for summarizing model quality, but we'll start with **Mean Absolute Error (also called MAE)**.

**Mean Absolute Error**

The prediction error for each house is:

'error=actual−predicted' 



So, if a house cost \\$150,000 and you predicted it would cost \\$100,000 the error is \\$50,000.

With the **MAE** metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

On average, our predictions are off by about X.

## Building a model to calculate its MAE.

In [6]:
import pandas as pd

melbourne_file_path = 'C:/Users/AndresCervantesNassa/Documents/GitHub/kaggle-courses/intro_to_machine_learning/data/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
filtered_melbourne_data = melbourne_data.dropna(axis=0)

In [10]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

y = filtered_melbourne_data.Price

In [11]:
# 1. Define model
from sklearn.tree import DecisionTreeRegressor

melbourne_model = DecisionTreeRegressor(random_state=1)

# 2. Fit model
melbourne_model.fit(X, y)

# 3. Predict model
predicted_home_prices = melbourne_model.predict(X)

# 4. Evaluate model
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y, predicted_home_prices)

434.71594577146544

## The Problem with "In-Sample" Scores

**We used a single "sample" of houses for both building the model and evaluating it**. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price.

However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

Since models' practical value come from making predictions on new data, **we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process**, and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.

The scikit-learn library has a function 'train_test_split' to break up the data into two pieces. We'll use some of that data as training data to fit the model, and we'll use the other data as validation data to calculate mean_absolute_error.

In [12]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# 1/4 Define
melbourne_model = DecisionTreeRegressor()

# 2/4 Fit
melbourne_model.fit(train_X, train_y)

# 3/4 Predict
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

263612.540348612


The MAE increased from \\$434 to \\$250,000! The error in new data is about a quarter of the average home value!

# Exercises

## Data split into train and test

In [13]:
# Previous model
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = 'C:/Users/AndresCervantesNassa/Documents/GitHub/kaggle-courses/intro_to_machine_learning/data/home_data_for_ml_course.csv'
home_data = pd.read_csv(iowa_file_path)

y = home_data.SalePrice

feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# 1/4 define
iowa_model = DecisionTreeRegressor()
# 2/4 fit
iowa_model.fit(X, y)
# 3/4 predict
print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]


## Step 1: Split Your Data

In [14]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

## Step 2: Specify and Fit the Model

In [15]:
from sklearn.tree import DecisionTreeRegressor

# 1/4 define
iowa_model = DecisionTreeRegressor(random_state=1)
# 2/4 fit
iowa_model.fit(train_X, train_y)

DecisionTreeRegressor(random_state=1)

## Step 3: Make Predictions with Validation data

In [16]:
# 3/4 predict
val_predictions = iowa_model.predict(val_X)

Inspect your predictions and actual values from validation data.

In [5]:
# print the top few validation predictions
print(val_predictions[0:4])
# print the top few actual prices from validation data
print(val_y[0:4])

[186500. 184000. 130000.  92000.]
258    231500
267    179500
288    122000
649     84500
Name: SalePrice, dtype: int64


What do you notice that is different from what you saw with in-sample predictions (which are printed after the top code cell in Excercises of this page).

*The new predictions are not as accurate as the in-sample predictions.*

Do you remember why validation predictions differ from in-sample (or training) predictions? This is an important idea from the last lesson.

*The model was built using in-sample predictions which is why the they fit the model perfectly. On the other hand, the model never considered the validation data.*

## Step 4: Calculate the Mean Absolute Error in Validation Data

In [17]:
# 4/4 validate
val_mae = mean_absolute_error(val_predictions, val_y)
val_mae

29652.931506849316

# End