# Preview

Measuring model quality is the key to iteratively improving your models.

## What is Model Validation

In most (though not all) applications, the relevant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens.

**Mean Absolute Error**

The prediction error for each house is:

'error=actual−predicted' 



So, if a house cost \\$150,000 and you predicted it would cost \\$100,000 the error is \\$50,000.

With the **MAE** metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

On average, our predictions are off by about X.

## Building a model to calculate its MAE.

In [16]:
import pandas as pd

melbourne_file_path = 'C:/Users/AndresCervantesNassa/Documents/personal/kaggle/data/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
# melbourne_data = melbourne_data.dropna(axis=0)

In [17]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

y = melbourne_data.Price

In [18]:
# 1. Define model
from sklearn.tree import DecisionTreeRegressor

melbourne_model = DecisionTreeRegressor(random_state=1)

# 2. Fit model
melbourne_model.fit(X, y)

# 3. Predict model
predicted_home_prices = melbourne_model.predict(X)

# 4. Evaluate model
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y, predicted_home_prices)

1125.1804614629357

In [19]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model
melbourne_model = DecisionTreeRegressor()

# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

248994.675012273


The MAE increased from \\$1,000 to \\$250,000, which is a quarter of the mean of the house prices.

# Exercises

## Data split into train and test

In [1]:
# Previous model
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = 'C:/Users/AndresCervantesNassa/Documents/personal/kaggle/data/home_data_for_ml_course.csv'
home_data = pd.read_csv(iowa_file_path)

y = home_data.SalePrice

feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]


## Step 1: Split Your Data

In [2]:
# Import the train_test_split function and uncomment
from sklearn.model_selection import train_test_split

# fill in and uncomment
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

## Step 2: Specify and Fit the Model

In [3]:
from sklearn.tree import DecisionTreeRegressor

iowa_model = DecisionTreeRegressor(random_state=1)
iowa_model.fit(train_X, train_y)

DecisionTreeRegressor(random_state=1)

## Step 3: Make Predictions with Validation data

In [4]:
val_predictions = iowa_model.predict(val_X)

Inspect your predictions and actual values from validation data.

In [5]:
# print the top few validation predictions
print(val_predictions[0:4])
# print the top few actual prices from validation data
print(val_y[0:4])

[186500. 184000. 130000.  92000.]
258    231500
267    179500
288    122000
649     84500
Name: SalePrice, dtype: int64


## Step 4: Calculate the Mean Absolute Error in Validation Data