You've built a model. But how good is it?

In this lesson, you will learn to use model validation to measure the quality of your model. Measuring model quality is the key to iteratively improving your models.

# What is Model Validation

You'll want to evaluate almost every model you ever build. In most (though not all) applications, the relevant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens.

Many people make a huge mistake when measuring predictive accuracy. They make predictions with their *training data* and compare those predictions to the target values in the *training data*. You'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first.

You'd first need to summarize the model quality into an understandable way. If you compare predicted and actual home values for 10,000 houses, you'll likely find mix of good and bad predictions. Looking through a list of 10,000 predicted and actual values would be pointless. We need to summarize this into a single metric.

There are many metrics for summarizing model quality, but we'll start with one called **Mean Absolute Error** (also called **MAE**). Let's break down this metric starting with the last word, error.

The prediction error for each house is: <br>
```
error=actual−predicted
```
 
So, if a house cost \$150,000 and you predicted it would cost \$100,000 the error is \$50,000.

With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

> On average, our predictions are off by about X.

To calculate MAE, we first need a model. That is built in a hidden cell below, which you can review by clicking the `code` button.

您将想要评估您构建的几乎每个模型。在大多数（尽管不是全部）应用中，模型质量的相关衡量标准是预测准确性。换句话说，模型的预测是否接近实际发生的情况。

许多人在衡量预测准确性时犯了一个严重的错误。他们使用训练数据进行预测，并将这些预测与训练数据中的目标值进行比较。您将会看到这种方法的问题以及如何解决它，但首先让我们思考一下如何进行。

首先，您需要将模型质量总结为一种可理解的方式。如果您比较了对于 10,000 栋房屋的预测和实际房屋价值，您可能会发现好的和坏的预测混杂在一起。浏览一份包含 10,000 个预测和实际值的列表将是毫无意义的。我们需要将其总结为单个指标。

有许多用于总结模型质量的指标，但我们将从一个叫做平均绝对误差（Mean Absolute Error，简称MAE）的指标开始。让我们从最后一个词开始解析这个指标，即误差。

每个房屋的预测误差为：

误差 = 实际值 - 预测值

因此，如果一座房屋的成本是 150,000 美元，而预测值是 100,000 美元，那么误差就是 50,000 美元。

使用MAE指标，我们取每个误差的绝对值。这将将每个误差转换为正数。然后，我们取这些绝对误差的平均值。这是我们衡量模型质量的指标。用简单的英语来说，可以说：

平均而言，我们的预测偏差约为X。

要计算MAE，我们首先需要一个模型。下面的隐藏单元格中已经构建了该模型，您可以通过点击代码按钮来查看。

In [1]:
# 加载数据，隐藏这里
import pandas as pd

# 加载数据
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
# 填充数据确实价格
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# 选择目标和特征值
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.tree import DecisionTreeRegressor
# 定义模型
melbourne_model = DecisionTreeRegressor()
# 训练模型
melbourne_model.fit(X, y)

DecisionTreeRegressor()

Once we have a model, here is how we calculate the mean absolute error:

一旦我们有了模型，下面就是我们如何计算平均绝对误差：

In [2]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

434.71594577146544

# The Problem with "In-Sample" Scores

The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price. 

However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called **validation data**.


# Coding It


The scikit-learn library has a function `train_test_split` to break up the data into two pieces. We'll use some of that data as training data to fit the model, and we'll use the other data as validation data to calculate `mean_absolute_error`.

Here is the code:

# “样本内”分数的问题

我们刚刚计算的度量可以称为“样本内”分数。我们使用单个房屋“样本”来构建模型和评估模型。这就是为什么这很糟糕。

想象一下，在大型房地产市场中，门的颜色与房价无关。

但是，在您用于构建模型的数据样本中，所有带绿色门的房屋都非常昂贵。该模型的工作是找到预测房价的模式，因此它会看到这种模式，并且它总是会预测带有绿色门的房屋的高价。

由于此模式源自训练数据，因此该模型在训练数据中将显得准确。

但是，如果当模型看到新数据时这种模式不成立，那么该模型在实践中使用时将非常不准确。

由于模型的实用价值来自对新数据的预测，因此我们衡量未用于构建模型的数据的性能。最直接的方法是从模型构建过程中排除一些数据，然后使用这些数据来测试模型对以前从未见过的数据的准确性。此数据称为**验证数据**。


# 编码


scikit-learn 库有一个函数“train_test_split”，可以将数据分成两部分。我们将使用其中一些数据作为训练数据来拟合模型，我们将使用其他数据作为验证数据来计算`mean_absolute_error`。

这是代码：

In [3]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

265806.91478373145


# Wow!

Your mean absolute error for the in-sample data was about 500 dollars.  Out-of-sample it is more than 250,000 dollars.

This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes.  As a point of reference, the average home value in the validation data is 1.1 million dollars.  So the error in new data is about a quarter of the average home value. 

There are many ways to improve this model, such as experimenting to find better features or different model types. 

# Wow！

样本内数据的平均绝对误差约为 500 美元。样本外超过250,000美元。

这是几乎完全正确的模型与无法用于大多数实际目的的模型之间的区别。作为参考，验证数据中的平均房屋价值为 110 万美元。因此，新数据的误差约为平均房屋价值的四分之一。

有很多方法可以改进这个模型，比如尝试寻找更好的特征或不同的模型类型。

# Your Turn
Before we look at improving this model, try **[Model Validation](https://www.kaggle.com/kernels/fork/1259097)** for yourself.

# 到你了
在我们着手改进此模型之前，请亲自尝试**[模型验证](https://www.kaggle.com/kernels/fork/1259097)**。

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-machine-learning/discussion) to chat with other learners.*