# Model Validation <br>
Measuring to quality of our model. There are many metrics for summarizing model quality, but we'll start with one called **Mean Absolute Error** (also called MAE). <br>

 Mean Absolute Error (MAE) is a measure of the average size of the mistakes in a collection of predictions, without taking their direction into account. It is measured as the average absolute difference between the predicted values and the actual values and is used to assess the effectiveness of a regression model. <br>

<p align="center">
    <img width="500" height="200" src="mae_.png" alt="Material Bread logo">
</p>

In [14]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

melb_data = pd.read_csv(".\melb_data.csv")
filtered_data = melb_data.dropna(axis=0)
y = filtered_data.Price
melb_features = ['Rooms', 'Bathroom', 'Landsize','BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_data[melb_features]

melbourne_model = DecisionTreeRegressor(random_state= 5)
melbourne_model.fit(X, y)

This is how we do calculate the mean absolute error: 

In [15]:
from sklearn.metrics import mean_absolute_error

predicted_price = melbourne_model.predict(X)
mean_absolute_error(y, predicted_price)

434.71594577146544

# Validation Data <br>
Well we need more data to ensure our data is doing great so when we build a model, we want to make sure that it can generalize well to new and unseen data. To test how well our model will perform on new data, we need to evaluate it on data that was not used during the model-building process. This is where the validation data comes in. There can be some patterns in the given data, so if you give new/unseen data it learns something new.



In [16]:
from sklearn.model_selection import train_test_split

train_test_split splits the data into two separate subsets: one for training the model and the other for evaluating its performance

train_X and train_y represent the features and target variable of the training data, respectively. The machine learning model will use this data to learn the relationship between the input features and the output variable.

val_X and val_y represent the features and target variable of the validation data, respectively. This data is used to evaluate the performance of the machine learning model on new data that it hasn't seen during training.

<p align="center">
    <img width="300" height="200" src="train-test-split.png" alt="Material Bread logo">
</p>

In [17]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=3)

melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(train_X, train_y)

val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

247256.56681730147


# Underfitting and Overfitting
Underfitting occurs when your model is too simple for your data. Overfitting occurs when your model is too complex for your data <br>
<p align="center">
    <img src="over-underfit.png" alt="Material Bread logo">
</p>

Variance and bias are two important concepts in statistics and machine learning that are related to the accuracy and generalization ability of a model.

Bias refers to the systematic error that a model makes in its predictions. It can be thought of as the difference between the expected prediction of the model and the true value of the target variable. Models with high bias tend to oversimplify the problem and underfit the data, leading to poor performance on both the training and test sets.

Variance, on the other hand, refers to the variability of the model's predictions for different training sets. It measures how sensitive the model is to small fluctuations in the training data. Models with high variance tend to overfit the training data and perform poorly on new, unseen data.

In summary, bias refers to the error due to a model's simplifying assumptions, while variance refers to the error due to the model's sensitivity to fluctuations in the training data. A good model should balance both bias and variance to achieve high accuracy and good generalization ability.

![image](mae_over.png)
![image](bias.png)

In [18]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return (mae)

In a decision tree, each node represents a decision based on the value of a particular feature or attribute. The maximum number of leaf nodes in a decision tree refers to the maximum number of final decision nodes (i.e., leaves) that can be present in the tree. <br> <br>

A decision tree algorithm typically tries to create a tree with a large number of leaf nodes to capture the nuances of the data and make accurate predictions. However, having too many leaf nodes can lead to overfitting, where the tree is too complex and captures noise or random fluctuations in the data rather than meaningful patterns. <br> <br>

Therefore, setting a maximum number of leaf nodes is a common way to prevent overfitting and create a simpler decision tree. This limit can be set by the user or specified as a hyperparameter to the algorithm. The decision tree algorithm will stop splitting the tree when the maximum number of leaf nodes is reached, even if the algorithm could have continued to create more splits and more leaf nodes.

In [19]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 25, 50, 250, 500, 2500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  341969
Max leaf nodes: 25  		 Mean Absolute Error:  264313
Max leaf nodes: 50  		 Mean Absolute Error:  247991
Max leaf nodes: 250  		 Mean Absolute Error:  236233
Max leaf nodes: 500  		 Mean Absolute Error:  236202
Max leaf nodes: 2500  		 Mean Absolute Error:  249578
Max leaf nodes: 5000  		 Mean Absolute Error:  251285
