E**xperimenting With Different Models**<br>
Now that you have a reliable way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions. But what alternatives do you have for models?

You can see in scikit-learn's [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) that the decision tree model has many options (more than you'll want or need for a long time). The most important options determine the tree's depth. Recall from [the first lesson in this course](https://www.kaggle.com/dansbecker/how-models-work) that a tree's depth is a measure of how many splits it makes before coming to a prediction. This is a relatively shallow tree

![Image](https://storage.googleapis.com/kaggle-media/learn/images/R3ywQsR.png)

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have  210
  groups of houses by the time we get to the 10th level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in the figure below.

![Image](https://storage.googleapis.com/kaggle-media/learn/images/AXSEOfI.png)

**Example**<br>
There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

dataset_path = "../resources/datasets/melb_data.csv"
df = pd.read_csv(dataset_path)

df = df.dropna(axis=0)

features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
y = df.Price
X = df[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

We can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.

In [2]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

for i in [5, 50, 500, 500]:
    model = DecisionTreeRegressor(max_leaf_nodes=i,random_state=0)
    model.fit(X=X_train, y=y_train)
    model_predict = model.predict(X=X_test)
    mae = mean_absolute_error(y_test, model_predict)
    
    print("Max leaf node: %d  \t\t Mean absolute error: %d" %(i, mae))

Max leaf node: 5  		 Mean absolute error: 347380
Max leaf node: 50  		 Mean absolute error: 258171
Max leaf node: 500  		 Mean absolute error: 243495
Max leaf node: 500  		 Mean absolute error: 243495


**Conclusion**<br>
Here's the takeaway: Models can suffer from either:

- **Overfitting**: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
- **Underfitting**: failing to capture relevant patterns, again leading to less accurate predictions.
We use validation data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.

In [3]:
def get_mae(max_leaf, X_train, X_test, y_train, y_test):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf, random_state=0)
    model.fit(X=X_train, y=y_train)
    model_predict_val = model.predict(X=X_test)
    return mean_absolute_error(y_test, model_predict_val)

In [4]:
# Search best value of max_leaf_nodes
best_max_leaf = None
best_mae = float('inf')
for max_leaf in [2, 100, 250, 500, 690, 710, 850, 900, 1000]:
    mae = get_mae(max_leaf, X_train, X_test, y_train, y_test)
    
    if mae < best_mae:
        best_mae = mae
        best_max_leaf = max_leaf

print(best_max_leaf)

710


In [5]:
# Second way
scores = {}

for max_leaf in [2, 100, 250, 500, 690, 710, 850, 900, 1000]:
    scores[max_leaf] = get_mae(max_leaf, X_train, X_test, y_train, y_test)

scores = min(scores, key=scores.get)
print(scores)

710


In [6]:
# Third way
candidate_max_leaf_nodes = [2, 100, 250, 500, 690, 710, 850, 900, 1000]
scores = {leaf_nodes: get_mae(leaf_nodes, X_train, X_test, y_train, y_test) for leaf_nodes in candidate_max_leaf_nodes}
best_max_leaf = min(scores, key=scores.get)

**Final Model**

In [7]:
final_model = DecisionTreeRegressor(max_leaf_nodes=best_max_leaf, random_state=1)
final_model.fit(X, y)

In [8]:
data = pd.DataFrame([
    {'Rooms': 2, 'Bathroom': 1, 'Landsize': 90.0, 'BuildingArea': 80.0, 'YearBuilt': 1950.0, 'Lattitude': -37.8101, 'Longtitude': 144.9965},
    {'Rooms': 3, 'Bathroom': 2, 'Landsize': 180.0, 'BuildingArea': 160.0, 'YearBuilt': 2005.0, 'Lattitude': -37.8045, 'Longtitude': 144.9982},
    {'Rooms': 4, 'Bathroom': 3, 'Landsize': 300.0, 'BuildingArea': 250.0, 'YearBuilt': 2010.0, 'Lattitude': -37.8012, 'Longtitude': 144.9973}
])

print(data.head())
final_model.predict(data.head())

   Rooms  Bathroom  Landsize  BuildingArea  YearBuilt  Lattitude  Longtitude
0      2         1      90.0          80.0     1950.0   -37.8101    144.9965
1      3         2     180.0         160.0     2005.0   -37.8045    144.9982
2      4         3     300.0         250.0     2010.0   -37.8012    144.9973


array([ 837200., 1463600., 1655000.])