# Preview

Decision trees leave you with a difficult decision. A deep **tree with lots of leaves will overfit** because each prediction is coming from historical data from only the few houses at its leaf. But a shallow **tree with few leaves will perform poorly** because it fails to capture as many distinctions in the raw data.

Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. We'll look at the random forest as an example.

The **random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree**. It generally **has much better predictive accuracy than a single decision tree** and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

In [1]:
import pandas as pd

In [7]:
# extract data
filepath = 'C:/Users/Andre/Documents/GitHub/kaggle-courses/intro_to_machine_learning/data/melb_data.csv'
data = pd.read_csv(filepath)
data = data.dropna(axis=0)

# y and X
y = data.Price

features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude'] 
X = data[features]

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

We build a random forest model similarly to how we built a decision tree in scikit-learn - this time using the **RandomForestRegressor** class instead of DecisionTreeRegressor.

In [9]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

191669.7536453626


## Conclusion

There is likely room for further improvement, but this is a big improvement over the best decision tree error of 250,000. There are parameters which allow you to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree. But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.

# Excercises

## Recap

Code so far:

In [16]:
# imports
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


# Path of the file to read
iowa_file_path = 'C:/Users/Andre/Documents/GitHub/kaggle-courses/intro_to_machine_learning/data/home_data_for_ml_course/train.csv'

home_data = pd.read_csv(iowa_file_path)

# target y
y = home_data.SalePrice
# features X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# 1/4 specify model
iowa_model = DecisionTreeRegressor(random_state=1)
# 2/4 fit model
iowa_model.fit(train_X, train_y)
# 3/4 predict
val_predictions = iowa_model.predict(val_X)
# 4/4 validate
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))

# Using best value for max_leaf_nodes
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(val_mae))


Validation MAE when not specifying max_leaf_nodes: 29,653
Validation MAE for best value of max_leaf_nodes: 27,283


Data science isn't always this easy. But replacing the decision tree with a Random Forest is going to be an easy win.

In [17]:
from sklearn.ensemble import RandomForestRegressor

# 1. define (set random_state to 1)
rf_model = RandomForestRegressor(random_state=1)

# 2. fit
rf_model.fit(train_X, train_y)

# 3. predict
rf_pred = rf_model.predict(val_X)

# 4. validate
rf_val_mae = mean_absolute_error(rf_pred, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

Validation MAE for Random Forest Model: 21,857


# Continue to [7_machine_learning_competitions](7_machine_learning_competitions.ipynb)