**Introduction** <br>
Decision trees leave you with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. We'll look at the random forest as an example.

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

dataset_path = "../resources/datasets/melb_data.csv"
df = pd.read_csv(dataset_path)

df = df.dropna(axis=0)

features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
y = df.Price
X = df[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

We build a random forest model similarly to how we built a decision tree in scikit-learn - this time using the RandomForestRegressor class instead of DecisionTreeRegressor.

In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

rf_model = RandomForestRegressor(random_state=78)
rf_model.fit(X=X_train, y=y_train)

rf_predt_val = rf_model.predict(X=X_test)
print(mean_absolute_error(y_test, rf_predt_val))

174768.55526268866
