# Random Forests

Random Forests are a bunch of decision trees combined.

It's called a **ensemble**. 

The idea of *ensemble classifiers* is to combine several weak classifiers into one strong classifier which results in a better performance. 

There is different methods to combine classifiers. Random forests are built using a *ensemble method* called **bagging**, in which multiple decision trees are used in parallel. The random forest then takes the mean value of the results from each decision tree.

Random forests reduce the risk of overfitting and accuracy is usually much higher than a single decision tree. 

In [None]:
import numpy as np
import pandas as pd


import sklearn
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier  
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

"""
`import sklearn` does not automatically import every submodules, 
you may have to import them explicitely like above.
"""

import matplotlib.pyplot as plt

from jyquickhelper import add_notebook_menu
print("Table of Contents")
add_notebook_menu()

In [None]:
melbourne_data = pd.read_csv("Melbourne_housing_FULL.csv")
melbourne_data = melbourne_data.dropna(axis=0).reset_index()

features = ['Rooms', 'Bathroom', 'Landsize', "Car", "YearBuilt"]

y = melbourne_data.Price
X = melbourne_data[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=42)

## Usage of RandomForestRegressor

In [None]:
forest_model = RandomForestRegressor(random_state=42)
forest_model.fit(X_train, y_train)
predictions = forest_model.predict(X_test)
mae = sklearn.metrics.mean_absolute_error(y_test, predictions)
print(f"MAE: {mae}")

## Usage of BaggingClassifier with DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier()
bag = BaggingClassifier(tree, n_estimators=100, max_samples=0.8,
                        random_state=1)
bag = bag.fit(X, y)
predictions = bag.predict(X_test)
mae = sklearn.metrics.mean_absolute_error(y_test, predictions)
print(f"MAE: {mae}")

Wow! It took more time to create the model, but the MAE is now largely reduced. 

## Visualize the RandomForest

To visualize the RandomForest, you will need to [install GraphViz](https://graphviz.gitlab.io/download/).

Simple instructions:
* On Linux (Ubuntu): `apt install graphviz`
* On MacOS: `brew install graphviz`

In [None]:
model = RandomForestClassifier(n_estimators=10)

# Train
model.fit(X, y)
# Extract single tree
estimator = model.estimators_[5]

# Export as dot file
sklearn.tree.export_graphviz(estimator, out_file='tree.dot', 
                max_depth=5,
                feature_names = features,
                rounded = True, proportion = True, 
                precision = 2, filled = True, rotate=True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

# More models

## Linear Regression

Linear Regression is one of the most simple model.

Its goal (objective function) is to minimize the error, with targets predicted by linear approximation.

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
mae = sklearn.metrics.mean_absolute_error(y_test, predictions)
print(f"MAE: {mae}")

## Lasso

A model built on Linear Regression, but with a different objective function, using L1 regularization.

In [None]:
from sklearn.linear_model import Lasso

model = Lasso()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
mae = sklearn.metrics.mean_absolute_error(y_test, predictions)
print(f"MAE: {mae}")

## GradientBoosting Regression

Another ensemble method, alongside bagging, is **boosting**. 

Instead of running parallel classifiers, it builds a classifier on the data set, and then a second classifier on the part of the data that the first classifier did not correctly predict.

It relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
mae = sklearn.metrics.mean_absolute_error(y_test, predictions)
print(f"MAE: {mae}")

It was a short lesson, and without any TODO's.

It's because I want you to start `project02` as soon as possible, and do the most of what you can!

See you there.