# Supervised Machine Learning: Random Forest

Random Forest is an ensemble machine learning algorithm that follows the bagging technique whose base estimators are decision trees. Random forest randomly selects a set of features that are used to decide the best split at each node of the decision tree.

1. Random subsets are created from the original dataset (bootstrapping).
2. At each node in the decision tree, only a random set of features are considered to decide the best split.
3. A decision tree model is fitted on each of the subsets.
4. The final prediction is calculated by averaging the predictions from all decision trees.

In [11]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np


## 1. Load Data

In [12]:
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Categorical.from_codes(iris.target, iris.target_names)

In [13]:
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [14]:
y = pd.get_dummies(y)
y.head()

Unnamed: 0,setosa,versicolor,virginica
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## 2. Random Forest

1. **n_estimators**: It defines the number of decision trees to be created in a random forest.
2. **criterion**: "Gini" or "Entropy."
3. **min_samples_split**: Used to define the minimum number of samples required in a leaf node before a split is attempted
4. **max_features**: It defines the maximum number of features allowed for the split in each decision tree.
5. **n_jobs**: The number of jobs to run in parallel for both fit and predict. Always keep (-1) to use all the cores for parallel processing.

In [18]:
clf = RandomForestClassifier(n_estimators=100, 
                             criterion='gini', 
                             min_samples_split=5, 
                             max_features=4, 
                             n_jobs=-1)
clf.fit(X_train, y_train)

RandomForestClassifier(max_features=4, min_samples_split=5, n_jobs=-1)

In [21]:
clf.score(X_test, y_test)

0.9736842105263158