**Understanding Random Forests Classifiers in Python**

Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

Random forests has a variety of applications, such as recommendation engines, image classification and feature selection. It can be used to classify loyal loan applicants, identify fraudulent activity and predict diseases. It lies at the base of the Boruta algorithm, which selects important features in a dataset.

**The Random Forests Algorithm**

Let’s understand the algorithm in layman’s terms. Suppose you want to go on a trip and you would like to travel to a place which you will enjoy.

So what do you do to find a place that you will like? You can search online, read reviews on travel blogs and portals, or you can also ask your friends.

Let’s suppose you have decided to ask your friends, and talked with them about their past travel experience to various places. You will get some recommendations from every friend. Now you have to make a list of those recommended places. Then, you ask them to vote (or select one best place for the trip) from the list of recommended places you made. The place with the highest number of votes will be your final choice for the trip.

In the above decision process, there are two parts. First, asking your friends about their individual travel experience and getting one recommendation out of multiple places they have visited. This part is like using the decision tree algorithm. Here, each friend makes a selection of the places he or she has visited so far.

The second part, after collecting all the recommendations, is the voting procedure for selecting the best place in the list of recommendations. This whole process of getting recommendations from friends and voting on them to find the best place is known as the random forests algorithm.

It technically is an ensemble method (based on the divide-and-conquer approach) of decision trees generated on a randomly split dataset. This collection of decision tree classifiers is also known as the forest. The individual decision trees are generated using an attribute selection indicator such as information gain, gain ratio, and Gini index for each attribute. Each tree depends on an independent random sample. In a classification problem, each tree votes and the most popular class is chosen as the final result. In the case of regression, the average of all the tree outputs is considered as the final result. It is simpler and more powerful compared to the other non-linear classification algorithms.

**How does the algorithm work?**

It works in four steps:

1. Select random samples from a given dataset.
1. Construct a decision tree for each sample and get a prediction result from each decision tree.
1. Perform a vote for each predicted result.
1. Select the prediction result with the most votes as the final prediction.

**Random Forests vs Decision Trees**

* Random forests is a set of multiple decision trees.
* Deep decision trees may suffer from overfitting, but random forests prevents overfitting by creating trees on random subsets.
* Decision trees are computationally faster.
* Random forests is difficult to interpret, while a decision tree is easily interpretable and can be converted to rules.

**Building a Classifier using Scikit-learn**

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn import datasets

from sklearn.model_selection import train_test_split

# from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
# Load dataset
iris = datasets.load_iris()

In [None]:
type(iris)

In [None]:
iris.target_names

In [None]:
iris.feature_names

In [None]:
# iris.data

In [None]:
# iris.data[:, 0]

In [None]:
iris.target

In [None]:
# Creating a DataFrame of given iris dataset

iris_data = pd.DataFrame({
    'sepal length': iris.data[:, 0],
    'sepal width' : iris.data[:, 1],
    'petal length': iris.data[:, 2],
    'petal width': iris.data[:, 3],
    'species': iris.target
})

In [None]:
type(iris_data)

In [None]:
iris_data.head()

In [None]:
X = iris_data.iloc[:, :-1].values  # features 
y = iris_data.iloc[:, -1].values   # target 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

In [None]:
model = RandomForestClassifier()                  # Model loading  
model.fit(X_train, y_train)                       # Model training or development 

In [None]:
y_pred = model.predict(X_test)                    # Model prediction 

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
model1 = RandomForestClassifier(n_estimators=100)
model1.fit(X_train, y_train)

In [None]:
# model1.feature_importances_?

In [None]:
feature_imp = pd.Series(data=model1.feature_importances_, index=iris.feature_names)
feature_imp = feature_imp.sort_values(ascending=False)

In [None]:
feature_imp

In [None]:
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()

**Happy Learning :)**