## Random Forests

It combines multiple decision trees and merges them together to get a more accurate and stable prediction.<br>
Decision trees have the problem of overfitting. Although we can use pruning to reduce it drastically. But as the number of nodes grows till we get a pure node there are more chances of overfitting.<br>
Not all features are selected in different decision trees being made.<br>
We try to create more decision trees so that all decision trees are included.<br>
The randomness in selecting the features and data points helps in reducing overfitting. The final answer will be the majority of the answers from the decision trees of the forest.
There are many ways to select the final answer. One of them is<br>

## Data bagging

It is a very powerful ensemble method.<br>
An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.<br>
Our goal is to reduce the variance of the decision trees. Combining multiple trees to give prediction will improve accuracy and precision.<br>

## Procedure:
We try to create smaller datasets in which we also repetition of data points and randomly select some features. Bagging is done generally on data points, not features.<br>

These smaller data-sets are obtained by choosing the data-points and the features in the following manner:<br>
1. Features are selected at random without repetition<br>
2. Data-points  are selected at random with repetition (which is actually bagging)<br>


No data points should be left out so we will try to increase the number of trees.<br>
Selecting different features in a data set helps us know the relative importance of each feature.<br>

## Feature selection

Feature selection is picking only useful features that make up a major contribution to the output.<br>

The advantages of feature selection are as follows :<br>
1. Reduces Overfitting<br>
2. Improves accuracy of the model<br>
3. Reduces training time<br>

## Implementation using sklearn.

We will be using the iris dataset to see how feature selection works.<br>
Lets import all the necessary libraries.

In [14]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets,tree
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
import pandas as pd

Load the iris dataset and initialize 'X' to features and 'Y' to labels.

In [15]:
iris = datasets.load_iris()
features = iris.feature_names
X = iris.data
Y = iris.target

Splitting the dataset into training and testing data

In [16]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state = 14)

Using the RandomForestClassifier to train the model. The number of decision trees to be formed will be initialized to 10000.

In [17]:
clf = RandomForestClassifier(n_estimators=10000, n_jobs=-1, random_state = 14)

In [18]:
# Train the classifier
clf.fit(X_train, Y_train)

RandomForestClassifier(n_estimators=10000, n_jobs=-1, random_state=14)

In [19]:
clf.score(X_test, Y_test)

0.95

## Implementing feature selection

In [20]:
feature_importances = pd.DataFrame(clf.feature_importances_, index = features, columns=['importance']).sort_values('importance', ascending=False)
feature_importances

Unnamed: 0,importance
petal width (cm),0.459124
petal length (cm),0.400622
sepal length (cm),0.11811
sepal width (cm),0.022145


This shows that <b>petal length</b> and <b>petal width</b> are important features as compared to the other two features i.e. <b>sepal length</b> and <b>sepal width</b>.

In [21]:
# Making a classifier picking only important features, 
# picking only those features that have importance value greater than 0.15
sfm = SelectFromModel(clf, threshold = 0.15)

In [22]:
sfm.fit(X_train, Y_train)

SelectFromModel(estimator=RandomForestClassifier(n_estimators=10000, n_jobs=-1,
                                                 random_state=14),
                threshold=0.15)

In [23]:
# Create a data subset picking only important features out of all the features.
X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)

In [24]:
# New random forest classifier with only important features
clf_important = RandomForestClassifier(n_estimators=10000, n_jobs=-1, random_state = 14)

We fit this with the important features with which we created a subset.

In [25]:
clf_important.fit(X_important_train, Y_train)

RandomForestClassifier(n_estimators=10000, n_jobs=-1, random_state=14)

We look at the score which tells us how the model is performing.

In [26]:
clf_important.score(X_important_test, Y_test)

0.9666666666666667