# Random Forest



Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.

It's also called  random decision forests,  random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

![Random Forest](1.jpeg)

**The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.**

## Why use Random Forest ?

* **versatility** :  It can be used for both regression and classification tasks, and it’s also easy to view the relative importance it assigns to the input features.
* **No Overfitting** : Use of multiple trees reduce the risk of overfitting.
* **Time** : It takes less training time as compared to other algorithms.
* **High Accuracy** : It predicts output with high accuracy, even for the large dataset it runs efficiently.
* **Estimates Missing Data** : It can also maintain accuracy when a large proportion of data is missing.

## How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N decision tree, and second is to make predictions for each tree created in the first phase.

Working Process:
1. Select random K data points from the training set.
2. Build the decision trees associated with the selected data  points (Subsets).
3. Choose the number N for decision trees that you want to build.
4. Repeat Step 1 & 2.
5. For new data points, find the predictions of each decision tree, and assign the new data points to the category that wins the majority votes.

![Random Forest](2.png)

### Example:

**Problem Statement**: You ware given some flowers and you are wonderring what species of iris do these flowers belong to?

Let's try to predict the species of the flowers using python machin learning (Random Forest algorithm) 

**Note**: Scikit-learn comes with dataset called iris dataset

### Loading dataset and importing libraries

In [2]:
import pandas as pd
import numpy as np
# load the dataset
from sklearn.datasets import load_iris
# importing Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
np.random.seed(0)
# This will help us to get the exact same random numbers

In [3]:
# Creating a data object
iris = load_iris()

In [5]:
# Creating a pandas dataframe with the feature variables
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [8]:
# let's take alook at the target 
# adding new column to the df for species name
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# check the data
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


#### spliting the Data to Test and Train Data

In [12]:
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,is_train
0,5.1,3.5,1.4,0.2,setosa,True
1,4.9,3.0,1.4,0.2,setosa,False
2,4.7,3.2,1.3,0.2,setosa,True
3,4.6,3.1,1.5,0.2,setosa,True
4,5.0,3.6,1.4,0.2,setosa,False


#### Let's Train and test our model

In [13]:
train, test = df[df['is_train'] == True], df[df['is_train'] == False]
# Show the number of observations for tain and test dataframe
print('Number of observations in tne training data:', len(train))
print('Number of observations in tne test data:', len(test))

Number of observations in tne training data: 113
Number of observations in tne test data: 37


In [14]:
# create a list of the feature column's names
features = df.columns[:4]
features

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

In [15]:
# Converting each species name into digits thougth the computer can understand
y = pd.factorize(train['species'])[0]
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2], dtype=int32)

#### Creating our model

In [17]:
clf = RandomForestClassifier(n_jobs=2, random_state=0)

#### Training the classifier

In [19]:
clf.fit(train[features], y)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
                       oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

#### Testing the model 

In [20]:
clf.predict(test[features])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 1, 1, 1, 2,
       2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2], dtype=int32)

In [21]:
# Viewing the predicted probabilities of the first 10 observations
clf.predict_proba(test[features])[0:10]

array([[1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [0.9, 0.1, 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [0.9, 0.1, 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ]])

In [24]:
# Mapping names for the plants for each predicted plant class
preds = iris.target_names[clf.predict(test[features])]
# print the predicted species for the first five observations
preds[0:5]

array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa'], dtype='<U10')

In [25]:
# Print the acual species for the first five observations
test['species'].head()

1     setosa
4     setosa
9     setosa
18    setosa
21    setosa
Name: species, dtype: category
Categories (3, object): [setosa, versicolor, virginica]

#### Let's create a confusion matrix between acual and predicted

In [26]:
pd.crosstab(test['species'], preds, rownames=['Acual Species'], colnames=['Predicted Species'])

Predicted Species,setosa,versicolor,virginica
Acual Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,13,0,0
versicolor,0,7,1
virginica,0,2,14
