# Random Forest
### A single Decision Tree is not powerful enough, but an entire forest is!
### Random forest is a method that operates by constructing multiple Decision Trees during training phase. The Decision of the majority of the trees is chosen by the random forest as the final decision.
* No overfitting
  * Use of multiple trees reduce the risk of overfitting (fit the data too close that we capture unimportant part)
  * Training time is less
* High accuracy
  * Runs efficiently on large  database
  * For large data, it produces highly accurate predictions
* Estimates missing data
  * Random Forest can maintain accuracy when a large proportion of data is missing

<img src="Image/Random_Forest.JPG" width="600" height="300">

### Application of Random Forest
* Remote Sensing
  * Used in ETM devices to acquire images of the earth's surface
  * Accuracy is higher and training tome is less
* Object Detection
  * Multi-class object detection is done using Random Forest algorithm
  * Provides better detection in complicated environments
* Kinect
  * Random Forest is used in a game console called Kinect
  * Tracks body movements and recreates it in the game

### Use Case - Iris 

### 1. Importing the libraries

In [2]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
# Setting random seed
np.random.seed(0)

### 2. Creating  dataframe of iris data with the four feature variables

In [3]:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [4]:
#Adding a new column for the species name (target variable)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   sepal length (cm)  150 non-null    float64 
 1   sepal width (cm)   150 non-null    float64 
 2   petal length (cm)  150 non-null    float64 
 3   petal width (cm)   150 non-null    float64 
 4   species            150 non-null    category
dtypes: category(1), float64(4)
memory usage: 5.1 KB


### 3. Separating the Target variable

In [35]:
# all features
X = df.iloc[:, :-1]
# translate species to number
#Y = pd.factorize(df.iloc[:, -1])[0]
Y = df.iloc[:, -1]

In [36]:
Y

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: category
Categories (3, object): ['setosa', 'versicolor', 'virginica']

### 4. Splitting the data into Train and Test set

In [71]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=200)
features = df.columns[:4]

### 5. Fitting Random Forest Model to training set

In [72]:
clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(X, Y)

### 6. Predicting the test set result

In [73]:
y_pred = clf.predict(X_test)
y_pred

array(['versicolor', 'virginica', 'setosa', 'setosa', 'versicolor',
       'virginica', 'versicolor', 'versicolor', 'versicolor', 'virginica',
       'virginica', 'setosa', 'setosa', 'setosa', 'virginica', 'setosa',
       'versicolor', 'setosa', 'versicolor', 'virginica', 'setosa',
       'versicolor', 'virginica', 'setosa', 'setosa', 'setosa',
       'versicolor', 'virginica', 'virginica', 'versicolor'], dtype=object)

In [74]:
# predicted probability of the first 10 observations  (Vote!)
clf.predict_proba(X_test)[:10]

array([[0.03, 0.94, 0.03],
       [0.  , 0.  , 1.  ],
       [1.  , 0.  , 0.  ],
       [1.  , 0.  , 0.  ],
       [0.  , 1.  , 0.  ],
       [0.  , 0.  , 1.  ],
       [0.  , 1.  , 0.  ],
       [0.  , 0.98, 0.02],
       [0.  , 1.  , 0.  ],
       [0.  , 0.01, 0.99]])

### 7. Creating Confusion Matrix

In [75]:
pd.crosstab(y_test,y_pred, rownames=['Actual Species'], colnames=['Predicted Species'])

Predicted Species,setosa,versicolor,virginica
Actual Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,11,0,0
versicolor,0,10,0
virginica,0,0,9


### 8. Evaluating the model

In [76]:
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test,y_pred)
score

1.0