# Random Forest
Accuracy is higher and training time is less
Object detection (multiclass). Better in complicated environments
Kinect: used in a game console, tracks body movement

Training set to identify body parts, random forest classifier, identifies the body parts while dancing
score game 

## Why Random Forest
1. No Overfitting, and training time is less
2. High accuracy, large data
3. Estimates missing data

# What is Random Forest
Operates by constructing multiple Decision Trees during training phase. The decision of the majority of the trees is chosen by the random forest as the final decision

### Entropy
The measure of randomness or unpredictability in the dataset
### Information Gain
Measure of decrease in the entropy after the dataset is split
### Leaf Node
Carries the classification or the decision
### Decision Node
has two or more branches
### Root Node
Top most decision node

Bowl of fruit. We want to classify the different types of fruits based on different features. The bowl has high entropy. Lets train with dataset (Color, Diameter, Label).
Objective: frame the conditions that split the data in such a way that the IG is the highest

Gain is the measure of decrease in entropy after splitting

1. Split using conditions (color==purple? | diameter=3 | color==yellow? | color==red? | diameter=1)
2. We have lots of fruits with diameter 3, so split by diamter > 3? 
3. No splitting if entropy=0
4. Split again based on color (lemon and apples)

Random forest multiple decision trees, each split based on a different conditions

In [2]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
# Setting random seed
np.random.seed(0)

In [3]:
# Creating an object with the iris data loaded in it
iris = load_iris()
# Create a data frame initialized with iris data
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Show first 5 rows of the dataset
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [5]:
# Adding a new column for the species name
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [7]:
# create test and train data
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,is_train
0,5.1,3.5,1.4,0.2,setosa,True
1,4.9,3.0,1.4,0.2,setosa,False
2,4.7,3.2,1.3,0.2,setosa,True
3,4.6,3.1,1.5,0.2,setosa,True
4,5.0,3.6,1.4,0.2,setosa,True


In [8]:
train, test = df[df['is_train'] == True], df[df['is_train']==False]
print("number of observations in the training data:", len(train))
print("number of observations in the test data:", len(test))

number of observations in the training data: 112
number of observations in the test data: 38


In [9]:
features = df.columns[:4]
features

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

In [10]:
y = pd.factorize(train['species'])[0]
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

In [12]:
clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(train[features], y)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [13]:
clf.predict(test[features])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [15]:
clf.predict_proba(test[features])[10:20]

array([[0. , 1. , 0. ],
       [0. , 1. , 0. ],
       [0. , 1. , 0. ],
       [0. , 1. , 0. ],
       [0. , 1. , 0. ],
       [0. , 1. , 0. ],
       [0. , 1. , 0. ],
       [0. , 0.9, 0.1],
       [0. , 1. , 0. ],
       [0. , 0.2, 0.8]])

In [18]:
preds = iris.target_names[clf.predict(test[features])]
preds[0:25]

array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica'], dtype='<U10')

In [20]:
test.species

1          setosa
5          setosa
6          setosa
13         setosa
14         setosa
15         setosa
24         setosa
27         setosa
34         setosa
43         setosa
59     versicolor
60     versicolor
65     versicolor
69     versicolor
71     versicolor
73     versicolor
75     versicolor
78     versicolor
90     versicolor
101     virginica
102     virginica
104     virginica
108     virginica
113     virginica
115     virginica
120     virginica
121     virginica
123     virginica
125     virginica
126     virginica
127     virginica
129     virginica
131     virginica
136     virginica
142     virginica
147     virginica
148     virginica
149     virginica
Name: species, dtype: category
Categories (3, object): [setosa, versicolor, virginica]

In [21]:
pd.crosstab(test.species, preds, rownames=['Actual Species'], colnames=["Predicted Species"])

Predicted Species,setosa,versicolor,virginica
Actual Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,10,0,0
versicolor,0,9,0
virginica,0,1,18


In [24]:
preds = iris.target_names[clf.predict([[5.0, 3.6, 1.4, 2.0]])]
preds

array(['virginica'], dtype='<U10')