# Random Forests
Random forest is constructed using multiple decision trees and the final decision is obtained by majority of votes of the decision tree.
## Details
The decision trees used for the construction are defined based on the subsets of the features. The following the topics are the key:
- Selection of the subsets: The number of features for each subset (decision tree) is selected based on the following criterias
    - Classification: The square root of total number of all the features
    - Regression: 1/3 of the total number of all features
- Number of samples for each decision tree (subset): It is picked randomly. Not all the samples are selected.
### Splitting Method
- Gini Impurity: To predict the likelihood that a randomly selected example would be incorrectly classified. It ranges from 0 to 1. 0 indicates that all of the elements belongs to single classifier (label). 0 is desired.
- Information Gain: Utilizes the entropy concept.
    - Entropy: Randomness or uncertainity in data. Low entropy is desired which means less random - more uniform data.
### How is the final decision done?
It is achieved using Ensemble Technique: Bootsrap. For given test data, each decision tree will generate a prediction. The one with the maximum score (vote) is selected as final prediction.
## Pros & Cons
- Pros:
    - Normalization not needed.
    - Reduced overfitting
    - Good accuracy -> better generalization
    - Low variance
    - Performs well on large datasets
    - Suitable for Classification and Regression
- Cons:
    - Requires more training time
    - Interpretation is complex due to high number of decision trees.
    - Requires more memory
    - Computationally expensive

## Penguins Example

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

### Load data

In [2]:
penguins_data = pd.read_csv('files/penguins.csv')

In [3]:
penguins_data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


### Analyze Data

In [4]:
penguins_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [5]:
penguins_data.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

In [6]:
# Drop null values
penguins_data.dropna(inplace=True)

In [7]:
penguins_data.isnull().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

### Preprocess Data
Transform categorical data into numeric

In [8]:
penguins_data.sex.unique()

array(['MALE', 'FEMALE'], dtype=object)

In [10]:
pd.get_dummies(penguins_data['sex']).head()

Unnamed: 0,FEMALE,MALE
0,False,True
1,True,False
2,True,False
4,True,False
5,False,True


In [11]:
# Since there are only two available options, we don't need one of the columns
sex_data = pd.get_dummies(penguins_data['sex'], drop_first=True)
sex_data.head()

Unnamed: 0,MALE
0,True
1,False
2,False
4,False
5,True


In [12]:
# Since species can not belong to multiple island, 
# two columns are sufficient to represent the island
island_data = pd.get_dummies(penguins_data['island'], drop_first=True)
island_data.head()

Unnamed: 0,Dream,Torgersen
0,False,True
1,False,True
2,False,True
4,False,True
5,False,True


In [13]:
# Concatenate these new data with penguins_data
penguins_data = pd.concat([penguins_data, sex_data, island_data], axis=1)
penguins_data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,MALE,Dream,Torgersen
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE,True,False,True
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE,False,False,True
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE,False,False,True
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE,False,False,True
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE,True,False,True


In [14]:
# Drop repeated columns
penguins_data.drop(['sex', 'island'], axis=1, inplace=True)
penguins_data.head()

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,MALE,Dream,Torgersen
0,Adelie,39.1,18.7,181.0,3750.0,True,False,True
1,Adelie,39.5,17.4,186.0,3800.0,False,False,True
2,Adelie,40.3,18.0,195.0,3250.0,False,False,True
4,Adelie,36.7,19.3,193.0,3450.0,False,False,True
5,Adelie,39.3,20.6,190.0,3650.0,True,False,True


### Extract Data

In [19]:
x_data = penguins_data.drop('species', axis=1)
y_data = penguins_data['species']
x_data.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,MALE,Dream,Torgersen
0,39.1,18.7,181.0,3750.0,True,False,True
1,39.5,17.4,186.0,3800.0,False,False,True
2,40.3,18.0,195.0,3250.0,False,False,True
4,36.7,19.3,193.0,3450.0,False,False,True
5,39.3,20.6,190.0,3650.0,True,False,True


#### Convert y_data from category to numeric

In [20]:
y_data.head()

0    Adelie
1    Adelie
2    Adelie
4    Adelie
5    Adelie
Name: species, dtype: object

In [21]:
y_data.unique()

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

In [22]:
y_data = y_data.map({'Adelie':0, 'Chinstrap':1, 'Gentoo': 2})
y_data.head()

0    0
1    0
2    0
4    0
5    0
Name: species, dtype: int64

### Split Data

In [23]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.30, random_state=0)
# random_state=0 ->will generate same train and test random data each time running

### Generate model and train

In [24]:
from sklearn.ensemble import RandomForestClassifier
rand_forest_model = RandomForestClassifier(n_estimators=5, criterion='entropy', random_state=0)
# n_estimators = number of decision trees
rand_forest_model.fit(x_train, y_train)

### Predict

In [25]:
y_pred = rand_forest_model.predict(x_test)

### Evaluate Results

In [27]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, accuracy_score

In [28]:
confusion_matrix(y_test, y_pred)

array([[48,  0,  0],
       [ 1, 15,  0],
       [ 1,  0, 35]])

In [29]:
accuracy_score(y_test, y_pred)

0.98

In [30]:
classification_report(y_test, y_pred)

'              precision    recall  f1-score   support\n\n           0       0.96      1.00      0.98        48\n           1       1.00      0.94      0.97        16\n           2       1.00      0.97      0.99        36\n\n    accuracy                           0.98       100\n   macro avg       0.99      0.97      0.98       100\nweighted avg       0.98      0.98      0.98       100\n'