# Random Forest Boost Algorithm (Classification)

This algorithm is an extension over bagging. It takes random selection of features rather than using all the features to grow trees along with taking random subset of data (as it's done in bagging algorithm. As in this algorithms there are multiple trees created. Hence, the name **Random Forrest**

Data Source: [Letter Recognition]("https://archive.ics.uci.edu/ml/datasets/Letter+Recognition")

**Data Attributes**
- 1. lettr capital letter (26 values from A to Z)
- 2. x-box horizontal position of box (integer)
- 3. y-box vertical position of box (integer)
- 4. width width of box (integer)
- 5. high height of box (integer)
- 6. onpix total # on pixels (integer)
- 7. x-bar mean x of on pixels in box (integer)
- 8. y-bar mean y of on pixels in box (integer)
- 9. x2bar mean x variance (integer)
- 10. y2bar mean y variance (integer)
- 11. xybar mean x y correlation (integer)
- 12. x2ybr mean of x * x * y (integer)
- 13. xy2br mean of x * y * y (integer)
- 14. x-ege mean edge count left to right (integer)
- 15. xegvy correlation of x-ege with y (integer)
- 16. y-ege mean edge count bottom to top (integer)
- 17. yegvx correlation of y-ege with x (integer)

In [1]:
# Importing the necessary packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

In [2]:
# Load and read the data
letter = pd.read_csv("./letter_recog/letter-recognition.txt")
letter

Unnamed: 0,T,2,8,3,5,1,8.1,13,0,6,6.1,10,8.2,0.1,8.3,0.2,8.4
0,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
1,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
2,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
3,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10
4,S,4,11,5,8,3,8,8,6,9,5,6,6,0,8,9,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19994,D,2,2,3,3,2,7,7,7,6,6,6,4,2,8,3,7
19995,C,7,10,8,8,4,4,8,6,9,12,9,13,2,9,3,7
19996,T,6,9,6,7,5,6,11,3,7,11,9,5,2,12,2,4
19997,S,2,3,4,2,1,8,7,2,6,10,6,8,1,9,5,8


In [3]:
# Renaming the colums as per data source
letter.columns = ["letter", "x.box", "y.box", "width", "high", "onpix", "x.bar", "y.bar", "x2.bar", "y2.bar",
                 "xybar", "x2ybr", "xy2br", "x.ege", "xegvy", "y.ege", "yegvx"]
letter

Unnamed: 0,letter,x.box,y.box,width,high,onpix,x.bar,y.bar,x2.bar,y2.bar,xybar,x2ybr,xy2br,x.ege,xegvy,y.ege,yegvx
0,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
1,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
2,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
3,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10
4,S,4,11,5,8,3,8,8,6,9,5,6,6,0,8,9,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19994,D,2,2,3,3,2,7,7,7,6,6,6,4,2,8,3,7
19995,C,7,10,8,8,4,4,8,6,9,12,9,13,2,9,3,7
19996,T,6,9,6,7,5,6,11,3,7,11,9,5,2,12,2,4
19997,S,2,3,4,2,1,8,7,2,6,10,6,8,1,9,5,8


In [4]:
# Export the data to .csv format
# letter.to_csv("./letter_recog/letter.csv", index = False)

In [5]:
# Load and read the .csv format of data
letter_rec = pd.read_csv("./letter_recog/letter.csv")
letter_rec.head()

Unnamed: 0,letter,x.box,y.box,width,high,onpix,x.bar,y.bar,x2.bar,y2.bar,xybar,x2ybr,xy2br,x.ege,xegvy,y.ege,yegvx
0,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
1,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
2,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
3,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10
4,S,4,11,5,8,3,8,8,6,9,5,6,6,0,8,9,7


In [6]:
# Display the characteristics of dataset
print("Dimensions of dataset is: ", letter_rec.shape)
print("The variables present in dataset are: \n", letter_rec.columns)

Dimensions of dataset is:  (19999, 17)
The variables present in dataset are: 
 Index(['letter', 'x.box', 'y.box', 'width', 'high', 'onpix', 'x.bar', 'y.bar',
       'x2.bar', 'y2.bar', 'xybar', 'x2ybr', 'xy2br', 'x.ege', 'xegvy',
       'y.ege', 'yegvx'],
      dtype='object')


In [7]:
# Using seed function to generate the same dataset
np.random.seed(3000)

In [8]:
# Train-Test Split of both independent and dependent feature
# Here we are considering letter as our target feature, rest all independent feature
training, test = train_test_split(letter_rec, test_size = 0.3)

x_trg = training.drop("letter", axis = 1)
y_trg = training["letter"]

x_test = test.drop("letter", axis = 1)
y_test = test["letter"]

### Creating a Random Forest model

In [9]:
# Model Building - Random Forest
letter_forest = RandomForestClassifier()

# Fit the model
letter_forest.fit(x_trg, y_trg)
print ("Accuracy of random forest model on training set is: ", letter_forest.score(x_trg, y_trg))

# Prediction
letter_forest_pred = letter_forest.predict(x_test)

# Computation of accuracy score
letter_forest_acc_score = accuracy_score(y_test, letter_forest_pred)
print("Accuracy of random forest on test set is: ", letter_forest_acc_score)

Accuracy of random forest model on training set is:  1.0
Accuracy of random forest on test set is:  0.9626666666666667


#### Creating a new Random Forest model with grid search

In [10]:
# Import the required package from sklearn
from sklearn.model_selection import GridSearchCV

In [11]:
# Grid parameters
param_grid = {"max_features" : ["auto", "sqrt", "log2"], "criterion" : ["gini", "entropy"]}

# Model building
letter_forest_grid = RandomForestClassifier()
letter_forest_CV = GridSearchCV(estimator = letter_forest_grid, param_grid = param_grid, cv = 5)

# Fit the model
letter_forest_result = letter_forest_CV.fit(x_trg, y_trg)
print("Best Parameters are: \n", letter_forest_CV.best_params_)

Best Parameters are: 
 {'criterion': 'gini', 'max_features': 'log2'}


#### Creating the model considering the best scores

In [12]:
# Model Building - new Random Forest
letter_forest_best = RandomForestClassifier(max_features = letter_forest_result.best_params_["max_features"],
                                           criterion = letter_forest_result.best_params_["criterion"])

#### Model Evaluation

In [13]:
# Fit the best model
letter_forest_best.fit(x_trg, y_trg)
print("Accuracy on training set with best parameters: ", letter_forest_best.score(x_trg, y_trg))

Accuracy on training set with best parameters:  1.0


In [14]:
# Predition via new model
letter_forest_pred_2 = letter_forest_best.predict(x_test)
print("Classification Report: \n", classification_report(y_test, letter_forest_pred_2))

Classification Report: 
               precision    recall  f1-score   support

           A       0.99      0.99      0.99       242
           B       0.88      0.97      0.93       235
           C       0.97      0.96      0.96       216
           D       0.93      0.95      0.94       225
           E       0.95      0.94      0.94       235
           F       0.94      0.94      0.94       235
           G       0.95      0.95      0.95       224
           H       0.95      0.90      0.93       220
           I       0.99      0.92      0.95       224
           J       0.95      0.98      0.97       225
           K       0.93      0.94      0.94       223
           L       1.00      0.98      0.99       223
           M       1.00      0.98      0.99       249
           N       0.96      0.98      0.97       242
           O       0.95      0.98      0.97       235
           P       0.98      0.95      0.96       256
           Q       0.94      0.98      0.96       223
  

In [15]:
# Determine the accuracy of new model
letter_forest_acc_score_2 = accuracy_score(y_test, letter_forest_pred_2)
print("Accuracy of new Random Forest model is: ", letter_forest_acc_score_2)

Accuracy of new Random Forest model is:  0.9626666666666667


#### Compare Random Forest model with Bagging Model

In [16]:
# Model building - Bagging
letter_bag = BaggingClassifier(base_estimator = None, n_estimators = 10, max_samples = 1.0, max_features = 1.0,
                              bootstrap = True)

# Fit the model
letter_bag.fit(x_trg, y_trg)
print("Accuracy of Bagging Model on Training set is: ", letter_bag.score(x_trg, y_trg))

# Prediction
letter_bag_pred = letter_bag.predict(x_test)

# Computation of accuracy of model
letter_bag_acc_score = accuracy_score(y_test, letter_bag_pred)
print("Accuracy of Bagging Model on Test set is: ", letter_bag_acc_score)

Accuracy of Bagging Model on Training set is:  0.99814272448032
Accuracy of Bagging Model on Test set is:  0.9223333333333333


#### Compare Random Forest with DecisionTree Model

In [17]:
# Model Building - Decsion Tree
letter_tree = DecisionTreeClassifier(random_state = 0)

# Fit the model
letter_tree.fit(x_trg, y_trg)
print("Accuracy of Decision Tree model on Training set is: ", letter_tree.score(x_trg, y_trg))

# Prediction
letter_tree_pred = letter_tree.predict(x_test)

# Compute the accuracy of prediction
letter_tree_acc_score = accuracy_score(y_test, letter_tree_pred)
print("Accuracy of Decision Tree model on Test set is: ", letter_tree_acc_score)

Accuracy of Decision Tree model on Training set is:  1.0
Accuracy of Decision Tree model on Test set is:  0.8616666666666667


From above built models, we can see clearly that for base Random Forest model the accuracy on training and test are  1.000 and 0.962 respectively. 

We have did the best parameters search we found the accuracy of new Random Forest changed to 0.962.

We have also checked the some other ensemble techniques like Bagging model (0.998 & 0.922) respectively which also shows overfitting nature of model and that of simple Decision Tree (1.000 & 0.866) which also shows the nature of overfitting model.