## Random Forest Classifier & Gradient Boosting (Number of Ingredients by Cuisine Types)

This notebook requires:
* trainEngineered.csv

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report

In [None]:
finalDF = pd.read_csv('trainEngineered.csv')
finalDF.head()

Unnamed: 0,greek,southern_us,filipino,indian,jamaican,spanish,italian,mexican,chinese,british,...,cajun_creole,brazilian,french,japanese,irish,korean,moroccan,russian,general,cuisine
0,6,1,0,2,0,0,6,7,1,0,...,1,0,3,0,0,0,1,0,8,greek
1,0,5,0,1,3,0,2,1,1,2,...,2,0,1,1,1,0,0,0,14,southern_us
2,0,0,1,2,0,0,1,3,1,0,...,1,0,2,1,0,1,0,0,18,filipino
3,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,6,indian
4,1,3,0,14,2,2,3,5,5,1,...,3,1,1,6,0,1,3,0,22,indian


### Split into Train and Test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(finalDF.drop(['cuisine'], axis = 1), 
                                                    finalDF['cuisine'], 
                                                    train_size = 0.8, 
                                                    random_state = 10)

## Random Forest Classifier

Here, we train the model with 3 different number of trees, namely, 250, 300, and 350.

In [None]:
for n_trees in [250, 300, 350]:
  rf_model = RandomForestClassifier(n_estimators=n_trees, 
                                    max_features=10, 
                                    n_jobs=-1,
                                    bootstrap=True, 
                                    oob_score=True, 
                                    criterion='gini',
                                    max_samples=20000)
  rf_model.fit(X_train, y_train)
  print(f'Number of trees: {n_trees}, Accuracy: {rf_model.score(X_test, y_test)}')

Number of trees: 250, Accuracy: 0.7423004399748586
Number of trees: 300, Accuracy: 0.7436832181018228
Number of trees: 350, Accuracy: 0.7435575109993715


We can see that the at 300 trees, the accuracy on the test set is the highest. Even though the accuracy remains around the same as that of one-hot encoding version, we have successfully reduced the size of the dataset and increased the speed of training.

Can we do better? Let's try another famous ensemble model called gradient boosting.

## Gradient Boosting

In [None]:
for n_iter in [200]:
  gb_model = GradientBoostingClassifier(n_estimators=n_iter,
                                        learning_rate=0.1, 
                                        max_depth=4)
  gb_model.fit(X_train, y_train)
  print(f'Number of trees: {n_iter}, Accuracy: {gb_model.score(X_test, y_test)}')

Number of trees: 200, Accuracy: 0.7536140791954745


We can see the accuracy on the test set only increases very slightly from 74% in random forest to 75% in gradient boosting. Let's use cross validation on smaller dataset to find out the best range of train set.

## Cross Validation

We can perform 5-fold cross validation on the dataset to check whether there is any improvement in accuracy. Since it is a validation set, the size will be smaller than the original train set. In this case, we use 50% of the original dataset.

In [None]:
X_validation, X_test, y_validation, y_test = train_test_split(finalDF.drop(['cuisine'], axis = 1), 
                                                    finalDF['cuisine'], 
                                                    train_size = 0.5, 
                                                    random_state = 42)

In [None]:
rf_model = RandomForestClassifier(n_estimators=300, 
                                  max_features=10, 
                                  oob_score=True,
                                  n_jobs=-1,
                                  bootstrap=True,  
                                  criterion='gini')

In [None]:
scores = cross_validate(rf_model, X_validation, y_validation, cv=5,
                        scoring='accuracy',
                        return_estimator=True)

In [None]:
print(scores['test_score'])

[0.73906486 0.72272499 0.73095298 0.72114659 0.72416394]


The random forest classifier's accuracy of validation set remains between 72% and 74%.

In [None]:
gb_model = GradientBoostingClassifier(n_estimators=200,
                                      learning_rate=0.1, 
                                      max_depth=3)

In [None]:
scores = cross_validate(gb_model, X_validation, y_validation, cv=5,
                        scoring='accuracy',
                        return_estimator=True)

In [None]:
print(scores['test_score'])

[0.74509804 0.73353444 0.74226804 0.73899925 0.74352527]


The gradient boosting method performs slightly better than random forest on average. However, its accuracy also remains between 73% and 75%.