# Random Forest Classifier 

This notebook explores the performance of the Random Forest Classification Model. 

First we import and load the data

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
train_data = pd.read_csv("../data/ohe_train_recipes_v2.csv",index_col="id")
train_data_tfidf = pd.read_csv("../data/tfidf_train_recipes_v2.csv",index_col="id")

In [3]:
train_data.head(2)

Unnamed: 0_level_0,1% buttermilk,1% chocolate milk,1% cottage cheese,1% milk,"2 1/2 to 3 lb. chicken, cut into serving pieces",2% cottage cheese,2% low fat cheddar chees,2% lowfat greek yogurt,2% milk mozzarella cheese,2% reduced-fat milk,...,yuzu,yuzu juice,za'atar,zest,zesty italian dressing,zinfandel,ziti,zucchini,zucchini blossoms,cuisine
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,spanish
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,mexican


In [14]:
train_data_tfidf = train_data_tfidf.merge(train_data['cuisine'], left_index=True, right_index=True)

Create train/validation splits of the data

In [4]:
X_train, X_val, y_train, y_val = train_test_split(train_data.drop(columns=['cuisine']),
                                                  train_data['cuisine'],
                                                  test_size=0.3,random_state=22)

In [15]:
X_train_tf, X_val_tf, y_train_tf, y_val_tf = train_test_split(train_data_tfidf.drop(columns=['cuisine']),
                                                              train_data_tfidf['cuisine'],
                                                              test_size=0.3,random_state=22)

## Decision Tree
For a baseline comparison, let's train a decision tree on the data.

In [5]:
tree_model = DecisionTreeClassifier()
tree_model.fit(X_train, y_train)
tree_model.score(X_train, y_train), tree_model.score(X_val, y_val)

(0.9997844904996228, 0.6057152434425542)

In [6]:
tree_model.tree_.max_depth

350

With arbitrary depth, decision trees will overfit. Limiting the depth lowers training accuracy, but might improve validation accuracy. 

In [12]:
tree_model = DecisionTreeClassifier(max_depth=100)
tree_model.fit(X_train, y_train)
tree_model.score(X_train, y_train), tree_model.score(X_val, y_val)

(0.9337308286340289, 0.6008547724796782)

For comparison, let's try using the TFIDF encoded data.

In [16]:
tree_model = DecisionTreeClassifier(max_depth=100)
tree_model.fit(X_train_tf, y_train_tf)
tree_model.score(X_train_tf, y_train_tf), tree_model.score(X_val_tf, y_val_tf)

(0.9247871843683776, 0.5743735858543535)

## Random Forest

Now let's train a random forest, which should reduce overfitting to the training data. 

In [17]:
forest_model = RandomForestClassifier(random_state=0)
forest_model.fit(X_train, y_train)
forest_model.score(X_train, y_train), forest_model.score(X_val, y_val)

(0.9997844904996228, 0.7127294058493254)

As with the decision tree, let's reduce the max depth to reduce overfitting. 

In [21]:
forest_model = RandomForestClassifier(max_depth=100)
forest_model.fit(X_train, y_train)
forest_model.score(X_train, y_train), forest_model.score(X_val, y_val)

(0.9778743579612801, 0.7049358920640242)

In [23]:
forest_model = RandomForestClassifier(max_depth=70, n_estimators=50)
forest_model.fit(X_train, y_train)
forest_model.score(X_train, y_train), forest_model.score(X_val, y_val)

(0.927696562623469, 0.6779518980977123)

In [26]:
forest_model = RandomForestClassifier(max_depth=250, n_estimators=25)
forest_model.fit(X_train, y_train)
forest_model.score(X_train, y_train), forest_model.score(X_val, y_val)

(0.9986351064976114, 0.7014162406771138)

## Grid Search CV

We can now use cross validation to tune the hyperparameters of the random forest model.

In [27]:
from sklearn.model_selection import GridSearchCV

In [29]:
parameters = {'max_depth':[50,100,150,200,250,300],
              'n_estimators': [25,50,100]}
forest = RandomForestClassifier()
grid_search = GridSearchCV(forest, parameters, cv=3, verbose=2)

In [30]:
grid_search.fit(X_val, y_val)

Fitting 3 folds for each of 18 candidates, totalling 54 fits
[CV] END ......................max_depth=50, n_estimators=25; total time=   1.6s
[CV] END ......................max_depth=50, n_estimators=25; total time=   1.5s
[CV] END ......................max_depth=50, n_estimators=25; total time=   1.5s
[CV] END ......................max_depth=50, n_estimators=50; total time=   2.9s
[CV] END ......................max_depth=50, n_estimators=50; total time=   2.9s
[CV] END ......................max_depth=50, n_estimators=50; total time=   2.8s
[CV] END .....................max_depth=50, n_estimators=100; total time=   5.5s
[CV] END .....................max_depth=50, n_estimators=100; total time=   5.5s
[CV] END .....................max_depth=50, n_estimators=100; total time=   5.5s
[CV] END .....................max_depth=100, n_estimators=25; total time=   2.1s
[CV] END .....................max_depth=100, n_estimators=25; total time=   2.1s
[CV] END .....................max_depth=100, n_e

In [32]:
grid_search.best_params_, grid_search.best_score_

({'max_depth': 250, 'n_estimators': 100}, 0.6627004429146156)

## Test Predictions
Generate predictions for the test set to evaluate model preformance.

In [33]:
test_data = pd.read_csv("../data/ohe_test_recipes_v2.csv",index_col="id")


In [34]:
final_model = grid_search.best_estimator_
test_predictions = final_model.predict(test_data)

In [39]:
pd.Series(test_predictions, index=test_data.index, name='cuisine').to_csv("model_predictions/random_forest.csv")
## kaggle score: 0.68574