# Random Forest Classifier

A decision tree splits the dataset per node, until a leaf node is created (e.g. when there is only one class in the leaf node). Those trees are highly sensitive to the training data and might fail to generalize. A random forest is  a collection of decision trees that are trained on different parts (randomly chosen  rows (with replacement) and randomly selected feature subsets) of the dataset and then combined to form a single model.

More information on the algorithm can be found [here](https://scikit-learn.org/stable/modules/ensemble.html#forest) or [here](https://www.datacamp.com/tutorial/random-forests-classifier-python).

In [2]:
from util import get_wpm_train_test, fit_predict_print_wp
from sklearn.ensemble import RandomForestClassifier

train_x, train_y, test_x, test_y, groups = get_wpm_train_test(include_groups=True)
model = RandomForestClassifier(random_state=42)

fit_predict_print_wp(model, train_x, train_y, test_x, test_y)

Accuracy: 52.75% (96/182)


## Hyperparameter tuning
Maybe our bad results are just because we didn't find the right hyperparameters. Let's try some different ones. Some of this code is based on [this site](https://www.kaggle.com/code/sociopath00/random-forest-using-gridsearchcv/notebook). Some extra parameters are based on [this site](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74).

In [2]:
from sklearn.model_selection import GridSearchCV
from util import get_manually_labeled_features

param_grid = {
    'n_estimators': [100, 200, 500, 1000, 1500],
    'max_features': ['sqrt', 'log2'],
    'max_depth': [4, 6, 8, 10, 50, 100, None],
    'criterion': ['gini', 'entropy']
}

grid_model = GridSearchCV(model, param_grid, cv=5, error_score="raise", verbose=1)

# grid_model.fit(get_manually_labeled_features(train_x), train_y["Winner"])

# best_params = grid_model.best_params_
best_params = {'criterion': 'gini', 'max_depth': 6, 'max_features': 'sqrt', 'n_estimators': 1500}

best_params

{'criterion': 'gini',
 'max_depth': 6,
 'max_features': 'sqrt',
 'n_estimators': 1500}

Fitting 5 folds for each of 140 candidates, totalling 700 fits

{'criterion': 'gini',
 'max_depth': 6,
 'max_features': 'sqrt',
 'n_estimators': 1500}

Let's test those parameters:

In [3]:
model = RandomForestClassifier(random_state=42, **best_params)
fit_predict_print_wp(model, train_x, train_y, test_x, test_y)

Accuracy: 53.85% (98/182)


This results in only a slight increase in accuracy (+1.10%). This is still worse than the accuracy of our RandomCalssifier (with an accuracy of 54.14%). (correction: it's better than the random classifier (which  is 41% after taking the average)
n_estimtors is at it's max depth, so let's also try some higher values for it:

In [4]:
param_grid = {
    'n_estimators': [1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000],
    'max_features': ['sqrt'],
    'max_depth': [6],
    'criterion': ['gini']
}

grid_model = GridSearchCV(model, param_grid, cv=5, error_score="raise", verbose=1)

# grid_model.fit(get_manually_labeled_features(train_x), train_y["Winner"])

# grid_model.best_params_

Fitting 5 folds for each of 8 candidates, totalling 40 fits

{'criterion': 'gini',
 'max_depth': 6,
 'max_features': 'sqrt',
 'n_estimators': 1500}

More n_estimators don't seem to make a difference. Let's also check whether it doesn't make a difference in combinaation with max_depth.

In [5]:
param_grid = {
    'n_estimators': [1500, 2000],
    'max_features': ['sqrt'],
    'max_depth': [5, 6, 7],
    'criterion': ['gini']
}

grid_model = GridSearchCV(model, param_grid, cv=5, error_score="raise", verbose=1)

# grid_model.fit(get_manually_labeled_features(train_x), train_y["Winner"])

# grid_model.best_params_

Fitting 5 folds for each of 8 candidates, totalling 40 fits

{'criterion': 'gini',
 'max_depth': 6,
 'max_features': 'sqrt',
 'n_estimators': 1500}

This also doesn't seem to make a difference.

### Own scoring function
Up to now, I've used accuracy per row as scoring function for the grid search. However for our own scoring function, we want to take the prediction accuracy per test. So we need to supply multiple rows (candidate headlines) as test input for this, predict the best, which results in a letter per test. Next we want to see how many of those letters match the winner-letter in train_y (instead of seeing for each row whether it gives the winner 0 or 1, which will e.g. result in a higher accuracy if you give everything a rating of 0).

We now need e.g. the test number also supplied with our train_x and test_x, and for y we only want the final selectd headline and not for each row whether it's a winner or not, so let's generate them again with this:

In [6]:
train_x, train_y, test_x, test_y = get_wpm_train_test(x_train_features_only=False)

# Note: There is also an argumnt (full_y_train=False) to immediately return the filtered version, but I need both now
train_y_filtered = train_y[train_y["Winner"] == True][["Test", "Headline ID"]]
train_y_filtered.head()

Unnamed: 0,Test,Headline ID
1,1,B
2,2,A
4,3,A
8,4,B
10,5,B


In [24]:
from sklearn.base import BaseEstimator
from util import predict_wp, evaluate_wp


class WinnerPredictor(RandomForestClassifier):
    # def __init__(self, model_class, args=None, proba=True):
    #     self.model_class = model_class  # Model should be the class, not yet an instance
    #     self.args = args if args else {}
    #     self.proba = proba
    #
    #     self.model = self.model_class(**self.args)

    def predict(self, test_x):
        predictions = predict_wp(self, test_x, proba=True)
        assert len(predictions) == len(test_x.Test.unique())
        return predict_wp(self, test_x, proba=True)

    def fit(self, train_x, train_y, **kwargs):
        # print("Winner fit")
        if kwargs:
            print("Warning: kwargs not used " + str(kwargs))
        return super().fit(get_manually_labeled_features(train_x), train_y["Winner"])

    def score(self, test_x, test_y, **kwargs):
        if kwargs:
            print("Warning: kwargs not used " + str(kwargs))

        if len(test_y) != len(test_y.Test.unique()):
            test_y = test_y[test_y["Winner"] == True][["Test", "Headline ID"]]
        assert len(test_y) == len(test_y.Test.unique())
        assert len(test_x.Test.unique()) == len(test_y)
        predictions = self.predict(test_x)
        assert len(test_y) == len(predictions)
        return evaluate_wp(test_y, predictions)

Let's see whether this model gives the same accuracy as without the model:

In [8]:
model1 = RandomForestClassifier(random_state=42)
acc1 = fit_predict_print_wp(model1, train_x, train_y, test_x, test_y, return_acc=True)

model2 = WinnerPredictor(random_state=42)
model2.fit(train_x, train_y)
acc2 = model2.score(test_x, test_y)

assert acc1 == acc2

print(f"Accuracy for both models is {acc1:.2f}")

Accuracy: 52.75% (96/182)
Winner fit
Accuracy for both models is 0.53


Our new model seems to work. When testing it with GridSearch, I realised there is also a problem with cross validation: We don't want to leave out rows randomly, we need to leave them out accoording to the test number (so the candidate headlines with the same test number will always result on the same side of a split). There exists some documentation about a [CV Splitter](https://scikit-learn.org/stable/glossary.html#term-CV-splitter) we can supply with this, but a generator might be easier (an example can be found [here](https://scikit-learn.org/stable/modules/cross_validation.html)). We need to return arrays of indices here.

In [9]:
from sklearn import model_selection
import numpy as np

np.random.seed(42)


def get_index_from_test_id_list(df, test_id_list):
    # return df[df["Test"] == test_id_list].index.values # suggsted by copilot
    return df['Test'].loc[df["Test"].isin(test_id_list)].index


def candidate_headline_cv_split_i(df, k=5):
    df.reset_index(drop=True, inplace=True)
    tests_ids = df.Test.unique()
    random_states = np.random.randint(0, 9000, k)
    i = 0
    while i < k:
        train_ids, test_ids = model_selection.train_test_split(tests_ids, test_size=0.2, random_state=random_states[i])
        train_idx = get_index_from_test_id_list(df, train_ids)
        test_idx = get_index_from_test_id_list(df, test_ids)
        yield train_idx, test_idx
        i += 1

Let's test the generator:

In [10]:
train_x.reset_index(drop=True, inplace=True)
train_y.reset_index(drop=True, inplace=True)

index_generator = candidate_headline_cv_split_i(train_x, k=5)

first_index = next(index_generator)

# Get the rows on indices in first_index from train_x and train_y
train_x_first = train_x.iloc[first_index[0]]
train_y_first = train_y.iloc[first_index[0]]

In [11]:
train_x_first.head()

Unnamed: 0,Test,Headline ID,Headline,Actief,Lang,Vragen,Interpunctie,Tweeledigheid,Emotie,Voorwaartse Verwijzing,Signaalwoorden,Lidwoorden,Adjectieven,Eigennamen,Betrekking,Voor+Achternaam,Cijfers,Quotes
0,1,A,Barack en Michelle Obama laten dansmoves zien ...,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
1,1,B,Barack en Michelle Obama gaan helemaal los tij...,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,3,A,Maandag drukste dag van het jaar op Brussels A...,1,1,0,0,1,0,0,0,0,0,0,1,0,0,1
5,3,B,Maandag drukste dag van het jaar op Brussels A...,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0
6,3,C,Maandag drukste dag van het jaar op Brussels A...,1,1,0,0,1,0,0,0,1,0,0,1,0,0,0


In [12]:
train_y_first.head()

Unnamed: 0,Test,Headline ID,Winner
0,1,A,False
1,1,B,True
4,3,A,True
5,3,B,False
6,3,C,False


In [18]:
param_grid = {
    'n_estimators': [1000],
    'max_features': ['sqrt'],
    'max_depth': [6, 7],
    'criterion': ['gini']
}

model = WinnerPredictor()

grid_model = GridSearchCV(model, param_grid, cv=candidate_headline_cv_split_i(train_x, 5), error_score="raise",
                          verbose=10)

grid_model.fit(train_x, train_y)

grid_model.best_params_

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5; 1/2] START criterion=gini, max_depth=6, max_features=sqrt, n_estimators=1000
Winner fit


KeyboardInterrupt: 

Keeps giving an error, but after doing some extra research on how to make a cv generator, I found Group KFold, which does the same as what I'm trying to achieve. (Note: There I got the same error, but it's fixed)

In [19]:
from sklearn.model_selection import GroupKFold

model = WinnerPredictor()
groups = train_x["Test"]
grid_model = GridSearchCV(model, param_grid, cv=GroupKFold(n_splits=5), error_score="raise", verbose=10)
grid_model.fit(train_x, train_y, groups=groups)
grid_model.best_params_

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5; 1/2] START criterion=gini, max_depth=6, max_features=sqrt, n_estimators=1000
Winner fit
[CV 1/5; 1/2] END criterion=gini, max_depth=6, max_features=sqrt, n_estimators=1000;, score=0.628 total time=   3.0s
[CV 2/5; 1/2] START criterion=gini, max_depth=6, max_features=sqrt, n_estimators=1000
Winner fit
[CV 2/5; 1/2] END criterion=gini, max_depth=6, max_features=sqrt, n_estimators=1000;, score=0.607 total time=   2.9s
[CV 3/5; 1/2] START criterion=gini, max_depth=6, max_features=sqrt, n_estimators=1000
Winner fit
[CV 3/5; 1/2] END criterion=gini, max_depth=6, max_features=sqrt, n_estimators=1000;, score=0.637 total time=   2.9s
[CV 4/5; 1/2] START criterion=gini, max_depth=6, max_features=sqrt, n_estimators=1000
Winner fit
[CV 4/5; 1/2] END criterion=gini, max_depth=6, max_features=sqrt, n_estimators=1000;, score=0.566 total time=   2.9s
[CV 5/5; 1/2] START criterion=gini, max_depth=6, max_features=sqrt, n_estimators=100

{'criterion': 'gini',
 'max_depth': 6,
 'max_features': 'sqrt',
 'n_estimators': 1000}

With this changes, it takes around 3s/fold * 5 folds/parameter combination, so 15s per parameter combination, which is e.g. 20 minutes for 400 fits. Results around 15:50

In [25]:
param_grid = {
    'n_estimators': [200, 500, 1000, 1500],
    'max_features': ['sqrt', 'log2'],
    'max_depth': [None, 4, 6, 8, 10],
    'criterion': ['gini', 'entropy']
}

model = WinnerPredictor()

groups = train_x["Test"]

grid_model = GridSearchCV(model, param_grid, cv=GroupKFold(n_splits=5), error_score="raise", verbose=1)

grid_model.fit(train_x, train_y, groups=groups)

grid_model.best_params_

Fitting 5 folds for each of 80 candidates, totalling 400 fits


{'criterion': 'gini',
 'max_depth': 4,
 'max_features': 'sqrt',
 'n_estimators': 1500}

Fitting 5 folds for each of 80 candidates, totalling 400 fits

{'criterion': 'gini',
 'max_depth': 4,
 'max_features': 'sqrt',
 'n_estimators': 1500}

In [26]:
best_params = {'criterion': 'gini',
               'max_depth': 4,
               'max_features': 'sqrt',
               'n_estimators': 1500}

model = RandomForestClassifier(random_state=42, **best_params)
fit_predict_print_wp(model, train_x, train_y, test_x, test_y)

Accuracy: 54.40% (98/182)


This has given a small increase. Some hyperparameters are still at the border of the given values, so let's see if other values perform better:

In [28]:
def find_best_params(params, train_x, train_y):
    model = WinnerPredictor()
    groups = train_x["Test"]
    grid_model = GridSearchCV(model, params, cv=GroupKFold(n_splits=5), error_score="raise", verbose=1)
    grid_model.fit(train_x, train_y, groups=groups)
    return grid_model.best_params_

In [29]:
param_grid = {
    'n_estimators': [1500, 2000], # more estimators will always be better, but we want to remain computationally efficient
    'max_features': ['sqrt'],
    'max_depth': [2, 3, 4],
    'criterion': ['gini']
}

best_params = find_best_params(param_grid, train_x, train_y)
best_params

Fitting 5 folds for each of 6 candidates, totalling 30 fits


{'criterion': 'gini',
 'max_depth': 3,
 'max_features': 'sqrt',
 'n_estimators': 1500}

Fitting 5 folds for each of 6 candidates, totalling 30 fits

{'criterion': 'gini',
 'max_depth': 3,
 'max_features': 'sqrt',
 'n_estimators': 1500}

In [30]:
model = RandomForestClassifier(random_state=42, **best_params)
fit_predict_print_wp(model, train_x, train_y, test_x, test_y)

Accuracy: 55.49% (101/182)


More information about the different hyperparameters can be found [here](https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/). Criterion is not listed on this site. This is just the split criterion, which determines on what attribute the tree should split.