In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

# An experiment with Random Forests on Linear Data 

In this small experiment, I thought about 2 different (artificial) data sets:

1. Sum case: Each independent feature is normally distributed, and the target variable is just the sum of the dependant features.

2. Abs case: Again, each independent feature is normally distributed, but the target variable is +1 if the sum of the dependant features is positive, and -1 otherwise (i.e., a sign function).

Although in both cases the independent variable is generated in a linear fashion, it is clear that 1 is a regression problem, while 2 is a classification problem.

The idea here is to understand how capable are Random Forest methods in comparison to more traditional linear models when the data is linear.

## 0. Creating the data sets

## 0.1 Function that creates the data set

In [2]:
def create_dataset_sum(num_rows, num_cols):
    """
    Creates a data set with shape (num_rows,num_cols), with target variable defined as a sum
    of the dependent variables.
    """
    l = [np.random.normal(size = num_rows) for i in range(num_cols)]
    df = pd.DataFrame(l).T
    df.columns = [f'Column {i+1}' for i in range(num_cols)]
    df['target'] = df.sum(axis = 1)
    X = df.drop('target', axis = 1)
    y = df['target']
    return df,X,y

In [3]:
def create_dataset_abs(num_rows, num_cols):
    """
    Creates a data set with shape (num_rows, num_cols), with target variable defined as
    the sign of sum of dependent variables.
    """
    l = [np.random.normal(size = num_rows) for i in range(num_cols)]
    df = pd.DataFrame(l).T
    df.columns = [f'Column {i+1}' for i in range(num_cols)]
    df['target'] = df.sum(axis = 1)
    df['target'] = df['target']/(df['target'].map(lambda x: abs(x)))
    X = df.drop('target', axis = 1)
    y = df['target']
    return df,X,y

## 0.2 Train and Test data for abs case

In [4]:
df_abs_train, X_abs_train, y_abs_train = create_dataset_abs(10000, 10)

In [5]:
df_abs_test, X_abs_test, y_abs_test = create_dataset_abs(2000, 10)

## 0.3 Train and Test data for sum case 

In [6]:
df_sum_train, X_sum_train, y_sum_train = create_dataset_sum(10000,10)

In [7]:
df_sum_test, X_sum_test, y_sum_test = create_dataset_sum(2000, 10)

# 1. Random Forests

Now, we start fitting our Random Forests (both a Classifier and a Regressor).

In [8]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

## 1.1 Using Classifier for abs case

In [9]:
clf_abs = RandomForestClassifier(random_state = 42)

Lets fit our classifier in the Train data:

In [10]:
clf_abs.fit(X_abs_train, y_abs_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

And make predictions on the Test data.

In [11]:
y_abs_clf_pred = clf_abs.predict(X_abs_test)

Although it is unlinkely in our case to have skewed classes (since the target variable before applying the sign function is also normally distributed), we still use the F1 score as the metric to our classification problem.

In [12]:
from sklearn.metrics import f1_score

In [13]:
f1_score(y_abs_clf_pred, y_abs_test)

0.8466060929983966

To this point, we haven't done any improvement to the model, we just fitted on our Train data and then predicted. We'll come back later and perform hyperparameter tuning to try to do better.

# 1.2 Using Regressor for abs case

Lets see if the Random Forest Regressor can do any harm in the abs case.

In [14]:
reg_abs = RandomForestRegressor(random_state = 42)

Lets fit the regressor to our Train data.

In [15]:
reg_abs.fit(X_abs_train, y_abs_train)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=42, verbose=0,
                      warm_start=False)

And then, predict for our Test data

In [16]:
y_abs_reg_pred = reg_abs.predict(X_abs_test)

The first obvious evaluation metric is the mean_squared_error.

In [17]:
from sklearn.metrics import mean_squared_error

In [18]:
mean_squared_error(y_abs_reg_pred, y_abs_test)

0.43964000000000003

Although this looks ok, but since our target variable assumes values between -1 and +1, this is not very good.

## 1.3 Using Regressor for sum case

Now, in what looks to be a more reasonable choice of model to our data, we fit a Regressor to the sum case.

In [19]:
reg_sum = RandomForestRegressor(random_state = 42)

Lets fit our model to the training data.

In [20]:
reg_sum.fit(X_sum_train, y_sum_train)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=42, verbose=0,
                      warm_start=False)

And again predict on the test data.

In [21]:
y_sum_reg_pred = reg_sum.predict(X_sum_test)

And for scoring: mean_squared_error

In [22]:
mean_squared_error(y_sum_reg_pred,y_sum_test)

1.8159730838343686

Yep. Not good.

# 2. Linear Models

For a sanity check, we apply our old friends: Linear and Logistic Regression.

## 2.1. Linear Regression

## 2.1.1 Linear Regression for sum case

In [23]:
from sklearn.linear_model import LinearRegression

In [24]:
linreg_sum = LinearRegression()

We fit our Linear Regression to the train data.

In [25]:
linreg_sum.fit(X_sum_train, y_sum_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

And again predict on the test data.

In [26]:
y_sum_linreg_pred = linreg_sum.predict(X_sum_test)

As for the mean_squared_error..

In [27]:
mean_squared_error(y_sum_linreg_pred, y_sum_test)

5.086197438241408e-30

So as we expected: the Linear Regression is able to completely learn the pattern in our data, since the pattern itself is linear. We can check this by looking at the Linear Regression coefficients: since our target is variable is a simple sum of independent variables, the coefficients should all be equal to one.

In [28]:
linreg_sum.coef_

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

## 2.2. Logistic Regression 

### 2.2.1 Logistic Regression for abs case

We fit 2 Logistic Regression models to our classification problem in the abs case, to see if there is a difference in encoding our target variable as -1 and +1 or 0 and 1.

In [29]:
from sklearn.linear_model import LogisticRegression

In [30]:
logreg_abs_1 = LogisticRegression()
logreg_abs_2 = LogisticRegression()

In [31]:
y_abs_train_1 = y_abs_train
y_abs_train_2 = y_abs_train.map(lambda x: 0 if x<0 else 1)

In [32]:
logreg_abs_1.fit(X_abs_train,y_abs_train_1)
logreg_abs_2.fit(X_abs_train, y_abs_train_2)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

And predict on the test data.

In [33]:
y_abs_logreg_pred_1 = logreg_abs_1.predict(X_abs_test)
y_abs_logreg_pred_2 = logreg_abs_2.predict(X_abs_test)

In [34]:
y_abs_test_1 = y_abs_test
y_abs_test_2 = y_abs_test.map(lambda x: 0 if x<1 else 1)

In [35]:
f1_score(y_abs_logreg_pred_1, y_abs_test_1), f1_score(y_abs_logreg_pred_2, y_abs_test_2)

(0.9958592132505175, 0.9958592132505175)

In [36]:
logreg_abs_1.coef_, logreg_abs_2.coef_

(array([[5.92203408, 5.91303718, 5.81691846, 5.83146846, 5.871298  ,
         5.93370892, 5.92951503, 5.94598541, 5.90325703, 5.85843037]]),
 array([[5.92203408, 5.91303718, 5.81691846, 5.83146846, 5.871298  ,
         5.93370892, 5.92951503, 5.94598541, 5.90325703, 5.85843037]]))

Absolutely no difference.

# 4. Hyperparameter tuning for Random Forests

We see that a naive approach with Random Forests will in principle underperform, in comparision with linear models, when the target variable has a linear dependence on the independent variables.

This goes to show how important it is for us to guarantee that we actually understand the data we are given. Of course, we still need to be careful with introducing serius bias into our modeling.

With this in mind, we'll now work on the abs case (a classification problem), and try to take our Random Forest Classifier to the extreme, improving it as much as we can with hyperparameter tuning. 

In [37]:
from sklearn.model_selection import GridSearchCV

## 4.1 Attempt n. 1 

In [38]:
param_grid_1 = {'n_estimators':[10,100,1000],
             'min_samples_split':[2,10,30],
             'min_samples_leaf':[1,10,100]}

In [39]:
model_1 = RandomForestClassifier(random_state = 42)

In [40]:
clf = GridSearchCV(model_1, param_grid_1, scoring = 'f1', cv = 5, n_jobs = -1, verbose = 2)

Now, this will take a while to run.

In [41]:
%%time
clf.fit(X_abs_train,y_abs_train)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:  5.6min finished


CPU times: user 25 s, sys: 193 ms, total: 25.2 s
Wall time: 6min 2s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'mi

In [42]:
clf.best_score_

0.9203824234983585

Clearly, a improvement on the f1_score metric for our RandomForestClassifier!
For the best hyper parameters:

In [43]:
clf.best_params_

{'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 1000}

Wow, 1000 trees! Are we overfitting our data??

In [44]:
y_abs_pred_cv = clf.predict(X_abs_test)

In [45]:
f1_score(y_abs_pred_cv,y_abs_test)

0.9225206611570247

Apparently not! How much further can we go?!

## 4.2 Attempt n.2

In [46]:
param_grid_2 = {'n_estimators':[1000,2500,5000],
              'min_samples_split' : [2,10,30],
              'min_samples_leaf': [1, 10, 100]}

In [47]:
model_2 = RandomForestClassifier(random_state = 42)

In [48]:
clf_2 = GridSearchCV(model_2, param_grid_2, scoring = 'f1', cv = 5, n_jobs = -1, verbose = 2)

Now THIS WILL TAKE SOME TIME!

In [49]:
%%time
clf_2.fit(X_abs_train,y_abs_train)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 12.1min
[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed: 42.3min finished


CPU times: user 2min 1s, sys: 561 ms, total: 2min 2s
Wall time: 44min 20s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'mi

In [50]:
clf_2.best_score_

0.9213823270510575

We actually see some improvement! Again, are we overfitting?

In [51]:
clf_2.best_params_

{'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 5000}

We are surely overfitting, right?

In [52]:
y_abs_pred_cv_2 = clf_2.predict(X_abs_test)
f1_score(y_abs_test,y_abs_pred_cv_2)

0.9257913855734302

I guess not.. Again, how much further can we go?

## 4.3 Attempt n.3

Again, I'll feed more trees to the model.

In [53]:
param_grid_3 = {
    'n_estimators': [5000,7500],
    'min_samples_split': [10],
    'min_samples_leaf': [1]
}

In [54]:
model_3 = RandomForestClassifier(random_state = 42)

In [55]:
clf_3 = GridSearchCV(model_3, param_grid_3, scoring = 'f1', cv = 5, n_jobs = -1, verbose = 2)

Again, some SERIOUS time here.

In [56]:
%%time
clf_3.fit(X_abs_train, y_abs_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  9.3min finished


CPU times: user 2min 57s, sys: 436 ms, total: 2min 57s
Wall time: 12min 14s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'mi

In [57]:
clf_3.best_score_

0.9210454926754515

Again we see some improvement.

In [58]:
clf_3.best_params_

{'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 7500}

Is it overfitting already?!

In [59]:
y_abs_pred_cv_3 = clf_3.predict(X_abs_test)
f1_score(y_abs_test,y_abs_pred_cv_3)

0.9253886010362694

Maybe!

## 4.4 Attempt n.4: the overfit dream?

Lets give a lot of trees to the model.

In [60]:
param_grid_4 = {
    'n_estimators': [7500,10000, 12500],
    'min_samples_split': [10],
    'min_samples_leaf': [1]
}

In [61]:
model_4 = RandomForestClassifier(random_state = 42)

In [62]:
clf_4 = GridSearchCV(model_4, param_grid_4, scoring = 'f1', cv = 5, n_jobs = -1, verbose = 2)

Again, this will take some time.

In [63]:
%%time
clf_4.fit(X_abs_train, y_abs_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed: 21.8min finished


CPU times: user 3min 56s, sys: 824 ms, total: 3min 57s
Wall time: 25min 43s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'mi

In [65]:
clf_4.best_score_

0.9216673505906401

Almost negligible improvement at this point.

In [66]:
clf_4.best_params_

{'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 10000}

To check for overfit:

In [67]:
y_abs_pred_cv_4 = clf_4.predict(X_abs_test)
f1_score(y_abs_test,y_abs_pred_cv_4)

0.9248315189217211

In [64]:
### Hyperparameter tuning for classifier

#from sklearn.model_selection import GridSearchCV

#param_grid = {'n_estimators': [1000, 2500, 5000],
             #'min_samples_split': [2, 10, 30],
             #'min_samples_leaf': [1, 10, 100]}

#model = RandomForestClassifier()

#clf = GridSearchCV(model, param_grid, scoring = 'f1', cv = 5, n_jobs = -1, verbose = 2)

#%%time
#clf.fit(X_abs_train, y_abs_train)

#clf.best_score_

#clf.best_params_

#clf.best_params_

#y_abs_pred_cv_2 = clf.predict(X_abs_test)

#f1_score(y_abs_test,y_abs_pred_cv_2)

#y_abs_pred_cv = clf.predict(X_abs_test )

#f1_score(y_abs_test, y_abs_pred_cv)