# Tree-based methods with Bank Data

In this notebook, we explore the use of tree-based methods such as:

- Random Forest
- Bagging and;
- Boosting

To learn more about about tree-based methods, check out the videos from StatQuest below.

- [Classification And Regression Trees](https://www.youtube.com/playlist?list=PLblh5JKOoLUKAtDViTvRGFpphEc24M-QH)
- [Random Forests](https://www.youtube.com/playlist?list=PLblh5JKOoLUIE96dI3U7oxHaCAbZgfhHk)
- [Gradient Boost](https://www.youtube.com/playlist?list=PLblh5JKOoLUJjeXUvUE0maghNuY2_5fY6)

## Preamble

### Imports

In [1]:
from os import path
import joblib

import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier, \
    HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import roc_auc_score

### Definitions

#### Constants

In [2]:
INPUT_PATH='/kaggle/input/playground-series-s5e8'
MODELS_PATH='/kaggle/input/ps5e8-bank-data-models/scikitlearn/default/1'
RNG_SEED = 42
VERBOSITY = 1
SCORING='roc_auc'

#### Utilities

In [3]:
def load_model(file_path):
    if path.exists(file_path):
        try:
            model = joblib.load(file_path)
            print(f'{file_path} loaded.')
            return model
        except Exception as e:
            print(f"error loading the object from {file_path}: {e}")
            return None
    else:
        print(f"{file_path} not found.")
        return None

def save_model(model, filename):
    joblib.dump(model, filename, compress=True)
    print(f'{filename} saved.');

## Bank data

In [4]:
train = pd.read_csv(path.join(INPUT_PATH, 'train.csv'), index_col='id')
train.shape

(750000, 17)

In [5]:
all_features = train.drop('y', axis='columns').columns.tolist()
non_bin_features = ['job', 'education', 'contact', 'month', 'poutcome','marital']
bin_features = ['default', 'housing', 'loan']
cat_features = bin_features + non_bin_features
num_features = [x for x in all_features if x not in cat_features]

cat_features, num_features

(['default',
  'housing',
  'loan',
  'job',
  'education',
  'contact',
  'month',
  'poutcome',
  'marital'],
 ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous'])

### Train/test split

In [6]:
y = train.loc[:, 'y']
X = train.drop('y', axis='columns')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RNG_SEED)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((502500, 16), (247500, 16), (502500,), (247500,))

## Column transformer

Here, we define a base column transformer for all pipelines.

In [7]:
ohe = OneHotEncoder(handle_unknown='ignore')
ohe_dense = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

column_transformer = ColumnTransformer(transformers=[
    ('cat', ohe, cat_features),
    ('num', StandardScaler(), num_features),
])
column_transformer

## Bagging

[Bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating) is a common technique used in many tree-based classifiers such as [random forest](https://en.wikipedia.org/wiki/Random_forest).

By default, [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) only selects from $m = \sqrt{p}$ where `p` is the number of predictors. We set the `max_features = p` so it will select from all features in each split.

We also use the [OOB error](https://en.wikipedia.org/wiki/Out-of-bag_error) in-place of cross-validation to estimate the test error.

In [8]:
bag_model_filename = 'bag_model.joblib'
bag_model = load_model(path.join(MODELS_PATH, bag_model_filename))

if bag_model is None:
    bag_model = Pipeline(steps=[
        ('column_transformer', column_transformer),
        ('classifier', RandomForestClassifier(
            max_features=X.shape[1],
            n_estimators=500,
            oob_score=True,
            n_jobs=-1,
            random_state=RNG_SEED,
            verbose=VERBOSITY,
        ))
    ])
    
    bag_model.fit(X_train, y_train)

bag_model

/kaggle/input/ps5e8-bank-data-models/scikitlearn/default/1/bag_model.joblib loaded.


### OOB and test score

We can observe that the OOB error and the test error is very similar. 

In [9]:
clf = bag_model.named_steps['classifier']

oob_score = roc_auc_score(y_train, clf.oob_decision_function_[:, 1])
test_score = roc_auc_score(y_test, bag_model.predict_proba(X_test)[:, 1])
oob_score, test_score

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    1.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    4.2s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    9.6s
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed:   11.1s finished


(0.9633260121518018, 0.9634408588254115)

In [10]:
save_model(bag_model, bag_model_filename)

bag_model.joblib saved.


## Random Forest

We use the same parameters as above, except now use the default `max_features` which is equal to $m = \sqrt{p}$ predictors. This means we consider only a fraction of the predictors unlike the bagged classifier which considers all predictors in each split.

In [11]:
rf_model_filename = 'rf_model.joblib'
rf_model = load_model(path.join(MODELS_PATH, rf_model_filename))

if rf_model is None:
    rf_model = Pipeline(steps=[
        ('column_transformer', column_transformer),
        ('classifier', RandomForestClassifier(
            n_estimators=500,
            oob_score=True,
            n_jobs=-1,
            random_state=RNG_SEED,
            verbose=VERBOSITY,
        ))
    ])
    
    rf_model.fit(X_train, y_train)

rf_model

/kaggle/input/ps5e8-bank-data-models/scikitlearn/default/1/rf_model.joblib loaded.


### OOB and test score

As with above, use OOB error to estimate the test error.

In [12]:
clf = rf_model.named_steps['classifier']

oob_score = roc_auc_score(y_train, clf.oob_decision_function_[:, 1])
test_score = roc_auc_score(y_test, rf_model.predict_proba(X_test)[:, 1])
oob_score, test_score

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    1.1s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    5.2s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:   12.2s
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed:   13.7s finished


(0.9616230906879348, 0.961895653437245)

In [13]:
save_model(rf_model, rf_model_filename)

rf_model.joblib saved.


## Boosting

We use the [HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html) as it is recommended to be faster than [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).

In [14]:
gb_model_filename = 'gb_model.joblib'
gb_model = load_model(path.join(MODELS_PATH, gb_model_filename))

if gb_model is None:
    gb_params = {
        'max_iter': 3000,
        'scoring': SCORING,
        'random_state': RNG_SEED,
        # 'verbose': VERBOSITY,
    }
    
    gb_model = Pipeline(steps=[
        ('column_transformer', column_transformer),
        ('classifier', HistGradientBoostingClassifier(**gb_params))
    ])
    
    gb_model.fit(X, y)

gb_model

/kaggle/input/ps5e8-bank-data-models/scikitlearn/default/1/gb_model.joblib loaded.


### Cross-validation

`HistGradientBoostingClassifier` does not return OOB error, so we use cross-validation to estimate the test error. Here, we set `cv=5`, which means it will use 5-fold cross-validation. By default, [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) uses [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) under the hood.

In [15]:
scores = cross_validate(
    gb_model, X, y, cv=5, 
    scoring=SCORING, 
    return_train_score=True,
    n_jobs=-1, 
    verbose=VERBOSITY,
)
scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  5.5min finished


{'fit_time': array([207.94045138, 148.5244751 , 118.22973394, 214.42088723,
        154.35692477]),
 'score_time': array([6.7715044 , 5.7890799 , 4.96554041, 7.07091594, 5.61889935]),
 'test_score': array([0.96587671, 0.96607459, 0.96498381, 0.96579898, 0.96667684]),
 'train_score': array([0.96893207, 0.96808877, 0.96739703, 0.96940478, 0.96873625])}

In [16]:
scores['test_score'].mean(), scores['train_score'].mean()

(0.9658821867365088, 0.9685117791770719)

In [17]:
save_model(gb_model, gb_model_filename)

gb_model.joblib saved.


## Test predictions

Finally, we make predictions on the test data using the above models.

In [18]:
test = pd.read_csv(path.join(INPUT_PATH, 'test.csv'), index_col='id')
test.shape

(250000, 16)

In [19]:
models = [
    ('bag_submission.csv', bag_model),
    ('rf_submission.csv', rf_model),
    ('gb_submission.csv', gb_model),
]

for (filename, model) in models:
    y_hat = model.predict_proba(test)[:, 1]
    submission = pd.DataFrame({
        'id': test.index,
        'y': y_hat,
    })
    submission.to_csv(filename, index=False)
    print(f'{filename} saved.')

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    1.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    4.4s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:   10.0s
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed:   11.4s finished


bag_submission.csv saved.


[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    1.3s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    5.3s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:   11.9s
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed:   13.4s finished


rf_submission.csv saved.
gb_submission.csv saved.
