## Ensembles for Customer Satisfaction Prediction

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

Businesses can improve their services by tailoring them to individual customers. One important factor is knowing when customers are dissatisfied. Based on their records, one can use machine learning tools to make predictions about which customers are more at risk of being dissatisfied than others. Such predictions allow for individualized actions that may help retain customers and will improve quality.

In this assignment, we will build a prediction model for bank account owners' satisfaction. The record includes more than 300 features for each client, including variable related to their balance and which banking operations they have performed. Many of these variables are sparse; some numerical, some categorical. 

Ensemble methods based on decision trees, such as random forests and boosting algorithms, have been very successful in modeling such heterogeneous tabular data. To learn how these models work, you will implement them step-by-step, and see how the performance of your predictions improve.

### Load the data

Load the data in `data/train_data.csv` with `pandas`. Inspect its content with `.head()`, `.shape` and other methods of your choice.

In [1]:
import pandas as pd
data = pd.read_csv('data/train_data.csv')
#data = pd.read_csv('train_data.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,47739,2,29,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,46565.04,0
1,4212,2,38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,90736.77,0
2,48967,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0
3,11077,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,172107.72,0
4,17475,2,26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67983.57,0


In [2]:
data.shape

(66020, 371)

#### Target variable

The last column, named `TARGET`, is the variable to be predicted. `TARGET=1` represents a dissatisfied customer. Inspect the target column with `.value_counts()`. 

What is the proportion of dissatisfied customers? Is the dataset balanced or imbalanced?

In [3]:
data['TARGET'].value_counts(normalize=True)

TARGET
0    0.960436
1    0.039564
Name: proportion, dtype: float64

### Note on dataset properties

As you can see, the dataset is highly imbalanced: there are only 2.6k positive entries and 63.4k negative entries. It definitely should be addressed in the models by introducing class_weight parameter where possible (there are different ways it can be done - feel free check it out in sklearn documentation).

If that is not possible to introduce class weights for the model due to the model type, be ready to the permanent majority class vote in the output. This can be addressed by tweaking the model parameters.

Separate the data into features `X` and target `y`. Split the data into training and validation sets, with validation set of 5000 samples, with stratified split to keep the same level of imbalance.

*Hint: you may use `train_test_split()` for stratified splits.*

In [4]:
from sklearn.model_selection import train_test_split

X = data.copy().drop(columns = 'TARGET')
y = data['TARGET']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 5000, stratify = y)

In [5]:
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(61020, 370) (5000, 370) (61020,) (5000,)


### Basic modelling pipeline

Implement a basic modelling pipeline for a Decision Tree Classifier, fitting the training data and printing the training and validation accuracy.

In [6]:
# fit the training data
# printing the training and validation accuracy

from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.model_selection import GridSearchCV

pipeline_dt = Pipeline([('dtc',DecisionTreeClassifier())])

pipelines = [pipeline_dt]

for pipe in pipelines:
    #pipe.fit(X_train, y_train)
    parameters  = [{'dtc__max_depth': [3], "dtc__min_samples_split": [5]}] 
    scoring = make_scorer(accuracy_score)
    gridCV = GridSearchCV(pipe, parameters, scoring = scoring, cv= 10)
    gridCV.fit(X_train, y_train)
    print(gridCV.best_params_)
    best_model = gridCV.best_estimator_
    print(accuracy_score(y_train, best_model.predict(X_train)))
    print(accuracy_score(y_val, best_model.predict(X_val)))


{'dtc__max_depth': 3, 'dtc__min_samples_split': 5}
0.9604555883316945
0.9604


Note that the prediction score is quite high, even for this very simple model. Take a moment to think why this high score is not that significant.

#### ROC curve metric

Change your scoring metric to `roc_auc_score`, which calculates the area below the ROC curve of your **prediction probabilities**, instead of using the binary prediction decisions.

*Hint: Use the probabilities for `y = True` (not `y = False`).*

In [7]:
from sklearn.metrics import roc_auc_score

for pipe in pipelines:
    parameters  = [{'dtc__max_depth': [3], "dtc__min_samples_split": [5]}] 
    scoring = make_scorer(roc_auc_score)
    gridCV = GridSearchCV(pipe, parameters, scoring = scoring, cv= 10)
    gridCV.fit(X_train, y_train)
    best_model = gridCV.best_estimator_

    # Evaluate the best model on the training data using probabilities for y=True
    train_probs = best_model.predict_proba(X_train)[:, 1]
    train_predictions = (train_probs > 0.5).astype(int)
    train_accuracy = accuracy_score(y_train, train_predictions)
    print("Training Accuracy:", train_accuracy)
    
    # Evaluate the best model on the validation data using probabilities for y=True
    val_probs = best_model.predict_proba(X_val)[:, 1]
    val_predictions = (val_probs > 0.5).astype(int)
    val_accuracy = accuracy_score(y_val, val_predictions)
    print("Validation Accuracy:", val_accuracy)

Training Accuracy: 0.9604555883316945
Validation Accuracy: 0.9604


#### Baseline score for random predictions

Calculate the ROC AUC for random uniform prediction probabilities. 

Is the Decision Tree better? Based on the training and validation scores, what is the problem with the Decision Tree model?

*Hint: You can use `np.random.uniform`.*

In [8]:
import numpy as np

random_probs = np.random.uniform(size = 5000)
random_predictions = (random_probs > 0.5).astype(int)
random_predictions_accuracy = accuracy_score(y_val, random_predictions)
print("Random Predictions Accuracy:", random_predictions_accuracy)

Random Predictions Accuracy: 0.4936


Create a function named `test_model(model, X_train, y_train, X_test, y_test)` that performs the basic prediction pipeline, receiving as argument the model and data, fitting the training data, and returning the training and test prediction scores. Check that it works with the Decision Tree model.

In [9]:
def test_model(model, X_train=X_train, y_train=y_train, X_test= X_val, y_test= y_val):
    # set up the model
    dtc = model
    # fit the training data
    dtc.fit(X_train, y_train)
    # return the training and test prediction scores
    train_score = dtc.score(X_train, y_train)
    test_score = dtc.score(X_test, y_test)
    print('Train Score',train_score)
    print('Test Score',test_score)
    return train_score, test_score


In [10]:
test_model(DecisionTreeClassifier())

Train Score 1.0
Test Score 0.9246


(1.0, 0.9246)

## Optimizing decision trees 

We can improve the prediction model by setting up the Decision Tree. Check the arguments available for the `DecisionTreeClassifier` class. 

Which arguments do you think could improve the validation score? Optimize your model by changing the meta-parameters. Inspect the most important meta-parameter by calculating the training and validation score for different values.

In [11]:
from sklearn.metrics import roc_auc_score

for pipe in pipelines:
    parameters  = [{'dtc__criterion':['gini'],
                    'dtc__splitter':['best'],
                    'dtc__max_depth': [10],
                    'dtc__min_samples_split': [3],
                    'dtc__min_samples_leaf': [2],
                    'dtc__max_leaf_nodes': [90]
               }] 
    scoring = make_scorer(roc_auc_score)
    gridCV = GridSearchCV(pipe, parameters, scoring = scoring, cv= 5)
    gridCV.fit(X_train, y_train)
    #best_model = gridCV.best_estimator_
    print(gridCV.best_params_)


{'dtc__criterion': 'gini', 'dtc__max_depth': 10, 'dtc__max_leaf_nodes': 90, 'dtc__min_samples_leaf': 2, 'dtc__min_samples_split': 3, 'dtc__splitter': 'best'}


To evaluate your models, we will test your data on a testing set. Load the test at `data/test_data.csv`.

In [12]:
test_data = pd.read_csv('data/test_data.csv')
#test_data = pd.read_csv('test_data.csv')


In [13]:
test_data.head()

Unnamed: 0,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,imp_op_var40_ult1,...,saldo_medio_var29_ult3,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38
0,2,48,0.0,203.46,322.68,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,89541.87
1,2,28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016
2,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,71340.87
3,2,46,0.0,0.0,12.69,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,144333.6
4,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016


In [14]:
X_train = X_train.drop(['Unnamed: 0'],axis=1)


In [15]:
X_val = X_val.drop(['Unnamed: 0'],axis=1)

 Calculate the prediction probabilities for the test data for the best Decision Tree, saving them in a variable named `dtc_preds`. `dtc_preds` should be an numpy array a single dimension.

In [16]:
best_dtc = gridCV.best_estimator_
best_dtc.fit(X_train, y_train)
dtc_preds = best_dtc.predict(test_data)

In [17]:
type(dtc_preds)

numpy.ndarray

### Bagging and Random Forests

While Decision Trees are prone to overfitting, their ensemble can be powerful predictors. Random Forests are essentially Bagging ensembles of decision trees, using the average prediction of the multiple decision trees base models, each trained with a different set of data samples.

You will create a Bagging model class, named `myBagging`, filling the class structure below.

The `.fit()` method should fit each base model with a bootstrap sample of the data (with replacement), with data size proportional by the meta-parameter `subsample`. That is, if `subsample=0.5`, each base model should get half the total number of samples.

The `.predict_proba()` method should estimate and average the prediction probabilities of the base models.

*Hint: You can use the `resample()` function for creating bootstrap samples.*

In [18]:
from sklearn.utils import resample

class myBagging:
    def __init__(self, base_models, subsample = 1.):
        self.n_models = len(base_models)
        self.base_models = base_models
        self.subsample = subsample
        
    def fit(self, X, y):
        '''Loop over base models, generate a bootstrap sample of the data with 'resample()',
           and fit them to the data.
           
           To access the variables inside the myBagging class, use the 'self.' prefix, 
           i.e. self.base_models, self.n_models and self.subsample
        '''
        n_samples = int(len(X) * self.subsample)
        self.fitted_models = []

        for base_model in self.base_models:
            X_bootstrap, y_bootstrap = resample(X, y, stratify= y, n_samples= n_samples)
            base_model.fit(X_bootstrap, y_bootstrap)
            self.fitted_models.append(base_model)

    def predict_proba(self, X):
        '''Return the ensemble predictions, given by the average prediction probability over base models.
           It should be an array with the length of the dataset.'''
        predictions = np.array([model.predict_proba(X) for model in self.fitted_models])
        return np.mean(predictions, axis=0)
    


Run and score a Random Forest, with 10 base Decision Trees, with maximum depth 10 and subsample 0.5. Use your `myBagging` class and `test_model()`.

In [19]:
#set up model
base_model1 = DecisionTreeClassifier(max_depth=10)
# create instance
my_bagging = myBagging(base_models=[base_model1], subsample= 0.5)
# fit model
my_bagging.fit(X_train, y_train)
# predict proba
predictions_proba = my_bagging.predict_proba(test_data)
# predictions
predictions = np.argmax(predictions_proba, axis=1)
# evaluate accuracy
# accuracy = accuracy_score(y_test, predictions)
# print
# print('Accuracy:', accuracy)

### Extra-Trees

Extremely Randomized Trees are decision trees in which, at each node split during training , only a fraction of the features is considered for the optimal split (e.g. for optimal Gini gain). This functionality is implemented on `sklearn` under the parameter `max_features`. 

Run and score a Extra-Trees version of your Random Forest, by changing the `max_features` parameter.

In [20]:
X_train.shape[1]

369

In [21]:
# from sklearn.ensemble import RandomForestClassifier

# for i in range(1,X_train.shape[1],60):
#     rfc = RandomForestClassifier(max_features=i)
#     rfc.fit(X_train, y_train)
#     print('Max Features:',i,'\tScore:',rfc.score(X_val, y_val))


### Sklearn comparison

For comparison, run and score the `sklearn` implementation, `RandomForestClassifier`.

### Optimize your Random Forest

Optimize your Random Forest meta-parameters, both of the myBagging and Decision Trees, and make your predictions for the test data, saving the predictions under `rf_preds`.

In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

params = dict(
    max_depth=[7],
    #min_samples_split=[2, 5, 10],
    max_features=['sqrt'],
    #min_samples_leaf=[1, 2, 4],
    class_weight=['balanced'],
    #n_estimators=[100, 200, 300],
    bootstrap = [True]#,False]
)

rfc1 = RandomForestClassifier(n_jobs= -1)

gcv = GridSearchCV(estimator=rfc1, param_grid=params, n_jobs=-1, cv= 5)
gcv.fit(X_train, y_train)

print(f"Best model accuracy: {gcv.best_score_}")

print(f"Best hyperparamters: {gcv.best_params_}")

best_model = gcv.best_estimator_


  warn(


Best model accuracy: 0.7313339888561128
Best hyperparamters: {'bootstrap': True, 'class_weight': 'balanced', 'max_depth': 7, 'max_features': 'auto'}


In [23]:

rf_preds = best_model.predict(test_data)


In [24]:
rf_preds

array([0, 1, 1, ..., 1, 0, 0], dtype=int64)

Note that including more decision trees improve performance but increases the computational cost of training linearly. The `max_depth` and `max_features` arguments can heavily cut the training time, by reducing the tree size and number of features considered at each split.

## Gradient Boosting

We will now implement a more sophisticated ensemble, Gradient Boosting, in which the base models are trained sequentially. Each new base model predicts what previous base models missed. 

As gradient boosting requires a continuous gradient, it can only use regression models for the base learner. 

For this exercise, we will perform regression directly on the 0-1 class labels, and treat the raw outputs as probabilities. 

We will try to setup the base models to optimise the MSE loss function against the class-labels, for which the gradient becomes simply the residual errors. 

When applied to probabilities, the MSE is known as the Brier score. 

Whilst performing this exercise, have a think about whether this is a robust approach. 

If not, what would you change either to your base-learners, meta-algorithm, or evaluation metrics to make this more robust?

You will have a chance to implement your suggestions tomorrow!

In the below structure, fill the `.fit()` and `.predict_proba()` functions. 

In [25]:
class myGradientBoosting:
    
    def __init__(self, base_models, learning_rate=0.5):
        self.n_models = len(base_models)
        self.models = base_models
        self.learning_rate = learning_rate
    
    def fit(self, x, y):
        ''' The `.fit()` function should loop over each base model 
         fitting it to the residual of the ensemble predictions so far, for the MSE loss:
         
         predictions = 0
         for each base model:
             residual = y - predictions   
             fit base model and make new predictions
             predictions = predictions + learning_rate * new_prediction 
        '''
        predictions = np.zeros(len(x))
        for i, model in enumerate(self.models):
            residual = y - predictions
            print(f"Fitting model {i + 1}/{self.n_models}")
            model.fit(x,residual)
            new_prediction = model.predict(x)
            predictions += self.learning_rate * new_prediction
            print(f"Model {i + 1} fitted. Predictions updated.")
                   
    def predict_proba(self, x):
        ''' Generate the ensemble prediction, by looping over each base model.
            Get their predictions and sum them, scaled by the learning rate.
        
            Trick: Regressor models return only one prediction (instead of two probabilities in the Classifiers).
                   To make your class compatible with test_model(), you can repeat the predictions, e.g.:
                   predictions.reshape(-1,1).repeat(2,axis=1)'''
        predictions = np.zeros(len(x))
        for i, model in enumerate(self.models):
            model_predictions = model.predict(x)
            predictions += self.learning_rate * model_predictions
            print(f"Model {i + 1} predictions added.")
        return predictions.reshape(-1,1).repeat(2,axis=1)
        


Run and score a Gradient Boosting model, with 20 base decision trees, with maximum depth 5, maximum feature 0.5 and learning rate 0.5. Use your `myGradientBoosting` class and `test_model()`. 

In [26]:
from sklearn.tree import DecisionTreeRegressor
#set up model

#X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

base_models = [DecisionTreeRegressor(max_depth=5, max_features=0.5, random_state=i) for i in range(20)]

model = myGradientBoosting(base_models, learning_rate=0.5)
                           
model.fit(X_train, y_train)

# Predict using the model
predictions_proba = model.predict_proba(X_val)

print("Predictions:")
print(predictions_proba)                          
                           

Fitting model 1/20
Model 1 fitted. Predictions updated.
Fitting model 2/20
Model 2 fitted. Predictions updated.
Fitting model 3/20
Model 3 fitted. Predictions updated.
Fitting model 4/20
Model 4 fitted. Predictions updated.
Fitting model 5/20
Model 5 fitted. Predictions updated.
Fitting model 6/20
Model 6 fitted. Predictions updated.
Fitting model 7/20
Model 7 fitted. Predictions updated.
Fitting model 8/20
Model 8 fitted. Predictions updated.
Fitting model 9/20
Model 9 fitted. Predictions updated.
Fitting model 10/20
Model 10 fitted. Predictions updated.
Fitting model 11/20
Model 11 fitted. Predictions updated.
Fitting model 12/20
Model 12 fitted. Predictions updated.
Fitting model 13/20
Model 13 fitted. Predictions updated.
Fitting model 14/20
Model 14 fitted. Predictions updated.
Fitting model 15/20
Model 15 fitted. Predictions updated.
Fitting model 16/20
Model 16 fitted. Predictions updated.
Fitting model 17/20
Model 17 fitted. Predictions updated.
Fitting model 18/20
Model 18 fit

For comparison, run and score the `sklearn` implementation, `GradientBoostingClassifier`.

In [27]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Fit the model to the training data
gbc.fit(X_train, y_train)

# Predict on the test data
y_pred = gbc.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
report = classification_report(y_val, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)


Accuracy: 0.9606
Classification Report:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      4802
           1       0.67      0.01      0.02       198

    accuracy                           0.96      5000
   macro avg       0.81      0.50      0.50      5000
weighted avg       0.95      0.96      0.94      5000



Optimize your myGradientBoosting and decision tree meta-parameters, and make your predictions for the test data, saving the predictions under `gb_preds`.

In [28]:
# X_train = X_train.drop(['Unnamed: 0'],axis=1)
# X_val = X_val.drop(['Unnamed: 0'],axis=1)

In [29]:
# Set up the parameter grid to search over
param_grid = {
    #'n_estimators': [50, 100, 200],
    'learning_rate': [0.01],#, 0.1, 0.5],
    'max_depth': [7],#3, 5, 7],
    #'subsample': [0.7, 0.8, 0.9],
    'max_features': [0.5]#, 0.7], 1.0]
}

# Initialize the GradientBoostingClassifier
gbc = GradientBoostingClassifier(random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=gbc, param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_}")

# Train the best model on the full training set
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

# Predict on the test data
gb_preds = best_model.predict(test_data)

# Evaluate the model
# accuracy = accuracy_score(y_test, y_pred)
# report = classification_report(y_test, y_pred)

# print(f"Test Accuracy: {accuracy}")
# print("Classification Report:")
# print(report)

Fitting 5 folds for each of 1 candidates, totalling 5 fits
Best Parameters: {'learning_rate': 0.01, 'max_depth': 7, 'max_features': 0.5}
Best Cross-Validation Score: 0.9604883644706653


Try to think about the difference between your implementation and the GradientBoostingClassifier.

Are there any fundamental differences? If so, why?

You could try looking at the distribution of your output probabilities for each model.