# Introduction to ensemble methods

In this practical, you will construct your own bagging ensemble, to boost performance on a binary classification problem.


## The problem

Users of your service sometimes close their accounts, even though they have been with you for a while. It would be useful if you could predict which ones are likely to do this, before it happens, so you can take action to retain their custom.

You have access to 20 different features for each user, based on data such as how often they log in, how many purchases they make, etc. The aim is to see if you can predict whether a user will end up closing their account or not.

## Getting started

First, load the data from `account_data.csv` using `pandas` into a DataFrame called `data`. 

Find out how many users left the service by examining the `closed_account` column. 

*Hint: use the `.value_counts()` method*

In [None]:
import pandas as pd

# Your code here...

data = pd.read_csv('data/account_data.csv')

print(data['closed_account'].value_counts())

print('Around half the accounts closed.')



## Setting up for classification - I

Before proceeding, we will split the data into training and testing subsets.

Create two variables, `X` and `y`, from the `data` DataFrame.

`X` should be all the columns that you want to use in your prediction (ie. all except for `closed_account`)

`y` should be the column that you are trying to predict (ie. `closed_account`)

These are the conventional variable names used in machine learning to represent your predicting features (`X`) and the values you are trying to predict (`y`).

In [None]:
# Your code here...

X = data.drop('closed_account', axis=1)
y = data['closed_account']



## Setting up for classification - II

Models will have access to the training set in order to learn, but we will evaluate them on the unseen test set.

Use `train_test_split` from `sklearn.model_selection` to split the data up. 

This function takes your `X` and `y` (our 20 monthly data points, and the closed/not closed label) and splits them up. It returns four lists - one `X` for each of train/test and the same for `y`.

The default gives a ratio of 3:1 train/test split.

Call the new variables `X_train`, `X_test`, `y_train`, and `y_test`.

(Set `random_state` to `5`, so that results are the same every time!)

In [None]:
from sklearn.model_selection import train_test_split

# Your code here...

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=5)



## Single model performance - I

Before trying an ensemble, we need some baselines in order to determine whether the ensemble method actually improved anything.

We will use a simple Naive Bayes classifier here with default parameters.

Import the required class and instantiate it with the variable name `model`.


In [None]:
from sklearn.naive_bayes import GaussianNB

# Your code here...

model = GaussianNB()



## Single model performance - II

Use the `.fit()` method of the model to train it on the training subset of the data.

Use the `.score()` method the same way, using the same data, to see how well it performs on the data it learned from.

Use `model.predict()` to generate predictions for `X_train` and compare these predictions to the ground truth (`y_train`) using a classification report.

In [None]:
from sklearn.metrics import classification_report

# Your code here...

model.fit(X_train, y_train)

print(f"Accuracy on observed data: {model.score(X_train, y_train):.2f}")

y_pred = model.predict(X_train)

print(classification_report(y_train, y_pred))



## Single model performance - III

Repeat the above scoring process but now use the unseen test data.

Do you expect accuracy to change? If so, in what direction?

What are your thoughts on how well the model performs on the two classes?

In [None]:
# Your code here...

y_pred = model.predict(X_test)

print(f"Accuracy on unseen data: {model.score(X_test, y_test):.2f}")

print("Accuracy is far lower on the unseen data, which is to be expected.")

print(classification_report(y_test, y_pred))

print("The model does quite poorly on the class we are most interested in - the users that will close their accounts.")



## Bagging ensemble - I

Now let's see what effect a simple bagging ensemble has.

`sklearn` has a useful utility class that will handle this for us: `sklearn.ensemble.BaggingClassifier`

This has several parameters which can be tuned but the main ones are:

`base_estimator` - the model to use

`n_estimators` - how many models to use. Default is 10.

`random_state` - ensures the same results every time

Create a bagging classifier named `ensemble`, using `GaussianNB()` as the `base_estimator`. Keep `n_estimators` at 10 and set `random_state` to 5.

In [None]:
from sklearn.ensemble import BaggingClassifier

# Your code here...

ensemble = BaggingClassifier(
    base_estimator=GaussianNB(),
    n_estimators=10,
    random_state=5
)



## Bagging ensemble - II

Once you have created this meta-model, it works exactly the same way as any model in `sklearn` - it has  methods like `.fit()` and `.score()` and `.predict()`.

Repeat the training/scoring/classification report process from the single model approach you did previously.

Train only on the training data and test only on the test data.

What do you observe now?

In [None]:
# Your code here...

ensemble.fit(X=X_train, y=y_train)

print(f"Ensemble accuracy on unseen data: {ensemble.score(X_test, y_test):.2f}")

y_pred = ensemble.predict(X_test)

print(classification_report(y_test, y_pred))

print("Performance for both classes has improved quite a bit, though the most important class (1) is still quite hard to predict accurately.")
print("We are performing marginally better than random guessing, so you could flip a coin to decide and get similar results!")



## Conclusions

Using an ensemble of methods improved performance in a binary classification task by a good margin, though performance was still generally quite low.

This may be because the signal to noise ratio in the data is low (we have used synthetic data), meaning the model needs to make predictions with very little information. Even so, what could you investigate in order to improve performance on this task?

In [None]:
# Your code below

print("The Naive Bayes model may not be the best suited to this task. You could try an SVM or Logistic Regression. It would be as simple as changing the base_estimator.")
print("Perhaps more estimators are needed in the ensemble, or fewer. You could vary this and observe the effect on performance.")
print("There might not be enough training data to learn from. Try adjusting the ratio of train:test data from 75:25 to perhaps 90:10.")

