# Workshop: Analyzing bank marketing data with scikit-learn

Task: Your client has given you a dataset and has asked you to build a model to predict whether a given customer is likely to purchase a bank term deposit.

Build this model by going through the process of tackling classification problems:
    1. Train the model
    2. Evaluate the model
    3. Tune / improve the model
    4. Use the model to predict the probability of future outcomes

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_columns = 50

## 2. Load and explore the data

In [None]:
df = pd.read_csv('./data/bank-marketing-data/bank-additional-one-hot-encoded.csv')

Based on the dataset's [README](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing), we know that the data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). For more info on the dataset, please see the dataset's [README](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).

### Data exploration

In [None]:
df.head()

In [None]:
df.describe()

## 3. Prepare / clean the data for modeling

### Convert pandas dataframe into 2 arrays for consumption

In [None]:
y = df['y'].tolist()

In [None]:
del df['y']
X = df.as_matrix()

### Split data into train and test set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

## 4. Train the model!

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

## 5. Evaluate the model

### Evaluation method 1: `.score(X, y)`

In [None]:
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print("training set score: %f" % train_score)
print("test set score:     %f" % test_score)

### Evaluation method 2: `.confusion_matrix(expected, predicted)`

In [None]:
expected = y
predicted = model.predict(X)

confusion_matrix = metrics.confusion_matrix(expected, predicted)
print("CONFUSION MATRIX")
print(confusion_matrix)

Confusion matrices are in the following format:
    
```
[[true_positive , false_positive]
 [false_negative, true_negative]]
```

### Evaluation method 3: `.classification_report(expected, predicted)`

In [None]:
report = metrics.classification_report(expected, predicted)

print("CLASSIFICATION REPORT")
print(report)

## 6. Tune / improve the model

#### Automated parameter tuning with GridSearchCV

`sklearn` offers an API for systematically finding the parameters which would produce the most accurate model. Using `GridSearchCV`, we can find the most accurate model by passing in the **estimator** and a **param_grid** dictionary containing the various parameters which we want to tune.

You can refer to the [LogisticRegression API docs](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for a list of what can be included in the `param_grid` dictionary.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
logistic_regression_model = LogisticRegression()

param_grid = {'C': [0.01, 0.1, 1, 10],
              'class_weight': [{
                  0: 1, 
                  1: 2
              },
              {
                  0: 1, 
                  1: 1.2
              },
              {
                  0: 1, 
                  1: 1.4
              }
              ]}

grid = GridSearchCV(estimator=logistic_regression_model, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best estimator:", grid.best_estimator_)
print("Best score:", grid.best_score_)

After tuning with GridSearch, we are able to identify the parameters that can create a model with the optimal predictive performance. Namely:
- `C=0.01`
- `class_weight={0: 1, 1: 1.2}`

### Evaluating our tuned model

Hint: use `.score(X, y)`, `.confusion_matrix(expected, predicted)`, `.classification_report(expected, predicted)`

In [None]:
train_score_of_grid_model = grid.score(X_train, y_train)
test_score_of_grid_model = grid.score(X_test, y_test)

print("training set score: %f" % train_score_of_grid_model)
print("test set score:     %f" % test_score_of_grid_model)

In [None]:
expected_for_grid = y
predicted_for_grid = grid.predict(X)

confusion_matrix = metrics.confusion_matrix(expected_for_grid, predicted_for_grid)
print("CONFUSION MATRIX")
print(confusion_matrix)

In [None]:
report_grid = metrics.classification_report(expected_for_grid, predicted_for_grid)

print("CLASSIFICATION REPORT")
print(report_grid)

At this point, you've probably tried everything and still we can't reduce the false negatives any more. This is due to the nature of the **imbalanced** dataset (we have 10 times more cases of y=0 as compared to y=1) What else can we do?

Answers:

1) Weighting (We've tried this above)

2) Thresholding (see [example implementation](https://github.com/davified/learn-scikit-learn/blob/master/bank-data-model.ipynb) in LogisticRegressionWithThreshold class)

3) Sampling (Sample such that your sample has roughly the equal number of cases of y=0 and y=1)

[Read more](https://stackoverflow.com/questions/26221312/dealing-with-the-class-imbalance-in-binary-classification/26244744#26244744)

## 7. Using the model to predict outcomes based on fresh/unseen data

Load new data from './data/bank-marketing-data/bank-unseen-data.csv'

In [None]:
df_new = pd.read_csv('./data/bank-marketing-data/bank-unseen-data.csv')

In [None]:
df_new.head()

In [None]:
X_new = df_new.as_matrix()

In [None]:
print(grid.predict_proba(X_new))
print(grid.predict(X_new))