# Workshop: Analyzing bank marketing data with scikit-learn

TODO: davidtan [2017-06-27]:
- update cell on GridSearchCV
- remove predict_with_threshold() if possible

Task: Your client has given you a dataset and has asked you to build a model to predict whether a given customer is likely to purchase a bank term deposit.

Build this model by going through the process of tackling classification problems:
    1. Train the model
    2. Evaluate the model
    3. Tune / improve the model
    4. Use the model to predict the probability of future outcomes

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_columns = 50

## 2. Load and explore the data

In [2]:
df = pd.read_csv('./data/bank-marketing-data/bank-additional-one-hot-encoded.csv')

Based on the dataset's [README](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing), we know that the data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). For more info on the dataset, please see the dataset's [README](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).

### Data exploration

In [None]:
df.head()

In [None]:
df.describe()

## 3. Prepare / clean the data for modeling

### Convert pandas dataframe into 2 arrays for consumption

In [9]:
y = df['y'].tolist()

In [10]:
del df['y']
X = df.as_matrix()

### Split data into train and test set

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

## 4. Train the model!

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [15]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## 5. Evaluate the model

### Evaluation method 1: `.score(X, y)`

In [16]:
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print("training set score: %f" % train_score)
print("test set score:     %f" % test_score)

training set score: 0.908906
test set score:     0.912013


### Evaluation method 2: `.confusion_matrix(expected, predicted)`

In [17]:
expected = y
predicted = model.predict(X)

confusion_matrix = metrics.confusion_matrix(expected, predicted)
print("CONFUSION MATRIX")
print(confusion_matrix)

CONFUSION MATRIX
[[35613   935]
 [ 2785  1855]]


Confusion matrices are in the following format:
    
```
[[true_positive , false_positive]
 [false_negative, true_negative]]
```

### Evaluation method 3: `.classification_report(expected, predicted)`

In [18]:
report = metrics.classification_report(expected, predicted)

print("CLASSIFICATION REPORT")
print(report)

CLASSIFICATION REPORT
             precision    recall  f1-score   support

          0       0.93      0.97      0.95     36548
          1       0.66      0.40      0.50      4640

avg / total       0.90      0.91      0.90     41188



## 6. Tune / improve the model

#### Automated parameter tuning with GridSearchCV

`sklearn` offers an API for systematically finding the parameters which would produce the most accurate model. Using `GridSearchCV`, we can find the most accurate model by passing in the estimator and a dictionary containing the various parameters which we want to tune:

In [19]:
from sklearn.model_selection import GridSearchCV

In [20]:
logistic_regression_model = LogisticRegression()

param_grid = {'C': [0.01, 0.1, 1, 10],
              'class_weight': [{
                  0: 1, 
                  1: 2
              },
              {
                  0: 1, 
                  1: 1.2
              },
              {
                  0: 1, 
                  1: 1.4
              }
              ]}

grid = GridSearchCV(estimator=logistic_regression_model, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best estimator:", grid.best_estimator_)
print("Best score:", grid.best_score_)

('Best estimator:', LogisticRegression(C=0.01, class_weight={0: 1, 1: 1.2}, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))
('Best score:', 0.91003852254702011)


After tuning with GridSearch, we are able to identify the parameters that can create a model with the optimal predictive performance. Namely:
- `C=0.01`
- `class_weight={0: 1, 1: 1.2}`

### Evaluating our tuned model

Hint: use `.score(X, y)`, `.confusion_matrix(expected, predicted)`, `.classification_report(expected, predicted)`

In [21]:
train_score_of_grid_model = grid.score(X_train, y_train)
test_score_of_grid_model = grid.score(X_test, y_test)

print("training set score: %f" % train_score_of_grid_model)
print("test set score:     %f" % test_score_of_grid_model)

training set score: 0.910395
test set score:     0.913373


In [23]:
expected_for_grid = y
predicted_for_grid = grid.predict(X)

confusion_matrix = metrics.confusion_matrix(expected_for_grid, predicted_for_grid)
print("CONFUSION MATRIX")
print(confusion_matrix)

CONFUSION MATRIX
[[35409  1139]
 [ 2521  2119]]


In [24]:
report_grid = metrics.classification_report(expected_for_grid, predicted_for_grid)

print("CLASSIFICATION REPORT")
print(report_grid)

CLASSIFICATION REPORT
             precision    recall  f1-score   support

          0       0.93      0.97      0.95     36548
          1       0.65      0.46      0.54      4640

avg / total       0.90      0.91      0.90     41188



In [67]:
# def predict_with_threshold(predicted_probabilities, threshold):
#     prediction_with_threshold = []
#     for p in predicted_probabilities:
#         if p[0] >= threshold:
#             prediction_with_threshold.append(0)
#         else:
#             prediction_with_threshold.append(1)
#     return prediction_with_threshold

In [74]:
# predict_with_threshold(predicted_probabilities, 0.85)

[1, 0, 0, 0, 0]

## 7. Using the model to predict outcomes based on fresh/unseen data

Load new data from './data/bank-marketing-data/bank-unseen-data.csv'

In [25]:
df_new = pd.read_csv('./data/bank-marketing-data/bank-unseen-data.csv')

In [26]:
X_new = df_new.as_matrix()

In [28]:
print(grid.predict_proba(X_new))
print(grid.predict(X_new))

[[ 0.384592    0.615408  ]
 [ 0.42140486  0.57859514]
 [ 0.57816862  0.42183138]
 [ 0.64419872  0.35580128]
 [ 0.07588198  0.92411802]
 [ 0.47072953  0.52927047]
 [ 0.58921587  0.41078413]
 [ 0.72256155  0.27743845]
 [ 0.51721953  0.48278047]
 [ 0.52611551  0.47388449]
 [ 0.48997225  0.51002775]
 [ 0.73937111  0.26062889]]
[1 1 0 0 1 1 0 0 0 0 1 0]
