# Workshop: Analyzing bank marketing data with scikit-learn

TODO: davidtan [2017-06-27]:
- update cell on GridSearchCV
- remove predict_with_threshold() if possible

Task: Your client has given you a dataset and has asked you to build a model to predict whether a given customer is likely to purchase a bank term deposit.

Build this model by going through the process of tackling classification problems:
    1. Train the model
    2. Evaluate the model
    3. Tune / improve the model
    4. Use the model to predict the probability of future outcomes

In [27]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_columns = 50

## 2. Load and explore the data

In [29]:
df = pd.read_csv('./data/bank-marketing-data/bank-additional-one-hot-encoded.csv')

Based on the dataset's [README](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing), we know that the data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). For more info on the dataset, please see the dataset's [README](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).

### Data exploration

## 3. Prepare / clean the data for modeling

### Convert pandas dataframe into 2 arrays for consumption

### Split data into train and test set

## 4. Train the model!

## 5. Evaluate the model

### Evaluation method 1: `.score(X, y)`

### Evaluation method 2: `.confusion_matrix(expected, predicted)`

Confusion matrices are in the following format:
    
```
[[true_positive , false_positive]
 [false_negative, true_negative]]
```

### Evaluation method 3: `.classification_report(expected, predicted)`

## 6. Tune / improve the model

#### Automated parameter tuning with GridSearchCV

`sklearn` offers an API for systematically finding the parameters which would produce the most accurate model. Using `GridSearchCV`, we can find the most accurate model by passing in the estimator and a dictionary containing the various parameters which we want to tune:

In [53]:
from sklearn.model_selection import GridSearchCV

In [54]:
logisticregression_tuned = LogisticRegression()

param_grid = {'C': [0.01, 0.1, 1, 10],
              'class_weight': [{
                  0: 1, 
                  1: 2
              },
              {
                  0: 1, 
                  1: 1.2
              },
              {
                  0: 1, 
                  1: 1.4
              }
              ]}

grid = GridSearchCV(estimator=logisticregression_tuned, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best estimator:", grid.best_estimator_)
print("Best score:", grid.best_score_)

('Best estimator:', LogisticRegression(C=0.01, class_weight={0: 1, 1: 1.2}, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))
('Best score:', 0.91003852254702011)


After tuning with GridSearch, we are able to identify the parameters that can create a model with the optimal predictive performance. Namely:
- `C=0.01`
- `class_weight={0: 1, 1: 1.2}`

### Evaluating our tuned model

Hint: use `.score(X, y)`, `.confusion_matrix(expected, predicted)`, `.classification_report(expected, predicted)`

In [67]:
# def predict_with_threshold(predicted_probabilities, threshold):
#     prediction_with_threshold = []
#     for p in predicted_probabilities:
#         if p[0] >= threshold:
#             prediction_with_threshold.append(0)
#         else:
#             prediction_with_threshold.append(1)
#     return prediction_with_threshold

In [74]:
# predict_with_threshold(predicted_probabilities, 0.85)

[1, 0, 0, 0, 0]

## 7. Using the model to predict outcomes based on fresh/unseen data

Load new data from './data/bank-marketing-data/bank-unseen-data.csv'