# Workshop: Analyzing bank marketing data with scikit-learn

Task: Your client has given you a dataset and has asked you to build a model to:
1. predict whether a given customer is likely to purchase a bank term deposit.
2. analyze the factors that make customer more likely (or less likely) to purchase a bank term deposit

Build this model by going through the process of tackling classification problems:
1. Load and explore data
2. Preprocess / clean data
3. Train the model
4. Evaluate the model
5. Use the model (for prediction and interpretation)

In [5]:
# Load libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_columns = 50
%precision 3

u'%.3f'

## 1. Load and explore the data

In [7]:
df = pd.read_csv('./data/bank-marketing-data/bank-additional-one-hot-encoded.csv')

# balanced data. uncomment the line below for the second part of your workshop
# df = pd.read_csv('./data/bank-marketing-data/bank-additional-balanced-dataset.csv', index_col=0)

Based on the dataset's [README](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing), we know that the data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). For more info on the dataset, please see the dataset's [README](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).

### Data exploration

In [10]:
# see the top n rows by calling df.head(n)

# YOUR CODE HERE:
df.head(5)

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_divorced,marital_married,marital_single,...,housing_yes,loan_no,loan_unknown,loan_yes,contact_cellular,contact_telephone,month_apr,month_aug,month_dec,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success,y
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0
1,57,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,...,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,...,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0


In [None]:
# see summary statistics by calling df.describe()

# YOUR CODE HERE:

## 3. Prepare / clean the data for modeling

### Convert pandas dataframe into 2 matrices for the model's consumption

In [18]:
X = df.iloc[:, df.columns != 'y'].values
y = df.iloc[:, df.columns == 'y'].values.ravel()

In [None]:
# try printing the following commands to get a sense of what X and y actually are:
# X.shape, y.shape
# X[0], y[0]
# X[any_random_integer], y[any_random_integer]
# X, y

In [23]:
# YOUR CODE HERE:
y.shape

(41188,)

### Split data into train and test set

In [29]:
# Use sklearn's train_test_split method to split the data into train and test set

# YOUR CODE HERE:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

## 4. Train the model!

In [31]:
# import the LogisticRegression class from sklearn.linear_model

# YOUR CODE HERE:
from sklearn.linear_model import LogisticRegression

In [32]:
# train the model using the .fit(x_train, y_train) method

# YOUR CODE HERE:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## 5. Evaluate the model

### Evaluation method 1: `.score(X, y)`

In [33]:
# Evaluate your model's performance using the .score() method

# YOUR CODE HERE:
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print("training set score: %f" % train_score)
print("test set score:     %f" % test_score)



training set score: 0.908906
test set score:     0.912013


In [34]:
# 
print(df['y'].value_counts())
print("Accuracy of a model that predicts 'no' (i.e. 0) all the time: ", 36548.0/(36548 + 4640))

0    36548
1     4640
Name: y, dtype: int64
("Accuracy of a model that predicts 'no' (i.e. 0) all the time: ", 0.8873458288821987)


### Evaluation method 2: `.confusion_matrix(expected, predicted)`

In [35]:
from sklearn import metrics

In [36]:
# Evaluate model using .confusion_matrix(y_true, y_predicted)

# YOUR CODE HERE:
expected = y
predicted = model.predict(X)

confusion_matrix = metrics.confusion_matrix(expected, predicted)
print("CONFUSION MATRIX")
print(confusion_matrix)


CONFUSION MATRIX
[[35613   935]
 [ 2785  1855]]


Confusion matrices are in the following format:
    
```
[[true_positive , false_positive]
 [false_negative, true_negative]]
```

### Evaluation method 3: `.classification_report(expected, predicted)`

In [37]:
# Evaluate model using .classification_report(y_true, y_predicted)

# YOUR CODE HERE:
report = metrics.classification_report(expected, predicted)

print("CLASSIFICATION REPORT")
print(report)

CLASSIFICATION REPORT
             precision    recall  f1-score   support

          0       0.93      0.97      0.95     36548
          1       0.66      0.40      0.50      4640

avg / total       0.90      0.91      0.90     41188



## 6. Using the model to predict outcomes based on fresh/unseen data

Load new data from './data/bank-marketing-data/bank-unseen-data.csv'

In [38]:
df_new = pd.read_csv('./data/bank-marketing-data/bank-unseen-data.csv')

In [39]:
# Explore data again with df.head(). Notice that there's no 'y' column at the end

# YOUR CODE HERE:
df_new.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_divorced,marital_married,marital_single,...,housing_unknown,housing_yes,loan_no,loan_unknown,loan_yes,contact_cellular,contact_telephone,month_apr,month_aug,month_dec,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,31,456,2,999,0,-1.1,94.767,-50.8,1.044,4963.6,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,...,0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0
1,61,508,4,999,0,-1.1,94.767,-50.8,1.044,4963.6,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0
2,31,260,1,999,0,-1.1,94.767,-50.8,1.041,4963.6,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,...,0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0
3,58,193,1,999,0,-1.1,94.767,-50.8,1.041,4963.6,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,...,0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0
4,41,597,2,3,2,-1.1,94.767,-50.8,1.041,4963.6,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,...,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1


In [40]:
# Convert our pandas dataframe to a matrix, so that the model can consume it
X_new = df_new.as_matrix()

In [43]:
# Use your model to predict the y value (i.e. 0 or 1) of the new data (hint: model.predict()`)
print(model.predict(X_new))


[1 1 0 0 1 0 0 0 1 0 0 0]


In [42]:
# Use your model to predict the probabilities of y being 0 or 1 (hint: model.predict_proba()`)
print(model.predict_proba(X_new))

[[ 0.474  0.526]
 [ 0.47   0.53 ]
 [ 0.656  0.344]
 [ 0.726  0.274]
 [ 0.106  0.894]
 [ 0.501  0.499]
 [ 0.698  0.302]
 [ 0.743  0.257]
 [ 0.474  0.526]
 [ 0.545  0.455]
 [ 0.581  0.419]
 [ 0.774  0.226]]


# Bonus: interpreting our model

In [None]:
plt.figure(figsize=(16,9))

plt.plot(model.coef_.T, 'o', label="logisticregression model (C=1)")
plt.xticks(range(X.shape[1]), df.columns, rotation=90)
plt.title("Coefficients of logistic_regression_with_threshold model")
plt.ylabel("Coefficients")
plt.xlabel("X variables")
plt.legend()

# Note: if you get any errors here saying model is not defined, simply replace 'model' in the second line of this box with the name of your model variable

In [None]:
# Before we can interpret coefficients as probabilities, we need to do a little math to calculate the odds ratio
# and the probability
logodds = model.intercept_ + model.coef_[0] * 2
odds = np.exp(logodds)
probabilities = odds/(1 + odds)
probabilities

In [None]:
number_of_x_vars = len(df.columns) - 1

In [None]:
plt.figure(figsize=(16,9))

plt.bar(range(0, number_of_x_vars), probabilities)
plt.title("Probabilities of outcome where y=1 given a unit change in X")
plt.xlabel("X variables")
plt.ylabel("Probability")
plt.axhline(y=0.5, hold=None, alpha=0.5)
plt.xticks(range(X.shape[1]), df.columns, rotation=90)
plt.legend()

#### How to interpret the chart 

We can interpret the chart above as such: Given a unit increase in X, the user is predicted to be \__% more likely to purchase a bank term deposit (i.e. y=1)

For example, given a unit increase in employment variation rate (the first positive blip in the chart), the user is predicted to be 16% more likely to purchase a bank term deposit

#### Based on this chart, we can observe the following: 
    
Attributes that have a positive effect on the outcome:
- contact_cellular
- month_august
- month_oct
- day_of_week_fri

Attributes that have a negative effect on the outcome:
- emp.var.rate
- cons.price.index
- cons.conf.index
- euribor3m
- education_basic.4y
- contact_telephone
- month_may