# SF-DAT-21 | Lab 09

In [109]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, cross_validation

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

%matplotlib inline
plt.style.use('ggplot')

In [110]:
df = pd.read_csv(os.path.join('..', 'datasets', 'bank-marketing.csv'))

In [111]:
df.shape

(4521, 17)

The dataset is related to the direct marketing campaigns (by phone) of a Portuguese banking institution.  The classification goal is to predict if the client will subscribe a term deposit (variable y).

Attribute Information:

- Input variables:
  - [Bank client data]
    - `age` (numeric)
    - `job`: type of job (categorical)
    - `marital`: marital status (categorical)
      - Note: `divorced` means divorced or widowed)
    - `education` (categorical)
    - `default`: has credit in default? (categorical)
    - `balance`: bank account balance (\$)
    - `housing`: has housing loan? (categorical)
    - `loan`: has personal loan? (categorical)
  - [Data related with the last contact of the current campaign]
    - `contact`: contact communication type (categorical) 
    - `month`: last contact month of year (categorical)
    - `day_of_week`: last contact day of the week (categorical)
    - `duration`: last contact duration, in seconds (numeric)
      - Important note: this attribute highly affects the output target (e.g., if `duration = 0` then `y = 'no'`). Yet, the duration is not known before a call is performed.  Also, after the end of the call y is obviously known.  Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.    
  - [Other attributes]
    - `campaign`: number of contacts performed during this campaign and for this client (numeric)
    - `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric)
      - 999 means client was not previously contacted
    - `previous`: number of contacts performed before this campaign and for this client (numeric)
    - `poutcome`: outcome of the previous marketing campaign (categorical)

- Output variable (desired target):
  - `y`: has the client subscribed a term deposit? (binary)

## Our goal is to develop a model the best predicts the outcome `y`, the success of the marketing campaign

## Question 1: Remove the categorical variables with the most number of distinct values

In [112]:
var = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

df.job.value_counts()

management     969
blue-collar    946
technician     768
...
housemaid    112
student       84
unknown       38
Length: 12, dtype: int64

In [113]:
df.drop('job', axis=1, inplace=True)


In [114]:
var = ['marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

In [115]:
for i in var:
    print i
    print df[i].value_counts()
    print 

marital
married     2797
single      1196
divorced     528
dtype: int64

education
secondary    2306
tertiary     1350
primary       678
unknown       187
dtype: int64

default
no     4445
yes      76
dtype: int64

housing
yes    2559
no     1962
dtype: int64

loan
no     3830
yes     691
dtype: int64

contact
cellular     2896
unknown      1324
telephone     301
dtype: int64

month
may    1398
jul     706
aug     633
...
sep    52
mar    49
dec    20
Length: 12, dtype: int64

poutcome
unknown    3705
failure     490
other       197
success     129
dtype: int64



Answer:

## Question 2: Recode all `yes`/`no` categorical variables with `0` as the most frequent value (then also append `"_no"` to the variable name), and `1` for the other (then leave the name unchanged)

In [116]:
var = ['default', 'housing', 'loan', 'marital', 'education', 'contact', 'month', 'poutcome']

df['default_yn'] = 0
df.loc[df.default == 'yes', 'default_yn'] = 1
df.default = df.default_yn
df.drop('default_yn', axis=1, inplace=True)

df['housing_yn'] = 0
df.loc[df.housing == 'yes', 'housing_yn'] = 1
df.housing = df.housing_yn
df.drop('housing_yn', axis=1, inplace=True)

df['loan_yn'] = 0
df.loc[df.loan == 'yes', 'loan_yn'] = 1
df.loan = df.loan_yn
df.drop('loan_yn', axis=1, inplace=True)

In [117]:
df.loan.value_counts()

0    3830
1     691
dtype: int64

## Question 3: Create dummy variables for the other categorical variables

In [118]:
dum_var = ['marital', 'education', 'contact', 'month', 'poutcome']

d_marital = pd.get_dummies(df['marital'], prefix='marital')
d_education = pd.get_dummies(df['education'], prefix = 'education')
d_contact = pd.get_dummies(df['contact'], prefix = 'contact')
d_month = pd.get_dummies(df['month'], prefix = 'month')
d_poutcome = pd.get_dummies(df['poutcome'], prefix = 'poutcome')

## Question 4: What should be your baseline for these dummy variables (namely, which dummy variables should you not include in your model)?

In [119]:
df = df.join(d_marital.ix[:, 1:])
df = df.join(d_education.ix[:, 1:])
df = df.join(d_contact.ix[:, 1:])
df = df.join(d_month.ix[:, 1:])
df = df.join(d_poutcome.ix[:, 1:])
df.head()

Unnamed: 0,age,marital,education,default,balance,...,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown
0,30,married,primary,0,1787,...,1,0,0,0,1
1,33,married,secondary,0,4789,...,0,0,0,0,0
2,35,single,tertiary,0,1350,...,0,0,0,0,0
3,30,married,tertiary,0,1476,...,0,0,0,0,1
4,59,married,secondary,0,0,...,0,0,0,0,1


In [120]:
df.drop(dum_var, axis=1, inplace=True)

Answer:

## Question 5: What input variable in the dataset seems to predict the outcome quite well.  Why?

In [121]:
df['target'] = 0
df.loc[df.y == 'yes', 'target'] = 1
df.drop('y', axis=1, inplace=True)
df.target.value_counts()

0    4000
1     521
dtype: int64

In [122]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4521 entries, 0 to 4520
Data columns (total 32 columns):
age                    4521 non-null int64
default                4521 non-null int64
balance                4521 non-null int64
housing                4521 non-null int64
loan                   4521 non-null int64
day                    4521 non-null int64
duration               4521 non-null int64
campaign               4521 non-null int64
pdays                  4521 non-null int64
previous               4521 non-null int64
marital_married        4521 non-null float64
marital_single         4521 non-null float64
education_secondary    4521 non-null float64
education_tertiary     4521 non-null float64
education_unknown      4521 non-null float64
contact_telephone      4521 non-null float64
contact_unknown        4521 non-null float64
month_aug              4521 non-null float64
month_dec              4521 non-null float64
month_feb              4521 non-null float64
month_jan    

In [133]:
df['Intercept'] = 1.0

Answer:

## Question 6: Split the dataset into a training set (60%) and a testing set (the rest)

In [134]:
var = ['age','marital_married','marital_single','default','balance','housing','loan','campaign','pdays','previous']

from sklearn.cross_validation import train_test_split
X = df[var].values
y = df.target.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=2016)
X.shape
y.shape

(4521L,)

## Question 7: Run a logistic regression with `age`, `marital` (the dummies), `default`, `balance`, `housing`, `loan`, `campaign`, `pdays`, `previous`?

In [135]:
from sklearn.linear_model import LogisticRegression 
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
confusion = pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

In [136]:
confusion

Predicted,0,1,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1607,1,1608
1,199,2,201
All,1806,3,1809


In [137]:
lr.score(X_train, y_train)

0.88126843657817111

## Question 8: What is your training error?  What is your generalization error?  Does it make sense?

In [138]:
lr.score(X_test, y_test)

0.88944168048645655

Answer:

## Question 9: Interpret your coefficients. (At least `marital_single`, `campaign`, and `default`).  Does your interpretation  make sense?

Answer:

## Question 10: What is your prediction for a 30 years old single female, a homeowner with a \$1,000 balance in the bank, without a loan, who has never been contacted before, and who has never defaulted.

Answer:

## Question 11: Normalize your variables.  (You can reuse the function from the previous lab)

## Question 12: Let's do some regularization.  Use 10-fold cross validation to find the best tuning parameter `c`

(Hint: check the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)

Note: `c` is just the inverse of $\lambda$ - the smaller $c$, the stronger the regularization. The smaller values choose less variables

(Hint 2: First try c = 10 ^ i with i = -10 ... 10)

Answer:

## Question 13: Now use the best `c` you found above and repeat your analysis; look over your coefficients

## Question 14: If you want to drop 3 variables from your analysis, which variables will you choose?

Answer: