# SF-DAT-21 | Lab 09

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, cross_validation

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

%matplotlib inline
plt.style.use('ggplot')

In [2]:
df = pd.read_csv(os.path.join('..', 'datasets', 'bank-marketing.csv'))

In [52]:
df.shape

(4521, 16)

The dataset is related to the direct marketing campaigns (by phone) of a Portuguese banking institution.  The classification goal is to predict if the client will subscribe a term deposit (variable y).

Attribute Information:

- Input variables:
  - [Bank client data]
    - `age` (numeric)
    - `job`: type of job (categorical)
    - `marital`: marital status (categorical)
      - Note: `divorced` means divorced or widowed)
    - `education` (categorical)
    - `default`: has credit in default? (categorical)
    - `balance`: bank account balance (\$)
    - `housing`: has housing loan? (categorical)
    - `loan`: has personal loan? (categorical)
  - [Data related with the last contact of the current campaign]
    - `contact`: contact communication type (categorical) 
    - `month`: last contact month of year (categorical)
    - `day_of_week`: last contact day of the week (categorical)
    - `duration`: last contact duration, in seconds (numeric)
      - Important note: this attribute highly affects the output target (e.g., if `duration = 0` then `y = 'no'`). Yet, the duration is not known before a call is performed.  Also, after the end of the call y is obviously known.  Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.    
  - [Other attributes]
    - `campaign`: number of contacts performed during this campaign and for this client (numeric)
    - `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric)
      - 999 means client was not previously contacted
    - `previous`: number of contacts performed before this campaign and for this client (numeric)
    - `poutcome`: outcome of the previous marketing campaign (categorical)

- Output variable (desired target):
  - `y`: has the client subscribed a term deposit? (binary)

## Our goal is to develop a model the best predicts the outcome `y`, the success of the marketing campaign

## Question 1: Remove the categorical variables with the most number of distinct values

In [41]:
var = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

df.job.value_counts()


management      969
blue-collar     946
technician      768
admin.          478
services        417
               ... 
entrepreneur    168
unemployed      128
housemaid       112
student          84
unknown          38
Name: job, dtype: int64

In [56]:
df.drop('job', axis=1, inplace=True)

ValueError: labels ['job'] not contained in axis

In [57]:
df.default.value_counts()

no     4445
yes      76
Name: default, dtype: int64

Answer:

## Question 2: Recode all `yes`/`no` categorical variables with `0` as the most frequent value (then also append `"_no"` to the variable name), and `1` for the other (then leave the name unchanged)

In [58]:
var = ['marital', 'education', 'housing', 'contact', 'month', 'poutcome']
yes_no = ['default', 'housing', 'loan']

df['default_no'] = 0
df.loc[df.default == 'yes', 'default_no'] = 1

df['housing_no'] = 0
df.loc[df.default == 'yes', 'housing_no'] = 1

df['loan_no'] = 0
df.loc[df.default == 'yes', 'loan_no'] = 1


## Question 3: Create dummy variables for the other categorical variables

In [None]:
beds_df = pd.get_dummies(df.marital, prefix = 'Bed')

## Question 4: What should be your baseline for these dummy variables (namely, which dummy variables should you not include in your model)?

Answer:

## Question 5: What input variable in the dataset seems to predict the outcome quite well.  Why?

Answer:

## Question 6: Split the dataset into a training set (60%) and a testing set (the rest)

## Question 7: Run a logistic regression with `age`, `marital` (the dummies), `default`, `balance`, `housing`, `loan`, `campaign`, `pdays`, `previous`?

## Question 8: What is your training error?  What is your generalization error?  Does it make sense?

Answer:

## Question 9: Interpret your coefficients. (At least `marital_single`, `campaign`, and `default`).  Does your interpretation  make sense?

Answer:

## Question 10: What is your prediction for a 30 years old single female, a homeowner with a \$1,000 balance in the bank, without a loan, who has never been contacted before, and who has never defaulted.

Answer:

## Question 11: Normalize your variables.  (You can reuse the function from the previous lab)

## Question 12: Let's do some regularization.  Use 10-fold cross validation to find the best tuning parameter `c`

(Hint: check the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)

Note: `c` is just the inverse of $\lambda$ - the smaller $c$, the stronger the regularization. The smaller values choose less variables

(Hint 2: First try c = 10 ^ i with i = -10 ... 10)

Answer:

## Question 13: Now use the best `c` you found above and repeat your analysis; look over your coefficients

## Question 14: If you want to drop 3 variables from your analysis, which variables will you choose?

Answer: