# DS-SF-25 | Lab 09 | Introduction to Logistic Regression

In [89]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

from sklearn import linear_model, cross_validation

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [90]:
df = pd.read_csv(os.path.join('..', 'datasets', 'bank-marketing.csv'))

In [91]:
df

Unnamed: 0,age,job,marital,education,default,...,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,...,1,-1,0,unknown,no
1,33,services,married,secondary,no,...,1,339,4,failure,no
2,35,management,single,tertiary,no,...,1,330,1,failure,no
3,30,management,married,tertiary,no,...,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,...,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,...,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,...,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,...,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,...,4,211,3,other,no


> The dataset is related to the direct marketing campaigns (by phone) of a Portuguese banking institution.  The classification goal is to predict if the client will subscribe a term deposit (variable y).

Attributes Information:

- Input variables:
  - [Bank client data]
    - `age` (numeric)
    - `job`: type of job (categorical)
    - `marital`: marital status (categorical)
      - Note: `divorced` means divorced or widowed)
    - `education` (categorical)
    - `default`: has credit in default? (categorical)
    - `balance`: bank account balance (\$)
    - `housing`: has housing loan? (categorical)
    - `loan`: has personal loan? (categorical)
  - [Data related with the last contact of the current campaign]
    - `contact`: contact communication type (categorical)
    - `month`: last contact month of year (categorical)
    - `day_of_week`: last contact day of the week (categorical)
    - `duration`: last contact duration, in seconds (numeric)
      - Important note: this attribute highly affects the output target (e.g., if `duration = 0` then `y = 'no'`). Yet, the duration is not known before a call is performed.  Also, after the end of the call y is obviously known.  Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  - [Other attributes]
    - `campaign`: number of contacts performed during this campaign and for this client (numeric)
    - `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric)
      - 999 means client was not previously contacted
    - `previous`: number of contacts performed before this campaign and for this client (numeric)
    - `poutcome`: outcome of the previous marketing campaign (categorical)

- Output variable (desired target):
  - `y`: has the client subscribed a term deposit? (binary)

> Our goal is to develop a model the best predicts the outcome `y`, the success of the marketing campaign.

> ## Question 1.  Remove the categorical variables with the most number of distinct values

In [92]:
len(df.job.unique())

12

In [93]:
len(df.marital.unique())

3

In [94]:
len(df.education.unique())

4

In [95]:
len(df.housing.unique())

2

In [96]:
len(df.loan.unique())

2

In [97]:
len(df.contact.unique())

3

In [98]:
len(df.month.unique())

12

In [99]:
len(df.poutcome.unique())

4

Answer: Job

In [100]:
df.drop('job',axis=1, inplace = True)

> ## Question 2.  Recode all `yes`/`no` categorical variables with `0` as the most frequent value (then also append `"_no"` to the variable name), and `1` for the other (then leave the name unchanged)

In [101]:
df.default.value_counts()

no     4445
yes      76
Name: default, dtype: int64

In [102]:
df['default'].min()

'no'

In [86]:
df.default = df.default.apply(lambda value: 0 if value == df['default'].max() else 1)

In [87]:
df.default.value_counts()

1    4445
0      76
Name: default, dtype: int64

In [106]:
df['housing'].max()

'yes'

In [68]:
df.housing.value_counts()

yes    2559
no     1962
Name: housing, dtype: int64

In [69]:
df.housing = df.housing.apply(lambda value: 0 if value == df['housing'].max() else 1)

In [70]:
df.housing.value_counts()

0    2559
1    1962
Name: housing, dtype: int64

In [19]:
df.loan.value_counts()

no     3830
yes     691
Name: loan, dtype: int64

In [20]:
df.loan = df.loan.apply(lambda value: 0 if value == df['loan'].max() else 1)

In [21]:
df.loan.value_counts()

1    3830
0     691
Name: loan, dtype: int64

In [22]:
df.y.value_counts()

no     4000
yes     521
Name: y, dtype: int64

In [23]:
df.y = df.y.apply(lambda value: 0 if value == df['y'].max() else 1)

In [24]:
df.y.value_counts()

1    4000
0     521
Name: y, dtype: int64

> ## Question 3.  Create binary/dummy variables for the other categorical variables

In [25]:
marital_df = pd.get_dummies(df.marital, prefix = 'marital')
education_df = pd.get_dummies(df.education, prefix = 'education')
contact_df = pd.get_dummies(df.contact, prefix = 'contact')

df = df.join([marital_df, education_df, contact_df])

> ## Question 4.  What should be your baseline for these binary variables (namely, which binary variables should you not include in your model)?

In [26]:
marital_df

Unnamed: 0,marital_divorced,marital_married,marital_single
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,1.0
3,0.0,1.0,0.0
4,0.0,1.0,0.0
...,...,...,...
4516,0.0,1.0,0.0
4517,0.0,1.0,0.0
4518,0.0,1.0,0.0
4519,0.0,1.0,0.0


In [27]:
education_df

Unnamed: 0,education_primary,education_secondary,education_tertiary,education_unknown
0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,1.0,0.0,0.0
...,...,...,...,...
4516,0.0,1.0,0.0,0.0
4517,0.0,0.0,1.0,0.0
4518,0.0,1.0,0.0,0.0
4519,0.0,1.0,0.0,0.0


In [28]:
contact_df

Unnamed: 0,contact_cellular,contact_telephone,contact_unknown
0,1.0,0.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0
...,...,...,...
4516,1.0,0.0,0.0
4517,0.0,0.0,1.0
4518,1.0,0.0,0.0
4519,1.0,0.0,0.0


Answer: TODO

In [29]:
df.drop(['marital', 'education', 'contact'], axis = 1, inplace = True)

> ## Question 5.  What input variable in the dataset seems to predict the outcome quite well.  Why?

In [30]:
df.corr().y.sort_values(ascending = False).head()

y                      1.000000
contact_unknown        0.139399
marital_married        0.064643
campaign               0.061147
education_secondary    0.028744
Name: y, dtype: float64

In [31]:
df.drop('duration', axis=1, inplace = True)

Answer: TODO

> ## Question 6.  Split the dataset into a training set (60%) and a testing set (the rest)

In [32]:
train_df = df.sample(frac = .6, random_state = 0)
test_df = df.drop(train_df.index)

In [33]:
train_df.index

Int64Index([4274,   72, 1065, 3185,  130, 2033, 1098, 2217, 2319, 1504,
            ...
            3537, 3212,  301, 3845, 4152, 1817, 2642, 3302, 1993,  310],
           dtype='int64', length=2713)

> ## Question 7.  Run a logistic regression with `age`, `marital` (the dummies), `default`, `balance`, `housing`, `loan`, `campaign`, `pdays`, `previous`?

In [34]:
names_X = ['age', 'marital_single', 'marital_divorced',
    'default', 'balance', 'housing',
    'loan', 'campaign', 'pdays', 'previous']

def X_y(df):
    X = df[names_X]
    y = df.y
    return X, y

train_X, train_y = X_y(train_df)
test_X, test_y = X_y(test_df)    

In [35]:
train_X

Unnamed: 0,age,marital_single,marital_divorced,default,balance,housing,loan,campaign,pdays,previous
4274,35,0.0,0.0,1,378,0,1,3,-1,0
72,31,1.0,0.0,1,6248,0,1,2,-1,0
1065,30,0.0,0.0,1,0,0,1,1,-1,0
3185,42,0.0,0.0,1,529,0,0,3,-1,0
130,41,0.0,0.0,1,168,0,1,3,-1,0
...,...,...,...,...,...,...,...,...,...,...
1817,27,0.0,0.0,1,2171,0,1,1,323,1
2642,46,0.0,0.0,1,143,0,1,1,-1,0
3302,33,0.0,1.0,1,0,0,0,8,-1,0
1993,40,0.0,0.0,1,419,0,1,1,-1,0


In [36]:
model = linear_model.LogisticRegression().\
    fit(train_X, train_y)
print model.intercept_
print model.coef_

[ 1.60629052]
[[ -1.25396230e-03  -3.57195822e-01  -3.83402998e-01   1.23825445e+00
   -3.00702144e-05  -6.68711382e-01  -3.38257689e-01   7.38630504e-02
   -1.65731392e-03  -7.30882254e-02]]


> ## Question 8.  What is your training error?  What is your generalization error?  Does it make sense?

In [37]:
1 - model.score(train_X, train_y)

0.11758201253225209

In [38]:
1- model.score(test_X, test_y)

0.11504424778761058

In [39]:
1 - model.score(train_X, train_y) - 1- model.score(test_X, test_y)

-1.7673737396801372

Answer: TODO

> ## Question 9.  Interpret your coefficients. (At least `marital_single`, `campaign`, and `default`).  Does your interpretation  make sense?

In [107]:
zip(names_X, np.exp(model.coef_[0]) - 1)

[('age', -0.0012531764133461021),
 ('marital_single', -0.30036451992773272),
 ('marital_divorced', -0.31846181486122649),
 ('default', 2.4495867940481451),
 ('balance', -3.0069762307127768e-05),
 ('housing', -0.48763160005750295),
 ('loan', -0.28698847103163361),
 ('campaign', 0.076659347336751393),
 ('pdays', -0.0016559413355317076),
 ('previous', -0.070481180724287884)]

Answer: TODO

> ## Question 10.  What is your prediction for a 30 years old single female, a homeowner with a \$1,000 balance in the bank, without a loan, who has never been contacted before, and who has never defaulted

In [111]:
X_predict = [30, 1, 0, 1, 1000, 0, 0,0,999,0 ]
y_hat = model.intercept_ + (X_predict*model.coef_[0]).sum()
print y_hat

[ 0.76400346]


Answer: TODO

> ## Question 11.  Normalize your variables (You can reuse the function we used from a previous lab)

In [112]:
def normalize(x):
    max=x.max()
    min=x.min()
    return (x-min)/(max-min)
train_df[ names_X ] = train_df[ names_X ].apply(normalize)

> ## Question 12.  Let's do some regularization.  Use 10-fold cross validation to find the best tuning parameter `c`

(Hint: check the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)

(Hint 2: First try c = 10 ^ i with i = -10 ... 10)

Answer: TODO

> ## Question 13.  Now use the best `c` you found above and repeat your analysis; look over your coefficients

In [44]:
# TODO

> ## Question 14.  If you want to drop 3 variables from your analysis, which variables will you choose?

In [45]:
# TODO

Answer: TODO