# DS-SF-30 | Assignment 11: Logistic Regression

In [3]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

from sklearn import linear_model, cross_validation

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

## Probability, Odds, and  Odds ratios

**Probability:** The number of ways that an event can occur divided by the total number of possible outcomes.

The probability of drawing a red card from a standard deck of cards is 26/52 (50 percent).
The probability of drawing a club from that deck is 13/52 (25 percent).

> ### Question 1.  What's the probability of getting heads in a fair coin flip?

In [4]:
p = 1/2
print(p)

0.5


Answer: 1/2 (50%)

The odds for an event is the ratio of the number of ways the event can occur compared to the number of ways it does not occur.

For example, using the same events as above, the odds for:
- Drawing a red card from a standard deck of cards is 1:1; and
- Drawing a club from that deck is 1:3.

> ### Question 2.  What's the odds of a fair coin flip?

In [5]:
p / (1-p)

1.0

> ### Question 3.  Suppose that 18 out of 20 patients in an experiment lost weight while using diet A, while 16 out of 20  lost weight using diet B.  What's the probability of weight loss with diet A?  What's the odds?

In [6]:
p_a = 18/20.0
print('Probability is {}'.format(p_a))
odds_a = p_a/(1-p_a)
print('Odds are {}'.format(odds_a))

Probability is 0.9
Odds are 9.000000000000002


> ### Question 4.  What's the probablity of weight loss with diet B?  What's the odds?

In [7]:
p_b = 16/20.0
print('Probability is {}'.format(p_b))
odds_b = p_b/(1-p_b)
print('Odds are {}'.format(odds_b))

Probability is 0.8
Odds are 4.000000000000001


> ### Question 5.  What's the odds ratio?

In [8]:
odds_a / odds_b

2.25

## Bank Marketing

In [9]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-11-bank-marketing.csv'))

In [10]:
df

Unnamed: 0,age,job,marital,education,default,...,campaign,pdays,previous,poutcome,c
0,30,unemployed,married,primary,no,...,1,-1,0,unknown,no
1,33,services,married,secondary,no,...,1,339,4,failure,no
2,35,management,single,tertiary,no,...,1,330,1,failure,no
3,30,management,married,tertiary,no,...,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,...,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,...,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,...,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,...,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,...,4,211,3,other,no


In [11]:
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'c'],
      dtype='object')

> The dataset is related to the direct marketing campaigns (by phone) of a Portuguese banking institution.  The classification goal is to predict if the client will subscribe a term deposit (variable y).

Attributes Information:

- Input variables:
  - [Bank client data]
    - `age` (numeric)
    - `job`: type of job (categorical)
    - `marital`: marital status (categorical)
      - Note: `divorced` means divorced or widowed)
    - `education` (categorical)
    - `default`: has credit in default? (categorical)
    - `balance`: bank account balance (\$)
    - `housing`: has housing loan? (categorical)
    - `loan`: has personal loan? (categorical)
  - [Data related with the last contact of the current campaign]
    - `contact`: contact communication type (categorical)
    - `month`: last contact month of year (categorical)
    - `day_of_week`: last contact day of the week (categorical)
    - `duration`: last contact duration, in seconds (numeric)
      - Important note: this attribute highly affects the output target (e.g., if `duration = 0` then `y = 'no'`). Yet, the duration is not known before a call is performed.  Also, after the end of the call y is obviously known.  Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  - [Other attributes]
    - `campaign`: number of contacts performed during this campaign and for this client (numeric)
    - `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric)
      - 999 means client was not previously contacted
    - `previous`: number of contacts performed before this campaign and for this client (numeric)
    - `poutcome`: outcome of the previous marketing campaign (categorical)

- Output variable (desired target):
  - `y`: has the client subscribed a term deposit? (binary)

> Our goal is to develop a model the best predicts the outcome `y`, the success of the marketing campaign.

> ## Question 6.  Remove the categorical variables with the most number of distinct values

In [12]:
print(df.job.unique())
print(df.marital.unique())
print(df.education.unique())
print(df.default.unique())
print(df.housing.unique())
print(df.loan.unique())
print(df.contact.unique())
print(df.month.unique())
print(df.day.unique())
print(df.poutcome.unique())

['unemployed' 'services' 'management' 'blue-collar' 'self-employed'
 'technician' 'entrepreneur' 'admin.' 'student' 'housemaid' 'retired'
 'unknown']
['married' 'single' 'divorced']
['primary' 'secondary' 'tertiary' 'unknown']
['no' 'yes']
['no' 'yes']
['no' 'yes']
['cellular' 'unknown' 'telephone']
['oct' 'may' 'apr' 'jun' 'feb' 'aug' 'jan' 'jul' 'nov' 'sep' 'mar' 'dec']
[19 11 16  3  5 23 14  6 17 20 13 30 29 27  7 18 12 21 26 22  2  4 15  8 28
  9  1 10 31 25 24]
['unknown' 'failure' 'other' 'success']


In [13]:
df.drop('job',inplace=True,axis=1)
#df.drop('month',inplace=True)
#df.drop('day',inplace=True)

Answer: Job has a lot of possible values (12), same with month, and day is described as categorical in the description and has 31 possible values. 

> ## Question 7.  Recode all `yes`/`no` categorical variables with `0` as the most frequent value (then also append `"_no"` to the variable name), and `1` for the other (then leave the name unchanged)

First default, then housing, then loan.

In [14]:
df.default.value_counts()

no     4445
yes      76
Name: default, dtype: int64

In [15]:
df.default = df.default.apply(lambda value: 0 if value == 'no' else 1)

In [16]:
# check if it worked
df.default.value_counts()

0    4445
1      76
Name: default, dtype: int64

Now housing

In [17]:
df.housing.value_counts()

yes    2559
no     1962
Name: housing, dtype: int64

In [18]:
df['housing_no'] = df.housing.apply(lambda value: 0 if value == 'yes' else 1)

In [19]:
df.housing_no.value_counts()

0    2559
1    1962
Name: housing_no, dtype: int64

Now loan

In [20]:
df.loan.value_counts()

no     3830
yes     691
Name: loan, dtype: int64

In [21]:
df.loan = df.loan.apply(lambda value: 0 if value == 'no' else 1)

In [22]:
df.loan.value_counts()

0    3830
1     691
Name: loan, dtype: int64

Oh, also outcome (c)

In [23]:
df.c.value_counts()

no     4000
yes     521
Name: c, dtype: int64

In [24]:
df.c = df.c.apply(lambda value: 0 if value == 'no' else 1)

In [25]:
df.c.value_counts()

0    4000
1     521
Name: c, dtype: int64

> ## Question 8.  Create binary/dummy variables for the other categorical variables

In [26]:
m_df = pd.get_dummies(df.marital, prefix = 'marital')
ed_df = pd.get_dummies(df.education, prefix = 'education')
c_df = pd.get_dummies(df.contact, prefix = 'contact')

df = df.join([m_df, ed_df, c_df])

In [27]:
df.head()

Unnamed: 0,age,marital,education,default,balance,...,education_tertiary,education_unknown,contact_cellular,contact_telephone,contact_unknown
0,30,married,primary,0,1787,...,0.0,0.0,1.0,0.0,0.0
1,33,married,secondary,0,4789,...,0.0,0.0,1.0,0.0,0.0
2,35,single,tertiary,0,1350,...,1.0,0.0,1.0,0.0,0.0
3,30,married,tertiary,0,1476,...,1.0,0.0,0.0,0.0,1.0
4,59,married,secondary,0,0,...,0.0,0.0,0.0,0.0,1.0


> ## Question 9.  What should be your baseline for these binary variables (namely, which binary variables should you not include in your model)?

Baseline values should be the most frequent ones, so which are those? 

In [28]:
m_df.sum()

marital_divorced     528.0
marital_married     2797.0
marital_single      1196.0
dtype: float64

In [29]:
ed_df.sum()

education_primary       678.0
education_secondary    2306.0
education_tertiary     1350.0
education_unknown       187.0
dtype: float64

In [30]:
c_df.sum()

contact_cellular     2896.0
contact_telephone     301.0
contact_unknown      1324.0
dtype: float64

Answer: Baseline (and therefore taken out) should be marital_married, education_secondary, and contact_cellular

In [31]:
df.drop(['marital', 'education', 'contact','marital_married',
         'education_secondary', 'contact_cellular'], axis = 1, inplace = True)

In [32]:
df.head()

Unnamed: 0,age,default,balance,housing,loan,...,education_primary,education_tertiary,education_unknown,contact_telephone,contact_unknown
0,30,0,1787,no,0,...,1.0,0.0,0.0,0.0,0.0
1,33,0,4789,yes,1,...,0.0,0.0,0.0,0.0,0.0
2,35,0,1350,yes,0,...,0.0,1.0,0.0,0.0,0.0
3,30,0,1476,yes,1,...,0.0,1.0,0.0,0.0,1.0
4,59,0,0,yes,0,...,0.0,0.0,0.0,0.0,1.0


> ## Question 10.  What input variable in the dataset seems to predict the outcome quite well.  Why?

In [33]:
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 20)

df.corr()

Unnamed: 0,age,default,balance,loan,day,duration,campaign,pdays,previous,c,housing_no,marital_divorced,marital_single,education_primary,education_tertiary,education_unknown,contact_telephone,contact_unknown
age,1.0,-0.017885,0.08382,-0.01125,-0.017853,-0.002367,-0.005148,-0.008894,-0.003511,0.045092,0.193888,0.148043,-0.410768,0.224938,-0.094042,0.081111,0.18306,-0.035125
default,-0.017885,1.0,-0.070886,0.063994,-0.013261,-0.011615,-0.012348,-0.026317,-0.026656,0.001303,-0.006881,0.048878,0.007391,-0.006734,-0.021407,-0.00124,-0.021119,0.014151
balance,0.08382,-0.070886,1.0,-0.071349,-0.008677,-0.01595,-0.009976,0.009437,0.026196,0.017905,0.050227,-0.036284,0.007525,-0.001551,0.076487,0.01923,0.034025,-0.018892
loan,-0.01125,0.063994,-0.071349,1.0,-0.004879,-0.004997,0.01712,-0.031086,-0.022115,-0.070517,-0.018451,0.017798,-0.048496,-0.016574,-0.043434,-0.066621,-0.004945,-0.005894
day,-0.017853,-0.013261,-0.008677,-0.004879,1.0,-0.024629,0.160706,-0.094352,-0.059114,-0.011244,0.031291,-0.007121,0.006769,-0.020851,0.007465,0.000787,0.053527,-0.048143
duration,-0.002367,-0.011615,-0.01595,-0.004997,-0.024629,1.0,-0.068382,0.01038,0.01808,0.401118,-0.01574,0.021373,0.02456,-0.00364,-0.017779,-0.010802,-0.02118,-0.005469
campaign,-0.005148,-0.012348,-0.009976,0.01712,0.160706,-0.068382,1.0,-0.093137,-0.067833,-0.061147,0.003574,-0.022157,-0.008093,0.009746,0.022631,-0.020508,0.026571,0.004882
pdays,-0.008894,-0.026317,0.009437,-0.031086,-0.094352,0.01038,-0.093137,1.0,0.577562,0.104087,-0.116893,-0.003411,0.020693,-0.019708,0.000377,0.004601,0.01966,-0.24627
previous,-0.003511,-0.026656,0.026196,-0.022115,-0.059114,0.01808,-0.067833,0.577562,1.0,0.116714,-0.038621,-0.022158,0.035558,-0.020439,0.026977,-0.004239,0.031797,-0.194142
c,0.045092,0.001303,0.017905,-0.070517,-0.011244,0.401118,-0.061147,0.104087,0.116714,1.0,0.104683,0.03484,0.045815,-0.02742,0.056649,-0.00887,0.025878,-0.139399


Answer: Duration has a 0.4 correlation with c, so that's definitely the highest. But the documentation says not to include it, as it is found out at the same time as the output variable. Sigh. 

In [34]:
df.drop('duration', axis = 1, inplace = True)

> ## Question 11.  Split the dataset into a training set (60%) and a testing set (the rest)

In [35]:
train_df = df.sample(frac = .6)
test_df = df.drop(train_df.index)

> ## Question 12.  Run a logistic regression with `age`, `marital` (the dummies), `default`, `balance`, `housing`, `loan`, `campaign`, `pdays`, `previous`?

In [36]:
columns_to_include = ['age', 'marital_single', 'marital_divorced',
    'default', 'balance', 'housing_no',
    'loan', 'campaign', 'pdays', 'previous']

def X_c(df):
    X = df[ columns_to_include ]
    c = df.c
    return X, c

train_X, train_c = X_c(train_df)
test_X, test_c = X_c(test_df)

In [37]:
train_X.dtypes

age                   int64
marital_single      float64
marital_divorced    float64
default               int64
balance               int64
housing_no            int64
loan                  int64
campaign              int64
pdays                 int64
previous              int64
dtype: object

In [38]:
model = linear_model.LogisticRegression().fit(train_X, train_c)

print(model.intercept_)
print(model.coef_)

[-2.98898955]
[[  1.39750710e-02   3.64805557e-01   3.26208230e-01   1.17281980e-01
   -1.77128369e-05   8.03796675e-01  -5.61061369e-01  -9.76732617e-02
    1.87575235e-03   1.17753371e-01]]


In [40]:
train_X.columns

Index(['age', 'marital_single', 'marital_divorced', 'default', 'balance',
       'housing_no', 'loan', 'campaign', 'pdays', 'previous'],
      dtype='object')

> ## Question 13.  What is your training error?  What is your generalization error?  Does it make sense?

In [39]:
print(1-model.score(train_X,train_c))
print(1-model.score(test_X,test_c))

0.116844821231
0.116150442478


Answer: No, testing error is a little less than training error, that seems off

> ## Question 14.  Interpret your coefficients. (At least `marital_single`, `campaign`, and `default`).  Does your interpretation  make sense?

In [45]:
for a,b in zip(columns_to_include, np.exp(model.coef_[0]) - 1):
    print(a,b)

age 0.0140731787953
marital_single 0.440233936889
marital_divorced 0.385703884353
default 0.124436453615
balance -1.77126800169e-05
housing_no 1.23400664579
loan -0.42939687789
campaign -0.0930548109504
pdays 0.00187751267126
previous 0.124966628183


Answer: This gives odds. Single people are more likely to subscribe than married people (odds are 44% higher), perhaps because they only have to make decisions for themselves. 

Being contacted more for a campaign makes people slightly less likely to respond to this campaign (odds go down by 9%), probably because people get better about saying 'no' if they've said no before. 

If they've defaulted, people are slightly more likely to subscribe (odds go up by 12%), perhaps because people who default are less responsible with their money

> ## Question 15.  What is your prediction for a 30 years old single female, a homeowner with a \$1,000 balance in the bank, without a loan, who has never been contacted before, and who has never defaulted

In [46]:
predict_X = [ [30, 1, 0, 0, 1000, 1, 0, 3, 999, 0] ]

print(model.predict(predict_X))
print(model.predict_proba(predict_X))

[1]
[[ 0.45958616  0.54041384]]


Answer:  54% chance of subscribing