# Lunch and learn: Machine learning (logistic regression)

### Goals of this notebook:
Demonstrate how to:
    1. Define the problem
    2. Clean/prepare data for building a model
    3. Train the model
    4. Evaluate the model
    5. Using the model to predict outcomes based on fresh/unseen data


### TODOS
    1. Finish write up as a proper ML notebook
    2. Host program/function on django and expose as a HTTP endpoint
    3. Prepare presentation


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### 1. Define the problem

The most important part of any ML problem is: **what puzzle(s) do we / our clients want to solve?**

In this example, since we have data on bank marketing binary classification goal, the questions I can come up with is as follows:
- what indicators (x variables) have the biggest effect on whether a user will subscribe to a bank term deposit (y variable)?
- given that we have the variables required for the model for a given user, can we predict if the client will subscribe a bank term deposit (y variable)?

### 1. Clean/prepare data for building a model

In [None]:
### 2. Train the model
### 3. Evaluate the model
### 4. Using the model to predict outcomes based on fresh/unseen data

In [None]:
### 2. ingest data

In [3]:
df = pd.read_csv('./data/bank-marketing-data/bank-additional-full.csv', sep=';')

### 2.1 Understand the data
The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

Data Sources: Sérgio Moro (ISCTE-IUL), Paulo Cortez (Univ. Minho) and Paulo Rita (ISCTE-IUL) @ 2014
   
  S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems (2014), doi:10.1016/j.dss.2014.03.001.
 
5. Number of Instances: 41188 for bank-additional-full.csv

6. Number of Attributes: 20 + output attribute.

7. Attribute information:

   ## Input variables:
   ### bank client data:
   
   1 - age (numeric)
   2 - job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
   3 - marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
   4 - education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")
   5 - default: has credit in default? (categorical: "no","yes","unknown")
   6 - housing: has housing loan? (categorical: "no","yes","unknown")
   7 - loan: has personal loan? (categorical: "no","yes","unknown")
   ### related with the last contact of the current campaign:
   
   8 - contact: contact communication type (categorical: "cellular","telephone") 
   9 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
  10 - day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
  11 - duration: last contact duration, in seconds (numeric). Important note:  this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  
   ### other attributes:
  12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14 - previous: number of contacts performed before this campaign and for this client (numeric)
  15 - poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")
  
   ### social and economic context attributes
  16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
  17 - cons.price.idx: consumer price index - monthly indicator (numeric)     
  18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)     
  19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
  20 - nr.employed: number of employees - quarterly indicator (numeric)

  # Output variable (desired target):
  21 - y - has the client subscribed a term deposit? (binary: "yes","no")

8. Missing Attribute Values: There are several missing values in some categorical attributes, all coded with the "unknown" label. These missing values can be treated as a possible class label or using deletion or imputation techniques. 


In [21]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [20]:
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [None]:
### 3. data cleaning/wrangling

In [4]:
# Treating missing values

# fastest way: dropna
df = df[df['loan'] != 'unknown']

In [5]:
df = df[df['default'] != 'unknown']

In [6]:
df = df[df['education'] != 'unknown']

In [7]:
df = df[df['job'] != 'unknown']

In [8]:
df = df[df['marital'] != 'unknown']

In [9]:
### 4. EDA
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,30488.0,30488.0,30488.0,30488.0,30488.0,30488.0,30488.0,30488.0,30488.0,30488.0
mean,39.030012,259.484092,2.521451,956.332295,0.194273,-0.07151,93.523311,-40.602263,3.459938,5160.813409
std,10.333529,261.714262,2.72015,201.373292,0.522788,1.610399,0.585374,4.789249,1.777231,75.158065
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,31.0,103.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.313,5099.1
50%,37.0,181.0,2.0,999.0,0.0,1.1,93.444,-41.8,4.856,5191.0
75%,45.0,321.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,95.0,4918.0,43.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [10]:
# Convert string data to numerical data so that scikitlearn can understand it
cols_to_transform = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week',
                    'poutcome', 'y']
df_with_dummies = pd.get_dummies(df, columns = cols_to_transform)

In [11]:
df_with_dummies.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success,y_no,y_yes
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,1,0,0,0,0,1,0,1,0
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,1,0,0,0,0,1,0,1,0
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,1,0,0,0,0,1,0,1,0
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,1,0,0,0,0,1,0,1,0
6,59,139,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,1,0,0,0,0,1,0,1,0


In [12]:
# Preparing the data into 2 sets: X and y variables

df_y = df_with_dummies[['y_yes']]
df_y.head()

Unnamed: 0,y_yes
0,0
2,0
3,0
4,0
6,0


In [13]:
del df_with_dummies['y_yes']
del df_with_dummies['y_no']
df_with_dummies.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
6,59,139,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0


In [14]:
df_with_dummies.as_matrix()

array([[  56.,  261.,    1., ...,    0.,    1.,    0.],
       [  37.,  226.,    1., ...,    0.,    1.,    0.],
       [  40.,  151.,    1., ...,    0.,    1.,    0.],
       ..., 
       [  56.,  189.,    2., ...,    0.,    1.,    0.],
       [  44.,  442.,    1., ...,    0.,    1.,    0.],
       [  74.,  239.,    3., ...,    1.,    0.,    0.]])

In [15]:
# convert pandas dataframe into numpy array
from numpy import array
X = df_with_dummies.as_matrix()
y = df_y.as_matrix() # this step may not be necessary

In [None]:
### 5. Train the model

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [18]:
logisticregression = LogisticRegression().fit(X_train, y_train)
print("training set score: %f" % logisticregression.score(X_train, y_train))
print("test set score: %f" % logisticregression.score(X_test, y_test))

  y = column_or_1d(y, warn=True)


training set score: 0.900245
test set score: 0.899370


In [19]:
from sklearn import metrics

In [22]:
expected = y_test
predicted = logisticregression.predict(X_test)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))

             precision    recall  f1-score   support

          0       0.92      0.97      0.94      6662
          1       0.66      0.42      0.51       960

avg / total       0.89      0.90      0.89      7622



In [23]:
print(metrics.confusion_matrix(expected, predicted))

[[6455  207]
 [ 560  400]]


In [40]:
# Using our trained model to predict whether y will be 1 or 0

sample_input = X_test[0]

# Using .predict_proba() .predict_proba() returns the probability of the sample for each class in the model, where classes are ordered as 
# they are in self.classes_.
print(logisticregression.predict_proba(sample_input))


# Using .predict()  This returns the class label (i.e. whether the prediction is 0 or 1)
print(logisticregression.predict(sample_input))


[[ 0.90609139  0.09390861]]
[0]




In [41]:
# Using our trained model to predict whether y will be 1 or 0

sample_input_2 = X_test[2]

# Using .predict_proba() .predict_proba() returns the probability of the sample for each class in the model, where classes are ordered as 
# they are in self.classes_.
print(logisticregression.predict_proba(sample_input_2))


# Using .predict()  This returns the class label (i.e. whether the prediction is 0 or 1)
print(logisticregression.predict(sample_input_2))


[[ 0.24718234  0.75281766]]
[1]




array([1], dtype=uint8)

In [42]:
from sklearn.linear_model import LogisticRegressionCV

In [44]:
logisticregressionCV = LogisticRegressionCV().fit(X_train, y_train)
print("training set score: %f" % logisticregressionCV.score(X_train, y_train))
print("test set score: %f" % logisticregressionCV.score(X_test, y_test))

training set score: 0.900245
test set score: 0.900026


In [46]:
logisticregression.C

1.0

In [49]:
logisticregression.solver

'liblinear'

In [48]:
logisticregressionCV.

'lbfgs'