
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Logistic Regresion Lab
## Exercise with bank marketing data

_Authors: Sam Stack(DC)_

## Introduction
- Data from the UCI Machine Learning Repository: data, [data dictionary](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)
- **Goal**: Predict whether a customer will purchase a bank product marketed over the phone
- `bank-additional.csv` is already in our repo, so there is no need to download the data from the UCI website

## Step 1: Read the data into Pandas

In [52]:
import pandas as pd
bank = pd.read_csv('../data/bank.csv')
bank.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,0
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,0
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,0
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,...,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,0
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,0


**  Target '`y`' represented as such**
    - No : 0
    - Yes : 1
    


In [53]:
# Perform what ever steps you need to familiarize yourself with the data:
bank.y.mean()

print 'marrital values: ', bank.marital.unique()
print 'campaign values:', bank.campaign.unique()
print 'age range: ', bank.age.min(), ' to ', bank.age.max() 

marrital values:  ['married' 'single' 'divorced' 'unknown']
campaign values: [ 2  4  1  3  6  7 27  5 12 14 10  8 11 13  9 15 16 18 17 22 19 23 24 35 29]
age range:  18  to  88



## Step 2: Prepare at least three features
- Include both numeric and categorical features
- Choose features that you think might be related to the response (based on intuition or exploration)
- Think about how to handle missing values (encoded as "unknown")

In [54]:
# A:
categorical_cols = ['marital']
bank = bank.join(pd.get_dummies(bank['marital'], prefix='marital'))

feature_cols = ['age', 'marital_divorced', 'marital_married', 'marital_single'] #drop marital_unknown

bank.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,marital_divorced,marital_married,marital_single,marital_unknown
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,-1.8,92.893,-46.2,1.313,5099.1,0,0,1,0,0
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,1.1,93.994,-36.4,4.855,5191.0,0,0,0,1,0
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1.4,94.465,-41.8,4.962,5228.1,0,0,1,0,0
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,...,1.4,94.465,-41.8,4.959,5228.1,0,0,1,0,0
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,-0.1,93.2,-42.0,4.191,5195.8,0,0,1,0,0



## Step 3: Model building
- Use cross-validation to evaluate the logistic regression model with your chosen features.  
    You can use any (combination) of the following metrics to evaluate.
    - [Classification/Accuracy Error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
    - [Confusion Matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
    - [ROC curves and area under a curve (AUC)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score)
    - [Log loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html)
- Try to increase the metrics by selecting different sets of features
    - *Bonus*: Experiment with hyper parameters such are regularization.

In [55]:
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn import metrics

**Build a Model**  

In [56]:
import numpy as np

X = bank[feature_cols]
y = bank.y
LR = LogisticRegression()

X_train, X_test, y_train, y_test = train_test_split(X, y)
LR.fit(X_train, y_train)
y_pred = LR.predict(X_test)
LR.score(X_test, y_test)

X = bank[feature_cols]
y = bank['y']
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=46)
logit_simple = linear_model.LogisticRegression(C=1e9).fit(X_train, y_train)

print logit_simple.score(X_test, y_test)
print np.mean(y_test == logit_simple.predict(X_test))


0.892233009709
0.892233009709


**Get the Coefficient for each feature.**
- Be sure to make note of interesting findings.



**Use the Model to predict on x_test and evaluate the model using metric(s) of Choice.**

In [74]:
# A:
#bank.y.mean()
logit_pred_proba = logit_simple.predict_proba(X_test)[:,1]
metrics.confusion_matrix(y_true=y_test, y_pred=logit_pred_proba > .1)



array([[382, 537],
       [ 36,  75]], dtype=int64)

### Model 2: Use a different combination of features.
- Evaluate the model and interpret your choosen metrics.

In [None]:
# A;



### Is your model not performing very well?

Is it not predicting any True Positives?

Lets try one more thing before we revert to grabbing more features.  Adjusting the probability threshold.

Use the `LogisticRegression.predict_proba()` attribute to get the probabilities.

Recall from the lesson the first probability is the for `class 0` and the second is for `class 1`.

In [78]:
# A:


**Visualize the distribution**

In [None]:
# A:

** Calculate a new threshold and use it to convert predicted probabilities to output classes**



In [None]:
# A:

**Evaluate the model metrics now**

In [None]:
# A:

## Step 4: Build a model using all of the features.

- Evaluate it using your prefered metrics.

In [None]:
# A:

## Bonus: Use Regularization to optimize your model.

In [None]:
# try using a for loop to test various regularization strengths 'C'