
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Logistic Regresion Lab
## Exercise with bank marketing data

_Authors: Sam Stack(DC)_

## Introduction
- Data from the UCI Machine Learning Repository: data, [data dictionary](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)
- **Goal**: Predict whether a customer will purchase a bank product marketed over the phone
- `bank-additional.csv` is already in our repo, so there is no need to download the data from the UCI website

## Step 1: Read the data into Pandas

In [2]:
import pandas as pd
bank = pd.read_csv('../data/bank.csv')
bank.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,0
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,0
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,0
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,...,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,0
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,0


**  Target '`y`' represented as such**
    - No : 0
    - Yes : 1
    


In [3]:
# Perform what ever steps you need to familiarize yourself with the data:
bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4119 entries, 0 to 4118
Data columns (total 21 columns):
age               4119 non-null int64
job               4119 non-null object
marital           4119 non-null object
education         4119 non-null object
default           4119 non-null object
housing           4119 non-null object
loan              4119 non-null object
contact           4119 non-null object
month             4119 non-null object
day_of_week       4119 non-null object
duration          4119 non-null int64
campaign          4119 non-null int64
pdays             4119 non-null int64
previous          4119 non-null int64
poutcome          4119 non-null object
emp.var.rate      4119 non-null float64
cons.price.idx    4119 non-null float64
cons.conf.idx     4119 non-null float64
euribor3m         4119 non-null float64
nr.employed       4119 non-null float64
y                 4119 non-null int64
dtypes: float64(5), int64(6), object(10)
memory usage: 675.9+ KB



## Step 2: Prepare at least three features
- Include both numeric and categorical features
- Choose features that you think might be related to the response (based on intuition or exploration)
- Think about how to handle missing values (encoded as "unknown")

In [4]:
# A:
feats = ['day_of_week','nr.employed', 'cons.price.idx']


## Step 3: Model building
- Use cross-validation to evaluate the logistic regression model with your chosen features.  
    You can use any (combination) of the following metrics to evaluate.
    - [Classification/Accuracy Error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
    - [Confusion Matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
    - [ROC curves and area under a curve (AUC)](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score)
    - [Log loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html)
- Try to increase the metrics by selecting different sets of features
    - *Bonus*: Experiment with hyper parameters such are regularization.

In [25]:
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn import metrics

**Build a Model**  

In [26]:
# convert selected features do dummies
bank_dummies = pd.get_dummies(bank.day_of_week, prefix='weekday_', drop_first=True)

banks_dummies = pd.concat([bank, bank_dummies], axis=1)

bank_dummies.head()

Unnamed: 0,weekday__mon,weekday__thu,weekday__tue,weekday__wed
0,0,0,0,0
1,0,0,0,0
2,0,0,0,1
3,0,0,0,0
4,1,0,0,0


In [27]:
# set the model
lr = LogisticRegression(C=1e9)
# set x and y
feats_new = ['weekday__mon','weekday__thu','weekday__tue','weekday__wed', 'nr.employed', 'cons.price.idx']
Xe = banks_dummies[feats_new]
ye = banks_dummies.y

In [28]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(Xe, ye, random_state=99)

In [30]:
# fit model
lr.fit(X_train, y_train)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [31]:
lr.score(X_train, y_train)

0.8879896406604079

**Get the Coefficient for each feature.**
- Be sure to make note of interesting findings.



In [34]:
list(zip(feats_new, lr.coef_[0]))

[('weekday__mon', 0.0041632694901498485),
 ('weekday__thu', 0.011603435758902379),
 ('weekday__tue', -0.007717101822762361),
 ('weekday__wed', -0.0006281479968076708),
 ('nr.employed', -0.011512152504317785),
 ('cons.price.idx', 0.6102412075862448)]

**Use the Model to predict on x_test and evaluate the model using metric(s) of Choice.**

In [None]:
# A:

### Model 2: Use a different combination of features.
- Evaluate the model and interpret your choosen metrics.

In [None]:
# A;



### Is your model not performing very well?

Is it not predicting any True Positives?

Lets try one more thing before we revert to grabbing more features.  Adjusting the probability threshold.

Use the `LogisticRegression.predict_proba()` attribute to get the probabilities.

Recall from the lesson the first probability is the for `class 0` and the second is for `class 1`.

In [None]:
# A:

**Visualize the distribution**

In [None]:
# A:

** Calculate a new threshold and use it to convert predicted probabilities to output classes**



In [None]:
# A:

**Evaluate the model metrics now**

In [None]:
# A:

## Step 4: Build a model using all of the features.

- Evaluate it using your prefered metrics.

In [None]:
# A:

## Bonus: Use Regularization to optimize your model.

In [None]:
# try using a for loop to test various regularization strengths 'C'