# Data pipeline

Workflow adapted from Katie Malone's [Workflows in Python](https://civisanalytics.com/blog/data-science/2015/12/17/workflows-in-python-getting-data-ready-to-build-models/) series

1. [Load data](#Load-Data)
2. [Transform features and labels to conform to machine learning](#Transform-data-for-machine-learning)
  1. _Note: need to incorporate Abbie's outlier data cleaning_
3. [Make a train/test split (for cross validation score)](http://localhost:8889/notebooks/DataScience/digblood/notebooks/jfa-2.0-data%20pipeline.ipynb#Split-the-data-into-training/test-sets-by-hand)
4. [Pick a classifer & evaluate it](#Playing-around-with-different-classifiers)
5. [Evaluate several models](#Shortcut:-sklearn.model_selection.cross_val_score)


----
See also
- [Sources](#Sources)

----
# Load Data

In [1]:
import pandas as pd
import sklearn

In [2]:
data_dir      = '../data/raw/'
data_filename = 'blood_train.csv'
df_blood      = pd.read_csv(data_dir+data_filename)

df_blood.head(10)

Unnamed: 0.1,Unnamed: 0,Months since Last Donation,Number of Donations,Total Volume Donated (c.c.),Months since First Donation,Made Donation in March 2007
0,619,2,50,12500,98,1
1,664,0,13,3250,28,1
2,441,1,16,4000,35,1
3,160,2,20,5000,45,1
4,358,1,24,6000,77,0
5,335,4,4,1000,4,0
6,47,2,7,1750,14,1
7,164,1,12,3000,35,0
8,736,5,46,11500,98,1
9,436,0,3,750,4,0


----
# Transform data for machine learning

In [3]:
X = df_blood.as_matrix()
y = list(df_blood["Made Donation in March 2007"])

----
# Split the data into training/test sets by hand

__Note: I don't actually use this data, since I overwrite using an automated way later on__

Splits the 1st third of the data into the training set, the 2nd third 

In [13]:
# Split data into 4 partitions
#  - training set
#  - validation set
#  - combined training & validation set
#  - testing set

nrows_total = df_blood.count()[1]
nrows_train = int(nrows_total/3)
nrows_valid = int(nrows_total*2/3)

X_train, y_train             = X[:nrows_train]           , y[:nrows_train]
X_valid, y_valid             = X[nrows_train:nrows_valid], y[nrows_train:nrows_valid]
X_test , y_test              = X[nrows_valid:]           , y[nrows_valid:]
X_train_valid, y_train_valid = X[:nrows_valid]           , y[:nrows_valid]

print("Total number of rows:\t", nrows_total)
print("Training rows:\t\t"     , 0          ,"-", nrows_train)
print("Validation rows:\t"     , nrows_train,"-", nrows_valid)
print("Testing rows:\t\t"      ,nrows_valid ,"-" , nrows_total)



Total number of rows:	 576
Training rows:		 0 - 192
Validation rows:	 192 - 384
Testing rows:		 384 - 576


###  Automated way to split data: 
### `sklearn.model_selection.train_test_split`

Split the total dataset into training and testing sets via random selection
 - `test_size` - proportion of the dataset to put into the __test__ set
 - `randome_state` - Seed for pseudo-random number generator

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test= sklearn.model_selection.train_test_split(
    X, y, 
    test_size=0.5, 
    random_state=0) 

print("No. Rows in training set:\t", len(X_train))
print("No. Rows in testing set:\t" , len(X_test))

No. Rows in training set:	 288
No. Rows in testing set:	 288


----
# Playing around with different classifiers

With the data loaded, transformed and split, can now pass it into different classifiers and see how they perform

Basic workflow for each classifier:
 1. import classifier
 2. initialize classifier into `clf` variable
 3. fit data (`X_train`, `y_train`) into classifier
 4. predict output (i.e. probabilities) using `X_test` data
 5. evaluate prediction quality (via `sklearn.metrics.log_loss` function & `y_test` data)

### Linear Classifier: 
- [`sklearn.linear_model.LinearRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
  - Example: [Simple linear regression](http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py)

In [6]:
from sklearn.linear_model import LinearRegression

### Random Forest Classifier
`sklearn.ensemble.RandomForestClassifier`


In [7]:
from sklearn.ensemble import RandomForestClassifier

# Train uncalibrated random forest classifier 
# on whole train and validation data 
# and evaluate on test data
clf       = sklearn.ensemble.RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)

# Get probabilities
clf_probs = clf.predict_proba(X_test)

# Test/Evaluate the the model
score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)

Log-loss score:	 0.0226022615147


### Calibrated Random Forest Classifier
- `sklearn.calibration.CalibratedClassifierCV`
- `sklearn.ensemble.RandomForestClassifier`

In [8]:
from sklearn.ensemble import RandomForestClassifier

# Train random forest classifier
#  - calibrate on validation data
#  - evaluate test data
clf       = sklearn.ensemble.RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)


from sklearn.calibration import CalibratedClassifierCV

# Pass the RandomForestClassifier into the CalibrationClassifier
sig_clf   = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_valid, y_valid)

# Get prediction probabilities from model
sig_clf_probs = sig_clf.predict_proba(X_test)

# Test quality of predictions using `log_loss` function
sig_score     = sklearn.metrics.log_loss(y_test, sig_clf_probs)
print("Log-loss score:\t", sig_score)

Log-loss score:	 0.0208181404669


----
# Shortcut: `sklearn.model_selection.cross_val_score`

From Katie Malone's [Workflows in Python](https://civisanalytics.com/blog/data-science/2015/12/17/workflows-in-python-getting-data-ready-to-build-models/):
> The cheapest and easiest way to train on one portion of my dataset and test on another, and to get a measure of model quality at the same time, is to use [sklearn.cross_validation.cross_val_score()](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html). 
>
> `cross_val_score()` 
> - splits data into 3 equal portions
> - trains on 2 portions
> - tests on the third 
>
> This process repeats 3 times. That’s why 3 numbers get printed in the code block below.

### Note: `log_loss` results are negative and is labelled `neg_log_loss` for the `cross_val_score` function

See:
- [Sklearn | Quantifying the Quality of Predictions](http://scikit-learn.org/stable/modules/model_evaluation.html)
- [StackOverflow | Why is log_loss negative?](http://stackoverflow.com/questions/26282884/why-is-the-logloss-negative)
  - Basically, _higher score means better performance (less loss)_


### Generate data
- X: Training vector
- y: Target vector

In [9]:
X = df_blood.as_matrix()
y = list(df_blood["Made Donation in March 2007"])

### LogisticRegression

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

clf   = sklearn.linear_model.LogisticRegression()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")

print(score)

[-0.05211964 -0.03698339 -0.04242597]


### DecisionTreeClassifier

In [11]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

clf   = sklearn.tree.DecisionTreeClassifier()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")
print(score)

[ -9.99200722e-16  -9.99200722e-16  -9.99200722e-16]


### RandomForestClassifier

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

clf   = sklearn.ensemble.RandomForestClassifier()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")
print(score)

[-0.03331336 -0.04324528 -0.01836764]


----
# Sources

Examples:
- [Example: Probability Calibration](http://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_multiclass.html#sphx-glr-auto-examples-calibration-plot-calibration-multiclass-py)
  - _helpful for seeing whole workflow in action from loading to plotting_
  
Documentation
- [sklearn.linear_model.LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [sklearn.model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- [sklearn.metrics.log_loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html#sklearn.metrics.log_loss)

Discussion
- [CrossValidated | sklearn `predict_proba` output interpretation](http://stats.stackexchange.com/questions/179977/scikit-predict-proba-output-interpretation)
