# Pipelines
Data pipelines are a series of automated data transformations that ensure the validity of your work for routine data maintenance tasks. Each stage of a pipeline feeds from the previous stage, i.e. the output of a stage is plugged into the input of the next stage and data flows through the pipeline from beginning to end just as water flows through a pipeline. Many organizations rely on data engineering teams to encode common tasks into pipelines.

Examples of data transformations:
- change in scale, units, or base
- text vectorization
- image vectorization
- sound file vectorization
- missing data imputation
- clipping

In [3]:
from sklearn.pipeline import Pipeline
import pandas as pd
import json

data = pd.read_csv("data/stumbleupon.tsv", sep='\t')
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))

# fill NA with empty cells and check data
titles = data['title'].fillna('')
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7395 entries, 0 to 7394
Data columns (total 29 columns):
url                               7395 non-null object
urlid                             7395 non-null int64
boilerplate                       7395 non-null object
alchemy_category                  7395 non-null object
alchemy_category_score            7395 non-null object
avglinksize                       7395 non-null float64
commonlinkratio_1                 7395 non-null float64
commonlinkratio_2                 7395 non-null float64
commonlinkratio_3                 7395 non-null float64
commonlinkratio_4                 7395 non-null float64
compression_ratio                 7395 non-null float64
embed_ratio                       7395 non-null float64
framebased                        7395 non-null int64
frameTagRatio                     7395 non-null float64
hasDomainLink                     7395 non-null int64
html_ratio                        7395 non-null float64
image_r

In [5]:
# set label as target
y = data.label
y[0:3]

0    0
1    1
2    1
Name: label, dtype: int64

In [7]:
# check target proportion
y.value_counts()/len(y)

1    0.51332
0    0.48668
Name: label, dtype: float64

In [11]:
# countvectorize our first title
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features = 1000, ngram_range=(1,2),
                            stop_words= 'english', binary =True)
vectorizer.fit(['IBM Sees Holographic calls Air Breathing'])


CountVectorizer(analyzer=u'word', binary=True, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [12]:
vectorizer.vocabulary_

{u'air': 0,
 u'air breathing': 1,
 u'breathing': 2,
 u'calls': 3,
 u'calls air': 4,
 u'holographic': 5,
 u'holographic calls': 6,
 u'ibm': 7,
 u'ibm sees': 8,
 u'sees': 9,
 u'sees holographic': 10}

In [13]:
# countvectorize our first title
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features = 1000, ngram_range=(1,3),
                            stop_words= 'english', binary =True)
vectorizer.fit(['IBM Sees the the the Holographic calls Air Breathing'])

# the the the does not count because it is a stop word

CountVectorizer(analyzer=u'word', binary=True, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 3), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [14]:
vectorizer.vocabulary_

{u'air': 0,
 u'air breathing': 1,
 u'breathing': 2,
 u'calls': 3,
 u'calls air': 4,
 u'calls air breathing': 5,
 u'holographic': 6,
 u'holographic calls': 7,
 u'holographic calls air': 8,
 u'ibm': 9,
 u'ibm sees': 10,
 u'ibm sees holographic': 11,
 u'sees': 12,
 u'sees holographic': 13,
 u'sees holographic calls': 14}

Example of how Count Vectorizer works:
![Example](assets/CountVectorizer.jpg)

In [15]:
# get n-grams
vectorizer.get_feature_names()
#or vectorizer.vocabulary_

[u'air',
 u'air breathing',
 u'breathing',
 u'calls',
 u'calls air',
 u'calls air breathing',
 u'holographic',
 u'holographic calls',
 u'holographic calls air',
 u'ibm',
 u'ibm sees',
 u'ibm sees holographic',
 u'sees',
 u'sees holographic',
 u'sees holographic calls']

In [16]:
# vectorize our original training title
vectorizer.transform(['IBM Sees Holographic Air']).todense()
#check word in vectorizer.get_feature_names()

matrix([[1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0]])

In [17]:
# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)
# Use `transform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles)

In [21]:
# build Logit and CV score
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

model = LogisticRegression()
scores = cross_val_score(model, X, y)
print('CV scores: {}'.format(scores))
print ('Average CVscore: {:0.3f} +/-{:0.3f}'.format(scores.mean(), scores.std()))

CV scores: [ 0.74736415  0.75456389  0.75446429]
Average CVscore: 0.752 +/-0.003


In [None]:
# Split the data into a training set
training_data = data[:6000]
X_train = training_data['title'].fillna('')
y_train = training_data['label']

# reserve future data, unavailable at training time
X_new = data[6000:]['title'].fillna('')

pipeleine = pipeleine([
    ('vec', vectorizer),
    ('model' , model)])
# Fit the full pipeline

# This means we perform the steps laid out above
# First we fit the vectorizer,
# And then feed the output of that into the fit function of the model


# Here again we apply the full pipeline for predictions
# The text is transformed automatically to match the features from the pipeline


### Merging Feature Sets in Pipelines

We may want to merge many different feature sets automatically. Here we can use scikit-learn's `FeatureUnion`.

While scikit-learn pipelines help with managing the transformation from raw data, there may be many steps before this takes place in your pipeline. These pipelines are often referred to as ETL pipelines for (Extract, Transform, Load). In an ETL pipeline, the data is pulled or extracted from some source (like a database), transformed or manipulated, and then loaded into whatever system will analyze the data.

Many data science teams rely on software tools to manage these ETL pipelines. If a transformation step fails, these tools alert you, or ensure that step can be re-run. If these transformation steps need to happen daily or weekly, these tools can manage that timeline.

One of the most popular Python tools for this is [Luigi](https://github.com/spotify/luigi) developed by Spotify.
An alternative is [Airflow](https://github.com/airbnb/airflow) by AirBnB.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# test `make_pipeline` vs `Pipeline`; are they different?
pipe1 = make_pipeline (StandardScaler(),LogisticRegression())

pipe2 = Pipeline(steps = [('standardscaler', StandardScaler())
                         ('logistic_reg', LogisticRegression())])

In [None]:
pipe1

In [None]:
pipe2

### Check
In pairs, assign one function to each pair, they have to read about it in the doc and then explain it to the class.

1. Binarizer
1. KernelCenterer
1. MaxAbsScaler
1. MinMaxScaler
1. Normalizer
1. OneHotEncoder
1. PolynomialFeatures
1. RobustScaler
1. StandardScaler
1. Data Imputation

1. Imputer
1. Function Transformer

1. FunctionTransformer
1. Label Manipulators

1. LabelBinarizer
1. LabelEncoder
1. MultiLabelBinarizer

In [None]:
# implement custom transformers by extending the BaseClass in sklearn
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureMultiplier(BaseEstimator, TransformerMixin):
    def __init__(self, factor):
        self.factor = factor

    def transform(self, X, *_):
        return X * self.factor

    def fit(self, *_):
        return self

fm = FeatureMultiplier(2)

test = np.diag((1,2,3,4))
print test

fm.transform(test)

How does this compare with `FunctionTransformer` from the preprocessing module?

Optional: Implement a custom transformer that selects a specific feature from a Pandas dataframe. It should be initialized with the column name or the column index and it should return the selected column when transforming a dataframe.

Revisit the salary prediction lab. How could you use `make_pipeline` and `make_union` to build a pipeline that performs the same steps all in one pass?

You will have to build something like this:

>Data: SelectCategoricalFeaturesTransformer: OneHotEncoder: FeatureUnion: Model: SelectNumericalFeaturesTransformer: Scaler

Students:
- Review lab and identify the steps that were performed
- For each step, determine input and output
- Is the input the whole dataframe or only a subset of the features?
- Is the output new features or a prediction?
- Identify what kind of transformer is needed:
    - Is it a custom transformer?
    - Does scikit-learn provide a transformer like this out of the box?
- If features are treated differently, how do we recombine ([Feature Union](http://scikit-learn.org/stable/modules/pipeline.html)) them?