# Pipelines

Pipelines enable us to define the flow of data from the initial dataset to a trained and scored model. The parameters of the pipeline can be tweaked and also optimized using algormithms like grid-search.

In [1]:
import sys
sys.path.append('../scripts')

import numpy as np
import pandas as pd
import helpers_models as hm
import transforms
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model` import LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

First, load the data.

In [2]:
data = hm.load_pivot_numbers()

In [3]:
first_week, last_week = data['tweets'].columns.min(), data['tweets'].columns.max()

In [4]:
data = hm.make_target(data, last_week)

Split the data into train and test.

In [5]:
data_train, data_test = train_test_split(data)

We also want to balance the dataset outside of the pipeline. The reason is that when fitting the pipeline, we can't use `ClassBalancer` because it only transforms the input data and not the target.

In [6]:
data_train = hm.balance_data(data_train)

Now make a pipeline for transforming and predicting.

In [7]:
pipeline = Pipeline([
    ('normal', transforms.Normalizer()),
    ('decay', transforms.TimeDecayApplier(first_week=first_week, target_week=last_week)),
    ('logreg', LogisticRegressionCV(max_iter=200, n_jobs=-1, verbose=2))
])

The next step is to train the pipeline.

In [8]:
%%time
pipeline.fit(data_train.drop('target', axis=1, level=0), data_train['target'])

[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   13.5s finished


CPU times: user 6 s, sys: 2.34 s, total: 8.34 s
Wall time: 18.3 s


Pipeline(steps=[('normal', Normalizer()), ('decay', TimeDecayApplier(first_week=23, target_week=36)), ('logreg', LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=200,
           multi_class='ovr', n_jobs=-1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=2))])

Finally, use the pipeline to predict test values and scores. We will use the test data for predictions.

In [9]:
%%time
predicted = pipeline.predict(data_test.drop('target', axis=1, level=0))
report = classification_report(data_test['target'], predicted)
print(report)

             precision    recall  f1-score   support

      False       0.92      0.96      0.94   1825978
       True       0.53      0.36      0.43    239430

avg / total       0.87      0.89      0.88   2065408

CPU times: user 4.02 s, sys: 1.66 s, total: 5.67 s
Wall time: 5.53 s
