# Train Sklearn Models With Simple Pipeline

In this notebook we train Logistic regression and [Huber classifier](https://en.wikipedia.org/wiki/Huber_loss#Variant_for_classification) from sklearn library and compare their perfomance.

In [1]:
import sys
import warnings
warnings.filterwarnings("ignore")

import numpy as np
from sklearn.linear_model import SGDClassifier
import matplotlib.pyplot as plt

sys.path.append('../..')

from batchflow.models import SklearnModel
from batchflow.opensets import MNIST
from batchflow import B, C, V, D, Pipeline
from examples.utils.utils import plot_images_predictions

Load MNIST dataset.

In [2]:
dataset = MNIST()

Pipeline preprocessing image actions include:
1. Transform images from `PIL.Image` to `np.array`.   
2. Reshape them to `2-dimensional` arrays where the number of rows equal to the batch size.

`Sklearn` models with partial_fit attribute support batch wise training and can be integrated into pipeline.

Initialize both models providing `model_name` and `estimator` keys in the pipeline config, `estimator` must be [SGDClassifiers](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.predict_proba) instance.

In [3]:
logloss_config = {'model_name': 'logloss_model', 'estimator': SGDClassifier(loss='log')}
huber_config =  {'model_name': 'huber_model', 'estimator': SGDClassifier(loss='modified_huber')} 

Training pipeline template. We exploit the exact same pipeline various times with different estimators.

In [4]:
num_classes = dataset.num_classes
train_template = (Pipeline(config=logloss_config)
                   .print(C('model_name'))
                    .init_model('dynamic', SklearnModel, C('model_name'),
                                 config=dict(estimator = C('estimator')))
                    .to_array()
                    .add_namespace(np)
                    .reshape(B('images'), (B('size'), -1), save_to=B('images'))
                     .train_model(C('model_name'), B.images, B.labels, 
                                  classes=list(range(num_classes)))
           ) << dataset.train

hello dynamic <class 'batchflow.models.sklearn.SklearnModel'> logloss_model


Ready to use training pipelines.

In [5]:
logreg_train_pipeline = train_template

In [9]:
logreg_train_pipeline.models

{'logloss_model': <batchflow.models.sklearn.SklearnModel object at 0x7f38234b1ef0>}

In [7]:
logreg_train_pipeline.run(64, n_iters=10)

logloss_model
logloss_model
logloss_model
logloss_model
logloss_model
logloss_model
logloss_model
logloss_model
logloss_model
logloss_model


<batchflow.pipeline.Pipeline at 0x7f37b6a83128>

In [8]:
huber_train_pipeline = train_template << huber_config
huber_train_pipeline.run(64, n_iters=10)

huber_model


KeyError: "Model 'huber_model' does not exist"

Run the pipelines.

In [None]:
huber_train_pipeline.run(64, 10)
#logreg_train_pipeline.run(64, 10)

In [None]:
huber_train_pipeline.models

The same steps for test pipelines.   
Instead of initializing models we import them from trained pipelines.

In [None]:
import_huber_model = Pipeline().import_model('my_model', huber_train_pipeline)
import_logreg_model = Pipeline().import_model('my_model', logreg_train_pipeline)

Test pipeline template.

In [None]:
test_template = (dataset.test.p
                    .init_variable('metrics', default=None)
                    .init_variable('predictions')
                    .to_array()
                    .add_namespace(np)
                    .reshape(B('images'), (B('size'), -1), save_to=B('images'))
                    .predict_model('my_model', B('images'), save_to=V('predictions'))              
                    .gather_metrics('class', B.labels, V('predictions'), num_classes=num_classes,
                                    fmt='proba', axis=1, save_to=V('metrics', mode='a'))
                    .run_later(200, n_epochs=1, drop_last=False, shuffle=True, bar=True)
                    .reshape(B('images'), (B('size'), 28, 28), save_to=B('images'))
            )

In [None]:
huber_test_pipeline = import_huber_model + test_template
logreg_test_pipeline = import_logreg_model + test_template

Run test pipelines.

In [None]:
huber_test_pipeline.run()
logreg_test_pipeline.run()

In [None]:
logreg_metrics = logreg_test_pipeline.v('metrics')
huber_metrics = huber_test_pipeline.v('metrics')

In [None]:
logreg_metrics.evaluate('acc'), huber_metrics.evaluate('acc') 

In [None]:
huber_batch = huber_test_pipeline.next_batch(10, shuffle=True)
logreg_batch = logreg_test_pipeline.next_batch(10, shuffle=True)

In [None]:
predictions = [huber_batch.pipeline.v('predictions'), logreg_batch.pipeline.v('predictions')]
plot_images_predictions(logreg_batch.images, logreg_batch.labels, predictions,
                        classes_names=None, figsize=(20,20), models_names=['huber', 'logreg'])