# Train Sklearn Models With Simple Pipeline

In this notebook we train [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) and [Huber classifier](https://en.wikipedia.org/wiki/Huber_loss#Variant_for_classification) from sklearn library using the same pipeline and compare their perfomance.

In [1]:
import sys

import numpy as np
from sklearn.linear_model import SGDClassifier
import matplotlib.pyplot as plt

sys.path.append('../..')

from batchflow.models import SklearnModel
from batchflow.opensets import MNIST
from batchflow import B, C, V, D, Pipeline
from batchflow.utils import plot_images

plt.style.use('seaborn-poster')
plt.style.use('ggplot')

# Create dataset

Load MNIST dataset.

In [2]:
dataset = MNIST()

# Initialize models

Provide the pipeline configs for each of the models.    
It has to contain the following keys:
* `model_name` 
* `estimator`  that has to be [SGDClassifiers](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.predict_proba) instance.
* `predictions_name` - the variable name in the pipeline that contains model's predictions
* `metrics_name` - the variable name in the pipeline that contains model's metrics

In [3]:
huber_config = {'model_name': 'huber_model', 
                'estimator': SGDClassifier(loss='modified_huber'),
                'metrics_name': 'huber_metrics'}

log_config = {'model_name': 'log_model', 
              'estimator': SGDClassifier(loss='log'),
              'metrics_name': 'log_metrics'}

Initialize both models with the estimators from the corresponding config keys, i.e. `config['estimator']`.

In [4]:
init_huber_model = Pipeline().init_model('dynamic', SklearnModel, 'huber_model', 
                                 config={'estimator' : C('estimator')})
init_log_model = Pipeline().init_model('dynamic', SklearnModel, 'log_model', 
                                 config={'estimator' : C('estimator')})

# Create train pipelines

Training pipeline template. We exploit the exact same pipeline various times with different estimators.

Pipeline preprocessing image actions include:
1. Transform images from `PIL.Image` to `np.array`.   
2. Reshape images to `2-dimensional` arrays where the number of rows equal to the batch size.

`Sklearn` models with `partial_fit` attribute support batch wise training and can be integrated into pipeline.

In [5]:
train_template = (dataset.train.p
                    .to_array()
                    .add_namespace(np)
                    .reshape(B('images'), (B('size'), -1), save_to=B('images'))
                    .train_model(C('model_name'), B.images, B.labels, 
                                classes=range(dataset.num_classes))
                    .run_later(64, n_iters=5000, drop_last=True, shuffle=True, bar=True)
                 ) 

Tie the pipeline's configs to the `train_template`    
Now we have ready to use training pipelines for each model.

In [6]:
huber_train_pipeline = (init_huber_model + train_template) << huber_config
log_train_pipeline = (init_log_model + train_template) << log_config

# Train the models

Run the pipelines.

In [7]:
huber_train_pipeline.run()

100%|██████████| 5000/5000 [02:15<00:00, 36.83it/s]


<batchflow.pipeline.Pipeline at 0x7f9d519bb1d0>

In [None]:
log_train_pipeline.run()

 85%|████████▍ | 4236/5000 [01:52<00:20, 37.85it/s]

# Test the models

Import trained models from training pipelines.

In [None]:
import_huber_model = Pipeline().import_model('huber_model', huber_train_pipeline)
import_log_model = Pipeline().import_model('log_model', log_train_pipeline)

The same steps of preprocessing images used in test pipelines.   

In [None]:
test_template = (dataset.test.p
                    .to_array()
                    .add_namespace(np)
                    .reshape(B('images'), (B('size'), -1), save_to=B('images'))
                    .predict_model(C('model_name'), B('images'),
                                   save_to=B('predictions'), proba=True)              
                    .gather_metrics('class', B('labels'), B('predictions'), num_classes=dataset.num_classes,
                                     fmt='proba', axis=1, save_to=V(C('metrics_name')))
                    .run_later(200, n_epochs=1, drop_last=False, shuffle=True, bar=True)
                    .reshape(B('images'), (B('size'), 28, 28), save_to=B('images'))
                ) 

Tie the pipeline configs to the test template.

In [None]:
huber_test_pipeline = (import_huber_model + test_template) << huber_config
log_test_pipeline = (import_log_model + test_template) << log_config

Run test pipelines.

In [None]:
huber_test_pipeline.run()

In [None]:
log_test_pipeline.run()

Let's get the accumulated [metrics information](https://analysiscenter.github.io/batchflow/intro/models.html#model-metrics).

In [None]:
log_metrics = log_test_pipeline.v('log_metrics')
huber_metrics = huber_test_pipeline.v('huber_metrics')

Calculate accuracy for both classifiers.

In [None]:
acc = [log_metrics.evaluate('acc'), huber_metrics.evaluate('acc')]

Huber classifier slightly  outperfoms logistic regression.

In [None]:
print('log_reg accuracy - {0} \nhuber accuracy - {1}'.format(*acc)) 

Generate and pass the batch of images through the trained classifiers and look in more details at the predictions of each of them.   

In [None]:
huber_batch = huber_test_pipeline.next_batch(10, shuffle=42)
log_batch = log_test_pipeline.next_batch(10, shuffle=42)

Let's take a look at the predictions.

In [None]:
predictions = [huber_batch.predictions, log_batch.predictions]
plot_images(log_batch.images, log_batch.labels, predictions,
                        classes=None, figsize=(16, 12), models_names=['huber', 'logreg'])

# Conclusion

In this tutorial you have learnt how to train and validate `logistic regression` and `huber` classifier from [sklearn](https://scikit-learn.org/stable/index.html) library using [batchflow pipeline](https://analysiscenter.github.io/batchflow/intro/pipeline.html) functionality

In your futher research you can train [SVM](https://en.wikipedia.org/wiki/Support-vector_machine) classifier or try to add more regularization to the models via passing `alpha` and `l1` arrguments to the `SGDClassifiers` instances.