# Train a ready to use PyTorch model with a simple pipeline

In [1]:
import os
import sys
import warnings
warnings.filterwarnings("ignore")

import numpy as np

# the following line is not required if BatchFlow is installed as a python package.
sys.path.append("../..")
from batchflow import Pipeline, B, C, F, V
from batchflow.opensets import MNIST, CIFAR10
from batchflow.models.torch import VGG7

If you comment out the line below, the training will take much more time and the accuracy might slightly decrease.
So it is always a good idea to import [best_practice](https://analysiscenter.github.io/batchflow/intro/best_practice.html)

In [2]:
from batchflow import best_practice

BATCH_SIZE might be increased for modern GPUs with lots of memory (4GB and higher).

In [3]:
BATCH_SIZE = 64

# Create a dataset

[MNIST](http://yann.lecun.com/exdb/mnist/) is a dataset of handwritten digits frequently used as a baseline for machine learning tasks.

Downloading MNIST database might take a few minutes to complete.

In [4]:
dataset = MNIST()

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Extracting /tmp/train-images-idx3-ubyte.gz
ExtractingExtracting /tmp/t10k-labels-idx1-ubyte.gz
Extracting /tmp/train-labels-idx1-ubyte.gz
 /tmp/t10k-images-idx3-ubyte.gz


There are also predefined CIFAR10 and CIFAR100 datasets.

# Define a pipeline config

Config allows to create flexible pipelines which take parameters.

For instance, if you put a model type into config, you can run a pipeline against different models.

See [a list of available models](https://analysiscenter.github.io/batchflow/intro/tf_models.html#ready-to-use-models) to choose the one which fits you best.

In [5]:
config = dict(model=VGG7)

# Create a template pipeline

A template pipeline is not linked to any dataset. It's just an abstract sequence of actions, so it cannot be executed, but it serves as a convenient building block.

In [6]:
train_template = (Pipeline(config=config)
                .init_variable('loss_history', init_on_each_run=list)
                .init_variable('current_loss')
                .init_model('dynamic', C('model'), 'conv_nn',
                            config={'inputs': dict(images={'shape': B('image_shape')},
                                                   labels={'classes': 10}),
                                    'initial_block/inputs': 'images'})
                .to_array(channels='first', dtype='float32')
                .train_model('conv_nn', B('images'), B('labels'),
                             fetches='loss', save_to=V('current_loss'), use_lock=True)
                .update_variable('loss_history', V('current_loss'), mode='a')
)

# Train the model

Apply a dataset to a template pipeline to create a runnable pipeline:

In [7]:
train_pipeline = (train_template << dataset.train)

Run the pipeline (it might take from a few minutes to a few hours depending on your hardware)

In [8]:
train_pipeline.run(BATCH_SIZE, shuffle=True, n_epochs=1, drop_last=True, bar=True, prefetch=1)

100%|██████████| 937/937 [34:45<00:00,  3.13s/it]


<batchflow.pipeline.Pipeline at 0x7fc31ff96c50>

Note that the progress bar often increments by 2 at a time - that's prefetch in action.

It does not give much here, though, since almost all time is spent in model training which is performed under a thread-lock one batch after another without any parallelism (otherwise the model would not learn anything as different batches would rewrite one another's model weights updates).

# Test the model

It is much faster than training, but if you don't have GPU it would take some patience.

In [9]:
test_pipeline = (dataset.test.p
                .import_model('conv_nn', train_pipeline)
                .init_variable('predictions') 
                .init_variable('metrics', init_on_each_run=None) 
                .to_array(channels='first', dtype='float32')
                .predict_model('conv_nn', B('images'), B('labels'),
                               fetches='predictions', save_to=V('predictions'))
                .gather_metrics('class', targets=B('labels'), predictions=V('predictions'),
                                fmt='logits', axis=1, save_to=V('metrics'), mode='w')
                .run(BATCH_SIZE, shuffle=True, n_epochs=1, drop_last=True, bar=True)
)

100%|██████████| 156/156 [03:45<00:00,  1.47s/it]


Let's get the accumulated [metrics information](https://analysiscenter.github.io/batchflow/intro/models.html#model-metrics)

In [10]:
metrics = test_pipeline.get_variable('metrics')

Now we can easiliy calculate any metrics we need

In [11]:
metrics.evaluate('accuracy')

0.98968349358974361

In [12]:
metrics.evaluate(['false_positive_rate', 'false_negative_rate'], multiclass=None)

{'false_negative_rate': array([ 0.00282634,  0.00444139,  0.00615079,  0.00646368,  0.00563441,
         0.00835877,  0.01503064,  0.02847777,  0.00730057,  0.01746818]),
 'false_positive_rate': array([ 0.00123332,  0.00101426,  0.00123533,  0.00090061,  0.00176282,
         0.00088898,  0.0002193 ,  0.00054381,  0.00111026,  0.00258118])}

# Save the model
After learning the model, you may need to save it. It's easy to do this.

In [13]:
train_pipeline.save_model('conv_nn', path='path/to/save')

## What's next?

See [the image augmentation tutorial](./06_image_augmentation.ipynb) or return to the [table of contents](./00_description.ipynb).