# Train a ready to use model with a simple pipeline

In [1]:
# pylint: skip-file
import os
import sys
import numpy as np

# the following line is not required if Dataset is installed as a python package.
sys.path.append("../..")
from dataset import Pipeline, B, C, F, V
from dataset.opensets import MNIST, CIFAR10
from dataset.models.tf import ResNet18

If you comment out the line below, the accuracy will slightly decrease.

In [2]:
from dataset import best_practice

BATCH_SIZE might be increased for modern GPUs with lots of memory (4GB and higher).

In [3]:
BATCH_SIZE = 64

# Create a dataset

[MNIST](http://yann.lecun.com/exdb/mnist/) is a dataset of handwritten digits frequently used as a baseline for machine learning tasks.

Downloading MNIST database might take a few minutes to complete.

In [4]:
dataset = MNIST()

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
DownloadingExtracting /tmp/train-images-idx3-ubyte.gz
Extracting /tmp/train-labels-idx1-ubyte.gz
 http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Extracting /tmp/t10k-images-idx3-ubyte.gz
Extracting /tmp/t10k-labels-idx1-ubyte.gz


There are also predefined CIFAR10 and CIFAR100 datasets.

# Define a config

Config allows to create flexible pipelines which take parameters.

For instance, if you put a model type into config, you can run a pipeline against different models.

See [a list of available models](https://analysiscenter.github.io/dataset/intro/tf_models.html#ready-to-use-models) to choose the one which fits you best.

In [5]:
config = dict(model=ResNet18)

# Create a template pipeline

A template pipeline is not linked to any dataset. It's just an abstract sequence of actions, so it cannot be executed, but it serves as a convenient building block.

In [6]:
train_template = (Pipeline(config=config)
                .init_variable('loss_history', init_on_each_run=list)
                .init_variable('current_loss', init_on_each_run=0)
                .init_model('dynamic', C('model'), 'conv_nn',
                            config={'inputs': dict(images={'shape': B('image_shape')},
                                                   labels={'classes': 10, 'transform': 'ohe', 'name': 'targets'}),
                                    'input_block/inputs': 'images',
                                    'output': dict(ops=['accuracy'])})
                .train_model('conv_nn', fetches='loss',
                                     feed_dict={'images': B('images'),
                                                'labels': B('labels')},
                             save_to=V('current_loss'))
                .update_variable('loss_history', V('current_loss'), mode='a')
)

# Train the model

Apply a dataset to a template pipeline to create a runnable pipeline:

In [7]:
train_pipeline = (train_template << dataset.train)

Run the pipeline (it might take from a few minutes to a few hours depending on your hardware)

In [11]:
train_pipeline.run(BATCH_SIZE, shuffle=True, n_epochs=1, drop_last=True, bar=True, prefetch=1)

100%|██████████| 937/937 [04:59<00:00,  2.87it/s]

<dataset.pipeline.Pipeline at 0x7f51064adb00>

Note that the progress bar often increments by 2 at a time - that's prefetch in action.

It does not give much here, though, since almost all time is spent in model training which is performed under a thread-lock one batch after another without any parallelism (otherwise the model would not learn anything as different batches would rewrite one another's model weights updates).

# Test the model

It is much faster than training, but if you don't have GPU it would take some patience.

In [13]:
test_pipeline = (dataset.test.p
                .import_model('conv_nn', train_pipeline)
                .init_variable('accuracy', init_on_each_run=list)
                .predict_model('conv_nn', fetches='output_accuracy',
                               feed_dict={'images': B('images'), 'labels': B('labels')},
                               save_to=V('accuracy'), mode='a')
                .run(BATCH_SIZE, shuffle=True, n_epochs=1, drop_last=True, bar=True)
)

 99%|█████████▉| 155/156 [00:09<00:00, 15.55it/s]

In [14]:
accuracy = np.array(test_pipeline.get_variable('accuracy')).mean()
print('Accuracy {:6.2f}'.format(accuracy))

Accuracy   0.99


# Save the model
After learning the model, you may need to save it. It's easy to do this.

In [15]:
train_pipeline.save_model('conv_nn', path='path/to/save')

## What's next?

See [the image augmentation tutorial](./06_image_augmentation.ipynb) or return to the [table of contents](./00_description.ipynb).