-
Notifications
You must be signed in to change notification settings - Fork 346
/
pytorch-mnist-tutorial.txt
249 lines (180 loc) · 11.8 KB
/
pytorch-mnist-tutorial.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
.. _pytorch-mnist-tutorial:
PyTorch MNIST Tutorial
=============================
This tutorial describes how to port an existing PyTorch model to Determined. We will port a simple image classification model for the MNIST dataset. This tutorial is based on the official `PyTorch MNIST example <https://github.com/PyTorch/examples/blob/master/mnist/main.py>`_.
Prerequisites
-------------
- Access to a Determined cluster. If you have not yet installed Determined, refer to the :ref:`installation instructions <install-determined>`.
- The Determined CLI should be installed on your local machine. For installation instructions, see :ref:`here <install-cli>`. After installing the CLI, configure it to connect to your Determined cluster by setting the ``DET_MASTER`` environment variable to the hostname or IP address where Determined is running.
Overview
--------
To use a PyTorch model in Determined, you need to port the model to Determined's API. For most models, this porting process is straightforward, and once the model has been ported, all of the features of Determined will then be available: for example, you can do :ref:`distributed training <multi-gpu-training>` or :ref:`hyperparameter search <hyperparameter-tuning>` without changing your model code, and Determined will store and visualize your model metrics automatically.
When training a PyTorch model, Determined provides a built-in training loop that feeds batches of data into your forward pass, performs backpropagation, and computes training metrics. Determined also handles checkpointing, log management, and device initialization. To plug your model code into the Determined training loop, you define methods to perform the following tasks:
- initialization
- build the model
- define the optimizer
- define the forward pass
- load the training data set
- load the validation data set
The Determined training loop will then invoke these functions automatically. These methods should be organized into a **trial class**, which is a user-defined Python class that inherits from :class:`determined.pytorch.PyTorchTrial`. The following sections walk through how to write your first trial class and then how to run a training job with Determined.
The complete code for this tutorial can be downloaded here: `mnist_pytorch <https://github.com/determined-ai/determined/tree/master/examples/official/mnist_pytorch>`_. We suggest you download that code and use it follow along as you read through this tutorial.
Building a ``PyTorchTrial`` Class
---------------------------------
Here is what the skeleton of our trial class looks like:
.. code:: python
import torch.nn as nn
from determined.pytorch import DataLoader, PyTorchTrial
class MNISTTrial(PyTorchTrial):
def __init__(self, context: det.TrialContext):
# Initialize the trial class.
pass
def build_model(self):
# Build the model.
pass
def optimizer(self, model: nn.Module):
# Define the optimizer.
pass
def train_batch(self, batch: TorchData, model: nn.Module, epoch_idx: int, batch_idx: int):
# Define the training forward pass and calculate loss and other metrics
# for a batch of training data.
pass
def evaluate_batch(self, batch: TorchData, model: nn.Module):
# Define how to evaluate the model by calculating loss and other metrics
# for a batch of validation data.
pass
def build_training_data_loader(self):
# Create the training data loader.
# This should return a determined.pytorch.Dataset.
pass
def build_validation_data_loader(self):
# Create the validation data loader.
# This should return a determined.pytorch.Dataset.
pass
We now discuss how to implement each of these methods in more detail.
Initialization
""""""""""""""
As with any Python class, the ``__init__`` method is invoked to construct our trial class. Determined passes this method a single parameter, :class:`TrialContext <determined.TrialContext>`. The trial context contains information about the trial, such as the values of the hyperparameters to use for training. For the time being, we don't need to access any properties from the trial context object, but we assign it to an instance variable so that we can use it later:
.. code:: python
def __init__(self, context: PyTorchTrialContext):
# Store trial context for later use.
self.context = context
Building the Model
""""""""""""""""""
The ``build_model`` method returns a ``torch.Module`` or ``torch.Sequential`` object. The MNIST model code uses the Torch Sequential API and we can continue to use that API in our implementation of ``build_model``. We also access the hyperparameter for our model by the trial context with the function ``get_hparam()``.
.. code:: python
from determined.pytorch import reset_parameters
...
def build_model(self):
model = nn.Sequential(
nn.Conv2d(1, self.context.get_hparam("n_filters1"), 3, 1),
nn.ReLU(),
nn.Conv2d(
self.context.get_hparam("n_filters1"), self.context.get_hparam("n_filters2"), 3,
),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Dropout2d(self.context.get_hparam("dropout1")),
Flatten(),
nn.Linear(144 * self.context.get_hparam("n_filters2"), 128),
nn.ReLU(),
nn.Dropout2d(self.context.get_hparam("dropout2")),
nn.Linear(128, 10),
nn.LogSoftmax(),
)
Defining the Optimizer
""""""""""""""""""""""
The ``optimizer`` method returns a ``torch.optim`` object. The MNIST model code uses the ``torch.optim.Adadelta`` and we can continue to use that in our implementation of ``optimizer``.
.. code:: python
def optimizer(self, model: nn.Module):
return torch.optim.Adadelta(model.parameters(), lr=self.context.get_hparam("learning_rate"))
Loading Data
""""""""""""""""
The last two methods we need to define are ``build_training_data_loader`` and ``build_validation_data_loader``. Determined uses these methods to load the training and validation datasets, respectively. This method should return a ``determined.pytorch.DataLoader``, which is very similar to ``torch.utils.data.DataLoader``.
.. code:: python
def build_training_data_loader(self):
if not self.data_downloaded:
self.download_directory = data.download_dataset(
download_directory=self.download_directory,
data_config=self.context.get_data_config(),
)
self.data_downloaded = True
train_data = data.get_dataset(self.download_directory, train=True)
return DataLoader(train_data, batch_size=self.context.get_per_slot_batch_size())
def build_validation_data_loader(self):
if not self.data_downloaded:
self.download_directory = data.download_dataset(
download_directory=self.download_directory,
data_config=self.context.get_data_config(),
)
self.data_downloaded = True
validation_data = data.get_dataset(self.download_directory, train=False)
return DataLoader(validation_data, batch_size=self.context.get_per_slot_batch_size())
Defining the Forward Pass
"""""""""""""""""""""""""""
The ``train_batch`` computes the forward pass and calculates the loss. Determined expects a dictionary with the calculated loss and other user defined metrics and will automatically average all the metrics. Since Determined handles the backprop call, we do not need to call ``loss.backwards()`` or ``optim.zero_grad()``.
.. code:: python
def train_batch(self, batch: TorchData, model: nn.Module, epoch_idx: int, batch_idx: int):
batch = cast(Tuple[torch.Tensor, torch.Tensor], batch)
data, labels = batch
output = model(data)
loss = torch.nn.functional.nll_loss(output, labels)
return {"loss": loss}
def evaluate_batch(self, batch: TorchData, model: nn.Module):
batch = cast(Tuple[torch.Tensor, torch.Tensor], batch)
data, labels = batch
output = model(data)
validation_loss = torch.nn.functional.nll_loss(output, labels).item()
pred = output.argmax(dim=1, keepdim=True)
accuracy = pred.eq(labels.view_as(pred)).sum().item() / len(data)
return {"validation_loss": validation_loss, "accuracy": accuracy}
Training the Model
------------------
Now that we have ported our model code to the trial API, we can use Determined to train a single instance of the model or to do a hyperparameter search. In Determined, a :ref:`trial <concept-trial>` is a training task that consists of a dataset, a deep learning model, and values for all of the model's hyperparameters. An :ref:`experiment <concept-experiment>` is a collection of one or more trials: an experiment can either train a single model (with a single trial), or can define a search over a user-defined hyperparameter space.
To create an experiment, we start by writing a configuration file which defines the kind of experiment we want to run. In this case, we want to train a single model for a fixed number of batches, using fixed values for the model's hyperparameters:
.. code:: yaml
description: mnist_pytorch_const
data:
url: https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz
hyperparameters:
learning_rate: 1.0
global_batch_size: 64
n_filters1: 32
n_filters2: 64
dropout1: 0.25
dropout2: 0.5
searcher:
name: single
metric: validation_loss
max_steps: 9 # 9 steps is ~ one epoch
smaller_is_better: true
entrypoint: model_def:MNistTrial
Rather than specifying the number of batches to train for directly, we instead specify the number of :ref:`steps <concept-step>`. By default, a step consists of 100 batches, so the config file above specifies that the model should be trained on 900 batches of data.
The ``entrypoint`` specifies the name of the trial class to use. This is useful if our model code contains more than one trial class. In this case, we use an entrypoint of ``model_def:MNistTrial`` because our trial class is named ``MNistTrial`` and it is defined in a Python file named ``model_def.py``.
For more information on experiment configuration, see the :ref:`experiment configuration reference <experiment-configuration>`.
Running an Experiment
---------------------
The Determined CLI can be used to create a new experiment, which will immediately start running on the cluster. To do this, we run:
.. code::
det experiment create const.yaml .
Here, the first argument (``const.yaml``) is the name of the experiment configuration file and the second argument (``.``) is the location of the directory that contains our model definition files. You may need to configure the CLI with the network address where the Determined master is running, via the ``-m`` flag or the ``DET_MASTER`` environment variable. For more information, see the :ref:`CLI reference page <cli>`.
Once the experiment is started, you will see a notification:
.. code:: text
Preparing files (../mnist_pytorch) to send to master... 2.5KB and 4 files
Created experiment xxx
Activated experiment xxx
Evaluating the Model
--------------------
Model evaluation is done automatically for you by Determined.
To access information on both training and validation performance,
simply go to the WebUI by entering the address of the Determined master
in your web browser.
Once you are on the Determined landing page, you can find your experiment
either via the experiment ID (xxx) or via its description.
Next Steps
----------
- :ref:`tf-cifar-tutorial`
- :ref:`experiment-configuration`
- :ref:`command-configuration`
- :ref:`topic-guides_yaml`
- :ref:`benefits-of-determined`
- :ref:`terminology-concepts`