# Introduction to Python and Natural Language Technologies

__Lecture 8, Sequence modeling__

__March 30, 2021__

__Judit Ács__

# Agenda

1. Overview of sequence modeling
    - Sequence elements
    - Types of models
    - Some applications
2. A bare bone sequence classification example
    - Cover the details of modeling and training
    - Data preparation with Pandas
    - Training on a small task that runs quickly on a laptop CPU 
    - Use advanced Python features instead of `torchtext`

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from collections import defaultdict

Standard `torch` imports:

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

# Sequence elements

We deal with sequences in NLP:
- a token is a sequence of characters/morphemes
- a sentence is a sequence of tokens
- a paragraph is a sequence of sentences
- a dialogue is a sequence of utterances
- etc.

What are the elements of these sequences?

## Words

Pros:

- More or less well-defined in most languages
- Relatively short sequences (a sentence is rarely longer than 30 tokens)

Cons:
- Difficult tokenization in some languages
- Large vocabulary (100,000+ easily)
- Out-of-vocabulary words are always there regardless of the size of the vocabulary
- Many rare words
    - Hapax: a word that only appears once in the dataset.

## Characters

Pros:
- Smaller vocabulary although logographic writing systems (Chinese and Japanese) have thousands of characters
- Easy tokenization
- Well defined: Unicode symbols

Cons:
- Long sequences
- Too fine-grained, token level information is lost

## Subwords

- Multiple characters but smaller than words
- Modern language models use subword vocabularies
- We will cover these next week

# Sequence classification

Assign a single label to the full sequence:

<img src="img/tikz/abstract_sequence_classification.png" width="350" />

__Applications__

- Topic classification (AG News dataset from Lab 7)
- Sentiment analysis: is this sentence or paragraph a positive (1) or a negative (0) review?

<img src="img/tikz/example_sequence_classification.png" width="500" />

# Sequence tagging

Assign a label to each element of the sequence:

<img src="img/tikz/abstract_sequence_tagging.png">

__Applications__

- part-of-speech tagging
- named entity recognition (NER)

<img src="img/tikz/example_sequence_tagging.png" >

# Seq2seq

<img src="img/tikz/abstract_seq2seq.png" width=600px>

- Maps a source sequence to a target sequence
    - Arbitrary length
    
- Two steps:
    1. Encode: create a representation of the source
    2. Decode: generate the target representation
        - autoregressive: generate tokens from left-to-right one-by-one (condition on the left context)
        
- Usually implemented as two separate neural networks for example:
    - The encoder is a bidirectional LSTM
    - The decoder is a unidirectional LSTM
    
- Applications:
    - Neural machine translation
    - Morphological inflection

# Attention

Seq2seq performs poorly in its naive form since the decoder has to generate the whole output based on a single hidden vector that represents the full input sequence.

**Attention** gives peak into the input sequence [image source](https://aihub.cloud.google.com/u/0/p/products%2F024b89fd-9bc8-4c24-b8a8-e347479f3270):
    
<img src="img/dl/attention_mechanism.jpg" width=600px>

# Other tasks

There are many other NLP tasks that are solved with some combination of the above models.

- Sentence pair classification: the same as sequence classification except we assign a label to a __pair__ of sentences
    - paraphrase identification
- Span level tasks
- Tree-based tasks
    - Universal Dependencies (Lecture 10)

# Sequence classification example

We will now train a sequence classification model on Hungarian morphology. This model can be trained on a laptop CPU.

## Backpropagation


Neural networks are usually trained with the [backpropagation](https://en.wikipedia.org/wiki/Backpropagation) algorithm.

The data flow in neural networks implicitly defines a **computation graph**, this is called **forward pass**.

The output is compared against the ground truth or label and the difference or cost is quantified by **loss function** also called **cost function**.

If the loss function is differentiable with respect to the parameters, we can compute the gradient w.r.t. to all parameters i.e. we can quantify 'how responsible' a parameter is for the loss using the chain rule. This is called **backpropagation**.

An optimizer then updates the parameters. The update is proportional to the gradient. This is called **gradient descent**.

More information: [Backpropagation chapter](https://www.deeplearningbook.org/contents/mlp.html#pf25) from [Deep Learning](https://www.deeplearningbook.org/) by Goodfellow, Bengio and Courville

We will now discuss the building blocks of the sequence classifier.

## `nn.Embedding`

`nn.Embedding` maps integers to continuous vectors. Its mandatory parameters are:
- `num_embeddings`: the size of the vocabulary
- `embedding_dim`: the size of the embedding

In [None]:
embedding = nn.Embedding(5, 3)
embedding(torch.LongTensor([1, 4]))
# embedding(torch.LongTensor([1, 5]))  # raises IndexError

Works with higher order tensors:

In [None]:
input_tensor = torch.LongTensor(
    [[1, 2], [0, 1]]
)
emb = embedding(input_tensor)
print(f"{input_tensor.size() = }")
print(f"{emb.size() = }")
emb

In [None]:
input_tensor = torch.LongTensor(
    [[[1, 2], [0, 1]]]
)
emb = embedding(input_tensor)
print(f"{input_tensor.size() = }")
print(f"{emb.size() = }")
emb

The embedding's parameters can be listed with:

In [None]:
for pname, param in embedding.named_parameters():
    print(pname, param.size())

## `nn.LSTM`

LSTM is the most popular recurrent cell that takes a sequence as an input and processes it one by one while updating its hidden state. Its parameters are (from [here](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html?highlight=lstm#torch.nn.LSTM)):
- `input_size`: The number of expected features in the input x
- `hidden_size`: The number of features in the hidden state h
- `num_layers`: Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1
- `bias`: If False, then the layer does not use bias weights b_ih and b_hh. Default: True
- `batch_first`: If True, then the input and output tensors are provided as (batch, seq, feature). Default: False
- `dropout`: If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout. Default: 0
- `bidirectional`: If True, becomes a bidirectional LSTM. Default: False


In [None]:
lstm = nn.LSTM(5, 12, num_layers=1, bidirectional=False, batch_first=True, dropout=0)
lstm

In [None]:
for pname, param in lstm.named_parameters():
    print(pname, param.size())

Its input **must** be 3D:

In [None]:
input_tensor = torch.rand((3, 2))
# lstm(input_tensor)  # raises RunTimeError

In [None]:
# batch_size X sequence_length X input_size
# 3 X 2 X 5
input_tensor = torch.rand((3, 2, 5))
outputs, (h, c) = lstm(input_tensor)

print(f"{outputs.size() = }")
print(f"{h.size() = }")
print(f"{c.size() = }")

In practice we generally use bidirectional LSTMs, which are implemented as two unidirectional LSTMs.

`h` and `c` are now the final states of both unidirectional LSTMs:

In [None]:
lstm = nn.LSTM(5, 12, num_layers=1, bidirectional=True, batch_first=True, dropout=0)

input_tensor = torch.rand((3, 2, 5))
outputs, (h, c) = lstm(input_tensor)

print(f"{outputs.size() = }")
print(f"{h.size() = }")
print(f"{c.size() = }")

In [None]:
lstm = nn.LSTM(5, 12, num_layers=3, bidirectional=True, batch_first=True, dropout=0)

input_tensor = torch.rand((3, 2, 5))
outputs, (h, c) = lstm(input_tensor)

print(f"{outputs.size() = }")
print(f"{h.size() = }")
print(f"{c.size() = }")

## Combining `nn.LSTM` with `nn.Embedding`

In [None]:
# model parameters
vocab_size = 10
embedding_size = 5
lstm_hidden_size = 6

embedding = nn.Embedding(vocab_size, embedding_size)
lstm = nn.LSTM(embedding_size, lstm_hidden_size, num_layers=1, bidirectional=True, batch_first=True, dropout=0)

In [None]:
# input parameters (not that these are independent of the model)
batch_size = 3
sequence_length = 4

input_ids = torch.randint(vocab_size, (batch_size, sequence_length))
input_ids

In [None]:
embedded = embedding(input_ids)
outputs, (h, c) = lstm(embedded)

print(f"{input_ids.size() = }")
print(f"{embedded.size() = }")
print(f"{outputs.size() = }")
print(f"{h.size() = }")
print(f"{c.size() = }")

## `nn.Linear`

`nn.Linear` implements a matrix projection:

In [None]:
dense = nn.Linear(3, 5)
dense

In [None]:
for pname, param in dense.named_parameters():
    print(pname, param.size())

In [None]:
input_tensor = torch.rand((2, 3))
output = dense(input_tensor)
print(f"{output.size() = }")

Inner dimensions must match:

In [None]:
input_tensor = torch.rand((2, 4))
# output = dense(input_tensor)  # raises RuntimeError

## The `LSTMClassifier` class

We can now define our own LSTM sequence classifier model.

All PyTorch modules must subclass `nn.Module` (or one of its subclasses) and call `init` before any attribute assignment.

There two methods we have to implement:
- `__init__`: defines submodules. These constitute the nodes _computation graph_.
- `forward` implements the forward pass of the module. This is how we map the input to the output. The way we pass the input through the module implicitly builds a directed graph of the submodules named _computation graph_.

The backward pass is automatically handled by PyTorch but it can be overriden by implementind the `backward` method.

In [None]:
class LSTMClassifier(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, output_size):
        super().__init__()
        self.embedding = nn.Embedding(input_size, embedding_size)
        self.lstm = nn.LSTM(
            input_size=embedding_size,
            hidden_size=hidden_size,
            num_layers=1,
            bidirectional=True,
            batch_first=True,
        )
        self.dense = nn.Linear(hidden_size * 2, output_size)
        
    def forward(self, sequences):
        # sequences: batch_size X sequence_length
        embedded = self.embedding(sequences)
        
        # lstm_outputs: batch_size X sequence_length X 2*hidden_size
        # h: 2 X batch_size X hidden_size
        # c: 2 X batch_size X hidden_size
        lstm_outputs, (h, c) = self.lstm(embedded)
        
        # h: batch_size X 2*hidden_size
        h = torch.cat((h[0], h[1]), dim=-1)
        
        # output: batch_size X output_size
        output = self.dense(h)
        return output
        
        
toy_classifier = LSTMClassifier(3, 10, 5, 2)
toy_input = torch.LongTensor([
    [0, 1, 0, 2],
    [0, 1, 0, 2],
    [0, 0, 0, 2],
    [1, 1, 2, 0],
])
toy_classifier(toy_input)

### Under the hood

PyTorch registers every attribute in `__init__` that is an instance of `nn.Module` in the parameters of the module:

In [None]:
for pname, param in toy_classifier.named_parameters():
    print(pname, param.size())

This is **not** true for lists and other complex data types. PyTorch does not attempt to traverse them:

In [None]:
class SimpleModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = [nn.Linear(12, 4), nn.Linear(15, 9)]
        
for pname, param in SimpleModule().named_parameters():
    print(pname, param.size())

The solution is `nn.ModuleList` or `nn.Sequential`:

In [None]:
class SimpleModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([nn.Linear(12, 4), nn.Linear(15, 9)])
        
for param in SimpleModule().named_parameters():
    print(f"name: {param[0]}, size: {param[1].size()}")

`nn.Sequential` chains multiple modules:

In [None]:
class SimpleModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(12, 4),
            nn.ReLU(),
            nn.Linear(15, 9),
            nn.Tanh(),
        )
        
    def forward(self, input):
        out = self.layers(out)
        return out
    
        
for pname, param in SimpleModule().named_parameters():
    print(pname, param.size())

## Data preparation

We will now train a small classifier that predicts the case of Hungarian nouns. Hungarian has a rich case system in place of prefixes used by English. Some examples are:

| Case | Hungarian | English |
| ---- | ---- | ---- |
| Nominative | ház | house |
| Instrumental | házzal | with (a) house |
| Ablative  | háztól | from at (a) house |
| Elativus  | házból | from inside (a) house |


We will train a character-level model that predicts the case based on the word form. This is an easy task since most of the time the grammatical case is obvious from the last 3 characters of the word.

Our model looks like this:

<img src="img/tikz/hungarian_case_lstm.png" width=600px>

In [None]:
train_df = pd.read_table("data/unimorph/hun_train.tsv")
dev_df = pd.read_table("data/unimorph/hun_dev.tsv")
test_df = pd.read_table("data/unimorph/hun_test.tsv")
print(len(train_df), len(dev_df), len(test_df))
train_df.head()

### Downsampling

We don't need much data to train the model, let's downsample it and train on a small subset.

In [None]:
train_df = train_df.sample(1000, random_state=1).reset_index(drop=True)
dev_df = dev_df.sample(200, random_state=1).reset_index(drop=True)
test_df = test_df.sample(200, random_state=1).reset_index(drop=True)

### Building the vocabulary

We need to map each character to an integer id. For this we need to define a `char->int` mapping that is as big as the alphabet:

In [None]:
alphabet = set()
for token in train_df.infl:
    alphabet |= set(token)
len(alphabet)

and some extra symbols:

1. PAD: used as filler symbols for shorter sequences (see later)
2. BOS: beginning-of-sequence. Indicates the start of the sequence.
2. EOS: end-of-sequence. Indicates the end of the sequence.
2. UNK: unknown. Symbols that fall out of the vocabulary are replaced with this symbol.

In [None]:
alphabet.add('<PAD>')
alphabet.add('<BOS>')
alphabet.add('<EOS>')
alphabet.add('<UNK>')

In [None]:
vocab = {symbol: i for i, symbol in enumerate(alphabet)}
len(vocab)

In [None]:
def encode_token(token):
    ids = []
    ids.append(vocab['<BOS>'])
    # dev and test might contain characters outside the alphabet
    ids.extend(vocab.get(c, vocab['<UNK>']) for c in token)
    ids.append(vocab['<EOS>'])
    return ids

print(f"{encode_token('alma') = }")
print(f"{vocab['<UNK>'] = }")
print(f"{encode_token('ALMA') = }")

Let's encode our dataset:

In [None]:
train_df['encoded'] = train_df.infl.apply(encode_token)
dev_df['encoded'] = dev_df.infl.apply(encode_token)
test_df['encoded'] = test_df.infl.apply(encode_token)
train_df.head()

### Padding

The input sequences different in length at the moment:

In [None]:
sns.countplot(x=train_df.infl.str.len(), palette='tab10')

We append **PAD** symbols to the shorter sequences like this:

<img src="img/tikz/padding.png">

In [None]:
maxlen = train_df.encoded.apply(len).max()
print(maxlen)

def pad_sequence(sequence):
    if len(sequence) > maxlen:
        return sequence[:maxlen]
    return sequence + [vocab['<PAD>'] for _ in range(maxlen-len(sequence))]

print(pad_sequence([1, 2, 3]))

In [None]:
train_df['padded'] = train_df.encoded.apply(pad_sequence)
dev_df['padded'] = dev_df.encoded.apply(pad_sequence)
test_df['padded'] = test_df.encoded.apply(pad_sequence)

train_df['padded'].apply(len).value_counts()

We will need the original lengths of each sequence later:

In [None]:
train_df['seqlen'] = train_df.encoded.apply(len)
dev_df['seqlen'] = dev_df.encoded.apply(len)
test_df['seqlen'] = test_df.encoded.apply(len)

### Preparing labels

There are 18 labels:

In [None]:
label_to_id = {label: i for i, label in enumerate(train_df.case.unique())}
label_to_id

We add a `label` column to each dataframe:

In [None]:
train_df['label'] = train_df.case.apply(lambda c: label_to_id[c])
dev_df['label'] = dev_df.case.apply(lambda c: label_to_id[c])
test_df['label'] = test_df.case.apply(lambda c: label_to_id[c])

In [None]:
train_df.head()

#### Extract input and output tensors

In [None]:
X_train = torch.from_numpy(np.array(list(train_df.padded)))
y_train = torch.LongTensor(train_df.label.values)
seqlen_train = torch.LongTensor(train_df.seqlen.values)
print(f"{X_train.size() = },\n{y_train.size() = }\n{seqlen_train.size() = }\n")

X_dev = torch.from_numpy(np.array(list(dev_df.padded)))
y_dev = torch.LongTensor(dev_df.label.values)
seqlen_dev = torch.LongTensor(dev_df.seqlen.values)
print(f"{X_dev.size() = },\n{y_dev.size() = }\n{seqlen_dev.size() = }\n")

X_test = torch.from_numpy(np.array(list(test_df.padded)))
y_test = torch.LongTensor(test_df.label.values)
seqlen_test = torch.LongTensor(test_df.seqlen.values)
print(f"{X_test.size() = },\n{y_test.size() = }\n{seqlen_test.size() = }")

## `PackedSequence`

We need to modify `LSTMTagger` to support padding. The last output is now different for each sequence:

<img src="img/tikz/padding_last_highlight.png">

In [None]:
class LSTMClassifier(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, output_size):
        super().__init__()
        self.embedding = nn.Embedding(input_size, embedding_size)
        self.lstm = nn.LSTM(
            input_size=embedding_size,
            hidden_size=hidden_size,
            num_layers=1,
            bidirectional=True,
            batch_first=True,
        )
        self.dense = nn.Linear(hidden_size * 2, output_size)
        
    # the input signature of forward changes
    def forward(self, sequences, sequence_lens):
        embedded = self.embedding(sequences)
        
        # THIS IS THE MODIFIED PART
        # returns a PackedSequence object
        packed = nn.utils.rnn.pack_padded_sequence(
            embedded,
            sequence_lens,
            enforce_sorted=False,
            batch_first=True)
        packed_outputs, (h, c) = self.lstm(packed)
        # extract LSTM outputs (not used here)
        lstm_outputs, lens = nn.utils.rnn.pad_packed_sequence(packed_outputs)
        
        h = torch.cat((h[0], h[1]), dim=-1)
        output = self.dense(h)
        return output

## Instantiating the model

The input and the output size are determined by the alphabet and the number of labels, the rest are up to us:

In [None]:
input_size = len(vocab)
embedding_size = 30
hidden_size = 64
output_size = train_df.label.nunique()

model = LSTMClassifier(input_size, embedding_size, hidden_size, output_size)
model

## Batching

Most datasets do not fit into the GPU memory so we process them in smaller chunks called _batches_.

There are many solutions for batching but it can also be implemented with simple class.

Note that the init function takes an arbitrary number of positional arguments (`*tensors`) and one mandatory keyword (`batch_size`). This class is a simplified version of batching, it lacks many features such as shuffling or sorting by sequence length.

In [None]:
class BatchedIterator:
    def __init__(self, *tensors, batch_size):
        # all tensors must have the same first dimension
        assert len(set(len(tensor) for tensor in tensors)) == 1
        self.tensors = tensors
        self.batch_size = batch_size
    
    def iterate_once(self):
        num_data = len(self.tensors[0])
        for start in range(0, num_data, self.batch_size):
            end = start + self.batch_size
            yield tuple(tensor[start:end] for tensor in self.tensors)
            
            
print("Two tensors:")
for batch in BatchedIterator([1, 2, 3], [2, 1, 2], batch_size=2).iterate_once():
    print(batch)
    
print("\nOne tensor:")
for batch in BatchedIterator([2, 1, 2], batch_size=2).iterate_once():
    print(batch)

In [None]:
train_iter = BatchedIterator(X_train, seqlen_train, y_train, batch_size=501)
for X, seqlens, y in train_iter.iterate_once():
    print(f"{X.size() = }, {seqlens.size() = }, {y.size() = }")

## Loss function and optimizer

The **loss function** or **cost function** quantifies cost of the model output differing from the expected target values.

The optimizer adjusts the model's parameters in accordance with the loss.

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

## Sanity check

Train and dev accuracy should be really bad without training.

Do **NOT** touch the test data while development finetuning.

In [None]:
logits = model(X_train, seqlen_train)
y = logits.argmax(axis=1)
accuracy = torch.sum(torch.eq(y, y_train)) / y.size(0)
print(f"Train accuracy: {accuracy:.1%}")

In [None]:
logits = model(X_dev, seqlen_dev)
y = logits.argmax(axis=1)
accuracy = torch.sum(torch.eq(y, y_dev)) / y.size(0)
print(f"Dev accuracy: {accuracy:.1%}")

## Training

We collect training statistics at the end of each epochs in `metrics`.

In [None]:
num_epochs = 30
batch_size = 128

metrics = defaultdict(list)
train_iter = BatchedIterator(X_train, seqlen_train, y_train, batch_size=batch_size)

We train the model batch by batch and then evaluate it on the train and the dev data at the end of each epoch.
Since the dataset is small, we can evaluate it the whole data in one step without batching.

Note that the model should be set to **train** or **eval** mode accordingly. Stochastic steps such as dropout are disabled in **eval** mode.

In [None]:
for epoch in range(num_epochs):
    model.train()
    # Training loop
    for X_batch, seqlen_batch, y_batch in train_iter.iterate_once():
        y_out = model(X_batch, seqlen_batch)
        loss = criterion(y_out, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    model.eval()  # or model.train(False)
    # Train and dev loss at the end of the epoch
    y_out = model(X_train, seqlen_train)
    train_loss = criterion(y_out, y_train).item()
    metrics['train_loss'].append(train_loss)
    labels = y_out.argmax(axis=1)
    train_accuracy = (torch.eq(y_train, labels).sum() / labels.size(0)).item()
    metrics['train_accuracy'].append(train_accuracy)
    
    y_out = model(X_dev, seqlen_dev)
    dev_loss = criterion(y_out, y_dev).item()
    metrics['dev_loss'].append(dev_loss)
    labels = y_out.argmax(axis=1)
    dev_accuracy = (torch.eq(y_dev, labels).sum() / labels.size(0)).item()
    metrics['dev_accuracy'].append(dev_accuracy)
    
    print(f"{epoch=} -- {train_loss=:.3f} - {train_accuracy=:.1%} - {dev_loss=:.3f} - {dev_accuracy=:.1%}")

## Evaluation

### Training curves

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 4))

sns.lineplot(data=metrics['train_loss'], ax=ax[0], label='train loss')
sns.lineplot(data=metrics['dev_loss'], ax=ax[0], label='dev loss')

sns.lineplot(data=metrics['train_accuracy'], ax=ax[1], label='train acc')
sns.lineplot(data=metrics['dev_accuracy'], ax=ax[1], label='dev acc')

## Test accuracy

In [None]:
logits = model(X_test, seqlen_test)
test_prediction = logits.argmax(axis=1)
test_accuracy = torch.sum(torch.eq(test_prediction, y_test)) / y_test.size(0)
print(f"Test accuracy: {test_accuracy:.1%}")

## Incorrectly classified examples

In [None]:
test_df['prediction'] = test_prediction

### Recovering labels

In [None]:
id_to_label = {i: l for l, i in label_to_id.items()}
test_df['predicted_case'] = test_df['prediction'].apply(lambda id_: id_to_label[id_])

In [None]:
test_df[test_df.prediction != test_df.label][['infl', 'case', 'predicted_case']]

# Further topics

## Early stopping

Stop the training process if the development metrics no longer improve.

## Dropout

Dropout disables a random subset of neurons during each training step. It's generally set to 10-20%. Dropout usually improves generalization.

It should be disabled in evaluation steps and during inference.

## Overfitting

Fit the train set very closely but lose generalization.

[Image source](https://www.quora.com/What-are-the-key-trade-offs-between-overfitting-and-underfitting)

<img src="img/dl/overfitting.png" />

## Using the GPU

Moving things manually to the GPU:
- model: move once
- criterion: move once
- data: move one batch at a time

This should be automatically handled by your code the following way:

In [None]:
use_cuda = torch.cuda.is_available()
print(use_cuda)

In [None]:
if use_cuda:
    tagger = tagger.cuda()
    criterion = criterion.cuda()

## Saving and loading models

All `nn.Modules` have a `state_dict` attribute, a dictionary of their parameters. This can be partially or fully saved with `torch.save` and loaded by `torch.load`.

(Official tutorial)[https://pytorch.org/tutorials/beginner/saving_loading_models.html]

## Inference

**Inference** is when we use the model for prediction and do not train it.

The models should be set to `eval` mode.