<img src="https://parl.ai/docs/_static/img/parlai.png" width="700"/>

**Author**: Stephen Roller ([GitHub](https://github.com/stephenroller), [Twitter](https://twitter.com/stephenroller))


# Welcome to the ParlAI interactive tutorial

In this tutorial we will:

- Chat with a neural network model!
- Show how to use common commands in ParlAI, like inspecting data and model outputs.
- See where to find information about many options.
- Show how to fine-tune a pretrained model on a specific task
- Add our own datasets to ParlAI
- And add our own models to ParlAI

We won't be running any examples of using Amazon Mechanical Turk, or connecting to Chat services, but you can check out our [docs](https://parl.ai/docs/) for more information on these areas.

**Note:** *Make sure you're running this session with a GPU attached.*

In [1]:
!nvidia-smi

Sun Feb  7 12:56:02 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Installing parlai

We need to install ParlAI. Since we're in Google Colab, we can assume PyTorch and similar dependencies are installed already

In [2]:
!pip install -q parlai
!pip install -q subword_nmt # extra requirement we need for this tutorial

[K     |████████████████████████████████| 1.4MB 8.1MB/s 
[K     |████████████████████████████████| 61kB 10.2MB/s 
[K     |████████████████████████████████| 317kB 26.7MB/s 
[K     |████████████████████████████████| 51kB 8.9MB/s 
[K     |████████████████████████████████| 133kB 27.7MB/s 
[K     |████████████████████████████████| 2.8MB 14.6MB/s 
[K     |████████████████████████████████| 163kB 35.8MB/s 
[K     |████████████████████████████████| 7.2MB 51.3MB/s 
[K     |████████████████████████████████| 215kB 56.9MB/s 
[K     |████████████████████████████████| 245kB 53.6MB/s 
[K     |████████████████████████████████| 133kB 60.8MB/s 
[K     |████████████████████████████████| 81kB 11.7MB/s 
[K     |████████████████████████████████| 40kB 7.5MB/s 
[K     |████████████████████████████████| 3.2MB 50.4MB/s 
[K     |████████████████████████████████| 7.0MB 22.4MB/s 
[K     |████████████████████████████████| 204kB 64.0MB/s 
[K     |████████████████████████████████| 552kB 51.1MB/s 
[K 

# Chatting with a model

Let's start by chatting interactively with a model file from our model zoo! We'll pick our "tutorial transformer generator" model, which is a generative transformer trained on pushshift.io Reddit. You can take a look at the [model zoo](https://parl.ai/docs/zoo.html) for a more complete list.

In [3]:
# Import the Interactive script
from parlai.scripts.interactive import Interactive

# call it with particular args
Interactive.main(
    # the model_file is a filename path pointing to a particular model dump.
    # Model files that begin with "zoo:" are special files distributed by the ParlAI team.
    # They'll be automatically downloaded when you ask to use them.
    model_file='zoo:tutorial_transformer_generator/model'
)



12:57:21 | building data: /usr/local/lib/python3.6/dist-packages/data/models/tutorial_transformer_generator/tutorial_transformer_generator_v1.tar.gz
12:57:21 | Downloading http://parl.ai/downloads/_models/tutorial_transformer_generator/tutorial_transformer_generator_v1.tar.gz to /usr/local/lib/python3.6/dist-packages/data/models/tutorial_transformer_generator/tutorial_transformer_generator_v1.tar.gz


Downloading tutorial_transformer_generator_v1.tar.gz: 100%|██████████| 1.12G/1.12G [00:15<00:00, 71.8MB/s]


12:57:55 | [33mOverriding opt["model_file"] to /usr/local/lib/python3.6/dist-packages/data/models/tutorial_transformer_generator/model (previously: /checkpoint/roller/20190909/cleanreddit/585/model)[0m
12:57:55 | [33mLoading model with `--beam-block-full-context false`[0m
12:57:55 | Using CUDA
12:57:55 | [31mYou set --fp16 true with --fp16-impl apex, but fp16 with apex is unavailable. To use apex fp16, please install APEX from https://github.com/NVIDIA/apex.[0m
12:57:55 | loading dictionary from /usr/local/lib/python3.6/dist-packages/data/models/tutorial_transformer_generator/model.dict
12:57:55 | num words = 54944
12:57:56 | TransformerGenerator: full interactive mode on.
12:57:57 | [33mDEPRECATED: XLM should only be used for backwards compatibility, as it involves a less-stable layernorm operation.[0m
12:58:08 | Total parameters: 87,508,992 (87,508,992 trainable)
12:58:08 | Loading existing model params from /usr/local/lib/python3.6/dist-packages/data/models/tutorial_transfor

The same on the command line:
```bash
python -m parlai.scripts.interactive --model-file zoo:tutorial_transformer_generator/model
```

# Taking a look at some data

We can look at look into a specific dataset. Let's look into the "empathetic dialogues" dataset, which aims to teach models how to respond with text expressing the appropriate emotion. We have over existing 80 datasets in ParlAI. You can take a full look in our [task list](https://parl.ai/docs/tasks.html).

In [4]:
# The display_data script is used to show the contents of a particular task.
# By default, we show the train
from parlai.scripts.display_data import DisplayData

DisplayData.main(task='empathetic_dialogues', num_examples=5)

13:02:12 | Opt:
13:02:12 |     allow_missing_init_opts: False
13:02:12 |     batchsize: 1
13:02:12 |     datapath: /usr/local/lib/python3.6/dist-packages/data
13:02:12 |     datatype: train:ordered
13:02:12 |     dict_class: None
13:02:12 |     display_ignore_fields: agent_reply
13:02:12 |     display_verbose: False
13:02:12 |     download_path: None
13:02:12 |     dynamic_batching: None
13:02:12 |     hide_labels: False
13:02:12 |     image_cropsize: 224
13:02:12 |     image_mode: raw
13:02:12 |     image_size: 256
13:02:12 |     init_model: None
13:02:12 |     init_opt: None
13:02:12 |     loglevel: info
13:02:12 |     max_display_len: 1000
13:02:12 |     model: None
13:02:12 |     model_file: None
13:02:12 |     multitask_weights: [1]
13:02:12 |     num_examples: 5
13:02:12 |     override: "{'task': 'empathetic_dialogues', 'num_examples': 5}"
13:02:12 |     parlai_home: /usr/local/lib/python3.6/dist-packages
13:02:12 |     remove_political_convos: False
13:02:12 |     starttime: Feb

Downloading empatheticdialogues.tar.gz: 100%|██████████| 28.0M/28.0M [00:00<00:00, 28.6MB/s]


[1;31m- - - NEW EPISODE: empathetic_dialogues - - -[0;0m
[0mI remember going to see the fireworks with my best friend. It was the first time we ever spent time alone together. Although there was a lot of people, we felt like the only people in the world.[0;0m
   [1;94mWas this a friend you were in love with, or just a best friend?[0;0m
[0mThis was a best friend. I miss her.[0;0m
   [1;94mWhere has she gone?[0;0m
[0mWe no longer talk.[0;0m
   [1;94mOh was this something that happened because of an argument?[0;0m
[1;31m- - - NEW EPISODE: empathetic_dialogues - - -[0;0m
[0mWas this a friend you were in love with, or just a best friend?[0;0m
   [1;94mThis was a best friend. I miss her.[0;0m
[0mWhere has she gone?[0;0m
   [1;94mWe no longer talk.[0;0m
13:02:15 | loaded 39057 episodes with a total of 64636 examples


The black, unindented text is the _prompt_, while the blue text is the _label_. That is, the label is what we will be training the model to mimic.

We can also ask to see fewer examples, and get them from the validation set instead.

In [5]:
# we can instead ask to see fewer examples, and get them from the valid set.
DisplayData.main(task='empathetic_dialogues', num_examples=3, datatype='valid')

13:03:12 | Opt:
13:03:12 |     allow_missing_init_opts: False
13:03:12 |     batchsize: 1
13:03:12 |     datapath: /usr/local/lib/python3.6/dist-packages/data
13:03:12 |     datatype: valid
13:03:12 |     dict_class: None
13:03:12 |     display_ignore_fields: agent_reply
13:03:12 |     display_verbose: False
13:03:12 |     download_path: None
13:03:12 |     dynamic_batching: None
13:03:12 |     hide_labels: False
13:03:12 |     image_cropsize: 224
13:03:12 |     image_mode: raw
13:03:12 |     image_size: 256
13:03:12 |     init_model: None
13:03:12 |     init_opt: None
13:03:12 |     loglevel: info
13:03:12 |     max_display_len: 1000
13:03:12 |     model: None
13:03:12 |     model_file: None
13:03:12 |     multitask_weights: [1]
13:03:12 |     num_examples: 3
13:03:12 |     override: "{'task': 'empathetic_dialogues', 'num_examples': 3, 'datatype': 'valid'}"
13:03:12 |     parlai_home: /usr/local/lib/python3.6/dist-packages
13:03:12 |     remove_political_convos: False
13:03:12 |     s

On the command line:
```bash
python -m parlai.scripts.display_data --task empathetic_dialogues
```
or a bit shorter
```
python -m parlai.scripts.display_data -t empathetic_dialogues
```

# Training a model

Well it's one thing looking at data, but what if we want to train our own model (from scratch)? Let's train a very simple seq2seq LSTM with attention, to respond to empathetic dialogues.

To get some extra performance, we'll initialize using GloVe embeddings, but we will cap the training time to 2 minutes for this tutorial. It won't perform very well, but that's okay.

In [6]:
# we'll save it in the "from_scratch_model" directory
!rm -rf from_scratch_model
!mkdir -p from_scratch_model

from parlai.scripts.train_model import TrainModel
TrainModel.main(
    # we MUST provide a filename
    model_file='from_scratch_model/model',
    # train on empathetic dialogues
    task='empathetic_dialogues',
    # limit training time to 2 minutes, and a batchsize of 16
    max_train_time=2 * 60,
    batchsize=16,
    
    # we specify the model type as seq2seq
    model='seq2seq',
    # some hyperparamter choices. We'll use attention. We could use pretrained
    # embeddings too, with embedding_type='fasttext', but they take a long
    # time to download.
    attention='dot',
    # tie the word embeddings of the encoder/decoder/softmax.
    lookuptable='all',
    # truncate text and labels at 64 tokens, for memory and time savings
    truncate=64,
)

13:03:31 | building dictionary first...
13:03:31 | Opt:
13:03:31 |     adafactor_eps: '(1e-30, 0.001)'
13:03:31 |     adam_eps: 1e-08
13:03:31 |     add_p1_after_newln: False
13:03:31 |     aggregate_micro: False
13:03:31 |     allow_missing_init_opts: False
13:03:31 |     attention: dot
13:03:31 |     attention_length: 48
13:03:31 |     attention_time: post
13:03:31 |     batchsize: 1
13:03:31 |     beam_block_full_context: True
13:03:31 |     beam_block_list_filename: None
13:03:31 |     beam_block_ngram: -1
13:03:31 |     beam_context_block_ngram: -1
13:03:31 |     beam_delay: 30
13:03:31 |     beam_length_penalty: 0.65
13:03:31 |     beam_min_length: 1
13:03:31 |     beam_size: 1
13:03:31 |     betas: '(0.9, 0.999)'
13:03:31 |     bidirectional: False
13:03:31 |     bpe_add_prefix_space: None
13:03:31 |     bpe_debug: False
13:03:31 |     bpe_merge: None
13:03:31 |     bpe_vocab: None
13:03:31 |     compute_tokenized_bleu: False
13:03:31 |     datapath: /usr/local/lib/python3.6/dis

Building dictionary: 100%|██████████| 64.6k/64.6k [00:02<00:00, 28.0kex/s]


13:03:34 | Saving dictionary to from_scratch_model/model.dict
13:03:34 | dictionary built with 22419 tokens in 0.0s
13:03:34 | No model with opt yet at: from_scratch_model/model(.opt)
13:03:34 | Using CUDA
13:03:34 | loading dictionary from from_scratch_model/model.dict
13:03:34 | num words = 22419
13:03:34 | Total parameters: 3,453,203 (3,453,203 trainable)
13:03:34 | Opt:
13:03:34 |     adafactor_eps: '(1e-30, 0.001)'
13:03:34 |     adam_eps: 1e-08
13:03:34 |     add_p1_after_newln: False
13:03:34 |     aggregate_micro: False
13:03:34 |     allow_missing_init_opts: False
13:03:34 |     attention: dot
13:03:34 |     attention_length: 48
13:03:34 |     attention_time: post
13:03:34 |     batchsize: 16
13:03:34 |     beam_block_full_context: True
13:03:34 |     beam_block_list_filename: None
13:03:34 |     beam_block_ngram: -1
13:03:34 |     beam_context_block_ngram: -1
13:03:34 |     beam_delay: 30
13:03:34 |     beam_length_penalty: 0.65
13:03:34 |     beam_min_length: 1
13:03:34 |   

({'accuracy': ExactMatchMetric(0),
  'bleu-4': BleuMetric(4.798e-05),
  'ctpb': GlobalAverageMetric(572.6),
  'ctps': GlobalTimerMetric(3748),
  'exps': GlobalTimerMetric(104.3),
  'exs': SumMetric(5738),
  'f1': F1Metric(0.1292),
  'gpu_mem': GlobalAverageMetric(0.0009806),
  'loss': AverageMetric(6.636),
  'lr': GlobalAverageMetric(1),
  'ltpb': GlobalAverageMetric(249.1),
  'ltps': GlobalTimerMetric(1631),
  'ppl': PPLMetric(761.8),
  'token_acc': AverageMetric(0.2078),
  'total_train_updates': GlobalFixedMetric(1398),
  'tpb': GlobalAverageMetric(821.7),
  'tps': GlobalTimerMetric(5379)},
 {'accuracy': ExactMatchMetric(0),
  'bleu-4': BleuMetric(1.242e-06),
  'ctpb': GlobalAverageMetric(604.5),
  'ctps': GlobalTimerMetric(3872),
  'exps': GlobalTimerMetric(102.4),
  'exs': SumMetric(5259),
  'f1': F1Metric(0.1272),
  'gpu_mem': GlobalAverageMetric(0.0009466),
  'loss': AverageMetric(6.643),
  'lr': GlobalAverageMetric(1),
  'ltpb': GlobalAverageMetric(252.6),
  'ltps': GlobalTimerM

Our perplexity and F1 (word overlap) scores are pretty bad, and our BLEU-4 score is nearly 0. That's okay, we would normally want to train for well over an hour. Feel free to change the max_train_time above.

## Performance is pretty bad there. Can we improve it?

The easiest way to improve it is to *initialize* using a *pretrained model*, utilizing *transfer learning*. Let's use the one from the interactive session at the beginning of the chat!

In [8]:
!rm -rf from_pretrained
!mkdir -p from_pretrained

TrainModel.main(
    # similar to before
    task='empathetic_dialogues', 
    model='transformer/generator',
    model_file='from_pretrained/model',
    
    # initialize with a pretrained model
    init_model='zoo:tutorial_transformer_generator/model',
    
    # arguments we get from the pretrained model.
    # Unfortunately, these must be looked up separately for each model.
    n_heads=16, n_layers=8, n_positions=512, text_truncate=512,
    label_truncate=128, ffn_size=2048, embedding_size=512,
    activation='gelu', variant='xlm',
    dict_lower=True, dict_tokenizer='bpe',
    dict_file='zoo:tutorial_transformer_generator/model.dict',
    learn_positional_embeddings=True,
    
    # some training arguments, specific to this fine-tuning
    # use a small learning rate with ADAM optimizer
    lr=1e-5, optimizer='adam',
    warmup_updates=100,
    # early stopping on perplexity
    validation_metric='ppl',
    # train at most 10 minutes, and validate every 0.25 epochs
    max_train_time=600, validation_every_n_epochs=0.25,
    
    # depend on your gpu. If you have a V100, this is good
    batchsize=12, fp16=True, fp16_impl='mem_efficient',
    
    # speeds up validation
    skip_generation=True,
    
    # helps us cram more examples into our gpu at a time
    dynamic_batching='full',
)

13:12:34 | building dictionary first...
13:12:34 | No model with opt yet at: from_pretrained/model(.opt)
13:12:34 | [33myour model is being loaded with opts that do not exist in the model you are initializing the weights with: allow_missing_init_opts: False,download_path: None,loglevel: info,dynamic_batching: full,datapath: /usr/local/lib/python3.6/dist-packages/data,tensorboard_logdir: None,train_experiencer_only: False,remove_political_convos: False,n_encoder_layers: -1,n_decoder_layers: -1,model_parallel: False,beam_block_full_context: True,beam_length_penalty: 0.65,topk: 10,topp: 0.9,beam_delay: 30,beam_block_list_filename: None,temperature: 1.0,compute_tokenized_bleu: False,interactive_mode: False,fp16_impl: mem_efficient,force_fp16_tokens: False,adafactor_eps: (1e-30, 0.001),history_reversed: False,history_add_global_end_token: None,special_tok_lst: None,bpe_vocab: None,bpe_merge: None,bpe_add_prefix_space: None,hf_skip_special_tokens: True,max_lr_steps: -1,invsqrt_lr_decay_gamm

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
  exp_avg.mul_(beta1).add_(1 - beta1, grad)


13:12:47 | Overflow: setting loss scale to 32768.0
13:12:50 | Overflow: setting loss scale to 16384.0
13:12:51 | time:10s total_exs:3200 epochs:0.05
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss        lr  ltpb  ltps   ppl  token_acc  \
   .9000  2627  8088 328.4 3200             48606  3.767    .5154 2.844 3.001e-06  1716  5284 17.18      .3946   
    total_train_updates  tpb   tps  ups  
                     30 4342 13372 3.08

13:13:01 | time:20s total_exs:6296 epochs:0.10
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss      lr  ltpb  ltps   ppl  token_acc  \
       1  3100  9670 301.8 3096             16384  4.167    .5105 2.811 6.2e-06  1634  5096 16.63      .4004   
    total_train_updates  tpb   tps   ups  
                     62 4734 14766 3.119

13:13:12 | time:31s total_exs:9416 epochs:0.15
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss      lr  ltpb  ltps   ppl  token_acc  \
       1  3119  9844 307.



13:13:58 | time:77s total_exs:19576 epochs:0.30
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  3111  9679 318.8 3280             16384  4.148    .4954 2.693 1e-05  1629  5067 14.78      .4122   
    total_train_updates  tpb   tps   ups  
                    197 4741 14746 3.111

13:14:08 | time:87s total_exs:22604 epochs:0.35
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  3014  9501 298.2 3028             16384  4.057    .4841 2.716 1e-05  1653  5211 15.12      .4095   
    total_train_updates  tpb   tps   ups  
                    229 4668 14712 3.152

13:14:18 | time:98s total_exs:25640 epochs:0.40
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2921  9168 297.8 3036             16384  4.417    .5422 2.694 1e-05  1533  4811 14.79      .4144   
    total_train_updates  t



13:15:03 | time:142s total_exs:35556 epochs:0.55
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  3093  9168 297.6 3012             16384  3.898    .5154 2.663 1e-05  1683  4989 14.34      .4167   
    total_train_updates  tpb   tps   ups  
                    367 4775 14157 2.965

13:15:14 | time:153s total_exs:38632 epochs:0.60
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2966  9232 299.2 3076             16384   4.32    .5050 2.664 1e-05  1586  4935 14.35      .4155   
    total_train_updates  tpb   tps   ups  
                    399 4552 14168 3.112

13:15:24 | time:163s total_exs:41700 epochs:0.65
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2613  8238 302.3 3068             16384  4.133    .5238 2.673 1e-05  1573  4958 14.49      .4148   
    total_train_updates



13:16:11 | time:210s total_exs:51648 epochs:0.80
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2722  8253 286.7 2932             16384   4.08    .5237 2.663 1e-05  1667  5053 14.34      .4140   
    total_train_updates  tpb   tps   ups  
                    541 4389 13307 3.032

13:16:21 | time:220s total_exs:54516 epochs:0.84
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2967  9340 282.2 2868             16384  4.283    .4954 2.656 1e-05  1509  4752 14.25      .4165   
    total_train_updates  tpb   tps   ups  
                    573 4476 14092 3.149

13:16:31 | time:230s total_exs:57472 epochs:0.89
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  3057  9445 294.6 2956             16384  4.188    .5190 2.628 1e-05  1544  4771 13.85      .4230   
    total_train_updates



13:17:16 | time:275s total_exs:67624 epochs:1.05
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  3077  9439 271.9 2748             16384  4.161    .5105 2.633 1e-05  1573  4825 13.91      .4191   
    total_train_updates  tpb   tps   ups  
                    709 4650 14264 3.067

13:17:26 | time:285s total_exs:70792 epochs:1.10
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  3016  9199 311.7 3168             16384  4.327    .4954 2.608 1e-05  1682  5131 13.57      .4235   
    total_train_updates  tpb   tps  ups  
                    740 4698 14330 3.05

13:17:37 | time:296s total_exs:73588 epochs:1.14
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  3097  9642   272 2796             16384  4.358    .4982 2.682 1e-05  1453  4522 14.62      .4153   
    total_train_updates  



13:18:24 | time:343s total_exs:84048 epochs:1.30
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps  ppl  token_acc  \
       1  3026  9261 296.2 3000             16384  4.029    .5232 2.617 1e-05  1609  4923 13.7      .4230   
    total_train_updates  tpb   tps   ups  
                    883 4634 14185 3.061

13:18:34 | time:353s total_exs:87300 epochs:1.35
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2879  9035 318.9 3252             16384   3.94    .5050 2.628 1e-05  1665  5226 13.84      .4198   
    total_train_updates  tpb   tps   ups  
                    915 4544 14261 3.139

13:18:44 | time:363s total_exs:90128 epochs:1.39
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2962  8800 280.1 2828             16384  4.096    .5105 2.604 1e-05  1558  4628 13.51      .4252   
    total_train_updates  



13:19:30 | time:409s total_exs:100352 epochs:1.55
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  3141  9701 302.1 3032             16384  4.062    .5233 2.585 1e-05  1634  5047 13.27      .4299   
    total_train_updates  tpb   tps   ups  
                   1051 4774 14748 3.089

13:19:40 | time:419s total_exs:103096 epochs:1.60
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  3035  9628   272 2744             16384  4.253    .5003 2.571 1e-05  1442  4574 13.07      .4282   
    total_train_updates  tpb   tps   ups  
                   1083 4477 14202 3.172

13:19:50 | time:430s total_exs:106132 epochs:1.64
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2538  7704 297.3 3036             16384  4.336    .5322 2.613 1e-05  1694  5143 13.65      .4243   
    total_train_upda



13:20:38 | time:477s total_exs:116680 epochs:1.81
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2916  9091 298.8 3068             16384  4.167    .5233 2.582 1e-05  1581  4927 13.23      .4249   
    total_train_updates  tpb   tps   ups  
                   1226 4497 14018 3.117

13:20:48 | time:487s total_exs:119888 epochs:1.85
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2945  9411 320.3 3208             16384  4.202    .4841 2.609 1e-05  1663  5314 13.58      .4223   
    total_train_updates  tpb   tps   ups  
                   1258 4608 14725 3.196

13:20:58 | time:497s total_exs:122860 epochs:1.90
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2776  8492 293.3 2972             16384  4.331    .5105 2.608 1e-05  1605  4909 13.57      .4201   
    total_train_upda



13:21:43 | time:542s total_exs:132696 epochs:2.05
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  3125 10269 290.3 2916             16384  4.297    .4954 2.564 1e-05  1436  4719 12.99      .4314   
    total_train_updates  tpb   tps   ups  
                   1397 4562 14988 3.286

13:21:54 | time:553s total_exs:135736 epochs:2.10
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2895  8960   294 3040             16384  4.092    .5105 2.597 1e-05  1630  5045 13.42      .4218   
    total_train_updates  tpb   tps   ups  
                   1429 4525 14005 3.095

13:22:04 | time:563s total_exs:138700 epochs:2.15
    clip  ctpb  ctps  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  loss    lr  ltpb  ltps   ppl  token_acc  \
       1  2994  9514 294.3 2964             16384  4.335    .5190  2.57 1e-05  1531  4864 13.06      .4261   
    total_train_upda



13:22:41 | max_train_time elapsed:600.173770904541s
13:22:41 | [33mOverriding opt["init_model"] to zoo:tutorial_transformer_generator/model (previously: /usr/local/lib/python3.6/dist-packages/data/models/tutorial_transformer_generator/model)[0m
13:22:41 | [33mOverriding opt["optimizer"] to adam (previously: mem_eff_adam)[0m
13:22:41 | [33myour model is being loaded with opts that do not exist in the model you are initializing the weights with: allow_missing_init_opts: False,loglevel: info,dynamic_batching: full,tensorboard_logdir: None,train_experiencer_only: False,remove_political_convos: False,n_encoder_layers: -1,n_decoder_layers: -1,model_parallel: False,beam_block_full_context: True,beam_length_penalty: 0.65,topk: 10,topp: 0.9,beam_delay: 30,beam_block_list_filename: None,temperature: 1.0,compute_tokenized_bleu: False,fp16_impl: mem_efficient,force_fp16_tokens: True,adafactor_eps: 1e-30,0.001,history_reversed: False,history_add_global_end_token: None,special_tok_lst: None,bpe

({'ctpb': GlobalAverageMetric(3413),
  'ctps': GlobalTimerMetric(3.555e+04),
  'exps': GlobalTimerMetric(878.8),
  'exs': SumMetric(5738),
  'gpu_mem': GlobalAverageMetric(0.07776),
  'loss': AverageMetric(2.45),
  'lr': GlobalAverageMetric(1e-05),
  'ltpb': GlobalAverageMetric(1351),
  'ltps': GlobalTimerMetric(1.407e+04),
  'ppl': PPLMetric(11.58),
  'token_acc': AverageMetric(0.4441),
  'total_train_updates': GlobalFixedMetric(1538),
  'tpb': GlobalAverageMetric(4764),
  'tps': GlobalTimerMetric(4.962e+04)},
 {'ctpb': GlobalAverageMetric(3590),
  'ctps': GlobalTimerMetric(3.634e+04),
  'exps': GlobalTimerMetric(831.8),
  'exs': SumMetric(5259),
  'gpu_mem': GlobalAverageMetric(0.07778),
  'loss': AverageMetric(2.471),
  'lr': GlobalAverageMetric(1e-05),
  'ltpb': GlobalAverageMetric(1334),
  'ltps': GlobalTimerMetric(1.35e+04),
  'ppl': PPLMetric(11.84),
  'token_acc': AverageMetric(0.4411),
  'total_train_updates': GlobalFixedMetric(1538),
  'tpb': GlobalAverageMetric(4923),
  'tps

## Wow that's a lot of options? Where do I find more info?

As you might have noticed, there are a LOT of options to ParlAI. You're best reading the [ParlAI docs](https://parl.ai/docs) to find a list of hyperparameters. We provide lists of the command-line args for both models

You can get some guidance in this notebook by using:

In [7]:
# note that if you want to see model-specific arguments, you must specify a model name
print(TrainModel.help(model='seq2seq'))

usage: TrainModel [-h] [--helpall] [-o INIT_OPT]
                  [--allow-missing-init-opts ALLOW_MISSING_INIT_OPTS]
                  [-t TASK] [-dt DATATYPE] [-bs BATCHSIZE]
                  [-dynb {None,batchsort,full}] [-dp DATAPATH] [-m MODEL]
                  [-mf MODEL_FILE] [-im INIT_MODEL] [-et EVALTASK]
                  [-eps NUM_EPOCHS] [-ttim MAX_TRAIN_TIME]
                  [-vtim VALIDATION_EVERY_N_SECS] [-stim SAVE_EVERY_N_SECS]
                  [-sval SAVE_AFTER_VALID] [-veps VALIDATION_EVERY_N_EPOCHS]
                  [-vp VALIDATION_PATIENCE] [-vmt VALIDATION_METRIC]
                  [-vmm {max,min}] [-mcs METRICS] [-micro AGGREGATE_MICRO]
                  [-tblog TENSORBOARD_LOG] [-tblogdir TENSORBOARD_LOGDIR]
                  [-hs HIDDENSIZE] [-esz EMBEDDINGSIZE] [-nl NUMLAYERS]
                  [-dr DROPOUT] [-bi BIDIRECTIONAL]
                  [-att {none,concat,general,dot,local}]
                  [-attl ATTENTION_LENGTH] [--attention-time {pre,post

You'll notice the options are give as commandline arguments. We control our options via `argparse`. The option names are relatively predictable: `--init-model` becomes `init_model`; `--num-epochs` becomes `num_epochs` and so on.

# Looking at model predictions

We have shown how we can chat with a model ourselves, interactively. We might want to inspect how the model reacts with a fixed set of inputs. Let's use that model we just trained!


In [9]:
from parlai.scripts.display_model import DisplayModel

DisplayModel.main(
    task='empathetic_dialogues',
    model_file='from_pretrained/model',
    num_examples=2,
)

13:24:15 | Using CUDA
13:24:15 | loading dictionary from from_pretrained/model.dict
13:24:15 | num words = 54944
13:24:16 | Total parameters: 87,508,992 (87,508,992 trainable)
13:24:16 | Loading existing model params from from_pretrained/model
13:24:25 | creating task(s): empathetic_dialogues
[EmpatheticDialoguesTeacher] Only use experiencer side? True, datatype: valid
13:24:26 | Opt:
13:24:26 |     activation: gelu
13:24:26 |     adafactor_eps: '[1e-30, 0.001]'
13:24:26 |     adam_eps: 1e-08
13:24:26 |     add_p1_after_newln: False
13:24:26 |     aggregate_micro: False
13:24:26 |     allow_missing_init_opts: False
13:24:26 |     attention_dropout: 0.0
13:24:26 |     batchsize: 12
13:24:26 |     beam_block_full_context: True
13:24:26 |     beam_block_list_filename: None
13:24:26 |     beam_block_ngram: -1
13:24:26 |     beam_context_block_ngram: -1
13:24:26 |     beam_delay: 30
13:24:26 |     beam_length_penalty: 0.65
13:24:26 |     beam_min_length: 1
13:24:26 |     beam_size: 1
13:24:

Whoa wait a second! The model isn't giving any responses? That's because we set `--skip-generation true` to speed up training. We need to turn that back off.

In [10]:
from parlai.scripts.display_model import DisplayModel

DisplayModel.main(
    task='empathetic_dialogues',
    model_file='from_pretrained/model',
    num_examples=2,
    skip_generation=False,
)

13:24:46 | [33mOverriding opt["skip_generation"] to False (previously: True)[0m
13:24:46 | Using CUDA
13:24:46 | loading dictionary from from_pretrained/model.dict
13:24:46 | num words = 54944
13:24:48 | Total parameters: 87,508,992 (87,508,992 trainable)
13:24:48 | Loading existing model params from from_pretrained/model
13:24:51 | creating task(s): empathetic_dialogues
[EmpatheticDialoguesTeacher] Only use experiencer side? True, datatype: valid
13:24:51 | Opt:
13:24:51 |     activation: gelu
13:24:51 |     adafactor_eps: '[1e-30, 0.001]'
13:24:51 |     adam_eps: 1e-08
13:24:51 |     add_p1_after_newln: False
13:24:51 |     aggregate_micro: False
13:24:51 |     allow_missing_init_opts: False
13:24:51 |     attention_dropout: 0.0
13:24:51 |     batchsize: 12
13:24:51 |     beam_block_full_context: True
13:24:51 |     beam_block_list_filename: None
13:24:51 |     beam_block_ngram: -1
13:24:51 |     beam_context_block_ngram: -1
13:24:51 |     beam_delay: 30
13:24:51 |     beam_length_

On the command line:
```bash
python -m parlai.scripts.display_model --task empathetic_dialogues --model-file zoo:tutorial_transformer_generator/model
```

# Bringing your own datasets

What if you want to build your own dataset in ParlAI? Of course you can do that!

In [11]:
from parlai.core.teachers import register_teacher, DialogTeacher

@register_teacher("my_teacher")
class MyTeacher(DialogTeacher):
    def __init__(self, opt, shared=None):
        # opt is the command line arguments.
        
        # What is this shared thing?
        # We make many copies of a teacher, one-per-batchsize. Shared lets us store 
        
        # We just need to set the "datafile".  This is boilerplate, but differs in many teachers.
        # The "datafile" is the filename where we will load the data from. In this case, we'll set it to
        # the fold name (train/valid/test) + ".txt"
        opt['datafile'] = opt['datatype'].split(':')[0] + ".txt"
        super().__init__(opt, shared)
    
    def setup_data(self, datafile):
        # filename tells us where to load from.
        # We'll just use some hardcoded data, but show how you could read the filename here:
        print(f" ~~ Loading from {datafile} ~~ ")
        
        # setup_data should yield tuples of ((text, label), new_episode)
        # That is ((str, str), bool)
        
        # first episode
        # notice how we have call, response, and then True? The True indicates this is a first message
        # in a conversation
        yield ('Hello', 'Hi'), True
        # Next we have the second turn. This time, the last element is False, indicating we're still going
        yield ('How are you', 'I am fine'), False
        yield ("Let's say goodbye", 'Goodbye!'), False
        
        # second episode. We need to have True again!
        yield ("Hey", "hi there"), True
        yield ("Deja vu?", "Deja vu!"), False
        yield ("Last chance", "This is it"), False
             
DisplayData.main(task="my_teacher")

13:25:07 | Opt:
13:25:07 |     allow_missing_init_opts: False
13:25:07 |     batchsize: 1
13:25:07 |     datapath: /usr/local/lib/python3.6/dist-packages/data
13:25:07 |     datatype: train:ordered
13:25:07 |     dict_class: None
13:25:07 |     display_ignore_fields: agent_reply
13:25:07 |     display_verbose: False
13:25:07 |     download_path: None
13:25:07 |     dynamic_batching: None
13:25:07 |     hide_labels: False
13:25:07 |     image_cropsize: 224
13:25:07 |     image_mode: raw
13:25:07 |     image_size: 256
13:25:07 |     init_model: None
13:25:07 |     init_opt: None
13:25:07 |     loglevel: info
13:25:07 |     max_display_len: 1000
13:25:07 |     model: None
13:25:07 |     model_file: None
13:25:07 |     multitask_weights: [1]
13:25:07 |     num_examples: 10
13:25:07 |     override: "{'task': 'my_teacher'}"
13:25:07 |     parlai_home: /usr/local/lib/python3.6/dist-packages
13:25:07 |     starttime: Feb07_13-25
13:25:07 |     task: my_teacher
13:25:07 | creating task(s): my_t

Notice how the data corresponds to the utterances we provided? In reality, we'd normally want to load up a data file, loop through it, and yield the tuples from processed data. But for this simple example, it works well.

We can now use our teacher in the standard places! Let's see how the model we trained earlier behaves with it:

In [12]:
DisplayModel.main(task='my_teacher', model_file='from_pretrained/model', skip_generation=False)

13:25:14 | [33mOverriding opt["task"] to my_teacher (previously: empathetic_dialogues)[0m
13:25:14 | [33mOverriding opt["skip_generation"] to False (previously: True)[0m
13:25:14 | Using CUDA
13:25:14 | loading dictionary from from_pretrained/model.dict
13:25:14 | num words = 54944
13:25:16 | Total parameters: 87,508,992 (87,508,992 trainable)
13:25:16 | Loading existing model params from from_pretrained/model
13:25:19 | creating task(s): my_teacher
 ~~ Loading from valid.txt ~~ 
13:25:19 | Opt:
13:25:19 |     activation: gelu
13:25:19 |     adafactor_eps: '[1e-30, 0.001]'
13:25:19 |     adam_eps: 1e-08
13:25:19 |     add_p1_after_newln: False
13:25:19 |     aggregate_micro: False
13:25:19 |     allow_missing_init_opts: False
13:25:19 |     attention_dropout: 0.0
13:25:19 |     batchsize: 12
13:25:19 |     beam_block_full_context: True
13:25:19 |     beam_block_list_filename: None
13:25:19 |     beam_block_ngram: -1
13:25:19 |     beam_context_block_ngram: -1
13:25:19 |     beam_de

Note that the `register_teacher` decorator makes the commands aware of your teacher. If you leave it off, the commands won't be able to locate it. If you want to use your teacher on the command line, you'll need to put it in a very specific filename: `parlai/agents/my_teacher/agents.py`, and you'll need to name the class `DefaultTeacher` instead of `MyTeacher`.

# Creating your own models

As a start, we'll implement a *very* simple agent. This agent will just sort of respond with "hello X, my name is Y", where X is based on the input

In [13]:
from parlai.core.agents import register_agent, Agent

@register_agent("hello")
class HelloAgent(Agent):
    @classmethod
    def add_cmdline_args(cls, parser):
        parser.add_argument('--name', type=str, default='Alice', help="The agent's name.")
        return parser
        
    def __init__(self, opt, shared=None):
        # similar to the teacher, we have the Opt and the shared memory objects!
        super().__init__(opt, shared)
        self.id = 'HelloAgent'
        self.name = opt['name']
    
    def observe(self, observation):
        # Gather the last word from the other user's input
        words = observation.get('text', '').split()
        if words:
            self.last_word = words[-1]
        else:
            self.last_word = "stranger!"
    
    def act(self):
        # Always return a string like this.
        return {
            'id': self.id,
            'text': f"Hello {self.last_word}, I'm {self.name}",
        }

Let's try seeing how this agent behaves:

In [14]:
DisplayModel.main(task='my_teacher', model='hello')

13:25:39 | creating task(s): my_teacher
 ~~ Loading from valid.txt ~~ 
13:25:39 | Opt:
13:25:39 |     allow_missing_init_opts: False
13:25:39 |     batchsize: 1
13:25:39 |     datapath: /usr/local/lib/python3.6/dist-packages/data
13:25:39 |     datatype: valid
13:25:39 |     dict_class: None
13:25:39 |     display_ignore_fields: 
13:25:39 |     download_path: None
13:25:39 |     dynamic_batching: None
13:25:39 |     hide_labels: False
13:25:39 |     image_cropsize: 224
13:25:39 |     image_mode: raw
13:25:39 |     image_size: 256
13:25:39 |     init_model: None
13:25:39 |     init_opt: None
13:25:39 |     loglevel: info
13:25:39 |     model: hello
13:25:39 |     model_file: None
13:25:39 |     multitask_weights: [1]
13:25:39 |     name: Alice
13:25:39 |     num_examples: 10
13:25:39 |     override: "{'task': 'my_teacher', 'model': 'hello'}"
13:25:39 |     parlai_home: /usr/local/lib/python3.6/dist-packages
13:25:39 |     starttime: Feb07_13-25
13:25:39 |     task: my_teacher
13:25:39 |

Notice how it read the words from the user, and provides its name from the command line argument? We can also interact with it easily enough.

In [15]:
Interactive.main(model='hello', name='Bob')

13:25:46 | Opt:
13:25:46 |     allow_missing_init_opts: False
13:25:46 |     batchsize: 1
13:25:46 |     datapath: /usr/local/lib/python3.6/dist-packages/data
13:25:46 |     datatype: train
13:25:46 |     dict_class: None
13:25:46 |     display_examples: False
13:25:46 |     display_ignore_fields: label_candidates,text_candidates
13:25:46 |     display_prettify: False
13:25:46 |     download_path: None
13:25:46 |     dynamic_batching: None
13:25:46 |     hide_labels: False
13:25:46 |     image_cropsize: 224
13:25:46 |     image_mode: raw
13:25:46 |     image_size: 256
13:25:46 |     init_model: None
13:25:46 |     init_opt: None
13:25:46 |     interactive_mode: True
13:25:46 |     interactive_task: True
13:25:46 |     local_human_candidates_file: None
13:25:46 |     log_keep_fields: all
13:25:46 |     loglevel: info
13:25:46 |     model: hello
13:25:46 |     model_file: None
13:25:46 |     multitask_weights: [1]
13:25:46 |     name: Bob
13:25:46 |     outfile: 
13:25:46 |     override:

Similar to the teacher, the call to `register_agent` makes it available for use in commands. If you forget the `register_agent` decorator, you won't be able to refer to it. Similarly, if you wanted to use this model from the command line, you would need to save this code to a special folder: `parlai/agents/hello/hello.py`.

## Creating a neural network model

The base Agent class is very simple, but it also provides extremely little functionality. We have created solid abstractions for creating neural-network type models. [`TorchGeneratorAgent`](https://parl.ai/docs/torch_agent.html#module-parlai.core.torch_generator_agent) is one our common abstractions, and it assumes a model which outputs one-word-at-a-time.

The following is from our [ExampleSeq2Seq](https://github.com/facebookresearch/ParlAI/blob/master/parlai/agents/examples/seq2seq.py) agent. It's a simple RNN model, trained like a Machine Translation model. The Model is too complex to go over in this document, but please feel free to [read our TorchGeneratorAgent tutorial](https://parl.ai/docs/tutorial_torch_generator_agent.html).

In [16]:
import torch.nn as nn
import torch.nn.functional as F
import parlai.core.torch_generator_agent as tga

class Encoder(nn.Module):
    """
    Example encoder, consisting of an embedding layer and a 1-layer LSTM with the
    specified hidden size.
    Pay particular attention to the ``forward`` output.
    """

    def __init__(self, embeddings, hidden_size):
        """
        Initialization.
        Arguments here can be used to provide hyperparameters.
        """
        # must call super on all nn.Modules.
        super().__init__()

        self.embeddings = embeddings
        self.lstm = nn.LSTM(
            input_size=hidden_size,
            hidden_size=hidden_size,
            num_layers=1,
            batch_first=True,
        )

    def forward(self, input_tokens):
        """
        Perform the forward pass for the encoder.
        Input *must* be input_tokens, which are the context tokens given
        as a matrix of lookup IDs.
        :param input_tokens:
            Input tokens as a bsz x seqlen LongTensor.
            Likely will contain padding.
        :return:
            You can return anything you like; it is will be passed verbatim
            into the decoder for conditioning. However, it should be something
            you can easily manipulate in ``reorder_encoder_states``.
            This particular implementation returns the hidden and cell states from the
            LSTM.
        """
        embedded = self.embeddings(input_tokens)
        _output, hidden = self.lstm(embedded)
        return hidden


class Decoder(nn.Module):
    """
    Basic example decoder, consisting of an embedding layer and a 1-layer LSTM with the
    specified hidden size. Decoder allows for incremental decoding by ingesting the
    current incremental state on each forward pass.
    Pay particular note to the ``forward``.
    """

    def __init__(self, embeddings, hidden_size):
        """
        Initialization.
        Arguments here can be used to provide hyperparameters.
        """
        super().__init__()
        self.embeddings = embeddings
        self.lstm = nn.LSTM(
            input_size=hidden_size,
            hidden_size=hidden_size,
            num_layers=1,
            batch_first=True,
        )

    def forward(self, input, encoder_state, incr_state=None):
        """
        Run forward pass.
        :param input:
            The currently generated tokens from the decoder.
        :param encoder_state:
            The output from the encoder module.
        :parm incr_state:
            The previous hidden state of the decoder.
        """
        embedded = self.embeddings(input)
        if incr_state is None:
            # this is our very first call. We want to seed the LSTM with the
            # hidden state of the decoder
            state = encoder_state
        else:
            # We've generated some tokens already, so we can reuse the existing
            # decoder state
            state = incr_state

        # get the new output and decoder incremental state
        output, incr_state = self.lstm(embedded, state)

        return output, incr_state

class ExampleModel(tga.TorchGeneratorModel):
    """
    ExampleModel implements the abstract methods of TorchGeneratorModel to define how to
    re-order encoder states and decoder incremental states.
    It also instantiates the embedding table, encoder, and decoder, and defines the
    final output layer.
    """

    def __init__(self, dictionary, hidden_size=1024):
        super().__init__(
            padding_idx=dictionary[dictionary.null_token],
            start_idx=dictionary[dictionary.start_token],
            end_idx=dictionary[dictionary.end_token],
            unknown_idx=dictionary[dictionary.unk_token],
        )
        self.embeddings = nn.Embedding(len(dictionary), hidden_size)
        self.encoder = Encoder(self.embeddings, hidden_size)
        self.decoder = Decoder(self.embeddings, hidden_size)

    def output(self, decoder_output):
        """
        Perform the final output -> logits transformation.
        """
        return F.linear(decoder_output, self.embeddings.weight)

    def reorder_encoder_states(self, encoder_states, indices):
        """
        Reorder the encoder states to select only the given batch indices.
        Since encoder_state can be arbitrary, you must implement this yourself.
        Typically you will just want to index select on the batch dimension.
        """
        h, c = encoder_states
        return h[:, indices, :], c[:, indices, :]

    def reorder_decoder_incremental_state(self, incr_state, indices):
        """
        Reorder the decoder states to select only the given batch indices.
        This method can be a stub which always returns None; this will result in the
        decoder doing a complete forward pass for every single token, making generation
        O(n^2). However, if any state can be cached, then this method should be
        implemented to reduce the generation complexity to O(n).
        """
        h, c = incr_state
        return h[:, indices, :], c[:, indices, :]

@register_agent("my_first_lstm")
class Seq2seqAgent(tga.TorchGeneratorAgent):
    """
    Example agent.
    Implements the interface for TorchGeneratorAgent. The minimum requirement is that it
    implements ``build_model``, but we will want to include additional command line
    parameters.
    """

    @classmethod
    def add_cmdline_args(cls, argparser):
        """
        Add CLI arguments.
        """
        # Make sure to add all of TorchGeneratorAgent's arguments
        super(Seq2seqAgent, cls).add_cmdline_args(argparser)

        # Add custom arguments only for this model.
        group = argparser.add_argument_group('Example TGA Agent')
        group.add_argument(
            '-hid', '--hidden-size', type=int, default=1024, help='Hidden size.'
        )

    def build_model(self):
        """
        Construct the model.
        """

        model = ExampleModel(self.dict, self.opt['hidden_size'])
        # Optionally initialize pre-trained embeddings by copying them from another
        # source: GloVe, fastText, etc.
        self._copy_embeddings(model.embeddings.weight, self.opt['embedding_type'])
        return model

Of course, now we can train with our new model. Let's train it on our toy task that we created earlier.

In [17]:
# of course, we can train the model! Let's Train it on our silly toy task from above
!rm -rf my_first_lstm
!mkdir -p my_first_lstm

TrainModel.main(
    model='my_first_lstm',
    model_file='my_first_lstm/model',
    task='my_teacher',
    batchsize=1,
    validation_every_n_secs=10,
    max_train_time=60,
)

13:27:19 | building dictionary first...
13:27:19 | Opt:
13:27:19 |     adafactor_eps: '(1e-30, 0.001)'
13:27:19 |     adam_eps: 1e-08
13:27:19 |     add_p1_after_newln: False
13:27:19 |     aggregate_micro: False
13:27:19 |     allow_missing_init_opts: False
13:27:19 |     batchsize: 1
13:27:19 |     beam_block_full_context: True
13:27:19 |     beam_block_list_filename: None
13:27:19 |     beam_block_ngram: -1
13:27:19 |     beam_context_block_ngram: -1
13:27:19 |     beam_delay: 30
13:27:19 |     beam_length_penalty: 0.65
13:27:19 |     beam_min_length: 1
13:27:19 |     beam_size: 1
13:27:19 |     betas: '(0.9, 0.999)'
13:27:19 |     bpe_add_prefix_space: None
13:27:19 |     bpe_debug: False
13:27:19 |     bpe_merge: None
13:27:19 |     bpe_vocab: None
13:27:19 |     compute_tokenized_bleu: False
13:27:19 |     datapath: /usr/local/lib/python3.6/dist-packages/data
13:27:19 |     datatype: train
13:27:19 |     delimiter: '\n'
13:27:19 |     dict_class: parlai.core.dict:DictionaryAgent


Building dictionary: 100%|██████████| 6.00/6.00 [00:00<00:00, 2.39kex/s]

 ~~ Loading from train.txt ~~ 
 ~~ Loading from train.txt ~~ 
13:27:20 | Saving dictionary to my_first_lstm/model.dict
13:27:20 | dictionary built with 30 tokens in 0.0s
13:27:20 | No model with opt yet at: my_first_lstm/model(.opt)
13:27:20 | Using CUDA
13:27:20 | loading dictionary from my_first_lstm/model.dict
13:27:20 | num words = 30
13:27:20 | Total parameters: 16,824,320 (16,824,320 trainable)
13:27:20 | Opt:
13:27:20 |     adafactor_eps: '(1e-30, 0.001)'
13:27:20 |     adam_eps: 1e-08
13:27:20 |     add_p1_after_newln: False
13:27:20 |     aggregate_micro: False
13:27:20 |     allow_missing_init_opts: False
13:27:20 |     batchsize: 1
13:27:20 |     beam_block_full_context: True
13:27:20 |     beam_block_list_filename: None
13:27:20 |     beam_block_ngram: -1
13:27:20 |     beam_context_block_ngram: -1
13:27:20 |     beam_delay: 30
13:27:20 |     beam_length_penalty: 0.65
13:27:20 |     beam_min_length: 1
13:27:20 |     beam_size: 1
13:27:20 |     betas: '(0.9, 0.999)'
13:27:20




13:27:20 |     hide_labels: False
13:27:20 |     history_add_global_end_token: None
13:27:20 |     history_reversed: False
13:27:20 |     history_size: -1
13:27:20 |     image_cropsize: 224
13:27:20 |     image_mode: raw
13:27:20 |     image_size: 256
13:27:20 |     inference: greedy
13:27:20 |     init_model: None
13:27:20 |     init_opt: None
13:27:20 |     interactive_mode: False
13:27:20 |     invsqrt_lr_decay_gamma: -1
13:27:20 |     label_truncate: None
13:27:20 |     learningrate: 1
13:27:20 |     load_from_checkpoint: True
13:27:20 |     log_every_n_secs: 10
13:27:20 |     loglevel: info
13:27:20 |     lr_scheduler: reduceonplateau
13:27:20 |     lr_scheduler_decay: 0.5
13:27:20 |     lr_scheduler_patience: 3
13:27:20 |     max_lr_steps: -1
13:27:20 |     max_train_time: 60.0
13:27:20 |     metrics: default
13:27:20 |     model: my_first_lstm
13:27:20 |     model_file: my_first_lstm/model
13:27:20 |     momentum: 0
13:27:20 |     multitask_weights: [1]
13:27:20 |     nesterov: 

({'accuracy': ExactMatchMetric(1),
  'bleu-4': BleuMetric(0.0003337),
  'ctpb': GlobalAverageMetric(8.167),
  'ctps': GlobalTimerMetric(838.2),
  'exps': GlobalTimerMetric(100.9),
  'exs': SumMetric(6),
  'f1': F1Metric(1),
  'gpu_mem': GlobalAverageMetric(0.00432),
  'loss': AverageMetric(7.749e-08),
  'lr': GlobalAverageMetric(1),
  'ltpb': GlobalAverageMetric(3.333),
  'ltps': GlobalTimerMetric(338),
  'ppl': PPLMetric(1),
  'token_acc': AverageMetric(1),
  'total_train_updates': GlobalFixedMetric(1258),
  'tpb': GlobalAverageMetric(11.5),
  'tps': GlobalTimerMetric(1181)},
 {'accuracy': ExactMatchMetric(1),
  'bleu-4': BleuMetric(0.0003337),
  'ctpb': GlobalAverageMetric(8.167),
  'ctps': GlobalTimerMetric(831.9),
  'exps': GlobalTimerMetric(101.1),
  'exs': SumMetric(6),
  'f1': F1Metric(1),
  'gpu_mem': GlobalAverageMetric(0.004319),
  'loss': AverageMetric(7.749e-08),
  'lr': GlobalAverageMetric(1),
  'ltpb': GlobalAverageMetric(3.333),
  'ltps': GlobalTimerMetric(338.3),
  'ppl

Let's see how it does. It should reproduce the data perfectly:

In [18]:
DisplayModel.main(model_file='my_first_lstm/model', task='my_teacher')

13:27:43 | Using CUDA
13:27:43 | loading dictionary from my_first_lstm/model.dict
13:27:43 | num words = 30
13:27:43 | Total parameters: 16,824,320 (16,824,320 trainable)
13:27:43 | Loading existing model params from my_first_lstm/model
13:27:43 | creating task(s): my_teacher
 ~~ Loading from valid.txt ~~ 
13:27:43 | Opt:
13:27:43 |     adafactor_eps: '[1e-30, 0.001]'
13:27:43 |     adam_eps: 1e-08
13:27:43 |     add_p1_after_newln: False
13:27:43 |     aggregate_micro: False
13:27:43 |     allow_missing_init_opts: False
13:27:43 |     batchsize: 1
13:27:43 |     beam_block_full_context: True
13:27:43 |     beam_block_list_filename: None
13:27:43 |     beam_block_ngram: -1
13:27:43 |     beam_context_block_ngram: -1
13:27:43 |     beam_delay: 30
13:27:43 |     beam_length_penalty: 0.65
13:27:43 |     beam_min_length: 1
13:27:43 |     beam_size: 1
13:27:43 |     betas: '[0.9, 0.999]'
13:27:43 |     bpe_add_prefix_space: None
13:27:43 |     bpe_debug: False
13:27:43 |     bpe_merge: None

Unsurprisingly, we got perfect accuracy. This is because the data set is only a handful of utterances, and we can perfectly memorize it in this LSTM. Nonetheless, a great success!

# What's next!

The sky's the limit! Be sure to check out our [GitHub](https://github.com/facebookresearch/ParlAI) and [Follow ParlAI on Twitter](https://twitter.com/parlai_parley). We're eager to hear what you are using ParlAI for!

Here are some other great resources:
- [Our research page](https://parl.ai/projects/)
- [ParlAI Documentations](https://parl.ai/docs/index.html)
- [Tutorial: Writing a Ranker model](https://parl.ai/docs/tutorial_torch_ranker_agent.html)
- [Tutorial: Using Mechanical Turk](https://parl.ai/docs/tutorial_mturk.html)
- [Tutorial: Connecting to chat services](https://parl.ai/docs/tutorial_chat_service.html)