# A RETRO tutorial by Carson Lam 

this is my notebook to learn and teach the implementation of <a href="https://arxiv.org/abs/2112.04426">RETRO</a>, Deepmind's Retrieval based Attention net, in Pytorch, on a small but meaningful task. 

1. I am using python3.8, if you dont have python 3.8 there are many ways to install it, using pyenv or homebrew, I used homebrew  and followed the instructions at the end of the download. You have to restart your terminal for the changes to take effect.

2. create a virtual environment for this project and entered that environment

```
python3.8 -m venv env
source env/bin/activate
```

3. install this project's dependencies from requirements.txt

```
pip install --upgrade pip
pip install -r requirements.txt
```

4. save any additional dependencies you have pip installed inside your environment along with the specific version back into requirements.txt for later use

```
pip freeze > requirements.txt
```

5. open up jupyter and open this notebook

```
jupyter notebook
```

In [1]:
import torch
from retro_pytorch import RETRO, TrainingWrapper

%load_ext autoreload
%autoreload 2

print('torch.version', torch.__version__)
print('torch.cuda.is_available()', torch.cuda.is_available())
print('torch.cuda.device_count()', torch.cuda.device_count())

torch.version 1.11.0+cu102
torch.cuda.is_available() True
torch.cuda.device_count() 2


the chunk size that is indexed and retrieved is needed for proper relative positions as well as causal chunked cross attention

decoder cross attention layers is used with causal chunk cross attention
 
turn on `use_deepnet`  post-normalization with DeepNet residual scaling and initialization,  for scaling to 1000 layers

In [2]:
retro = RETRO(
    chunk_size = 64,                         # the chunk size that is indexed and retrieved  
    max_seq_len = 2048,                      # max sequence length
    enc_dim = 4, #896                        # encoder model dim
    enc_depth = 2,                           # encoder depth
    dec_dim = 4, #768,                       # decoder model dim
    dec_depth = 12,                          # decoder depth
    dec_cross_attn_layers = (3, 6, 9, 12),   # decoder cross attention layers 
    heads = 8,                               # attention heads
    dim_head = 64,                           # dimension per head
    dec_attn_dropout = 0.25,                 # decoder attention dropout
    dec_ff_dropout = 0.25,                   # decoder feedforward dropout
    use_deepnet = True                       # turn on post-normalization with DeepNet residual scaling and initialization 
)


In [3]:
 # plus one since it is split into input and labels for training
seq = torch.randint(0, 20000, (2, 2048 + 1))     
print(seq)
print(seq.shape)

tensor([[15002,  6277, 16790,  ...,  8993,  8445, 10696],
        [14636, 14967,  5681,  ...,  5709, 14426,  8226]])
torch.Size([2, 2049])


In [4]:
# retrieved tokens 
# - (batch, num chunks, num retrieved neighbors, retrieved chunk with continuation)
retrieved = torch.randint(0, 20000, (2, 32, 2, 128)) 
print(retrieved[:1,:3,:1,:32])
print(retrieved.shape)

tensor([[[[13814,  2228,  8314,  9380,  2469, 14591, 16025,  4959, 15966,  5552,
           12659, 13505, 11930, 10936,  9576, 19747,  8154,  4276,  3369,  8374,
           13024, 15922, 13479,  3554,  3488,  9570,   848,  9146, 18109,  6342,
           12020, 19881]],

         [[ 4706, 11936,  7855,  9514, 14994, 15707,  1554, 15798, 12646, 13213,
           18616, 17526, 16600,  6834, 18114,  8581,  8004, 14320, 18069, 14028,
           14515, 11528, 18079, 18692,  4627, 17391, 14247, 12973, 18911,  6000,
            2962, 15096]],

         [[  549, 10380,  7294, 17518, 17016, 14893, 15643,  4232,   244,  9032,
           13008,  2705,  7246,  4125, 12728,  9813, 19083, 12078,  7347,  6122,
            9420,  9940,  6467, 17694,  1392, 17523, 15222, 12154, 12990, 11640,
            6459,  3266]]]])
torch.Size([2, 32, 2, 128])


In [5]:
loss = retro(seq, retrieved, return_loss = True)
print(loss)

tensor(10.4575, grad_fn=<NllLoss2DBackward0>)


The aim of the TrainingWrapper is to process a folder of text documents into the necessary memmapped numpy arrays to begin training RETRO.

`bert_embed()` will automatically use cuda if available so best to match it with the retro that is inputted to wrapper



In [6]:
if torch.cuda.is_available():
    retro = retro.cuda()

wrapper = TrainingWrapper(
    retro = retro,                                 # path to retro instance
    knn = 2,                                       # knn (2 in paper was sufficient)
    chunk_size = 64,                               # chunk size (64 in paper)
    documents_path = '../text_folder',              # path to folder of text
    glob = '**/*.txt',                             # text glob
    chunks_memmap_path = './train.chunks.dat',     # path to chunks
    seqs_memmap_path = './train.seq.dat',          # path to sequence data
    doc_ids_memmap_path = './train.doc_ids.dat',   # path to document ids per chunk (used for filtering neighbors belonging to same document)
    max_chunks = 1_000_000,                        # maximum cap to chunks
    max_seqs = 100_000,                            # maximum seqs
    knn_extra_neighbors = 100,                     # num extra neighbors to fetch
    max_index_memory_usage = '100m',
    current_memory_available = '1G'
)

processing ../text_folder/doc1.txt


Using cache found in /home/carson/.cache/torch/hub/huggingface_pytorch-transformers_main


processing ../text_folder/doc2.txt


Using cache found in /home/carson/.cache/torch/hub/huggingface_pytorch-transformers_main
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!