## RNN Language Model

Sources

2. [About Mixed Precision](https://pytorch.org/blog/what-every-user-should-know-about-mixed-precision-training-in-pytorch/)
3. [AMP Recipes](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html)
4. [Efficient training in a single GPU](https://huggingface.co/docs/transformers/perf_train_gpu_one)

We have the following datasets available for this task:

- Penn Trebank (originally created for POS tagging)
- WikiText

Before loading our dataset, define how it will be tokenized and preprocessed. To do this, `torchtext` uses `data.Field`. By default, it uses [`spaCy`](https://spacy.io/api/tokenizer) tokenization.

Also, we set an `init_token` and `eos_token` for the begin and end of sentence characters.

Now, we can load our dataset

In [1]:
import torch

from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-103-v1")

dataset

Found cached dataset wikitext (/users/jmperez/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 1801350
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

In [4]:
dataset["train"][:15]

{'text': ['',
  ' = Valkyria Chronicles III = \n',
  '',
  ' Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . \n',
  " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game m

In [5]:
# Skip empty lines and those starting with "="

def filter_line(ex):
    text = ex["text"]
    return len(text) > 20 and not text.lstrip().startswith("=")

dataset = dataset.filter(filter_line)
dataset = dataset.map(lambda ex: {"text": ex["text"].strip()})

Loading cached processed dataset at /users/jmperez/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-ab32b8dc95e7e19c.arrow
Loading cached processed dataset at /users/jmperez/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-967a5f7435720a60.arrow
Loading cached processed dataset at /users/jmperez/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-94f87be65989f374.arrow
Loading cached processed dataset at /users/jmperez/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-983eb387b09cbff9.arrow
Loading cached processed dataset at /users/jmperez/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-029

## Tokenization



In [6]:
# Load tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("finiteautomata/wikitext-tokenizer")

In [7]:
tokenizer.model_max_length = 256

tokenized_ds = dataset.map(
    lambda ex: tokenizer(ex["text"], padding=False, truncation=True),
    batched=True,
)

Loading cached processed dataset at /users/jmperez/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-32cd229dfccbe42a.arrow


Map:   0%|          | 0/840904 [00:00<?, ? examples/s]

Loading cached processed dataset at /users/jmperez/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-5f62c63be08d820d.arrow


In [8]:
tokenized_ds["train"][0]

{'text': 'Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " .',
 'input_ids': [0,
  3483,
  79,
  206,
  1678,
  1257,
  27126,
  4214,
  1240,
  1288,
  692,
  3,
  21034,
  1122,
  2775,
  1288,
  692,
  899,
  872,
  743,
  822,
  757,
  817,
  768,
  813,
  816,
  758,
  24,
  1012,
  2164,
  1020,
  1257,
  27126,
  4214,
  1030,
  1009,
  3143,
  2

In [9]:
tokenized_ds = tokenized_ds.filter(
    lambda ex: len(ex["input_ids"]) > 7
)

tokenized_ds = tokenized_ds.remove_columns(["text"])

Loading cached processed dataset at /users/jmperez/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-7911f3cef3101137.arrow


Filter:   0%|          | 0/840904 [00:00<?, ? examples/s]

Loading cached processed dataset at /users/jmperez/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-5435a9ad0e67482e.arrow


## Dataloaders

In [10]:
# Create dataloaders
# First
from torch.utils.data import DataLoader
from transformers import DataCollatorForLanguageModeling

TRAIN_BATCH_SIZE = 64
EVAL_BATCH_SIZE = 32

collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

train_dataloader = DataLoader(tokenized_ds["train"], batch_size=TRAIN_BATCH_SIZE, shuffle=False, collate_fn=collator)
dev_dataloader = DataLoader(tokenized_ds["validation"], batch_size=EVAL_BATCH_SIZE, shuffle=False, collate_fn=collator)
test_dataloader = DataLoader(tokenized_ds["test"], batch_size=EVAL_BATCH_SIZE, shuffle=False, collate_fn=collator)



### Check speed 


In [11]:
from tqdm.auto import tqdm

for batch in tqdm(zip(range(5_000), train_dataloader)):
    pass

0it [00:00, ?it/s]

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Now, add `pin_memory=True` to the `DataLoader` and `num_workers=4`

In [12]:

train_dataloader = DataLoader(tokenized_ds["train"], batch_size=TRAIN_BATCH_SIZE, shuffle=False, collate_fn=collator, num_workers=4, pin_memory=True)
dev_dataloader = DataLoader(tokenized_ds["validation"], batch_size=EVAL_BATCH_SIZE, shuffle=False, collate_fn=collator)
test_dataloader = DataLoader(tokenized_ds["test"], batch_size=EVAL_BATCH_SIZE, shuffle=False, collate_fn=collator)



In [13]:
from tqdm.auto import tqdm

for batch in tqdm(zip(range(5_000), train_dataloader)):
    pass

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

0it [00:00, ?it/s]

Quite better!

In [14]:
# %load ../pytorch_lm/models/rnn.py
import torch.nn as nn

class ResidualRNN(nn.Module):
    """
    Capa recurrente + conexión residual
    """
    def __init__(self, input_size):
        super().__init__()
        self.rnn = nn.GRU(input_size, input_size, batch_first=True)

    def forward(self, x):
        out, _ = self.rnn(x)

        return x + out

class RNNLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, pad_idx, dropout=0.20, num_layers=1):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.rnn = nn.GRU(embedding_dim, embedding_dim, batch_first=True, num_layers=num_layers)
        self.fc = nn.Linear(embedding_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)

        # Weight tying
        self.fc.weight = self.embedding.weight

    def forward(self, inp, hidden=None):
        # inputs = [batch_size, seqlen]
        emb = self.embedding(inp)
        # emb = [batch, seqlen, embedding_dim]
        rnn_outputs, hidden = self.rnn(emb, hidden)
        # hidden = [batch, hidden_dim]

        out = self.fc(self.dropout(rnn_outputs))
        # out = [batch, vocab size]

        return out, hidden


Create the Language Model

In [37]:
stoi = tokenizer.get_vocab()

PAD_IDX = stoi["<pad>"]
UNK_IDX = stoi["<unk>"]
BOS_IDX = stoi["<s>"]
EOS_IDX = stoi["</s>"]


In [16]:
import torch.nn.functional as F


model = RNNLanguageModel(len(stoi), 512, pad_idx=PAD_IDX)

batch = next(iter(train_dataloader))


inputs = batch["input_ids"][:32, :-1]
targets = batch["labels"][:32, 1:]

out, hidden = model(inputs)
# Calculate loss
# Import functional

F.cross_entropy(out.view(-1, out.size(-1)), targets.reshape(-1))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/to

tensor(30.2208, grad_fn=<NllLossBackward0>)

### Check speed with GPU (with and without torch.amp)

In [17]:
input_ids = batch["input_ids"][:, :-1]
labels = batch["labels"][:, 1:]

input_ids.shape, labels.shape

(torch.Size([64, 255]), torch.Size([64, 255]))

In [18]:
from tqdm.auto import tqdm

device = "cuda"
model = model.to(device)

for _, batch in tqdm(zip(range(1_000), train_dataloader)):
    input_ids = batch["input_ids"][:, :-1].to(device)
    labels = batch["labels"][:, 1:].to(device)

    out, hidden = model(input_ids)


    

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

0it [00:00, ?it/s]

In [19]:

for _, batch in tqdm(zip(range(1_000), train_dataloader)):
    with torch.autocast(device_type='cuda'):
        input_ids = batch["input_ids"][:, :-1].to(device)
        labels = batch["labels"][:, 1:].to(device)

        out, hidden = model(input_ids)

print(out.dtype)
    

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

0it [00:00, ?it/s]

torch.float16


🤯 >3x speedup. 

In [20]:
# Count model parameters

sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6

16.965936

16M parameters

In [21]:
del batch, inputs, targets, out, hidden, model

## Training 

In [25]:
# Load tensorboard writer
from torch.utils.tensorboard import SummaryWriter
from tqdm.auto import tqdm
import torch.optim as optim
from torch.cuda.amp import GradScaler

hidden_dim = 300

device = "cuda"

model = RNNLanguageModel(len(stoi), hidden_dim, pad_idx=PAD_IDX).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# Triangular learning rate scheduler
# Primero sube, luego baja a 0
num_epochs = 6
warmup_steps = 1_000
total_steps = len(train_dataloader) * num_epochs 
lr_scheduler = optim.lr_scheduler.CyclicLR(
    optimizer, base_lr=1e-4, max_lr=1e-3, 
    step_size_up=warmup_steps, step_size_down= total_steps - warmup_steps,
    cycle_momentum=False)
scaler = GradScaler()


writer = SummaryWriter("runs/")

loss_fn = nn.CrossEntropyLoss().to(device)

step = 0

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0.0
    for batch in tqdm(train_dataloader):
        step += 1
        optimizer.zero_grad()

        with torch.autocast(device_type='cuda'):
            inputs = batch["input_ids"][:, :-1].to(device)
            targets = batch["labels"][:, 1:].to(device)
            out, hidden = model(inputs)

            loss = loss_fn(out.view(-1, out.size(-1)), targets.reshape(-1))

        writer.add_scalar("train/loss", loss, global_step=step)
        # Log LR
        writer.add_scalar("train/learning rate", optimizer.param_groups[0]["lr"], global_step=step)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        lr_scheduler.step()


    model.eval()
    epoch_loss = 0.0
    for batch in tqdm(dev_dataloader):
        inputs = batch["input_ids"][:, :-1]
        targets = batch["labels"][:, 1:]

        inputs = inputs.to(device)
        targets = targets.to(device)

        with torch.no_grad():
            out, hidden = model(inputs)

            loss = F.cross_entropy(out.view(-1, out.size(-1)), targets.reshape(-1))

            epoch_loss += loss.item()

    epoch_loss /= len(dev_dataloader)

    writer.add_scalar("dev/loss", epoch_loss, global_step=step)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  0%|          | 0/12997 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  0%|          | 0/56 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  0%|          | 0/12997 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  0%|          | 0/56 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  0%|          | 0/12997 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  0%|          | 0/56 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  0%|          | 0/12997 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  0%|          | 0/56 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  0%|          | 0/12997 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Exception ignored in: Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fc81e45ee50><function _MultiProcessingDataLoaderIter.__del__ at 0x7fc81e45ee50>

Traceback (most recent call last):
Traceback (most recent call last):
Exception ignored in: Exception ignored in:   File "/users/jmperez/projects/pytorch-language-models/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
  File "/users/jmperez/projects/pytorch-language-models/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    <function _MultiProcessingDataLoaderIter.__del__ at 0x7fc81e45ee50>
<function _MultiProcessingDataLoaderIter.__del__ at 0x7fc81e45ee50>    Traceback (most recent call last):
  File "/users/jmperez/projects/pytorch-language-models/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
self._shutdown_workers()
self._shutdown_workers()
Traceback (most recent call last):

  File 

	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


<function _MultiProcessingDataLoaderIter.__del__ at 0x7fc81e45ee50>
assert self._parent_pid == os.getpid(), 'can only test a child process'

Traceback (most recent call last):
  File "/users/jmperez/projects/pytorch-language-models/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
  File "/users/jmperez/projects/pytorch-language-models/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1478, in __del__
    
    AssertionErrorif w.is_alive():: self._shutdown_workers()Exception ignored in: can only test a child process

  File "/users/jmperez/.pyenv/versions/3.8.16/lib/python3.8/multiprocessing/process.py", line 160, in is_alive

    Exception ignored in:   File "/users/jmperez/projects/pytorch-language-models/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1461, in _shutdown_workers
<function _MultiProcessingDataLoaderIter.__del__ at 0x7fc81e45ee50>assert self._parent_pid == os.getpid(), 'can o

  0%|          | 0/56 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  0%|          | 0/12997 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  0%|          | 0/56 [00:00<?, ?it/s]

In [26]:
# Save model

torch.save(model.state_dict(), "rnn-lm.pt")

We can check perplexities for other models in [this blogpost](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)

A more complex recurrent network (using a cache of hidden states) achieves a perplexity of 100. So this very basic model (without any hyperparameter optimization) seems fairly ok

## Sampling

As we rise temperature, we have more variety at the cost of meaningless stuff..

In [36]:
stoi"]

{'igua': 29846,
 'istically': 14739,
 'psilon': 22034,
 'icz': 18243,
 '▁campaign': 2721,
 '▁LeM': 26661,
 '▁Sr': 14679,
 '▁hem': 10930,
 '▁Online': 14289,
 '▁dispers': 14922,
 '▁inferior': 16311,
 '▁Global': 14848,
 '条': 915,
 '▁extraction': 21737,
 '▁cov': 26683,
 'lides': 15344,
 '▁fear': 7023,
 '▁shrines': 26740,
 '▁1500': 19406,
 '▁advert': 6121,
 '▁allies': 9669,
 '▁married': 3957,
 '▁clause': 16955,
 '▁wear': 5111,
 '▁POW': 24108,
 '▁Fest': 24827,
 '▁intersection': 6466,
 'body': 7968,
 '▁Tucker': 13300,
 'apy': 19213,
 '▁substance': 14446,
 'ph': 1601,
 '▁instrumental': 8873,
 'ocket': 11625,
 '▁Brandon': 14809,
 'iter': 6512,
 '▁1860s': 18163,
 'ephal': 26218,
 '▁consult': 9081,
 'rooms': 11170,
 '▁escaping': 20049,
 '▁Ain': 20465,
 'ographical': 8999,
 '▁subsidies': 25131,
 '▁smok': 27867,
 '▁fa': 17990,
 '▁replace': 6847,
 '▁dist': 1828,
 '▁Nit': 22904,
 '▁demise': 19679,
 '▁demonstrating': 22429,
 '▁Reform': 14589,
 '▁Volunteers': 28028,
 '▁combined': 5106,
 '▁than': 1500,


In [39]:
tokenizer('')

{'input_ids': [0, 2], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]}

In [61]:
import torch.nn.functional as F

def sample_sentence(init_sentence='', temperature=1, max_len=100):
    # Remove last token
    seq = tokenizer(init_sentence)["input_ids"][:-1]
    hidden = None
    while len(seq) == 1 or seq[-1] != EOS_IDX:
        inp = torch.LongTensor([[seq[-1]]]).to(device)
        out, hidden = model(inp, hidden=hidden)

        """
        Sample from probabilities
        """
        probs = F.softmax(out.view(-1) / temperature, dim=0)
        next_tok_idx = torch.multinomial(probs, num_samples=1).item()
        
        #print(f"Next token {tokenizer.decode(next_tok_idx)}")
        seq.append(next_tok_idx)

        if len(seq) > max_len:
            break
    
    #print(seq)
    return tokenizer.decode(seq)

In [62]:
sample_sentence("Boys and girls from", temperature=1.1)

'<s> Boys and girls from 35 June 2011 on 28 April 2014. After hearing the consequences of the depression, processed feeding two lower designs, settled in debates across Asia was designated and CP ( 1015 ). In January 2016, between 40 September 15, 21 June highaku study showed that a performance and official storm surgeons declared ceremony that showed aniding competition with 2014 and considering a test 1 August 1892 ( nine visible in Hannings Neptune ) by 62 @,@ 750 flights over simultaneous AIDS.</s>'

In [63]:
import numpy as np
for temperature in np.arange(0.5, 1.5, 0.15):
    print("="*80, f"\nSampling with temperature = {temperature:.2f}")
    
    print(sample_sentence("The dogs and cats are", temperature=temperature))

Sampling with temperature = 0.50
<s> The dogs and cats are more common and other types of animals. In each other, they are often used as a b @-@ shaped to the hole in the light.</s>
Sampling with temperature = 0.65
<s> The dogs and cats are known of the species. The species is well known. It is defined by the species, but is not only a year @-@ long, but is only known as a species, and thus is a common species in captivity is a distinct species. The species is endemic, and features the fungus, and is classified as the species of the genus, though it is sometimes referred to as " a species of edible species " as it is a species of species. The species is known as
Sampling with temperature = 0.80
<s> The dogs and cats are not used to produce a color, with an informal example of eating or separate fox in debt.</s>
Sampling with temperature = 0.95
<s> The dogs and cats are implicated Hispanic origin, where it has four ranks. Each one can have forms, and a polained, with maximum per hasts r