# Language Models

In [1]:
import torch 
from d2l import torch as d2l

## Learning Language Models

* Markov Models and n-grams
* Word Frequency
* Laplace Smoothing


### How to measure the language model quality?

One way is to check _how surprising the text is__. __A good language model is able to predict tokens with high accuracy.__

Using cross-entropy? Perplexity? 

We will design language models using neural networks and use perplexity to evaluate how good the model is at predicting the next token given the current set of tokens in the text sequences. 

To train language models, we can randomly sample pairs of input sequences and target minibatches. After training, we will use perplexity to measure the language model quality.

## Perplexity

In [2]:
@d2l.add_to_class(d2l.TimeMachine)  #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
    super(d2l.TimeMachine, self).__init__()
    self.save_hyperparameters()
    corpus, self.vocab = self.build(self._download())
    array = torch.tensor([corpus[i:i+num_steps+1]
                        for i in range(len(corpus)-num_steps)])
    self.X, self.Y = array[:,:-1], array[:,1:]

In [3]:
@d2l.add_to_class(d2l.TimeMachine)  #@save
def get_dataloader(self, train):
    idx = slice(0, self.num_train) if train else slice(
        self.num_train, self.num_train + self.num_val)
    return self.get_tensorloader([self.X, self.Y], train, idx)

In [4]:
data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
    print('X:', X, '\nY:', Y)
    break

X: tensor([[ 7, 19, 16, 14,  0, 21,  9,  6,  0, 17],
        [21,  9,  2, 21,  0,  7, 16, 13, 13, 16]]) 
Y: tensor([[19, 16, 14,  0, 21,  9,  6,  0, 17, 19],
        [ 9,  2, 21,  0,  7, 16, 13, 13, 16, 24]])
