# Language Models and the Dataset

**Summary**
- Introduces the concept of Language Models
- Zipf's law of word frequencies.
- Difficulties associated with estimating LM frequencies and some smoothing techniques.
- Two ways of generating sequence modeling datasets
    - Random Sampling
    - Sequential Partitioning

**Changes**
- 30-03-2021 - Created

- What is Language Model, what is it trying to model? How can it be useful in generating natural text?
- How can we build a simple language model using word occurrence and co-occurrence frequencies in a corpus? What are the difficulties associated with estimating te probabilities?
- What is Laplace smoothing? What problem does it solve?

\begin{split}\begin{aligned}
    \hat{P}(x) & = \frac{n(x) + \epsilon_1/m}{n + \epsilon_1}, \\
    \hat{P}(x' \mid x) & = \frac{n(x, x') + \epsilon_2 \hat{P}(x')}{n(x) + \epsilon_2}, \\
    \hat{P}(x'' \mid x,x') & = \frac{n(x, x',x'') + \epsilon_3 \hat{P}(x'')}{n(x, x') + \epsilon_3}.
\end{aligned}\end{split}

- How does a first order markov model of language look like?
- What is Zipf' law?
- Do Unigrams, Bi-grams and Tri-grams follow Zipf's law?

In [1]:
import random
import torch
from d2l import torch as d2l

tokens = d2l.tokenize(d2l.read_time_machine())
# Since each text line is not necessarily a sentence or a paragraph, we
# concatenate all text lines
corpus = [token for line in tokens for token in line]
vocab = d2l.Vocab(corpus)
vocab.token_freqs[:10]

[('the', 2261),
 ('i', 1267),
 ('and', 1245),
 ('of', 1155),
 ('a', 816),
 ('to', 695),
 ('was', 552),
 ('in', 541),
 ('that', 443),
 ('my', 440)]

### Random Sampling

In [4]:
def seq_data_iter_random(corpus, batch_size, num_steps):  #@save
    """Generate a minibatch of subsequences using random sampling."""
    # Start with a random offset (inclusive of `num_steps - 1`) to partition a
    # sequence
    corpus = corpus[random.randint(0, num_steps - 1):]
    # Subtract 1 since we need to account for labels
    num_subseqs = (len(corpus) - 1) // num_steps
    # The starting indices for subsequences of length `num_steps`
    initial_indices = list(range(0, num_subseqs * num_steps, num_steps))
    # In random sampling, the subsequences from two adjacent random
    # minibatches during iteration are not necessarily adjacent on the
    # original sequence
    random.shuffle(initial_indices)

    def data(pos):
        # Return a sequence of length `num_steps` starting from `pos`
        return corpus[pos:pos + num_steps]

    num_batches = num_subseqs // batch_size
    for i in range(0, batch_size * num_batches, batch_size):
        # Here, `initial_indices` contains randomized starting indices for
        # subsequences
        initial_indices_per_batch = initial_indices[i:i + batch_size]
        X = [data(j) for j in initial_indices_per_batch]
        Y = [data(j + 1) for j in initial_indices_per_batch]
        yield torch.tensor(X), torch.tensor(Y)

Example of how Random Sampling works on a corpus of int 0-34

In [42]:
my_seq = list(range(35))
for X, Y in seq_data_iter_random(my_seq, batch_size=4, num_steps=3):
    print('X: ', X, '\nY:', Y)

X:  tensor([[20, 21, 22],
        [ 5,  6,  7],
        [17, 18, 19],
        [23, 24, 25]]) 
Y: tensor([[21, 22, 23],
        [ 6,  7,  8],
        [18, 19, 20],
        [24, 25, 26]])
X:  tensor([[ 2,  3,  4],
        [29, 30, 31],
        [14, 15, 16],
        [26, 27, 28]]) 
Y: tensor([[ 3,  4,  5],
        [30, 31, 32],
        [15, 16, 17],
        [27, 28, 29]])


### Sequential Partitioning

In [43]:
def seq_data_iter_sequential(corpus, batch_size, num_steps):  #@save
    """Generate a minibatch of subsequences using sequential partitioning."""
    # Start with a random offset to partition a sequence
    offset = random.randint(0, num_steps)
    num_tokens = ((len(corpus) - offset - 1) // batch_size) * batch_size
    Xs = torch.tensor(corpus[offset:offset + num_tokens])
    Ys = torch.tensor(corpus[offset + 1:offset + 1 + num_tokens])
    Xs, Ys = Xs.reshape(batch_size, -1), Ys.reshape(batch_size, -1)
    num_batches = Xs.shape[1] // num_steps
    for i in range(0, num_steps * num_batches, num_steps):
        X = Xs[:, i:i + num_steps]
        Y = Ys[:, i:i + num_steps]
        yield X, Y

Example of how Sequential Partitioning works

In [44]:
for X, Y in seq_data_iter_sequential(my_seq, batch_size=3, num_steps=5):
    print('X: ', X, '\nY:', Y)

X:  tensor([[ 3,  4,  5,  6,  7],
        [13, 14, 15, 16, 17],
        [23, 24, 25, 26, 27]]) 
Y: tensor([[ 4,  5,  6,  7,  8],
        [14, 15, 16, 17, 18],
        [24, 25, 26, 27, 28]])
X:  tensor([[ 8,  9, 10, 11, 12],
        [18, 19, 20, 21, 22],
        [28, 29, 30, 31, 32]]) 
Y: tensor([[ 9, 10, 11, 12, 13],
        [19, 20, 21, 22, 23],
        [29, 30, 31, 32, 33]])


In [45]:
class SeqDataLoader:  #@save
    """An iterator to load sequence data."""
    def __init__(self, batch_size, num_steps, use_random_iter, max_tokens):
        if use_random_iter:
            self.data_iter_fn = d2l.seq_data_iter_random
        else:
            self.data_iter_fn = d2l.seq_data_iter_sequential
        self.corpus, self.vocab = d2l.load_corpus_time_machine(max_tokens)
        self.batch_size, self.num_steps = batch_size, num_steps

    def __iter__(self):
        return self.data_iter_fn(self.corpus, self.batch_size, self.num_steps)

In [46]:
def load_data_time_machine(batch_size, num_steps,  #@save
                           use_random_iter=False, max_tokens=10000):
    """Return the iterator and the vocabulary of the time machine dataset."""
    data_iter = SeqDataLoader(batch_size, num_steps, use_random_iter,
                              max_tokens)
    return data_iter, data_iter.vocab