# N-Gram Model
上一节课，我们讲了词嵌入以及词嵌入是如何得到的，现在我们来讲讲词嵌入如何来训练语言模型，首先我们介绍一下 N-Gram 模型的原理和其要解决的问题。

For a sentence, the order of the words is very important, so can we predict the next few words from the previous words, such as 'I lived in France for 10 years, I can speak _' In the middle, we can predict that the last word is French.


对于一句话 T，其由 $w_1, w_2, \cdots, w_n$ 这 n 个词构成，

$$
P(T) = P(w_1)P(w_2 | w_1)P(w_3 |w_2 w_1) \cdots P(w_n |w_{n-1} w_{n-2}\cdots w_2w_1)
$$

We can simplify this model again. For example, for a word, it does not need all the preceding words as conditional probabilities, that is, a word can only be related to several words in front of it. This is the Markov assumption.


For the conditional probability here, the traditional method is to estimate the frequency of occurrence of each word in the corpus, and estimate the conditional probability according to Bayes' theorem. Here we can replace it with word embedding, and then use RNN for conditional probability. Computation, and then maximizing this conditional probability, not only modifies the word embedding, but also enables the model to predict one of the words based on the calculated conditional probability.

Below we explain directly with the code


In [1]:
CONTEXT_SIZE = 2 # Number of words based on
EMBEDDING_DIM = 10 # Dimensions of the word vector
# We use Shakespeare's poems
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

The `CONTEXT_SIZE` here means that we want to predict the word from the first few words. Here we use two words, `EMBEDDING_DIM` to indicate the dimension of the word embedding.

Then we build a training set that facilitates the entire corpus, grouping the three words, the first two as input, and the last as the result of the prediction.


In [2]:
 trigram = [((test_sentence[i], test_sentence[i+1]), test_sentence[i+2]) 
            for i in range(len(test_sentence)-2)]

In [5]:
# Total amount of data
len(trigram)

113

In [6]:
# Take the first data and see
trigram[0]

(('When', 'forty'), 'winters')

In [8]:
#Create a code for each word and number, and build word embedding accordingly
Vocb = set(test_sentence) # Use set to remove duplicate elements
word_to_idx = {word: i for i, word in enumerate(vocb)}
idx_to_word = {word_to_idx[word]: word for word in word_to_idx}

In [13]:
word_to_idx

{"'This": 94,
 'And': 71,
 'How': 18,
 'If': 49,
 'Proving': 78,
 'Shall': 48,
 'Then': 33,
 'This': 68,
 'Thy': 75,
 'To': 81,
 'Were': 61,
 'When': 14,
 'Where': 95,
 'Will': 27,
 'a': 21,
 'all': 53,
 'all-eating': 3,
 'an': 15,
 'and': 23,
 'answer': 80,
 'art': 70,
 'asked,': 69,
 'be': 29,
 'beauty': 16,
 "beauty's": 40,
 'being': 79,
 'besiege': 55,
 'blood': 11,
 'brow,': 1,
 'by': 59,
 'child': 8,
 'cold.': 32,
 'couldst': 26,
 'count,': 77,
 'days;': 43,
 'deep': 62,
 "deserv'd": 41,
 'dig': 64,
 "excuse,'": 86,
 'eyes,': 84,
 'fair': 56,
 "feel'st": 44,
 'field,': 9,
 'forty': 46,
 'gazed': 93,
 'held:': 12,
 'his': 89,
 'in': 45,
 'it': 34,
 'lies,': 57,
 'livery': 28,
 'lusty': 65,
 'made': 54,
 'make': 42,
 'mine': 13,
 'more': 83,
 'much': 30,
 'my': 50,
 'new': 92,
 'now,': 25,
 'of': 47,
 'old': 22,
 'old,': 19,
 'on': 74,
 'own': 20,
 'praise': 38,
 'praise.': 96,
 'proud': 5,
 'say,': 63,
 'see': 58,
 'shall': 87,
 'shame,': 90,
 'small': 31,
 'so': 67,
 'succession'

From the above you can see that each word corresponds to a number, and the words here are all different.

Then we define the model. The input of the model is the first two words, and the output is the probability of predicting the word.


In [14]:
import torch
from torch import nn
import torch.nn.functional as F
from torch.autograd import Variable

In [16]:
# Define model
class n_gram(nn.Module):
    def __init__(self, vocab_size, context_size=CONTEXT_SIZE, n_dim=EMBEDDING_DIM):
        super(n_gram, self).__init__()
        
        self.embed = nn.Embedding(vocab_size, n_dim)
        self.classify = nn.Sequential(
            nn.Linear(context_size * n_dim, 128),
            nn.ReLU(True),
            nn.Linear(128, vocab_size)
        )
        
    def forward(self, x):
Voc_embed = self.embed(x) # get word embedding
Voc_embed = voc_embed.view(1, -1) # Put two word vectors together
        out = self.classify(voc_embed)
        return out

Finally, our output is a conditional probability, which is equivalent to a classification problem. We can use cross entropy to easily measure the error.


In [49]:
net = n_gram(len(word_to_idx))

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=1e-2, weight_decay=1e-5)

In [71]:
for e in range(100):
    train_loss = 0
For word, label in trigram: # using the first 100 as a training set
Word = Variable(torch.LongTensor([word_to_idx[i] for i in word])) # Enter two words as input
        label = Variable(torch.LongTensor([word_to_idx[label]]))
#向向传播
        out = net(word)
        loss = criterion(out, label)
        train_loss += loss.data[0]
#反传播
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if (e + 1) % 20 == 0:
        print('epoch: {}, Loss: {:.6f}'.format(e + 1, train_loss / len(trigram)))

epoch: 20, Loss: 0.088273
epoch: 40, Loss: 0.065301
epoch: 60, Loss: 0.057113
epoch: 80, Loss: 0.052442
epoch: 100, Loss: 0.049236


Finally we can test the results


In [74]:
net = net.eval()

In [76]:
# test the results
word, label = trigram[19]
print('input: {}'.format(word))
print('label: {}'.format(label))
print()
word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
out = net(word)
pred_label_idx = out.max(1)[1].data[0]
predict_word = idx_to_word[pred_label_idx]
print('real word is {}, predicted word is {}'.format(label, predict_word))

input: ('so', 'gazed')
label: on

real word is on, predicted word is on


In [77]:
word, label = trigram[75]
print('input: {}'.format(word))
print('label: {}'.format(label))
print()
word = Variable(torch.LongTensor([word_to_idx[i] for i in word]))
out = net(word)
pred_label_idx = out.max(1)[1].data[0]
predict_word = idx_to_word[pred_label_idx]
print('real word is {}, predicted word is {}'.format(label, predict_word))

input: ("'This", 'fair')
label: child

real word is child, predicted word is child


It can be seen that the network can basically predict accuracy on the training set, but there are too few samples here, which is especially easy to overfit.

In the next lesson we will talk about how RNN is applied in natural language processing.
