# LSTM 做词性预测
Earlier we talked about word embedding and n-gram models for word prediction, but we haven't used RNN yet. In the last lesson, we will combine all the pre-requisites mentioned above to teach you how to use LSTM for part-of-speech prediction.

##Model Introduction
For a word, there will be different word-of-speech. Firstly, it can be judged according to the suffix of a word. For example, the suffix of -ly, the probability is an adverb. In addition, a same word can represent two different words. The part of speech, such as book, can represent both nouns and verbs, so in the end, what part of the word needs to be combined with the text before and after.

According to this problem, we can use the lstm model to make predictions. First, for a word, we can think of it as a sequence. For example, apple is composed of five words of apple, which forms a sequence of 5, we can match these characters. Construct word embedding, then enter lstm, just like lstm does image classification, only the last output is used as the prediction result. The string of the whole word can form a memory characteristic, which helps us to better predict the part of speech.

![](https://ws3.sinaimg.cn/large/006tKfTcgy1fmxi67w0f7j30ap05qq2u.jpg)

Then we put the word and its first few words into a sequence, we can construct a new word embedding for these words, and finally the output is the word part of the word, that is, the word part of the word is classified according to the information of the previous words.

Below we use the example to explain briefly


In [1]:
import torch
from torch import nn
from torch.autograd import Variable

We use the following simple training set


In [2]:
training_data = [("The dog ate the apple".split(),
                  ["DET", "NN", "V", "DET", "NN"]),
                 ("Everybody read that book".split(), 
                  ["NN", "V", "DET", "NN"])]

Next we need to encode the words and tags.


In [3]:
word_to_idx = {}
tag_to_idx = {}
for context, tag in training_data:
    for word in context:
        if word.lower() not in word_to_idx:
            word_to_idx[word.lower()] = len(word_to_idx)
    for label in tag:
        if label.lower() not in tag_to_idx:
            tag_to_idx[label.lower()] = len(tag_to_idx)

In [4]:
word_to_idx

{'apple': 3,
 'ate': 2,
 'book': 7,
 'dog': 1,
 'everybody': 4,
 'read': 5,
 'that': 6,
 'the': 0}

In [5]:
tag_to_idx

{'det': 0, 'nn': 1, 'v': 2}

Then we encode the letters


In [6]:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
char_to_idx = {}
for i in range(len(alphabet)):
    char_to_idx[alphabet[i]] = i

In [7]:
char_to_idx

{'a': 0,
 'b': 1,
 'c': 2,
 'd': 3,
 'e': 4,
 'f': 5,
 'g': 6,
 'h': 7,
 'i': 8,
 'j': 9,
 'k': 10,
 'l': 11,
 'm': 12,
 'n': 13,
 'o': 14,
 'p': 15,
 'q': 16,
 'r': 17,
 's': 18,
 't': 19,
 'u': 20,
 'v': 21,
 'w': 22,
 'x': 23,
 'y': 24,
 'z': 25}

Then we can build the training data


In [8]:
Def make_sequence(x, dic): # character encoding
    idx = [dic[i.lower()] for i in x]
    idx = torch.LongTensor(idx)
    return idx

In [9]:
make_sequence('apple', char_to_idx)


  0
 15
 15
 11
  4
[torch.LongTensor of size 5]

In [10]:
training_data[1][0]

['Everybody', 'read', 'that', 'book']

In [11]:
make_sequence(training_data[1][0], word_to_idx)


 4
 5
 6
 7
[torch.LongTensor of size 4]

Construct a lstm model of a single character


In [12]:
class char_lstm(nn.Module):
    def __init__(self, n_char, char_dim, char_hidden):
        super(char_lstm, self).__init__()
        
        self.char_embed = nn.Embedding(n_char, char_dim)
        self.lstm = nn.LSTM(char_dim, char_hidden)
        
    def forward(self, x):
        x = self.char_embed(x)
        out, _ = self.lstm(x)
        return out[-1] # (batch, hidden)

Constructing the lstm model of part of speech classification


In [13]:
class lstm_tagger(nn.Module):
    def __init__(self, n_word, n_char, char_dim, word_dim, 
                 char_hidden, word_hidden, n_tag):
        super(lstm_tagger, self).__init__()
        self.word_embed = nn.Embedding(n_word, word_dim)
        self.char_lstm = char_lstm(n_char, char_dim, char_hidden)
        self.word_lstm = nn.LSTM(word_dim + char_hidden, word_hidden)
        self.classify = nn.Linear(word_hidden, n_tag)
        
    def forward(self, x, word):
        char = []
For w in word: # lstm for each word
            char_list = make_sequence(w, char_to_idx)
Char_list = char_list.unsqueeze(1) # (seq, batch, feature) satisfies the lstm input condition
            char_infor = self.char_lstm(Variable(char_list)) # (batch, char_hidden)
            char.append(char_infor)
        char = torch.stack(char, dim=0) # (seq, batch, feature)
        
        x = self.word_embed(x) # (batch, seq, word_dim)
x = x.permute(1, 0, 2) # change order
x = torch.cat((x, char), dim=2) # splicing the word embedding of each word along with the result of the character lstm output along the feature channel
        x, _ = self.word_lstm(x)
        
        s, b, h = x.shape
x = x.view(-1, h) # re reshape to classify the linear layer
        out = self.classify(x)
        return out

In [14]:
net = lstm_tagger(len(word_to_idx), len(char_to_idx), 10, 100, 50, 128, len(tag_to_idx))
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=1e-2)

In [15]:
#开始培训
for e in range(300):
    train_loss = 0
    for word, tag in training_data:
Word_list = make_sequence(word, word_to_idx).unsqueeze(0) # Add the first dimension batch
        tag = make_sequence(tag, tag_to_idx)
        word_list = Variable(word_list)
        tag = Variable(tag)
#向向传播
        out = net(word_list, word)
        loss = criterion(out, tag)
        train_loss += loss.data[0]
#反传播
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if (e + 1) % 50 == 0:
        print('Epoch: {}, Loss: {:.5f}'.format(e + 1, train_loss / len(training_data)))

Epoch: 50, Loss: 0.86690
Epoch: 100, Loss: 0.65471
Epoch: 150, Loss: 0.45582
Epoch: 200, Loss: 0.30351
Epoch: 250, Loss: 0.20446
Epoch: 300, Loss: 0.14376


Finally, we can look at the predicted results.


In [19]:
net = net.eval()

In [25]:
test_sent = 'Everybody ate the apple'
test = make_sequence(test_sent.split(), word_to_idx).unsqueeze(0)
out = net(Variable(test), test_sent.split())

In [27]:
print(out)

Variable containing:
-1.2148  1.9048 -0.6570
-0.9272 -0.4441  1.4009
 1.6425 -0.7751 -1.1553
-0.6121  1.6036 -1.1280
[torch.FloatTensor of size 4x3]



In [28]:
print(tag_to_idx)

{'det': 0, 'nn': 1, 'v': 2}


Finally, you can get the above result, because the linear layer of the last layer does not use softmax, so the value is not like a probability, but the largest value of each row means that it belongs to the class, you can see that the first word 'Everybody' belongs to nn The second word 'ate' belongs to v, the third word 'the' belongs to det, and the fourth word 'apple' belongs to nn, so the prediction result obtained is correct.
