<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#原始语料" data-toc-modified-id="原始语料-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>原始语料</a></span><ul class="toc-item"><li><span><a href="#文本预处理" data-toc-modified-id="文本预处理-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>文本预处理</a></span></li><li><span><a href="#创建词汇表" data-toc-modified-id="创建词汇表-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>创建词汇表</a></span></li><li><span><a href="#文本转换成数值" data-toc-modified-id="文本转换成数值-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>文本转换成数值</a></span></li><li><span><a href="#标签转换为数值" data-toc-modified-id="标签转换为数值-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>标签转换为数值</a></span></li><li><span><a href="#删除空文本" data-toc-modified-id="删除空文本-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>删除空文本</a></span></li></ul></li><li><span><a href="#输入数据" data-toc-modified-id="输入数据-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>输入数据</a></span><ul class="toc-item"><li><span><a href="#输入向量等长处理" data-toc-modified-id="输入向量等长处理-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>输入向量等长处理</a></span></li><li><span><a href="#拆分数据集" data-toc-modified-id="拆分数据集-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>拆分数据集</a></span></li><li><span><a href="#创建数据管道" data-toc-modified-id="创建数据管道-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>创建数据管道</a></span></li></ul></li><li><span><a href="#创建分类模型" data-toc-modified-id="创建分类模型-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>创建分类模型</a></span><ul class="toc-item"><li><span><a href="#定义模型" data-toc-modified-id="定义模型-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>定义模型</a></span></li><li><span><a href="#训练模型" data-toc-modified-id="训练模型-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>训练模型</a></span></li><li><span><a href="#评估模型性能" data-toc-modified-id="评估模型性能-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>评估模型性能</a></span></li></ul></li><li><span><a href="#利用模型进行预测" data-toc-modified-id="利用模型进行预测-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>利用模型进行预测</a></span></li></ul></div>

In [2]:
import string
from collections import Counter
import numpy as np


# 原始语料

In [3]:
with open('datasets/reviews.txt', 'r') as f:
    reviews = f.read()
with open('datasets/labels.txt', 'r') as f:
    labels = f.read()

print("Example of reviews:")
print(reviews[:1000])
print("=" * 80)
print("Labels:")
print(labels[:20])

Example of reviews:
bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orches

## 文本预处理

In [4]:
# 大写转小写
reviews = reviews.lower()

# 删除标点符号
all_text = ''.join([char for char in reviews if char not in string.punctuation]) 

# 句子列表
reviews_split = all_text.split('\n')
print("No. of reviews:", len(reviews_split))


reviews_split[0], reviews_split[-1]

No. of reviews: 25001


('bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   ',
 '')

## 创建词汇表

In [5]:
# 单词列表
all_text = ' '.join(reviews_split)
words = all_text.split()  
print(words[:20])

['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', 'such']


In [6]:
# 创建词汇表，及单词-索引字典

counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)  # 按词频排序
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}


## 文本转换成数值

In [7]:
# 文本向量化，每个句子转化成单词索引列表
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])

## 标签转换为数值

In [8]:
# 标签转化成类别
labels_split = labels.split('\n')
encoded_labels = np.array(
    [1 if label == 'positive' else 0 for label in labels_split])

labels_split[-1]  # 预处理不彻底

''

## 删除空文本 

In [9]:
# 文本中的句子长度
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 1
Maximum review length: 2514


In [10]:
# 删除长度为 0 的句子

print("Number of reviews before removing outliers: ", len(reviews_ints))
non_zero_idx = [
    ii for ii, review in enumerate(reviews_ints) if len(review) != 0
]

reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print("Number of reviews after removing outliers: ", len(reviews_ints))

Number of reviews before removing outliers:  25001
Number of reviews after removing outliers:  25000


#  输入数据

##  输入向量等长处理

In [11]:
# 将所有句子处理成相同长度
# 过长则截短，过短则在句首填充 0 

def pad_features(reviews_ints, seq_length):
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)
    for i, row in enumerate(reviews_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
    return features


seq_length = 200
features = pad_features(reviews_ints, seq_length=seq_length)

assert len(features) == len(reviews_ints)
assert len(features[0]) == seq_length

print(features[:10, :10])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [22382    42 46418    15   706 17139  3389    47    77    35]
 [ 4505   505    15     3  3342   162  8312  1652     6  4819]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [   54    10    14   116    60   798   552    71   364     5]]


## 拆分数据集

In [12]:
# 拆分数据集，(train:val:test)--(0.8:0.1:0.1)

split_frac = 0.8
split_idx = int(len(features) * 0.8)

train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x) * 0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

print("Feature Shapes:")
print("Train set:        \t{}".format(train_x.shape),
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set:       \t{}".format(test_x.shape))

Feature Shapes:
Train set:        	(20000, 200) 
Validation set: 	(2500, 200) 
Test set:       	(2500, 200)


## 创建数据管道

In [13]:
import torch
from torch.utils.data import TensorDataset, DataLoader

In [14]:
# 创建数据管道
train_data = TensorDataset(torch.from_numpy(train_x),
                           torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

batch_size = 50

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

In [15]:
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print("Sample input size: ", sample_x.size())
print("Sample input: \n", sample_x)
print()
print("Sample label size: ", sample_y.size())
print("Sample label: \n", sample_y)

Sample input size:  torch.Size([50, 200])
Sample input: 
 tensor([[    1,   104,     4,  ...,  6934,    22, 35707],
        [    0,     0,     0,  ...,    71,  4592, 27094],
        [    0,     0,     0,  ...,     5,    29,   499],
        ...,
        [  101,   869,    49,  ...,    15,     3,   492],
        [   72,    72,    72,  ...,     5,   550,    22],
        [    1,  1006,   368,  ...,   325,   625,  1915]])

Sample label size:  torch.Size([50])
Sample label: 
 tensor([0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1,
        1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,
        0, 1])


# 创建分类模型
![](images/network_diagram.png)

In [16]:
# 指定 gpu
train_on_gpu = torch.cuda.is_available()
if train_on_gpu:
    print("Training on GPU.")
else:
    print("No GPU avaliable, training on CPU.")

Training on GPU.


## 定义模型

In [20]:
import torch.nn as nn


class SentimentRNN(nn.Module):
    def __init__(self,
                 vocab_size,
                 output_size,
                 embedding_dim,
                 hidden_dim,
                 n_layers,
                 drop_prob=0.5):
        super(SentimentRNN, self).__init__()
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            n_layers,
            dropout=drop_prob,
            batch_first=True,  # 输入:batch,seq,feature
        )
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()

    def forward(self, x, hidden):
        batch_size = x.size(0)

        x = x.long()  # batch,seq
        embeds = self.embedding(x)  # batch,seq,feature
        lstm_out, hidden = self.lstm(embeds,
                                     hidden)  # lstm_out：batch,seq,hidden

        # .contiguous() 将tensor变成在内存中连续分布的形式
        # .view() 操作的对象，内存中连续的
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        # batch*seq, hidden

        out = self.dropout(lstm_out)  # batch*seq, hidden
        out = self.fc(out)  # batch*seq, 1

        sig_out = self.sig(out)  # batch*seq, 1

        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)  # batch,seq
        sig_out = sig_out[:, -1]  # 选择最后一个输出
        return sig_out, hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        if train_on_gpu:
            hidden = (weight.new(self.n_layers, batch_size,
                                 self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers, batch_size,
                                 self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size,
                                 self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size,
                                 self.hidden_dim).zero_())
        return hidden

In [21]:
# 模型参数
vocab_size = len(vocab_to_int) + 1
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2

# 创建模型
net = SentimentRNN(
    vocab_size,
    output_size,
    embedding_dim,
    hidden_dim,
    n_layers,
)
print(net)

SentimentRNN(
  (embedding): Embedding(74073, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


## 训练模型


In [22]:
# 损失函数和优化器
lr = 0.001

criterion = nn.BCELoss()
optimzer = torch.optim.Adam(net.parameters(), lr=lr)

In [27]:
epochs = 4

counter = 0
print_every = 400
clip = 5

if train_on_gpu:
    net.cuda()

net.train()

for e in range(epochs):
    h = net.init_hidden(batch_size)  # 初始化 cell state + hidden state

    for inputs, labels in train_loader:
        counter += 1
        if train_on_gpu:
            inputs, labels = inputs.cuda(), labels.cuda()

        net.zero_grad()  # 梯度归零

        h = tuple([each.data for each in h])  # (h0,c0)

        output, h = net(inputs, h)  # 计算输出

        loss = criterion(output.squeeze(), labels.float())  # 计算损失

        loss.backward()  # 反向传播，计算梯度

        nn.utils.clip_grad_norm_(net.parameters(), clip)  # 梯度裁剪，防止梯度保证

        optimzer.step()  # 更新梯度

        if counter % print_every == 0:

            val_h = net.init_hidden(batch_size)
            val_losses = []

            net.eval()  # 评估模式

            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if (train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)

                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            print("Epoch: {}/{}...".format(e + 1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))
            

            net.train()  # 切换回训练模式
            

Epoch: 1/4... Step: 400... Loss: 0.148913... Val Loss: 0.549996
Epoch: 2/4... Step: 800... Loss: 0.138612... Val Loss: 0.693075
Epoch: 3/4... Step: 1200... Loss: 0.018518... Val Loss: 0.752316
Epoch: 4/4... Step: 1600... Loss: 0.042832... Val Loss: 0.832552


## 评估模型性能

In [29]:

test_losses = []  # 测试损失
num_correct = 0  # 准确度

# init hidden state
h = net.init_hidden(batch_size)

net.eval()  # 评估模式

for inputs, labels in test_loader:

    # 每次调用模型时都重新初始化单元状态，否者：单元状态保留上一次调用的值
    h = tuple([each.data for each in h])

    if (train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()

    # 预测
    output, h = net(inputs, h)

    # 计算损失
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())

    # 将概率转换成标签
    pred = torch.round(output.squeeze())  # rounds to the nearest integer

    # 正确率
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(
        correct_tensor.numpy()) if not train_on_gpu else np.squeeze(
            correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)

# 测试集上的损失
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# 测试集上的精度
test_acc = num_correct / len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.868
Test accuracy: 0.792


# 利用模型进行预测

In [34]:
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'

In [35]:
from string import punctuation

# 输入文本预处理
def tokenize_review(test_review):
    test_review = test_review.lower()  # 转小写
    
    test_text = ''.join([c for c in test_review
                         if c not in punctuation])  # 删除标点

    test_words = test_text.split() # 分词

    test_ints = [] # 文本数值化
    test_ints.append([vocab_to_int[word] for word in test_words])

    return test_ints

In [36]:
# 进行预测

def predict(net, test_review, sequence_length=200):

    net.eval()  # 模型进入评估模式

    # 文本处理为向量
    test_ints = tokenize_review(test_review)
    seq_length = sequence_length
    features = pad_features(test_ints, seq_length)

    # 转换成张量
    feature_tensor = torch.from_numpy(features)

    batch_size = feature_tensor.size(0)

    # 初始化 hidden state
    h = net.init_hidden(batch_size)

    if (train_on_gpu):
        feature_tensor = feature_tensor.cuda()

    # 获取输出
    output, h = net(feature_tensor, h)
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))    

    # 预测概率转为类别
    pred = torch.round(output.squeeze())

    if (pred.item() == 1):
        print("Positive review detected!")
    else:
        print("Negative review detected.")

In [37]:
predict(net, test_review_neg, seq_length)

Prediction value, pre-rounding: 0.000148
Negative review detected.
