# Basic PyTorch Text Classification

## Overview
This tutorial is extracted from https://github.com/bentrevett/pytorch-sentiment-analysis. Before proceeding, make sure to install the required library. 

一、学习目标：
（1）如何学习新项目：
        跟书学？：①书跟实际上的API有很大的区别②书中会漏掉关键步骤
        找一个比较清楚的例子！在例子的基础上进行修改（开源项目）——一行一行看并不太好：1.先跑通：可能有问题（自身代码问题、依赖包问题）2.不要太关注细节——根据函数就能大概知道整个架构如何，先把逻辑理清，再去看细节3.检查每一步中哪个地方有问题
（2）跑通之后，去思考如何提升其准确性

二、项目结构
    （1）对文本进行预处理：处理数据，分词，对应成相应ID（方便查找Embedding）——torchtext
    （2）定义相关的网络：网络定义，实例化，复制Embedding，定义优化器、损失函数
    （3）训练过程

In [1]:
import nltk#安装nltk：pip install nltk；nltk：常用的python自然语言处理的工具包，有一些现成的算法和功能
nltk.download('punkt')#下载nltk的一个子的library——原始例子中用的是spaCy，有bug：导致其在谷歌云上安装会出现问题,于是改了一个方法
#punkt：分词功能，英文分词：对缩写处理（isn't在tokenization也就是分词的过程中，会变成 is 和 n't）,
#分词器在中英文中影响不大，数据的影响会比较大，此处没有进行数据清洗，
#如果数据从网上抓取下来，文本中会有很多其它符号，清洗之后的分类准确性会更高，但并不是数据集越干净准确性越高（颜文字）
#不同的清理方法对不同的模型的效果也不尽相同；
#清洗时，首先要自己判断去掉比较明显的错误，然后选择一个已经做好的模型（XLNet准确性最高），在这个基础上做实验
#用最好的模型做数据清洗的尝试
#中文中常见的分词器叫做jieba，jieba的准确率有一定限制，但运行速度较快（Tradeoff）
#cpu运行用了八小时，用实验室服务器两分钟就搞定了，大型模型要用GPU跑
#想要在终端友好的的调试python程序，使用了pudb3
#pytorch的版本变化很大，往往在一个机器中运行的好的代码未必在另一台机器上运行的好，甚至可能跑不通
#先跑通，再看每个环节的作用，然后再研究如何可以提高精确度


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
from nltk.tokenize import word_tokenize
tokenizer = word_tokenize

In [3]:
import torch
from torchtext.legacy import data#pip install torchtext
from torchtext.legacy import datasets

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = tokenizer, include_lengths = True)#文本这一列怎么处理:tokenize是什么,length要计算下来
LABEL = data.LabelField(dtype = torch.float)#label这一列怎么处理:转换成torch.float

In [None]:
from torchtext.legacy import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
#调用内置数据库IMDB（电影的review网站，并没有放全部review，而是选择highly polarized（情感倾向非常强的review））
#分好训练数据和测试数据
#下载和分词

In [5]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

ValueError: not enough values to unpack (expected 2, got 0)

In [None]:
MAX_VOCAB_SIZE = 25000#我关心哪些词

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.300d", 
                 unk_init = torch.Tensor.normal_)#下载一个新的词向量glove.6B.300d(global vector是谷歌开源的一个词向量，根据6个词来计算词的频率，300维的向量)
#没有用中间的全部词，而是把中间比如说频率，超过了排在25000之后的词直接舍去
#如果出现我不知道的词，给定随机的一个正态的初始化
LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip:  59%|███████▌     | 506M/862M [36:23<06:32, 909kB/s]

In [7]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)
#Bucketlterator：构造data loader

In [8]:
#处理网络
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        
        self.fc = nn.Linear(hidden_dim * 2, output_dim)#hidden_dim为什么乘以2？：一般bidirectional会设置成true,即双层双向的LSTM，最后前向后向的结果拼接到一块
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):#text\text_length从文本中获得
        
        embedded = self.embedding(text)
        
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        #padded,不同长度的字符串补零变成齐长，但做了很多无意义的加法，因此padded将字符串存储成下图所示，避免多余的乘零运算
        #得用pytorch1.6.0才能跑通，否则要将text_lengths改成text_lengths.cpu()
        
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)#pad将padded字符串重新变回补零的形式
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        #hidden实际上拿出倒数后两个（默认假设你使用的是双向LSTM，因此最终有一个向前的状态，把其隐藏层拿出，然后再把向后的一个最终状态拿出来，从而有两个hidden）
                
        return self.fc(hidden)#理论上输出可以从负无穷到正无穷，无法分类，希望将其变成从0到1之间的概率值，所以做了一个logistic变换（softmax变换）

![Packed Sequence](rnn_packed_seq.jpg)

In [9]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 300
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2#叠加两个LSTM
BIDIRECTIONAL = True
DROPOUT = 0.2
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]#将pad_token转换成一个数值型

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)#实例化

In [10]:
#一般不希望重新训练Embedding，因此要将其copy出来
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)


tensor([[-0.1117, -0.4966,  0.1631,  ..., -1.4447,  0.8402, -0.8668],
        [ 0.1032, -1.6268,  0.5729,  ...,  0.3180, -0.1626, -0.0417],
        [ 0.0466,  0.2132, -0.0074,  ...,  0.0091, -0.2099,  0.0539],
        ...,
        [ 0.4301,  0.1106, -0.1652,  ...,  0.6874,  0.2279,  0.1751],
        [ 0.4206,  0.5589,  0.0129,  ...,  0.0560,  0.5848, -0.2663],
        [ 0.0303,  0.4811,  0.0488,  ..., -0.4523, -0.0706,  0.1418]])

In [11]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]#unknown token仍然要做成一个类似于数值型的形式

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)#unknown全部变零
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)#pad全部变零

In [12]:
#给了优化器（Adam的优化）和一个损失函数
import torch.optim as optim

optimizer = optim.Adam(model.parameters())#把Adam模型中的model.parameters即模型中所有的权重给了optimizer
criterion = nn.BCEWithLogitsLoss()
#损失函数（BCEWithLogisLOss，BCE：binary cross entropy）做两件事：
#①进行一个Logit变换，将fc输出转化成概率p
#②cross entropy的loss需要计算出来

model = model.to(device)#把模型存到GPU或CPU中
criterion = criterion.to(device)

In [13]:
def binary_accuracy(preds, y):#计算二值精准度
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [14]:
def train(model, iterator, optimizer, criterion):#iterator：data loader、bucketIterator
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()#对应model.eval函数，train训练网络（计算相关权重及其梯度），eval时，我们不需要改变网络，因此不需要梯度
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        text, text_lengths = batch.text
        
        predictions = model(text, text_lengths).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [15]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.text
            
            predictions = model(text, text_lengths).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [16]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [17]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

100%|█████████▉| 399700/400000 [01:00<00:00, 8294.58it/s]

Epoch: 01 | Epoch Time: 1m 50s
	Train Loss: 0.558 | Train Acc: 70.57%
	 Val. Loss: 0.444 |  Val. Acc: 79.54%
Epoch: 02 | Epoch Time: 1m 50s
	Train Loss: 0.393 | Train Acc: 82.70%
	 Val. Loss: 0.383 |  Val. Acc: 83.21%
Epoch: 03 | Epoch Time: 1m 50s
	Train Loss: 0.287 | Train Acc: 88.10%
	 Val. Loss: 0.300 |  Val. Acc: 88.08%
Epoch: 04 | Epoch Time: 1m 50s
	Train Loss: 0.161 | Train Acc: 94.26%
	 Val. Loss: 0.314 |  Val. Acc: 87.84%
Epoch: 05 | Epoch Time: 1m 50s
	Train Loss: 0.122 | Train Acc: 95.53%
	 Val. Loss: 0.367 |  Val. Acc: 87.17%


In [18]:
model.load_state_dict(torch.load('model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.321 | Test Acc: 87.01%
