情感分析：输入一段文本(电影评论)到一个训练好的模型中，输出这段模型是pos还是neg。
使用的数据集是IMDB，在torchtext中这个包中可以导入。

整理一下思路：对于这个任务，数据是化成三分，train，valid，test。然后对train中最常见的25000个单词建立一个vocab。vocab的作用就是
将每一个单词映射成一个数字。
在建立模型的时候，首先是有一个emdedding层，这个层输入一个单词对应的数字，便会得到一个词向量，对于一段评论，输入这段文本中的每个词都会
得到一个词向量，对于所有单词的词向量进行平均就会得到整段文本的词向量，在将平均词向量输入到一个线性变换层中，变得到了最终输出的pos还是neg，也就是0还是1.这就是word_avg模型的思路，其余的RNN、CNN模型的思路也差不离。求平均的时候用的avg_pool2d，注意维度的变化。
注意：在输入模型中的时候是一个batch一个batch输入的，且是长度差不多的句子组成一个batch，这些工作都是参数设置的。

In [1]:
import torch
from  torchtext import data

SEED=1234
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic=True

In [2]:
import spacy
spacy.load('en_core_web_sm')

<spacy.lang.en.English at 0x7fb8836507b8>

In [3]:
TEXT=data.Field(tokenize='spacy',tokenizer_language='en_core_web_sm',)#Field决定了数据会如何被处理。
    
LABEL=data.LabelField(dtype=torch.float)

In [4]:
from torchtext import datasets#导入数据集，torchtext这个包中含有很多数据，这里使用IMDB数据，划分成train，test 数据。
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

In [5]:
print('number of training examples {}'.format(len(train_data)))
print('number of test examples {}'.format(len(test_data)))

number of training examples 25000
number of test examples 25000


In [6]:
print(vars(train_data.examples[0]))#已经分词分好了，vars表示返回属性和属性值的字典对象。

{'text': ['Like', 'his', 'elder', 'brothers', ',', 'Claude', 'Sautet', 'and', 'Jean', '-', 'Pierre', 'Melville', ',', 'Alain', 'Corneau', 'began', 'to', 'cut', 'his', 'teeth', 'in', 'French', 'cinema', 'with', 'a', 'series', 'of', 'fine', 'thrillers', ':', '"', 'la', 'Menace', '"', '(', '1977', ')', 'and', '"', 'Série', 'Noire', '"', '(', '1979', ')', 'among', 'others', '.', '"', 'Police', 'Python', '357', '"', 'is', 'a', 'good', 'example', 'of', 'how', 'Corneau', 'conceived', 'and', 'shot', 'his', 'works', 'at', 'this', 'time', 'of', 'his', 'career', '.', 'They', 'had', 'a', 'splendid', 'cinematography', ',', 'painstaking', 'screenplays', 'and', 'a', 'sophisticated', 'directing', 'elaborated', 'for', 'efficiency', "'s", 'sake.<br', '/><br', '/>The', 'police', 'superintendent', 'Ferrot', '(', 'Yves', 'Montand', ')', 'is', 'a', 'cop', 'with', 'unconventional', 'methods', 'who', 'usually', 'works', 'all', 'alone', '.', 'He', 'makes', 'the', 'acquaintance', 'of', 'a', 'young', 'woman', 'S

In [7]:
import random
train_data,valid_data=train_data.split(random_state=random.seed(SEED))#默认的是7,3分。
print(len(train_data))
print(len(valid_data))

17500
25000


In [8]:
print(len(train_data))
print(len(valid_data))
print(len(test_data))

17500
7500
25000


下一步我们需要创建 vocabulary 。vocabulary 就是把每个单词一一映射到一个数字。
我们使用最常见的25k个单词来构建我们的单词表，用max_size这个参数可以做到这一点。
所有其他的单词都用<unk>来表示。

In [9]:
TEXT.build_vocab(train_data, max_size=25000, vectors="glove.6B.100d", unk_init=torch.Tensor.normal_)
LABEL.build_vocab(train_data)#选取 训练数据中最常见的25000个词组成词向量表。
#其中，vectors表示的是采用"glove.6B.100d"这个预训练好的向量进行参数初始化。从而加速训练。

.vector_cache/glove.6B.zip: 862MB [01:51, 7.71MB/s]                               
 99%|█████████▉| 397966/400000 [00:30<00:00, 22595.76it/s]

In [10]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


In [19]:
print(vars(TEXT.vocab)['vectors'].shape)#vars

torch.Size([25002, 100])


In [20]:
print(TEXT.vocab.freqs.most_common(20))

[('the', 203479), (',', 192700), ('.', 165658), ('a', 109189), ('and', 109099), ('of', 100605), ('to', 93433), ('is', 76108), ('in', 61269), ('I', 54280), ('it', 53950), ('that', 49195), ('"', 43989), ("'s", 43324), ('this', 42548), ('-', 36975), ('/><br', 35514), ('was', 35036), ('as', 30305), ('movie', 29951)]


In [21]:
print(TEXT.vocab.itos[:10])#int to string

['<unk>', '<pad>', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is']


In [22]:
print(LABEL.vocab.stoi)#string to int

defaultdict(None, {'neg': 0, 'pos': 1})


In [23]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size=BATCH_SIZE,
    device=device)#生成数据的迭代器

In [27]:
print(type(train_iterator))
print(next(iter(train_iterator)))#查看一个batch的数据的格式。seq_len*batch_size.

<class 'torchtext.data.iterator.BucketIterator'>

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1247x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]


In [101]:
for example in train_iterator:#也就是每一列表示一个样本
    print(example)
    print(example.text.shape)
    print(example.label.shape)


[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1035x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1035, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1246x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1246, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1064x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1064, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1557x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1557, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 939x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([939, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	


[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 883x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([883, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1069x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1069, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 763x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([763, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1180x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1180, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 907x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([907, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.te


[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1016x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1016, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 860x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([860, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1012x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1012, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1109x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1109, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 825x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([825, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.


[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 954x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([954, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 970x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([970, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1285x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1285, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1186x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1186, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1249x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1249, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.


[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1014x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1014, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1083x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1083, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1005x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1005, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 987x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([987, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1098x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1098, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	


[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 694x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([694, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 876x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([876, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 735x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([735, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 817x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([817, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1012x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]
torch.Size([1012, 64])
torch.Size([64])

[torchtext.data.batch.Batch of size 64]
	[.text

In [149]:
#下面就是搭建word_avg模型了
#一段评论,根据vocab转换成数字，一个单词对应一个数字，一个数字根据embedding找一个向量embed_size长度。

import torch.nn as nn
import torch
import torch.nn.functional as F
class Wordavgmodel(nn.Module):
    #下面这一段就是明白这个意思，但是写混乱了@@～～@@，推到，重来！！
    def __init__(self,vocab_size,embed_size,pad_idx,outputdim):
        super(Wordavgmodel,self).__init__()
        
#         #查官方文档的api，变可以知道输出是什么维度的
#         embed=nn.Embedding(vocab_size,embed_size,padding_idx=pad_idx)#[seq_len,batch,embed_size]
#         #为了输入到avg_pool2d中维度合适，需要进行维度转换
#         embed=torch.perm(1,0,2)#[batch,seq_len,embed_size]
#         avg=F.avg_pool2d(embed,(embed.shape[1],1)).squeeze(1)#[batch,1,embed_size]---->[batch,embed_size]
#         self.linear=nn.Linear(avg,outputdim)
        self.embedding=nn.Embedding(vocab_size,embed_size,padding_idx=pad_idx)
        self.fc=nn.Linear(embed_size,outputdim)
    def forward(self,text):#这里text 的形式就是在一个batch里数据的格式，因为在model train的时候，是将一个batch的数据喂入
        embed=self.embedding(text)#[seq_len,batch,embed_size] ，text的格式为[seq_len,batch_size]
        #print(embed.shape)
        embed=embed.permute(1,0,2)#[batch,seq_len,embed_size]
        #avg_pool对所有单词的词向量做平均，这里F.avg_pool2d的池化维度就是取向量每一个位置的值
        avg=F.avg_pool2d(embed,(embed.shape[1],1)).squeeze(1)#去除第二个的维度1.[batch size, embedding_dim]
        return self.fc(avg)
    

In [40]:
INPUT_DIM=len(TEXT.vocab)
print(INPUT_DIM)
print(TEXT.vocab.stoi[TEXT.pad_token])

25002
1


In [150]:
INPUT_DIM=len(TEXT.vocab)#25002
EMBEDDING_DIM=100
OUTPUTDIM=1
PAD_IDX=TEXT.vocab.stoi[TEXT.pad_token]
model= Wordavgmodel(INPUT_DIM,EMBEDDING_DIM,PAD_IDX,OUTPUTDIM)

In [151]:
print(model) #embedding维度是这样的，但是放进去的text的维度是1247*64啊，计算出来的维度是[1247,64,100],

Wordavgmodel(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (fc): Linear(in_features=100, out_features=1, bias=True)
)


In [152]:
#nn.Embedding(l,h):生成一个l×h的矩阵，l表示单词个数，h为词嵌入的维度，是自定义的，
#这个模块常用来保存词嵌入和用下标检索它们。模块的输入是一个下标的列表，输出是对应的词嵌入。
from torch.autograd import Variable
word_to_index={'hello':0,'world':1}
hello_index=Variable(torch.LongTensor(word_to_index['hello']))
embeds=nn.Embedding(2,5)
print(embeds)
print(embeds(hello_index))

Embedding(2, 5)
tensor([], size=(0, 5), grad_fn=<EmbeddingBackward>)


In [51]:
for p in model.parameters():
    print(p)
    print(p.numel())#numel()方法返回数组中元素的个数

Parameter containing:
tensor([[-0.2054, -1.6745, -0.6783,  ...,  0.6963,  1.8395, -1.0152],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.6114, -0.9283,  0.9795,  ..., -1.3320,  0.1223, -1.2474],
        ...,
        [-0.4985,  0.5306, -0.7786,  ...,  1.0312, -2.7757, -0.4136],
        [ 0.5917,  0.3962, -0.2695,  ..., -0.2319, -0.0390, -0.2140],
        [-0.8517, -0.8586, -1.2057,  ..., -0.7280,  1.1957,  1.2343]],
       requires_grad=True)
2500200
Parameter containing:
tensor([[-0.0092,  0.0752,  0.0202,  0.0121,  0.0342, -0.0687,  0.0566, -0.0823,
         -0.0179,  0.0657,  0.0708,  0.0449,  0.0075,  0.0200,  0.0464, -0.0942,
          0.0539, -0.0560, -0.0038,  0.0918,  0.0722, -0.0931, -0.0202, -0.0681,
         -0.0495,  0.0195,  0.0484, -0.0016, -0.0002,  0.0677,  0.0120, -0.0918,
         -0.0551, -0.0326, -0.0799,  0.0378, -0.0531,  0.0564,  0.0657,  0.0543,
          0.0436,  0.0842,  0.0900, -0.0788, -0.0638, -0.0654, -0.0550,  0.0881,
 

In [153]:
def count_model_parameters(model):
    return sum( p.numel() for p in model.parameters() if p.requires_grad)
print(f'the model has {count_model_parameters(model):,} model parameters.')

the model has 2,500,301 model parameters.


In [62]:
TEXT.vocab.vectors #shape(25002,100)，运用了预训练的vectors="glove.6B.100d"进行初始化

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.7244, -0.0186,  0.0996,  ...,  0.0045, -1.0037,  0.6646],
        [-1.1243,  1.2040, -0.6489,  ..., -0.7526,  0.5711,  1.0081],
        [ 0.2525,  0.4068, -0.1437,  ..., -0.5324,  0.4820,  0.1396]])

In [63]:
TEXT.vocab.vectors.shape

torch.Size([25002, 100])

In [154]:
#模型参数初始化
pretrained_embedding=TEXT.vocab.vectors #vectors="glove.6B.100d"
model.embedding.weight.data.copy_(pretrained_embedding)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.7244, -0.0186,  0.0996,  ...,  0.0045, -1.0037,  0.6646],
        [-1.1243,  1.2040, -0.6489,  ..., -0.7526,  0.5711,  1.0081],
        [ 0.2525,  0.4068, -0.1437,  ..., -0.5324,  0.4820,  0.1396]])

In [None]:
vars(TEXT.vocab)#显示vocab的所有的属性和属性值。

In [155]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

In [156]:
#train a model
optimizer=torch.optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()#计算二分类的输出和target之间的交叉熵，输出经过了sigmoid了。

model=model.to(device)
criterion=criterion.to(device)

In [157]:
def binary_accuracy(y_preds,y):
    
    round_preds = torch.round(torch.sigmoid(y_preds))
    correct = (round_preds==y).float()
    acc = correct.sum()/len(correct)  
    return acc
    

In [158]:
def train(model,iterator,optimizer,criterion):
    epoch_loss=0
    epoch_acc=0
    model.train()
    
    for batch in iterator:
        optimizer.zero_grad()
        y_pred = model(batch.text).squeeze(1)
        loss=criterion(y_pred,batch.label)
        acc=binary_accuracy(y_pred,batch.label)
        loss.backward()
        optimizer.step()
        
        
        epoch_loss+=loss.item()
        epoch_acc+=acc.item()
        
    return epoch_loss/len(iterator),epoch_acc/len(iterator)
        

In [159]:
def evaluate(model,iterator,criterion):
    epoch_loss=0
    epoch_acc=0
    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            y_pred=model(batch.text).squeeze(1)
            loss=criterion(y_pred,batch.label)
            acc=binary_accuracy(y_pred,batch.label)
            
            epoch_loss+=loss.item()
            epoch_acc+=acc.item()
    return epoch_loss/len(iterator),epoch_acc/len(iterator)

In [160]:
import time
def epoch_time(start_time,end_time):
    elapsed_time=int(end_time-start_time)
    elapsed_mins=int(elapsed_time/60)
    elapsed_secs=int(elapsed_time-(elapsed_mins*60))#不够一分钟的那些秒
    return elapsed_mins,elapsed_secs

In [161]:
#训练wordavg
NUM_EPOCHES=20
best_valid_loss=float('inf')#正无穷大的数
for epoch in range(NUM_EPOCHES):
    start_time=time.time()
    train_loss,train_acc=train(model,train_iterator,optimizer,criterion)
    val_loss,val_acc=evaluate(model,valid_iterator,criterion)
    end_time=time.time()
    
    epoch_min,epoch_secs=epoch_time(start_time,end_time)
    
    if val_loss<best_valid_loss:
        best_valid_loss=val_loss
        torch.save(model.state_dict(),'/home/control/Desktop/text8/wordavg-model.pth')
    print(f'Epoch:{epoch+1:02} | Epoch time:{epoch_min}m {epoch_secs}s')
    print(f'\tTrain Loss:{train_loss:.3f}| Train acc:{train_acc*100:.2f}%')
    print(f'\tVal Loss:{val_loss}:.3f | Valid acc:{val_acc*100:.2f}%')

Epoch:01 | Epoch time:0m 5s
	Train Loss:0.685| Train acc:59.97%
	Val Loss:0.6213701508812985:.3f | Valid acc:71.73%
Epoch:02 | Epoch time:0m 4s
	Train Loss:0.641| Train acc:74.36%
	Val Loss:0.4979406969021943:.3f | Valid acc:76.66%
Epoch:03 | Epoch time:0m 4s
	Train Loss:0.565| Train acc:79.21%
	Val Loss:0.47108364079968407:.3f | Valid acc:79.20%
Epoch:04 | Epoch time:0m 4s
	Train Loss:0.493| Train acc:83.12%
	Val Loss:0.430426794214774:.3f | Valid acc:82.64%
Epoch:05 | Epoch time:0m 5s
	Train Loss:0.431| Train acc:85.91%
	Val Loss:0.41735472732176215:.3f | Valid acc:84.67%
Epoch:06 | Epoch time:0m 5s
	Train Loss:0.383| Train acc:87.87%
	Val Loss:0.4244473109305915:.3f | Valid acc:86.09%
Epoch:07 | Epoch time:0m 4s
	Train Loss:0.346| Train acc:88.98%
	Val Loss:0.4416395543237864:.3f | Valid acc:86.57%
Epoch:08 | Epoch time:0m 4s
	Train Loss:0.318| Train acc:89.74%
	Val Loss:0.4547139684022483:.3f | Valid acc:87.33%
Epoch:09 | Epoch time:0m 4s
	Train Loss:0.292| Train acc:90.55%
	Val Lo

In [162]:
import spacy
#nlp=spacy.load('en')
nlp=spacy.load('en_core_web_sm')
def predicit_sentiment(sentence):
    tokenized=[tok.text for tok in nlp.tokenizer(sentence)]#对一句话进行分词
    indexed=[TEXT.vocab.stoi[i] for i in tokenized]#找到每一个词对应的索引（在词典中）
    tensor=torch.LongTensor(indexed).to(device)
    tensor=tensor.unsqueeze(1)#在第一个位置添加一个维度1
    #这里必须加上一个维度1，是因为模型中forward(text)的text的格式是[seq_len,batch],所以为了送入模型是对的，维度必须符合
    prediction=torch.sigmoid(model(tensor))
    return prediction.item()


In [163]:
predicit_sentiment('this film is terrible')

0.0

In [164]:
predicit_sentiment('this film is green')

7.216053537743827e-20

In [165]:
import torch

In [166]:
#搭建一个RNN模型
#rnn 经常作为一个编码器，对一个sentence进行编码，使用最后一个hidden state表示整个句子。
#但是这个地方是双向rnn，所以有两个hidden，需要把两个hidden进行级联
#把最后一个hidden通过一个线性变换f，预测句子的情感。
class RNNModel(nn.Module):
    def __init__(self,vocab_size,embedding_dim,hidden_dim,output_dim,n_layers,bidirectional,dropout,pad_idx):
        super(RNNModel,self).__init__()
        self.embed=nn.Embedding(vocab_size,embedding_dim,padding_idx=pad_idx)
        self.rnn=nn.LSTM(embedding_dim,hidden_dim,num_layers=n_layers,
                         bidirectional=bidirectional, dropout=dropout)
        self.fc=nn.Linear(hidden_dim*2,output_dim)#注意这里的维度设置和lstm是否是双向的有关系。
        self.drop=nn.Dropout(dropout)
        
    def forward(self,text):#这里的text就是iterator中每次取出来一个batch的数据。
        embed=self.drop(self.embed(text))  ##[sent len, batch size, emb dim]     
        output,(hidden, cell)=self.rnn(embed)#这里直接填入embed就好了。#因为是lstm所以输出是不一样的。
        #output = [sent len, batch size, hid dim * num directions]
        #hidden = [num layers * num directions, batch size, hid dim]
        #cell = [num layers * num directions, batch size, hid dim]
        #当采用双向LSTM的输出，需要将最后的两个hidden进行级联，最后的forward和最后的backward
        hidden=self.drop(torch.cat((hidden[-2,:,:],hidden[-1,:,:]),dim=1))#cat((a,b),0/1)，拼接两个张量，按照行或者列
        return self.fc(hidden)

In [167]:
INPUT_DIM=len(TEXT.vocab)
EMBEDDING_DIM=100
HIDDEN_DIM=256
OUTPUT_DIM=1
N_LAYERS=2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

rnn_model = RNNModel(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, 
            N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)

In [168]:
print(rnn_model)

RNNModel(
  (embed): Embedding(25002, 100, padding_idx=1)
  (rnn): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (drop): Dropout(p=0.5, inplace=False)
)


In [169]:
print(f'the model parameters {count_model_parameters(rnn_model):,} trainable parameters.')

the model parameters 4,810,857 trainable parameters.


In [170]:
rnn_model.embed.weight.data.copy_(pretrained_embedding)
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

rnn_model.embed.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
rnn_model.embed.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(rnn_model.embed.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.7244, -0.0186,  0.0996,  ...,  0.0045, -1.0037,  0.6646],
        [-1.1243,  1.2040, -0.6489,  ..., -0.7526,  0.5711,  1.0081],
        [ 0.2525,  0.4068, -0.1437,  ..., -0.5324,  0.4820,  0.1396]])


In [171]:
optimizer=torch.optim.Adam(rnn_model.parameters())
rnn_model=rnn_model.to(device)

In [172]:
NUM_EPOCHES=10
best_valid_loss=float('inf')
for epoch in range(NUM_EPOCHES):
    start_time=time.time()
    train_loss,train_acc=train(rnn_model,train_iterator,optimizer,criterion)
    val_loss,val_acc=evaluate(rnn_model,valid_iterator,criterion)
    end_time=time.time()
    
    epoch_min,epoch_secs=epoch_time(start_time,end_time)
    
    if val_loss<best_valid_loss:
        best_valid_loss=val_loss
        torch.save(rnn_model.state_dict(),'/home/control/Desktop/text8/lstm-model.pth')
    print(f'Epoch:{epoch+1:02} | Epoch time:{epoch_min}m {epoch_secs}s')
    print(f'\tTrain Loss:{train_loss:.3f}| Train acc:{train_acc*100:.2f}%')
    print(f'\tVal. Loss:{val_loss}:.3f | Valid acc:{val_acc*100:.2f}%')

Epoch:01 | Epoch time:1m 17s
	Train Loss:0.670| Train acc:58.60%
	Val. Loss:0.6905420968088053:.3f | Valid acc:58.80%
Epoch:02 | Epoch time:1m 19s
	Train Loss:0.628| Train acc:65.46%
	Val. Loss:0.6737146877636344:.3f | Valid acc:53.75%
Epoch:03 | Epoch time:1m 19s
	Train Loss:0.538| Train acc:73.62%
	Val. Loss:0.994186706967273:.3f | Valid acc:59.18%
Epoch:04 | Epoch time:1m 20s
	Train Loss:0.718| Train acc:52.25%
	Val. Loss:0.676621558807664:.3f | Valid acc:58.74%
Epoch:05 | Epoch time:1m 17s
	Train Loss:0.627| Train acc:63.08%
	Val. Loss:0.6819068208589392:.3f | Valid acc:52.39%
Epoch:06 | Epoch time:1m 15s
	Train Loss:0.517| Train acc:74.93%
	Val. Loss:0.45044662209890657:.3f | Valid acc:79.67%
Epoch:07 | Epoch time:1m 22s
	Train Loss:0.305| Train acc:87.75%
	Val. Loss:0.39604502398583846:.3f | Valid acc:84.23%
Epoch:08 | Epoch time:1m 19s
	Train Loss:0.254| Train acc:90.11%
	Val. Loss:0.4529498441744659:.3f | Valid acc:76.18%
Epoch:09 | Epoch time:1m 19s
	Train Loss:0.220| Train ac

In [177]:
model.load_state_dict(torch.load('/home/control/Desktop/text8/wordavg-model.pth'))
test_loss,test_acc=evaluate(model,test_iterator,criterion)
print(f'test loss:{test_loss:.3f} | test acc :{test_acc*100:.3f}%')

test loss:0.431 | test acc :84.356%


In [179]:
rnn_model.load_state_dict(torch.load('/home/control/Desktop/text8/lstm-model.pth'))
test_loss,test_acc = evaluate(rnn_model,test_iterator,criterion)
print(f'test loss :{test_loss:.3f} | test acc:{test_acc*100:.3f}%')

test loss :0.379 | test acc:84.400%


In [None]:
#Modulelist中的module可以被主module所识别，但是普通list中的module不能被识别。

In [190]:
class CNN(nn.Module):
    def __init__(self,vocab_size, embedding_dim,n_filters,filter_sizes,output_dim,dropout,pad_idx):
        super(CNN,self).__init__()
        self.embed=nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.list=nn.ModuleList([nn.Conv2d(in_channels=1,out_channels=n_filters,
                                          kernel_size=(fs,embedding_dim)) for fs in filter_sizes ])
        self.fc=nn.Linear(len(filter_sizes)*n_filters,output_dim)
        self.drop=nn.Dropout(dropout)
        
    def forward(self,text):#原本输入的text的格式是[seq len,batchsize]
        text=text.permute(1,0)#[batch,seq_len]
        embed=self.embed(text)#[batch,seq_len,embed_dim]
        embed=embed.unsqueeze(1)#[batch,1,sen_len,embed_dim]
        
        convd = [F.relu(conv(embed)).squeeze(3) for conv in self.list]
        #conv_n = [batch size, n_filters, sent len - filter_sizes[n]]
        
        poold=[F.max_pool1d(conv,conv.shape[2]).squeeze(2) for conv in convd]
        #pooled_n = [batch size, n_filters]
        
        cat = self.drop(torch.cat(poold,dim=1))
        #cat = [batch size, n_filters * len(filter_sizes)]
        return self.fc(cat)
        
        

In [192]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
N_FILTERS = 100
FILTER_SIZES = [3,4,5]
OUTPUT_DIM = 1
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

cnn_model = CNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT, PAD_IDX)

In [193]:
print(cnn_model)

CNN(
  (embed): Embedding(25002, 100, padding_idx=1)
  (list): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(5, 100), stride=(1, 1))
  )
  (fc): Linear(in_features=300, out_features=1, bias=True)
  (drop): Dropout(p=0.5, inplace=False)
)


In [195]:

cnn_model.embed.weight.data.copy_(pretrained_embedding)
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

cnn_model.embed.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
cnn_model.embed.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
cnn_model = cnn_model.to(device)
print(cnn_model.embed.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.7244, -0.0186,  0.0996,  ...,  0.0045, -1.0037,  0.6646],
        [-1.1243,  1.2040, -0.6489,  ..., -0.7526,  0.5711,  1.0081],
        [ 0.2525,  0.4068, -0.1437,  ..., -0.5324,  0.4820,  0.1396]],
       device='cuda:0')


In [196]:
optim=torch.optim.Adam(cnn_model.parameters())
criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)

N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(cnn_model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(cnn_model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(cnn_model.state_dict(), 'CNN-model.pth')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')


Epoch: 01 | Epoch Time: 0m 18s
	Train Loss: 0.732 | Train Acc: 50.13%
	 Val. Loss: 0.694 |  Val. Acc: 51.20%
Epoch: 02 | Epoch Time: 0m 17s
	Train Loss: 0.730 | Train Acc: 50.50%
	 Val. Loss: 0.694 |  Val. Acc: 51.20%
Epoch: 03 | Epoch Time: 0m 17s
	Train Loss: 0.729 | Train Acc: 50.56%
	 Val. Loss: 0.694 |  Val. Acc: 51.20%
Epoch: 04 | Epoch Time: 0m 18s
	Train Loss: 0.731 | Train Acc: 49.99%
	 Val. Loss: 0.694 |  Val. Acc: 51.20%
Epoch: 05 | Epoch Time: 0m 18s
	Train Loss: 0.733 | Train Acc: 49.87%
	 Val. Loss: 0.694 |  Val. Acc: 51.20%
Epoch: 06 | Epoch Time: 0m 17s
	Train Loss: 0.738 | Train Acc: 49.02%
	 Val. Loss: 0.694 |  Val. Acc: 51.20%
Epoch: 07 | Epoch Time: 0m 18s
	Train Loss: 0.729 | Train Acc: 50.67%
	 Val. Loss: 0.694 |  Val. Acc: 51.20%
Epoch: 08 | Epoch Time: 0m 18s
	Train Loss: 0.731 | Train Acc: 50.36%
	 Val. Loss: 0.694 |  Val. Acc: 51.20%
Epoch: 09 | Epoch Time: 0m 18s
	Train Loss: 0.736 | Train Acc: 49.65%
	 Val. Loss: 0.694 |  Val. Acc: 51.20%
Epoch: 10 | Epoch T

In [198]:
cnn_model.load_state_dict(torch.load('CNN-model.pth'))
test_loss, test_acc = evaluate(cnn_model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.695 | Test Acc: 50.07%
