设置随机种子，已保证在相同的网络结构、超参数下，相同的输入具有相同的输出效果，可参考pytorch对应的官方文档：https://pytorch.org/docs/stable/notes/randomness.html

第一，控制随机性来源

第二，配置pytorch来避免某些操作使用非确定行算法，以保证在相同的输入下，多次调用这些操作产生相同的结果；

In [1]:
import torch
import random
import numpy as np

SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
# torch.use_deterministic_algorithms(True)

从 transformers 库中导入相应的 Tokenizer；官方文档参考：https://huggingface.co/docs/transformers/main_classes/tokenizer

Tokenizers 负责为模型准备输入，主要有以下方法：
1. token化，即分词；
2. token 和 Vocabulary 的映射；
3. 特殊标记，如掩码、句首等；

In [2]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

In [3]:
len(tokenizer.vocab)

21128

将字符串转换为 token

In [4]:
tokens = tokenizer.tokenize('真 没 搞 明白 京东 券是 怎么回事 买 没有 券买 完 有券 ')

print(tokens)

['真', '没', '搞', '明', '白', '京', '东', '券', '是', '怎', '么', '回', '事', '买', '没', '有', '券', '买', '完', '有', '券']


tokens to indices

In [5]:
indexes = tokenizer.convert_tokens_to_ids(tokens)
print(indexes)

[4696, 3766, 3018, 3209, 4635, 776, 691, 1171, 3221, 2582, 720, 1726, 752, 743, 3766, 3300, 1171, 743, 2130, 3300, 1171]


获取特殊字符，BERT模型中，输入序列的第一个token为 '<CLS>'，序列结束为'<SEP>'

In [6]:
init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


输入的最大长度，即token的数量

In [7]:
max_input_length = tokenizer.max_model_input_sizes['bert-base-chinese']

print(max_input_length)

512


token化的函数，即分词函数

In [8]:
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[: max_input_length - 2]
    return tokens

# 利用 torchtext 对数据进行预处理，

In [9]:
from torchtext.legacy import data

TEXT = data.Field(batch_first=True,
                use_vocab=False,
                tokenize=tokenize_and_cut,
                preprocessing=tokenizer.convert_tokens_to_ids,
                init_token = init_token_idx,
                eos_token=eos_token_idx,
                pad_token=pad_token_idx,
                unk_token=unk_token_idx)

LABEL = data.LabelField()   

In [10]:
fields = [('label', LABEL), ('comment_processed', TEXT)]

train_Dataset, val_Dataset, test_Dataset = data.TabularDataset.splits(
    path='/workspace/vscode/works/研一上学期任务/data',
    format='csv',
    train='train_data.csv',
    validation='valid_data.csv',
    test='test_data.csv',
    skip_header=True,
    fields=fields)

LABEL.build_vocab(train_Dataset)
print(LABEL.vocab.stoi)


defaultdict(None, {'2': 0, '0': 1, '1': 2})


In [11]:
print(vars(train_Dataset.examples[6]))

tokens = tokenizer.convert_ids_to_tokens(vars(train_Dataset.examples[6])['comment_processed'])
print(tokens)

{'label': '1', 'comment_processed': [4692, 6629, 3341, 679, 1008, 3173, 3322, 1690, 7027, 7481, 2523, 1914, 6763, 816]}
['看', '起', '来', '不', '像', '新', '机', '器', '里', '面', '很', '多', '软', '件']


In [12]:
print(f"Number of train_data: {len(train_Dataset)}")
print(f"Number of valid_data: {len(val_Dataset)}")
print(f"Number of test_data: {len(test_Dataset)}")

Number of train_data: 45595
Number of valid_data: 14249
Number of test_data: 11399


In [13]:
print(vars(train_Dataset.examples[6]))

{'label': '1', 'comment_processed': [4692, 6629, 3341, 679, 1008, 3173, 3322, 1690, 7027, 7481, 2523, 1914, 6763, 816]}


In [14]:
tokens = tokenizer.convert_ids_to_tokens(vars(train_Dataset.examples[6])['comment_processed'])

print(tokens)

['看', '起', '来', '不', '像', '新', '机', '器', '里', '面', '很', '多', '软', '件']


In [15]:
batch_size = 128
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits((train_Dataset, val_Dataset, test_Dataset),
                                                                           batch_size=batch_size,
                                                                           sort=False,
                                                                           device=device)

In [16]:
next(iter(train_iterator)).label

tensor([0, 2, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 2, 2, 2, 0, 0, 2, 0, 0, 0, 2, 2,
        2, 1, 0, 2, 1, 2, 2, 1, 1, 1, 2, 1, 0, 2, 0, 0, 1, 0, 2, 1, 1, 0, 1, 0,
        0, 0, 2, 1, 1, 0, 2, 2, 1, 2, 0, 0, 2, 0, 0, 1, 1, 2, 1, 1, 0, 0, 1, 1,
        1, 2, 1, 2, 1, 1, 2, 0, 1, 1, 1, 2, 0, 1, 0, 1, 0, 1, 2, 0, 0, 1, 1, 1,
        2, 1, 1, 1, 0, 0, 0, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 0, 2, 1, 2, 1, 1,
        2, 1, 0, 2, 2, 1, 1, 1], device='cuda:0')

In [17]:
next(iter(train_iterator)).comment_processed 

tensor([[ 101,  782, 4495,  ...,    0,    0,    0],
        [ 101, 1927, 3307,  ...,    0,    0,    0],
        [ 101, 2797, 3322,  ...,    0,    0,    0],
        ...,
        [ 101, 1041, 4510,  ...,    0,    0,    0],
        [ 101, 3219, 1921,  ...,    0,    0,    0],
        [ 101, 4212, 4685,  ...,    0,    0,    0]], device='cuda:0')

In [18]:
next(iter(train_iterator))


[torchtext.legacy.data.batch.Batch of size 128]
	[.label]:[torch.cuda.LongTensor of size 128 (GPU 0)]
	[.comment_processed]:[torch.cuda.LongTensor of size 128x233 (GPU 0)]

导入预训练模型，

In [19]:
from transformers import BertTokenizer, BertModel

bert = BertModel.from_pretrained('bert-base-chinese')

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# 定义模型

In [20]:
import torch.nn as nn

class BERTGRUSentiment(nn.Module):
    def __init__(self, bert, hidden_dim, nums_output, n_layers, bidirectional, dropout):
        super(BERTGRUSentiment, self).__init__()
        self.bert = bert
        embedding_dim = bert.config.to_dict()['hidden_size']
        self.rnn = nn.GRU(embedding_dim,
                            hidden_dim,
                            num_layers=n_layers, 
                            bidirectional=bidirectional,
                            batch_first=True,
                            dropout = 0 if n_layers < 2 else dropout)
        self.dropout = nn.Dropout(dropout)
        self.output = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, nums_output)

    def forward(self, text):
        # text的大小： (batch size, sentence len)
        # with torch.no_grad():
        embeded = self.bert(text)[0]
        
        # embeded的大小：(batch size, sentence len, embeded dim)
        _, hidden = self.rnn(embeded)

        # hidden的大小：(nums_layer * bidirectional, batch size, embeded dim)
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
        
        # output的大小：(batch size, nums_output)
        output = self.output(hidden)

        return output

## 设置超参数

In [21]:
HIDDEN_DIM = 512
NUMS_OUTPUT = 3
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25
lr = 0.0005

model = BERTGRUSentiment(bert, HIDDEN_DIM, NUMS_OUTPUT, N_LAYERS, BIDIRECTIONAL, DROPOUT)

In [22]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 110,933,763 trainable parameters


In [23]:
import torch.optim as optim


optimizer = optim.Adam(model.parameters(), lr=lr)
# optimizer = optim.Adagrad(model.parameters())
# criterion = nn.BCEWithLogitsLoss()
criterion = nn.CrossEntropyLoss()
model = model.to(device)
criterion = criterion.to(device)

In [24]:
def accuracy(pred, y):
    correct = (pred.argmax(dim=1) == y).float()
    return correct.sum() / len(correct)

In [25]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.train()

    for batch in iterator:
        optimizer.zero_grad()
        preds = model(batch.comment_processed).squeeze(1)
        loss = criterion(preds, batch.label)
        acc = accuracy(preds, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [26]:
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():
        for batch in iterator:
            preds = model(batch.comment_processed).squeeze(1)
            loss = criterion(preds, batch.label)
            acc = accuracy(preds, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [27]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [28]:
NUM_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(NUM_EPOCHS):
    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), "bert-GRU-Reviews-Sentiment.pt")
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

RuntimeError: CUDA out of memory. Tried to allocate 734.00 MiB (GPU 0; 44.56 GiB total capacity; 39.56 GiB already allocated; 496.31 MiB free; 42.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
model.load_state_dict(torch.load('bert-GRU-Reviews-Sentiment.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.548 | Test Acc: 75.43%
