In [1]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

导入数据集

In [2]:
df = pd.read_csv('./data/train.tsv', delimiter='\t', header=None)
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1.0
1,apparently reassembled from the cutting room f...,0.0
2,they presume their audience wo n't sit still f...,0.0
3,this is a visually stunning rumination on love...,1.0
4,jonathan parker 's bartleby should have been t...,1.0


In [3]:
df[1].value_counts()

1.0    561
0.0    519
Name: 1, dtype: int64

导入预训练好的 DistilBERT 模型与分词器

In [4]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

In [5]:
# load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Right now, the variable **model** holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory

分词：

In [7]:
tokenized  = df[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [8]:
tokenized

0       [101, 1037, 18385, 1010, 6057, 1998, 2633, 182...
1       [101, 4593, 2128, 27241, 23931, 2013, 1996, 62...
2       [101, 2027, 3653, 23545, 2037, 4378, 24185, 10...
3       [101, 2023, 2003, 1037, 17453, 14726, 19379, 1...
4       [101, 5655, 6262, 1005, 1055, 12075, 2571, 376...
                              ...                        
1076    [101, 5363, 2000, 2147, 1999, 1996, 2168, 1281...
1077    [101, 2005, 2216, 2040, 6620, 3209, 2006, 1213...
1078    [101, 1037, 2143, 2007, 3824, 2576, 17011, 720...
1079    [101, 6373, 2153, 2743, 25607, 2015, 2049, 826...
1080    [101, 3477, 10631, 5363, 2000, 5333, 2070, 380...
Name: 0, Length: 1081, dtype: object

填充：将向量整理成相同的维度（在较短的句子后面填充上编号0）

In [9]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)
        
padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

In [10]:
np.array(padded).shape

(1081, 59)

**Masking**: if we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore(mask) the padding we've added when it's processiong its input.That's what attention_mask is:

In [11]:
attention_mask = np.where(padded !=0, 1, 0)
attention_mask.shape

(1081, 59)

In [13]:
attention_mask

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

**DistilBERT处理**<br>
这一步完成后，会将 DistilBERT 的输出赋给「last_hidden_states」。这是一个维度为（句子数，序列中的最大词数，DistilBERT 模型中的隐藏层数）的元组。在本例中，这个维度就是（2000，66，768），因为我们有 2000 个句子，2000 个句子中最长的序列长度为 66，DistilBERT 模型中有 768 个隐藏层。

In [14]:
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask = attention_mask)

对于句子分类任务来说，我们只对 BERT 得到的 [CLS] 对应的输出感兴趣，所以我们只保留「立方体」中 [CLS] 对应的切片，删掉了其它内容。

In [15]:
# Slice the output for the first position for all the sequences, take all hidden unit outputs
features = last_hidden_states[0][:,0,:].numpy()

In [16]:
features

array([[-0.21593425, -0.14028914,  0.00831067, ..., -0.13694833,
         0.58670044,  0.20112702],
       [-0.17262712, -0.1447617 ,  0.00223441, ..., -0.17442559,
         0.21386437,  0.37197483],
       [-0.05063363,  0.07203963, -0.02959726, ..., -0.07148931,
         0.71852386,  0.26225471],
       ...,
       [-0.18480204, -0.14686532, -0.02905772, ..., -0.21470352,
         0.3324482 ,  0.2401797 ],
       [-0.25359544, -0.23936309,  0.07089166, ..., -0.16446884,
         0.5500152 ,  0.39706248],
       [-0.16152978, -0.1405405 , -0.11402909, ..., -0.30985343,
         0.36905068,  0.17303947]], dtype=float32)

In [17]:
labels = df[1]
labels

0       1.0
1       0.0
2       0.0
3       1.0
4       1.0
       ... 
1076    0.0
1077    1.0
1078    1.0
1079    0.0
1080    NaN
Name: 1, Length: 1081, dtype: float64

**Train/Test Split**

In [22]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

**在训练集上训练Logistic回归模型**

In [23]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').