# 文本分类任务实战
数据集构建：影评数据集进行情感分析(分类任务)\
词向量模型：加载训练好的词向量或者自己训练都可以\
序列网络模型：训练RNN模型进行训练

In [1]:
import os
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import tensorflow as tf
import logging
import time
import pprint
from collections import Counter
from pathlib import Path
from tqdm import tqdm

加载影评数据集，可以手动下载放到对应位置

In [2]:
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.imdb.load_data()

In [3]:
X_train.shape, X_test.shape

((25000,), (25000,))

In [4]:
X_train[3453]

[1,
 45,
 24,
 179,
 4,
 3679,
 991,
 25,
 62,
 440,
 12,
 62,
 30,
 448,
 23,
 4,
 8247,
 12,
 9552,
 21,
 61155,
 2927,
 2538,
 9,
 131,
 6,
 12999,
 4702,
 18,
 107,
 185,
 156,
 43,
 10789,
 83,
 650,
 33,
 4,
 58,
 526,
 34,
 308,
 15706,
 5,
 398,
 34,
 20264,
 5877,
 4,
 20,
 186,
 8,
 30,
 6,
 2220,
 7,
 94,
 58,
 4,
 522,
 5581,
 54,
 298,
 108,
 71,
 262,
 18471,
 21,
 12,
 131,
 6041,
 6,
 3341,
 88,
 4,
 65,
 266,
 180,
 8,
 1326,
 7,
 5293,
 5,
 9398,
 15,
 15107,
 57,
 551,
 51,
 810,
 4,
 598,
 1360,
 2398,
 70,
 131,
 30,
 421,
 11,
 4,
 15048,
 31454,
 258,
 11,
 8389,
 4711,
 20060,
 2527,
 10,
 10,
 4,
 6379,
 114,
 1160,
 914,
 2999,
 6,
 2340,
 185,
 13097,
 37,
 1068,
 8,
 847,
 8,
 3754,
 8,
 413,
 6,
 9962,
 18,
 3485,
 18,
 1026,
 372,
 368,
 7,
 1708,
 21,
 1892,
 101,
 11478,
 29,
 996,
 3605,
 21,
 9,
 8911,
 8,
 16301,
 4108,
 466,
 27,
 19511,
 17040,
 29,
 892,
 6,
 3071,
 10921,
 4970,
 3036,
 773,
 6042,
 9203,
 37,
 86,
 1085,
 914,
 17,
 35,
 776,
 11

读进来的数据是已经转化成ID映射的，一般的数据读进来都是词语，需要手动转化成ID映射

In [5]:
X_train[2][:10]

[1, 14, 47, 8, 30, 31, 7, 4, 249, 108]

词和ID的映射表，空出来3个的目的是加上特殊字符

In [6]:
_word2idx = tf.keras.datasets.imdb.get_word_index()
word2idx = {k: v + 3 for k, v in _word2idx.items()}
word2idx['<pad>'] = 0
word2idx['<start>'] = 1
word2idx['<unk>'] = 2
idx2word = dict(zip(word2idx.values(), word2idx.keys()))

In [7]:
list(word2idx.items())[:10]

[('fawn', 34704),
 ('tsukino', 52009),
 ('nunnery', 52010),
 ('sonja', 16819),
 ('vani', 63954),
 ('woods', 1411),
 ('spiders', 16118),
 ('hanging', 2348),
 ('woody', 2292),
 ('trawling', 52011)]

In [8]:
list(idx2word.items())[:10]

[(34704, 'fawn'),
 (52009, 'tsukino'),
 (52010, 'nunnery'),
 (16819, 'sonja'),
 (63954, 'vani'),
 (1411, 'woods'),
 (16118, 'spiders'),
 (2348, 'hanging'),
 (2292, 'woody'),
 (52011, 'trawling')]

按文本长度大小进行排序

In [9]:
def sort_by_len(x, y):
    x, y = np.asarray(x), np.asarray(y)
    idx = sorted(range(len(x)), key=lambda i: len(x[i]))
    return x[idx], y[idx]

将中间结果保存到本地，保存的是文本数据，方便调用

In [10]:
x_train, y_train = sort_by_len(X_train, y_train)
x_test, y_test = sort_by_len(X_train, y_test)

def write_file(f_path, xs, ys):
    with open(f_path, 'w', encoding='utf-8') as f:
        for x, y in zip(xs, ys):
            f.write(str(y) + '\t' + ' '.join([idx2word[i] for i in x][1:]) + '\n')

write_file("./data/text/train.txt", x_train, y_train)
write_file("./data/text/test.txt", x_test, y_test)

# 构建语料表，基于词频来进行统计

In [11]:
counter = Counter()
with open('./data/text/train.txt', encoding='utf-8') as f:
    for line in f:
        line = line.rstrip()
        label, words = line.split('\t')
        words = words.split(' ')
        counter.update(words)

words = ['<pad>'] + [w for w, freq in counter.most_common() if freq >= 10]
print('Vocab Size:', len(words))

Path("./vocab").mkdir(exist_ok=True)

with open('./vocab/word.txt', 'w', encoding='utf-8') as f:
    for w in words:
        f.write(w + '\n')

Vocab Size: 20598


得到新的word2idx映射表

In [12]:
word2idx = dict()
with open("./vocab/word.txt", 'r', encoding="utf-8") as f:
    for i, word in enumerate(f):
        word = word.rstrip()
        word2idx[word] = i

# embedding层
可以基于网络来训练，也可以加载别人训练好的，一般直接加载别人预训练好的模型\
常用预训练模型：https://nlp.stanford.edu/projects/glove/

In [13]:
# 做了一个大表，里面有20598个不同的词，【20598*50】
embedding = np.zeros((len(word2idx)+1, 50))   # + 1表示如果不在语料表中，就都是unknow

with open('./data/glove.6B.50d.txt', encoding='utf-8') as f:    # 下载好的预训练模型
    count = 0
    for i, line in enumerate(f):
        if i % 100000 == 0:
            print('- At line {}'.format(i))   # 打印处理了多少数据
        line = line.rstrip()
        sp = line.split(' ')
        word, vec = sp[0], sp[1:]
        if word in word2idx:
            count += 1
            embedding[word2idx[word]] = np.asarray(vec, dtype='float32')  # 将词转换成对应的向量

- At line 0
- At line 100000
- At line 200000
- At line 300000


In [14]:
embedding[1]

array([ 4.18000013e-01,  2.49679998e-01, -4.12420005e-01,  1.21699996e-01,
        3.45270008e-01, -4.44569997e-02, -4.96879995e-01, -1.78619996e-01,
       -6.60229998e-04, -6.56599998e-01,  2.78430015e-01, -1.47670001e-01,
       -5.56770027e-01,  1.46579996e-01, -9.50950012e-03,  1.16579998e-02,
        1.02040000e-01, -1.27920002e-01, -8.44299972e-01, -1.21809997e-01,
       -1.68009996e-02, -3.32789987e-01, -1.55200005e-01, -2.31309995e-01,
       -1.91809997e-01, -1.88230002e+00, -7.67459989e-01,  9.90509987e-02,
       -4.21249986e-01, -1.95260003e-01,  4.00710011e+00, -1.85939997e-01,
       -5.22870004e-01, -3.16810012e-01,  5.92130003e-04,  7.44489999e-03,
        1.77780002e-01, -1.58969998e-01,  1.20409997e-02, -5.42230010e-02,
       -2.98709989e-01, -1.57490000e-01, -3.47579986e-01, -4.56370004e-02,
       -4.42510009e-01,  1.87849998e-01,  2.78489990e-03, -1.84110001e-01,
       -1.15139998e-01, -7.85809994e-01])

In [15]:
print("%d / %d words have found pre-trained values" % (count, len(word2idx)))
np.save("./vocab/word.npy", embedding)
print('Saved ./vocab/word.py')

19676 / 20598 words have found pre-trained values
Saved ./vocab/word.py


In [16]:
embedding.shape

(20599, 50)

# 构建训练数据
注意所有的输入样本必须都是相同的shape(文本长度，词向量维度等)

# 数据生成器
tf.data.Dataset.from_tensor_slices(tensor): 将tensor沿其第一个维度切片，返回一个含有N个样本的数据集，这样做的问题是需要将整个数据集整体传入，然后切片建立数据集类对象，比较占内存。\
tf.data.Dataset.from_generator(data_generator, output_data_type, output_data_shape): 从一个生成器中不断读取样本

In [17]:
def data_generator(f_path, params):
    with open(f_path, encoding='utf-8') as f:
        print("Reading", f_path)
        for line in f:
            line = line.rstrip()
            label, text = line.split('\t')
            text = text.split(' ')
            x = [params['word2idx'].get(w, len(word2idx)) for w in text]  # 得到当前词所对应的ID
            if len(x) >= params['max_len']:  # 截断操作
                x = x[:params['max_len']]
            else:
                x += [0] * (params['max_len'] - len(x))        # 补齐操作
            y = int(label)
            yield x, y

In [18]:
def dataset(is_training, params):
    _shapes = ([params['max_len']], ())
    _types = (tf.int32, tf.int32)
    
    if is_training:
        ds = tf.data.Dataset.from_generator(
            lambda: data_generator(params['train_path'], params),
            output_shapes=_shapes,
            output_types=_types
        )
        ds = ds.shuffle(params['num_samples'])
        ds = ds.batch(params['batch_size'])
        ds = ds.prefetch(tf.data.experimental.AUTOTUNE)   # 设置缓存序列，根据可用的CPU动态设置并行调用的数量，说白了就是加速
    else:
        ds = tf.data.Dataset.from_generator(
            lambda: data_generator(params['test_path'], params),
            output_shapes=_shapes,
            output_types=_types
        )
        ds = ds.batch(params['batch_size'])
        ds = ds.prefetch(tf.data.experimental.AUTOTUNE) 
    
    return ds

# 自定义网络模型
定义好都有哪些层\
前向传播走一遍就行了

BiLSTM(双向长短期记忆网络)\
相当于两层LSTM，batchsize*2

In [37]:
class Model(tf.keras.Model):
    def __init__(self, params):
        super().__init__()
        
        self.embedding = tf.Variable(np.load('./vocab/word.npy'),
                                     dtype=tf.float32,
                                     name='pretrained_embedding',
                                     trainable=False
                                    )
        
        self.drop1 = tf.keras.layers.Dropout(params['dropout_rate'])
        self.drop2 = tf.keras.layers.Dropout(params['dropout_rate'])
        self.drop3 = tf.keras.layers.Dropout(params['dropout_rate'])
        
        self.rnn1 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(params['rnn_units'], return_sequences=True))
        self.rnn2 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(params['rnn_units'], return_sequences=True))
        self.rnn3 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(params['rnn_units'], return_sequences=False))
        
        self.drop_fc = tf.keras.layers.Dropout(params['dropout_rate'])
        self.fc = tf.keras.layers.Dense(2*params['rnn_units'], tf.nn.elu)
        
        self.out_linear = tf.keras.layers.Dense(2)
        
    def call(self, inputs, training=False):
        if inputs.dtype != tf.int32:
            inputs = tf.cast(inputs, tf.int32)
            
        batch_sz = tf.shape(inputs)[0]
        rnn_units = 2 * params['rnn_units']
        
        x = tf.nn.embedding_lookup(self.embedding, inputs)
        
        x = self.drop1(x, training=training)
        x = self.rnn1(x)
        
        x = self.drop2(x, training=training)
        x = self.rnn2(x)
        
        x = self.drop3(x, training=training)
        x = self.rnn3(x)
        
        x = self.drop_fc(x, training=training)
        x = self.fc(x)
        
        x = self.out_linear(x)
        
        return x

In [35]:
# 速度更快
class Model2(tf.keras.Model):
    def __init__(self, params):
        super().__init__()

        self.embedding = tf.Variable(np.load('./vocab/word.npy'),
                                     dtype=tf.float32,
                                     name='pretrained_embedding',
                                     trainable=False
                                    )

        self.drop1 = tf.keras.layers.Dropout(params['dropout_rate'])
        self.drop2 = tf.keras.layers.Dropout(params['dropout_rate'])
        self.drop3 = tf.keras.layers.Dropout(params['dropout_rate'])

        self.rnn1 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(params['rnn_units'], return_sequences=True))
        self.rnn2 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(params['rnn_units'], return_sequences=True))
        self.rnn3 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(params['rnn_units'], return_sequences=True))

        self.drop_fc = tf.keras.layers.Dropout(params["dropout_rate"])
        self.fc = tf.keras.layers.Dense(2*params['rnn_units'], tf.nn.elu)

        self.out_linear = tf.keras.layers.Dense(2)

        return x

    def call(self, inputs, training=False):
        if inputs.type != tf.int32:
            inputs = tf.cast(inputs, tf.int32)
            
        batch_sz = tf.shape(inputs)[0]
        rnn_units = 2 * params['rnn_units']
        
        x = tf.nn.embedding_lookup(self.embedding, inputs)
        
        x = tf.reshape(x, (batch_sz*10*10, 10, 50))
        x = self.drop1(x, training=training)
        x = self.rnn1(x)
        x = tf.reduce_max(x, 1)
        
        x = tf.reshape(x, (batch_sz*10, 10, rnn_units))
        x = self.drop2(x, training=training)
        x = self.rnn2(x)
        x = self.reduce_max(x, 1)
        
        x = tf.reshape(x, (batch_sz, 10, rnn_units))
        x = self.drop3(x, training=training)
        x = self.rnn3(x)
        x = tf.reduce_max(x, 1)
        
        x = self.drop_fc(x, training=training)
        x = self.fc(x)
        
        x = self.out_linear(x)
        
        return x

# 设置参数

In [25]:
params = {
    'vocab_path': './vocab/word.txt',
    'train_path': './data/text/train.txt',
    'test_path': './data/text/test.txt',
    'num_samples': 25000,
    'num_labels': 2,
    'batch_size': 32,
    'max_len': 200,
    'rnn_units': 128,
    'dropout_rate': 0.2,
    'clip_norm': 10,
    'num_patience': 3,
    'lr': 3e-4
}

In [23]:
# 判断是否提前停止
def is_descending(history: list):
    history = history[-(params['num_patience']+1):]
    for i in range(1, len(history)):
        if history[i-1] <= history[i]:
            return False
    return True

In [26]:
word2idx = {}
with open(params['vocab_path'], encoding='utf-8') as f:
    for i, line in enumerate(f):
        line = line.rstrip()
        word2idx[line] = i
params['word2idx'] = word2idx
params['vocab_size'] = len(word2idx) + 1

In [27]:
len(word2idx)

20598

In [40]:
model = Model(params)
model.build(input_shape=(None, None))   # 设置输入的大小， 或者fit时候也能自动找到

decay_lr = tf.optimizers.schedules.ExponentialDecay(params['lr'], 1000, 0.95)    # 相当于加了一个指数衰减函数
optim = tf.optimizers.Adam(params['lr'])
global_step = 0

history_acc = []
best_acc = .0

t0 = time.time()
logger = logging.getLogger('tensorflow')
logger.setLevel(logging.INFO)

In [41]:
while True:
    # 训练模型
    for texts, labels in dataset(is_training=True, params=params):
        with tf.GradientTape() as tape: # 梯度带，记录所有在上下文的操作，并且通过调用.gradient()获得任何上下文计算出的张量的梯度
            logits = model(texts, training=True)
            loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits)
            loss = tf.reduce_mean(loss)
        
        optim.lr.assign(decay_lr(global_step))
        grads = tape.gradient(loss, model.trainable_variables)
        grads, _ = tf.clip_by_global_norm(grads, params['clip_norm'])   # 将梯度限制一下，有时候更新太猛，防止过拟合
        optim.apply_gradients(zip(grads, model.trainable_variables))    # 更新梯度
        
        if global_step % 50 == 0:
            logger.info("Step {} | Loss: {:.4f} | Spent: {:.1f} secs | LR: {:6f}".format(
                global_step, loss.numpy().item(), time.time() - t0, optim.lr.numpy().item()
            ))
        
        global_step += 1
        
    # 验证集效果
    m = tf.keras.metrics.Accuracy()
    
    for texts, labels in dataset(is_training=False, params=params):
        logits = model(texts, training=False)
        y_pred = tf.argmax(logits, axis=1)
        m.update_state(y_true=labels, y_pred=y_pred)
    
    acc = m.result().numpy()
    logger.info('Evaluation: Testing Accuracy: {:.3f}'.format(acc))
    history_acc.append(acc)
    
    if acc > best_acc:
        best_acc = acc
    logger.info("Best Accuracy: {:.3f}".format(best_acc))
    
    if len(history_acc) > params['num_patience'] and is_descending(history_acc):
        logger.info("Testing Accuracy not improved over {} epochs, Early Stop".format(params['num_patience']))
        break

Reading ./data/text/train.txt
INFO:tensorflow:Step 0 | Loss: 0.6912 | Spent: 8.2 secs | LR: 0.000300
INFO:tensorflow:Step 50 | Loss: 0.6728 | Spent: 268.8 secs | LR: 0.000299
INFO:tensorflow:Step 100 | Loss: 0.6815 | Spent: 536.9 secs | LR: 0.000298
INFO:tensorflow:Step 150 | Loss: 0.7035 | Spent: 889.9 secs | LR: 0.000298
INFO:tensorflow:Step 200 | Loss: 0.5722 | Spent: 1323.2 secs | LR: 0.000297
INFO:tensorflow:Step 250 | Loss: 0.4185 | Spent: 1694.4 secs | LR: 0.000296
INFO:tensorflow:Step 300 | Loss: 0.5918 | Spent: 2101.0 secs | LR: 0.000295
INFO:tensorflow:Step 350 | Loss: 0.6386 | Spent: 2485.0 secs | LR: 0.000295
INFO:tensorflow:Step 400 | Loss: 0.5774 | Spent: 2849.7 secs | LR: 0.000294
INFO:tensorflow:Step 450 | Loss: 0.5334 | Spent: 3206.8 secs | LR: 0.000293
INFO:tensorflow:Step 500 | Loss: 0.5720 | Spent: 3567.1 secs | LR: 0.000292
INFO:tensorflow:Step 550 | Loss: 0.5842 | Spent: 3929.6 secs | LR: 0.000292
INFO:tensorflow:Step 600 | Loss: 0.6052 | Spent: 4309.4 secs | LR: 

INFO:tensorflow:Step 4700 | Loss: 0.4614 | Spent: 47491.7 secs | LR: 0.000236
INFO:tensorflow:Step 4750 | Loss: 0.4202 | Spent: 47902.7 secs | LR: 0.000235
INFO:tensorflow:Step 4800 | Loss: 0.3594 | Spent: 48316.8 secs | LR: 0.000235
INFO:tensorflow:Step 4850 | Loss: 0.3716 | Spent: 48730.1 secs | LR: 0.000234
INFO:tensorflow:Step 4900 | Loss: 0.4767 | Spent: 49148.3 secs | LR: 0.000233
INFO:tensorflow:Step 4950 | Loss: 0.4557 | Spent: 49566.3 secs | LR: 0.000233
INFO:tensorflow:Step 5000 | Loss: 0.6009 | Spent: 49976.2 secs | LR: 0.000232
INFO:tensorflow:Step 5050 | Loss: 0.3194 | Spent: 50392.9 secs | LR: 0.000232
INFO:tensorflow:Step 5100 | Loss: 0.2405 | Spent: 50807.5 secs | LR: 0.000231
INFO:tensorflow:Step 5150 | Loss: 0.3446 | Spent: 51222.8 secs | LR: 0.000230
INFO:tensorflow:Step 5200 | Loss: 0.2864 | Spent: 51637.1 secs | LR: 0.000230
INFO:tensorflow:Step 5250 | Loss: 0.3027 | Spent: 52053.1 secs | LR: 0.000229
INFO:tensorflow:Step 5300 | Loss: 0.3739 | Spent: 52468.3 secs |

AttributeError: 'str' object has no attribute 'fotmat'