# Text Classification Using Attention and Positional Embeddings

Recently most of the natural language processing tasks are being dominated by the `Transformer` architecture. Transformers were introduced in the paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762), which used a simple mechanism called `Neural Attention` as one of its building blocks. As the title suggests this architecture didn't require any recurrent layer.

In [None]:
import numpy as np#导入处理数据或矩阵的库
import pandas as pd#导入具有数据分析能力和处理工具的库
from sklearn.datasets import fetch_20newsgroups#添加20类新闻文本的数据集
from sklearn.model_selection import train_test_split#导入随机将数据集划分为训练集和测试集的函数

import tensorflow as tf#导入tensorflow模块
from tensorflow import keras#导入keras人工神经网络库
from tensorflow.keras import layers as L#导入keras中的layers层，它内置了很多各种各样的结构，大量常用的预定义层
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer#导入Tokenizer函数，用于向量化文本或将文本转换为序列

We will be using 20 news groups data in our notebooks which comes as a standard dataset in the `scikit-learn` package

In [None]:
dataset = fetch_20newsgroups(subset='all')#加载20类新闻文本的全部数据集

X = pd.Series(dataset['data'])#x取数据集的数据值
y = pd.Series(dataset['target'])#y取数据集的标签
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1, stratify=y, random_state=19)#把数据集划分为测试集和验证集，测试集占比0.1，按照y中的比例分配，随机选取的数值有19个
y_train = pd.get_dummies(y_train)#对训练集变量进行编码
y_valid = pd.get_dummies(y_valid)#对验证集变量进行编码

The concept of `Neural Attention` is fairly simple ie not all input information seen by a model is equally important to the task at hand. Although this concept has been utilised at vaious different places as well eg Max Pooling in CNNs, but the kind of attention we are looking for should be `context aware`.

The attention mechanism allows output to focus attention on input while producing output while the self-attention model allows inputs to interact with each other i.e calculate attention of all other inputs with respect tt one input.

In the paper, the authors proposed another type of attention mechanism called multi-headed attention which refers to the fact that the outer space of the self attention layer gets factored into a set of independent sub-spaces leanred separately, where each subspace is called a "head"

There is a learnable dense projection present after the multihead attention which enables the layr to actually learn something, as opposed to being a purely stateless transformation.



In [None]:
class TransformerBlock(L.Layer):#构建TransformerBlock模型
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)#初始化
        self.embed_dim = embed_dim#最终输出的各矩阵的维度，这个维度需要和词向量的维度一样
        self.dense_dim = dense_dim#连接层维度
        self.num_heads = num_heads#设置多头注意力的数量
        self.attention = L.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)#计算多头注意力
        self.dense_proj = keras.Sequential([L.Dense(dense_dim, activation='relu'), L.Dense(embed_dim)])#构建线性、顺序的Sequential模型
        self.layernorm1 = L.LayerNormalization()#标准化
        self.layernorm2 = L.LayerNormalization()
    
    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[: tf.newaxis, :]
        attention_output = self.attention(inputs, inputs, attention_mask=mask)#输入的注意力
        proj_input = self.layernorm1(inputs + attention_output)#输入
        proj_output = self.dense_proj(proj_input)#输出
        return self.layernorm2(proj_input + proj_output)
    
    def get_config(self):
        config = super().get_confog()#使用自定义层。在get_config部分，把自定义的数值，字符串等加入到字典config中。
        config.update({#更新参数
            "embed_dim": self.embed_dim,#矩阵维度
            "num_heads": self.num_heads,#多头注意力数量
            "dense_dim": self.dense_dim#连接层维度
        })
        return config#必须要有return config语句，否则后续加载模型时，无法得到当前类TransformerBlock的参数值。

The idea behind Positional Encoding is fairly simple as well, ie to give the model access to token order information, therefore we are going to add the token's position in the sentence to each word embedding

Thus, one input word embedding will have to components: the usual token vector representing the token independent of any specific context, and a position vector representing the position of the token in the current sequence.

In [None]:
class PositionalEmbedding(L.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)#初始化
        self.token_embeddings = L.Embedding(input_dim, output_dim)#经过Embedding层处理后的数据会增加一个维度，新增加的维度为one-hot编码，表明了当前数据在数据列表的索引
        self.position_embeddings = L.Embedding(sequence_length, output_dim)
        self.sequence_length = sequence_length#序列长度
        self.input_dim = input_dim#输入数据的维度，即这一行数据是由多少个元素组成的
        self.output_dim = output_dim#输出数据的维度，即经过Embedding层降维后的数据由多少个元素组成
        
    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)#嵌入标记
        embedded_positions = self.position_embeddings(positions)#嵌入位置
        return embedded_tokens + embedded_positions
        
    def get_config(self):
        config = super().get_config()#使用自定义层。
        config.update({#更新参数
            "output_dim": self.output_dim,#输出数据的维度
            "sequence_length": self.sequence_length,#序列长度
            "input_dim": self.input_dim,#输入数据的维度
        })
        return config#使后续加载模型时，可以得到当前类PositionalEmbedding的参数值

Here we define some contants to parameterize the model

In [None]:
vocab_size = 10_000#单词表的单词数目
embed_dim = 256#词嵌入矩阵中词向量的维度
num_heads = 2#多头注意力的数量
dense_dim = 32#连接层维度
seq_length = 256#序列长度

The input texts are here tokenized and padded to a uniform sequence length

In [None]:
tokenizer = Tokenizer(num_words=vocab_size, oov_token='<unw>')#使用Tokenizer进行文本预处理
tokenizer.fit_on_texts(X_train)#用Tokenizer的fit_on_texts方法学习出文本的字典
X_train = tokenizer.texts_to_sequences(X_train)#将训练集文本转换为序列
X_train = sequence.pad_sequences(X_train, maxlen=seq_length)#将序列转化为经过填充以后的一个长度相同的新序列
X_valid = tokenizer.texts_to_sequences(X_valid)#将验证集文本转换为序列
X_valid = sequence.pad_sequences(X_valid, maxlen=seq_length)#将序列转化为经过填充以后的一个长度相同的新序列

**Defining the model**
The model architecture is fairly simple ie,:
* Input Layer
* Positional Embeddings
* Transformer Block
* Pooling
* Dropout
* Output Layer

In [None]:
inputs = keras.Input(shape=(None, ), dtype="int64")#初始化深度学习网络输入层的tensor
x = PositionalEmbedding(seq_length, vocab_size, embed_dim)(inputs)#使用PositionalEmbedding绝对位置编码
x = TransformerBlock(embed_dim, dense_dim, num_heads)(x)#构建TransformerBlock模型
x = L.GlobalMaxPooling1D()(x)#调用GlobalMaxPooling1D()来选择整个特征图中的最大值
x = L.Dropout(0.5)(x)#每个数据有50%的概率被留下来,50%的概率被抹去
outputs = L.Dense(20, activation='softmax')(x)#在模型中添加Dense作为输出层，激活函数为softmax

model = keras.Model(inputs, outputs)#使用Layer调用输出的张量inputs,和outputs，构建深度学习模型
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])#对该模型进行编译以便于训练，使用Adam作为优化器，使用分类交叉熵作为损失函数

In [None]:
model.summary()#使用model.summary()函数打印模型的结构

In [None]:
es = keras.callbacks.EarlyStopping(verbose=1, patience=5, restore_best_weights=True)#当被监测的数量不再提升，即训练5轮后还没有进步，则停止训练
rlp = keras.callbacks.ReduceLROnPlateau(patience=3, verbose=1)#当标准评估停止提升时，即训练3轮后还没有进步，降低学习速率

In [None]:
history = model.fit(X_train, y_train, validation_data=(X_valid, y_valid),#训练模型
    callbacks=[es, rlp], epochs=100#使用回调函数中的EarlyStopping函数和ReduceLROnPlateau函数进行训练，最大训练次数为100
)

In [None]:
import matplotlib.pyplot as plt#加载窗口，绘制图形
import seaborn as sns#导入seaborn库，一个统一的统计图制作库
sns.set_style('darkgrid')#设置网格背景为灰色

history = pd.DataFrame(history.history)#保存记录
fig, ax = plt.subplots(2, 1, figsize=(20, 12))#新建大小为20*12的画布，fig代表绘图窗口，ax代表这个绘图窗口上的坐标系
fig.suptitle('Learning Curve', fontsize=24)#图表中添加标题'Learning Curve'，标题大小为24
history[['loss', 'val_loss']].plot(ax=ax[0])#将损失的记录绘制成图像，且曲线记为0号
history[['accuracy', 'val_accuracy']].plot(ax=ax[1])#将准确率的记录绘制成图像，且曲线记为1号
ax[0].set_title('Loss', fontsize=18)#0号损失曲线起名为'Loss',字体大小为18
ax[1].set_title('Accuarcy', fontsize=18);#1号准确率曲线起名为'Accuarcy',字体大小为18