#### 定义问题
- 有哪些数据可以用？
- 想要预测什么？
- 是否需要收集更多数据或雇人为数据手动添加标签

数据为新闻文本，并按照字符级别进行匿名处理。整合划分出14个候选分类类别：财经、彩票、房产、股票、家居、教育、科技、社会、时尚、时政、体育、星座、游戏、娱乐的文本数据。
赛题数据由以下几个部分构成：训练集20w条样本，测试集A包括5w条样本，测试集B包括5w条样本。为了预防选手人工标注测试集的情况，我们将比赛数据的文本按照字符级别进行了匿名处理。

In [None]:
import pandas as pd

In [None]:
# 若文件中存在utf-8不能解码的内容， 使用unicode_escape编码格式
df = pd.read_csv('data/train_set.csv', encoding='utf-8', sep='\t', nrows=500)

In [None]:
# 查看数据
print(df.columns)
print(df.head(3))

In [None]:
# 统计每个句子的长度
df['text_len'] = df['text'].apply(lambda x: len(x.split(' ')))
# print(df['text_len'].describe())

In [None]:
# 读取测试集

df_test = pd.read_csv('data/test_a.csv', encoding='utf-8', sep='\t', nrows=500)

In [None]:
#　收集每条测试文本的长度信息
df_test['text_len'] = df_test['text'].apply(lambda x: len(x.split(' ')))
# print(df_test['text_len'].describe())

原始数据有两列，第一列为标签，第二列为进行匿名处理的文本, 手动添加一列，记录每条文本的长度。一共有20万条文本，最大长度为57921，最短长度为2。每条记录的长度差异较大， 需要对较大的进行截断

In [None]:
# 查看新闻类别
print(df['label'].unique())
print(df.shape)

In [None]:
# 一共有14个类别，查看各个类别分布
import matplotlib.pyplot as plt
df['label'].value_counts().plot(kind='bar')
ax = plt.gca() # 获取当前的axes
ax.spines['left'].set_color('red')
ax.spines['bottom'].set_color('red')
plt.title('News class count')
plt.xlabel('category')

In [None]:
tmp = df.loc[df['text_len']>10]

In [None]:
# 尝试删除长度过小的文本，测试集中长度最小的文本为14, 只有0.25的文本长度小于370
# 查看长度小于10的文本的类别分布情况
print(tmp['text_len'].describe())
print(tmp.shape)
import matplotlib.pyplot as plt
tmp['label'].value_counts().plot(kind='bar')
ax = plt.gca() # 获取当前的axes
ax.spines['left'].set_color('red')
ax.spines['bottom'].set_color('red')
plt.title('News class count')
plt.xlabel('category')

各个类别分布严重不平衡, 且删除长度小于10的文本后，各类别分布基本不变

In [None]:
# 使用长度不小于10的文本，作为训练集
df = tmp

In [None]:
# 统计每个字符出现的次数
# 执行时 kernel will restart
if False:
    from collections import Counter
    all_lines = ' '.join(list(df['text']))
    word_count = Counter(all_lines.split(" "))
    word_count = sorted(word_count.items(), key=lambda d:d[1], reverse=True)
    # 一共有多少个字
    print(len(word_count))
    # 出现次数最多的字的编号
    print(word_count[0])
    # 出现次数最少的字的编号
    print(word_count[-1])
# 根据不同字符在句子中出现的次数， 推测标点符号
# 根据推测的标点符号， 分析每篇新闻由多少个句子组成
# 分析每类新闻中 出现次数最多的字符

In [None]:
# 查看训练集中最大的字的编号
df['max'] = df['text'].apply(lambda x: max([int(num) for num in x.split()]))
df['max'].max()

In [None]:
# 查看训练集中最小的字的编号
df['max'] = df['text'].apply(lambda x: min([int(num) for num in x.split()]))
df['max'].min()

## 数据分析的结论
1. 每个新闻平均字符个数较多，可能需要截断
2. 各个类别不均衡， 会严重影响模型的精度
3. 训练集中最大的编号为7549， 假设共有10000个不同的编号，即max_features=10000
4. 设置文本的长度为最大长度为300， maxlen=300
5. 最小编号为0， padding之前应对所有字符+1 或指定padding的value

##### 评估目标的方法
- 使用哪种指标对目标进行评估

本项目共存在14个类别， 且类别分布严重不平衡， 所以采用f1-score作为评估指标

##### 准备用于评估模型的验证过程。
- 定义训练集、验证集和测试集。验证集和测试集应该和训练集分开

In [None]:
# 划分训练集和验证集
from sklearn.model_selection import train_test_split

x_train = df['text'].values.tolist()
y_train = df['label'].values.tolist()
x_test = df_test['text'].values.tolist()

x_train, x_val, y_train, y_val = train_test_split(x_train,
                                                  y_train,
                                                  test_size=0.3,
                                                  random_state=1)

In [None]:
import utils
from utils import DataGenerator
from utils import DataGeneratorHAN
from utils import F1_score

In [None]:
utils.assign_gpu()

In [None]:
# 定义一个序列的最大长度
maxlen = 400
n_classes = 14
# 定义最大的字的编号（特征数）
max_features = 8000
batch_size = 200
epochs = 100
embedding_dims = 128

train_generator = DataGenerator(x_train, y_train,
                                n_classes,
                                batch_size=batch_size,
                                maxlen=maxlen,
                               )
val_generator = DataGenerator(x_val, y_val,
                              n_classes,
                              batch_size=batch_size,
                              maxlen=maxlen,
                             )

##### 数据向量化（数据预处理）
- 将数据转换为能被神经网络接收的形式 

In [None]:
# 导入必须的包
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Embedding, SimpleRNN
from utils import F1_score

# 定义一个简单的RNN模型
model = Sequential()
model.add(Embedding(max_features, 100))
model.add(SimpleRNN(32))
model.add(Dense(n_classes, activation='softmax'))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
              metrics=[F1_score()])

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow import keras

callbacks_list = [
    keras.callbacks.EarlyStopping(
        monitor='val_f1_score',
        patience=2,
    ),
    keras.callbacks.ModelCheckpoint(
        filepath='SimpleRNN.h5',
        monitor='val_f1_score',
        save_best_only=True,
    )]

In [None]:
# 拟合模型
history = model.fit(train_generator,
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_data=val_generator,
                    validation_freq=1,
                    callbacks=callbacks_list,
                   )

##### 开发模型
- 使用fasttext模型作为基线模型， fasttext在划分的验证集上的f1score为0.8972
- 简单的RNN模型验证的f1-score为0.744远小于 fasttext，
- 尝试使用biLSTM模型

##### 调节超参数和正则化

In [None]:
# 使用一个biLSTM提取特征
# 导入必须的包
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Embedding, Bidirectional, LSTM

# 定义一个简单的双向RNN模型
class biLSTM():
    def __init__(self):
        pass
    
    def get_model(self):
        model = Sequential()
        model.add(Embedding(max_features, 100))
        # model.add(SimpleRNN(32))
        model.add(Bidirectional(LSTM(128)))
        model.add(Dense(n_classes, activation='softmax'))
        model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
                      metrics=[F1_score()])
        return model

In [None]:
callbacks_list = [
    keras.callbacks.EarlyStopping(
        monitor='val_f1_score',
        patience=2,
        mode='max'
    ),
    keras.callbacks.ModelCheckpoint(
        filepath='biLSTM.h5',
        monitor='val_f1_score',
        save_best_only=True,
    )]

history = model.fit(train_generator,
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_data=val_generator,
                    validation_freq=1,
                    callbacks=callbacks_list,
                   )

双向LSTM模型在训练到第4轮时达到最佳f1-score： 0.8367，相比于简单的RNN有了较大的提升， 但仍远小于fasttext的效果。
从之前类别分布的信息中，我们知道， 样本的类别分布是非常不平衡的， 但是对于每个类别， 我们是同等对待的。 因此这里引入样本权重，来解决样本分布不均衡的问题。
在生成器中， 我们没有加入样本的权重， 下面可以尝试添加样本的权重再次训练该网络。为了避免将验证集的信息引入模型的训练过程， 在计算类别权重时，应该使用划分好的训练数据。
本问题中，各个类别之间应该是同等重要的，因此，不指定类别权重。

In [None]:
# 定义一个序列的最大长度
maxlen = 400
n_classes = 14
# 定义最大的字的编号（特征数）
max_features = 8000
batch_size = 1024
epochs = 100
embedding_dims = 128

In [None]:
train_generator = DataGenerator(x_train, y_train,
                                n_classes,
                                batch_size=batch_size,
                                maxlen=maxlen,
                               )
val_generator = DataGenerator(x_val, y_val,
                              n_classes,
                              batch_size=batch_size,
                              maxlen=maxlen,
                             )

In [None]:
callbacks_list = [
    keras.callbacks.EarlyStopping(
        monitor='val_f1_score',
        patience=2,
        mode='max'
    ),
    keras.callbacks.ModelCheckpoint(
        filepath='biLSTM_sample_weights.h5',
        monitor='val_f1_score',
        save_best_only=True,
    )]

In [None]:
model = biLSTM().get_model()

In [None]:
history = model.fit(train_generator,
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_data=val_generator,
                    validation_freq=1,
                    callbacks=callbacks_list,
                   )

引入样本权重后， 模型泛化效果并没有想象中的得到提升，f1-score仅有0.8394， 下面定义一个卷积网络进行训练

In [None]:
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, GlobalMaxPooling1D, Dense
from tensorflow.keras.models import Sequential
class TextCNN():
    def __init__(self):
        pass
    
    def get_model(self):
        model = Sequential()
        model.add(Embedding(max_features, embedding_dims, input_length=maxlen))
        model.add(Conv1D(32, 7, activation='relu'))
        model.add(MaxPooling1D(5))
        model.add(Conv1D(32, 7, activation='relu'))
        model.add(GlobalMaxPooling1D())
        model.add(Dense(n_classes, activation='softmax'))
        return model

model = TextCNN().get_model()

In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=[F1_score()])

In [None]:
history = model.fit(train_generator,
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_data=val_generator,
                    validation_freq=1,
                    callbacks=callbacks_list,
                   )

Epoch 8/10  
136/136 [==============================] - ETA: 0s - loss: 0.2736 - f1_score_1: 0.8938WARNING:tensorflow:Early   stopping conditioned on metric `val_f1_score` which is not available. Available metrics are:   loss,f1_score_1,val_loss,val_f1_score_1  
WARNING:tensorflow:Can save best model only with val_f1_score available, skipping.  
136/136 [==============================] - 43s 313ms/step - loss: 0.2736 - f1_score_1: 0.8938 - val_loss: 0.4793 - val_f1_score_1: 0.8729  
同一个notebook中第二次拟合模型时， monitor metrics 会由定义的val_f1_score变成val_f1_score_1  
TextCNN在训练到第8轮时达到最优， 然后开始过拟合， f1-score为0.8729， 相比biLSTM有了些许提升， 相比fasttext的0.89已非常接近。  
尝试结合RNN和CNN进行训练  

In [None]:
from tensorflow.keras import Input, Model
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Embedding, Dense, SimpleRNN, Lambda, Concatenate, Conv1D, GlobalMaxPooling1D


class TextRCNN(object):
    def __init__(self):
        pass

    def get_model(self):
        input_text = Input((maxlen,))

        embedder = Embedding(max_features, embedding_dims, input_length=maxlen)
        embedding = embedder(input_text)

        x_left = SimpleRNN(128, return_sequences=True)(embedding)
        x_right = SimpleRNN(128, return_sequences=True, go_backwards=True)(embedding)
        x_right = Lambda(lambda x: K.reverse(x, axes=1))(x_right)
        x = Concatenate(axis=2)([x_left, embedding, x_right])

        x = Conv1D(64, kernel_size=1, activation='tanh')(x)
        x = GlobalMaxPooling1D()(x)

        output = Dense(n_classes, activation='softmax')(x)
        model = Model(inputs=input_text, outputs=output)
        return model
    
textRCNN = TextRCNN().get_model()

In [None]:
textRCNN.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=[F1_score()])

In [None]:
callbacks_list = [
    keras.callbacks.EarlyStopping(
        monitor='val_f1_score',
        patience=2,
        mode='max'
    ),
    keras.callbacks.ModelCheckpoint(
        filepath='RCNN.h5',
        monitor='val_f1_score',
        save_best_only=True,
    )]

In [None]:
history = textRCNN.fit(train_generator,
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_data=val_generator,
                    validation_freq=1,
                    callbacks=callbacks_list,
                   )

RCNN的f1-score最高达到了0.90, 超过了fastText, 且在第10轮达到最佳，第11轮开始过拟合， 使用该模型提交测试结果

In [None]:
test_text = df_test.text.values.tolist()
test_generator = DataGenerator(test_text,
                                batch_size=100,
                                maxlen=maxlen,
                               )

In [None]:
result = textRCNN.predict(test_generator)

In [None]:
result = np.argmax(result, axis=1)
result = pd.DataFrame({'label': result})

In [None]:
result.to_csv('rcnn.csv', index=False)

In [None]:
sample = pd.read_csv('./data/test_a_sample_submit.csv')

In [None]:
all_train_generator = DataGenerator(x_train, y_train,
                                   batch_size=batch_size,
                                   maxlen=maxlen,
                                   )

In [None]:
# 在所有训练集上进行一DataGeneratorl_train_generator = DataGenerator(x_train, y_train, batch_size=batch_size, maxlen=maxlen)
textRCNN_from_scratch = TextRCNN().get_model()
textRCNN_from_scratch.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=[F1_score()])
history = textRCNN_from_scratch.fit(all_train_generator,
                    epochs=10,
                    batch_size=100,
                   )
result_from_scratch = textRCNN_from_scratch.predict(test_generator)
result_from_scratch = np.argmax(result_from_scratch, axis=1)

In [None]:
result_from_scratch = pd.DataFrame({'label': result_from_scratch})
result.to_csv('rcnn_from_scratch.csv', index=False)

In [None]:
from tensorflow.keras import backend as K
#from tensorflow.python.keras import backend as K
from tensorflow.keras import initializers, regularizers, constraints
from tensorflow.keras.layers import Layer
#from keras.engine.topology import Layer

class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        """
        Keras Layer that implements an Attention mechanism for temporal data.
        Supports Masking.
        Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
        # Input shape
            3D tensor with shape: `(samples, steps, features)`.
        # Output shape
            2D tensor with shape: `(samples, features)`.
        :param kwargs:
        Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
        The dimensions are inferred based on the output shape of the RNN.
        Example:
            # 1
            model.add(LSTM(64, return_sequences=True))
            model.add(Attention())
            # next add a Dense layer (for classification/regression) or whatever...
            # 2
            hidden = LSTM(64, return_sequences=True)(words)
            sentence = Attention()(hidden)
            # next add a Dense layer (for classification/regression) or whatever...
        """
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0

        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight(shape=(input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight(shape=(input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim

        e = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))  # e = K.dot(x, self.W)
        if self.bias:
            e += self.b
        e = K.tanh(e)

        a = K.exp(e)
        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())
        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)

        c = K.sum(a * x, axis=1)
        return c

    def compute_output_shape(self, input_shape):
        return input_shape[0], self.features_dim

In [None]:
# 定义一个序列的最大长度
maxlen = 400
n_classes = 14
# 定义最大的字的编号（特征数）
max_features = 8000
maxlen_text = 16
maxlen_sentence = 25
batch_size = 200
epochs = 100
embedding_dims = 128

train_generator = DataGeneratorHAN(x_train, y_train,
                                n_classes,
                                batch_size=batch_size,
                                maxlen_text=maxlen_text,
                                   maxlen_sentence=maxlen_sentence,
                               )
val_generator = DataGeneratorHAN(x_val, y_val,
                              n_classes,
                              batch_size=batch_size,
                                 maxlen_text=maxlen_text,
                                 maxlen_sentence=maxlen_sentence,
                             )

In [None]:
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Embedding, Dense, Dropout
from tensorflow.keras.layers import Bidirectional, LSTM
from tensorflow.keras.layers import TimeDistributed

class HAN():
    def __init__(self):
        pass

    def get_model(self):
        input_words = Input(shape=(maxlen_sentence,))
        x_words = Embedding(max_features, embedding_dims,
                            input_length=maxlen_sentence)(input_words)
        x_words = Bidirectional(LSTM(128, return_sequences=True))(x_words)
        x_words = Attention(maxlen_sentence)(x_words)
        model_words = Model(input_words, x_words)
        
        # Sentence part
        input_sentences = Input(shape=(maxlen_text, maxlen_sentence))
        x_sentence = TimeDistributed(model_words)(input_sentences)
        x_sentence = Bidirectional(LSTM(128, return_sequences=True))(x_sentence)
        x_sentence = Attention(maxlen_text)(x_sentence)
        
        output = Dense(n_classes, activation='softmax')(x_sentence)
        model = Model(inputs=input_sentences, outputs=output)
        
        return model
        

In [None]:
han = HAN().get_model()

han.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=[F1_score()])

In [None]:
from tensorflow import keras

callbacks_list = [
    keras.callbacks.EarlyStopping(
        monitor='f1_score',
        patience=2,
        mode='max'
    ),
    keras.callbacks.ModelCheckpoint(
        filepath='han_weights.h5',
        monitor='f1_score',
        save_best_only=True,
    )]


In [None]:
history = han.fit(train_generator,
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_data=val_generator,
                    validation_freq=1,
                    callbacks=callbacks_list,
                   )
