# Improved LSTM baseline

This kernel is a somewhat improved version of [Keras - Bidirectional LSTM baseline](https://www.kaggle.com/CVxTz/keras-bidirectional-lstm-baseline-lb-0-051) along with some additional documentation of the steps. (NB: this notebook has been re-run on the new test set.)

In [1]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.callbacks import EarlyStopping, ModelCheckpoint,TensorBoard
import jieba
import glob
from subprocess import check_output

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


We include the GloVe word vectors in our input files. To include these in your kernel, simple click 'input files' at the top of the notebook, and search 'glove' in the 'datasets' section.

In [11]:
path = '../input/'
comp = 'toxic/'
EMBEDDING_FILE = f'{path}wordvector/wiki.zh.vec'
TRAIN_DATA_FILE = f'{path}{comp}rasa_train.csv'
TEST_DATA_FILE = f'{path}{comp}rasa_test.csv'
tensor_path = "../logs/toxic/"
model_path = "../model/toxic/rasa_weights_base.best.hdf5"
res_file = "../result/toxic/rasa_baseline.csv"

print(check_output(["ls", path + comp]).decode("utf8"))

jieba_userdicts = glob.glob(path + "jieba/*.txt")
for jieba_userdict in jieba_userdicts:
    jieba.load_userdict(jieba_userdict)

baseline.csv
rasa_baseline.csv
rasa_test.csv
rasa_train.csv
rasa_train_sample_submission.csv
rasa_zh_root.json
sample_submission.csv
test.csv
train.csv



Set some basic config parameters:

In [12]:
embed_size = 300 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a comment to use

Read in our data and replace missing values:

In [13]:
# 使用原始文件 分出验证集
train_all = pd.read_csv(TRAIN_DATA_FILE)
train_all = train_all.sample(frac=1).reset_index(drop=True)  
lenth_train = train_all.shape[0]
spint = int(0.8*lenth_train)
train = train_all.loc[0:spint,:]
test = train_all.loc[spint:,:]
test.to_csv(TEST_DATA_FILE, index=False)
# test = pd.read_csv(TEST_DATA_FILE)

for i1 in train.index:
    train.loc[i1, "comment_text"]=" ".join(jieba.cut(train.loc[i1, "comment_text"]))
for i1 in test.index:
    test.loc[i1, "comment_text"]=" ".join(jieba.cut(test.loc[i1, "comment_text"]))

list_sentences_train = train["comment_text"].fillna("_na_").values
# list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
list_classes = [i1 for i1 in train.columns]
try:
    list_classes.remove("comment_text")
    list_classes.remove("id")
except Exception as e:
    pass
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("_na_").values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Standard keras preprocessing, to turn each comment into a list of word indexes of equal length (with truncation or padding as needed).

In [14]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

Read the glove word vectors (space delimited strings) into a dictionary from word->vector.

In [15]:
def get_coefs(word, *arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))
for o in list(embeddings_index.keys()):
     if len(embeddings_index[o])!=embed_size:
         del embeddings_index[o]

Use these vectors to create our embedding matrix, with random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [16]:
all_embs = np.stack(embeddings_index.values())
emb_mean, emb_std = all_embs.mean(), all_embs.std()

In [17]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index)+2)
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector
embedding_matrix[0]=np.zeros((embed_size))

Simple bidirectional LSTM with two fully connected layers. We add some dropout to the LSTM since even 2 epochs is enough to overfit.

In [18]:
inp = Input(shape=(maxlen,))
x = Embedding(nb_words, embed_size, weights=[embedding_matrix], trainable=False)(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(30, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Now we're ready to fit out model! Use `validation_split` when not submitting.

In [19]:
batch_size=32
epochs=100
checkpoint = ModelCheckpoint(model_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
tensorb = TensorBoard(log_dir=tensor_path, histogram_freq=10, write_graph=True, write_images=True, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None)
early = EarlyStopping(monitor="val_loss", mode="min", patience=20)
callbacks_list = [checkpoint, early, tensorb] #early

Instructions for updating:
Use the retry module or similar alternatives.


In [20]:
model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.1, callbacks=callbacks_list)

Train on 1332 samples, validate on 148 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.20992, saving model to ../model/toxic/rasa_weights_base.best.hdf5
Epoch 2/100

Epoch 00002: val_loss improved from 0.20992 to 0.13889, saving model to ../model/toxic/rasa_weights_base.best.hdf5
Epoch 3/100

Epoch 00003: val_loss improved from 0.13889 to 0.13501, saving model to ../model/toxic/rasa_weights_base.best.hdf5
Epoch 4/100

Epoch 00004: val_loss improved from 0.13501 to 0.13278, saving model to ../model/toxic/rasa_weights_base.best.hdf5
Epoch 5/100

Epoch 00005: val_loss improved from 0.13278 to 0.13032, saving model to ../model/toxic/rasa_weights_base.best.hdf5
Epoch 6/100

Epoch 00006: val_loss improved from 0.13032 to 0.12753, saving model to ../model/toxic/rasa_weights_base.best.hdf5
Epoch 7/100

Epoch 00007: val_loss improved from 0.12753 to 0.12372, saving model to ../model/toxic/rasa_weights_base.best.hdf5
Epoch 8/100

Epoch 00008: val_loss improved from 0.12372 to 0

Epoch 35/100

Epoch 00035: val_loss did not improve
Epoch 36/100

Epoch 00036: val_loss did not improve
Epoch 37/100

Epoch 00037: val_loss did not improve
Epoch 38/100

Epoch 00038: val_loss improved from 0.07068 to 0.07014, saving model to ../model/toxic/rasa_weights_base.best.hdf5
Epoch 39/100

Epoch 00039: val_loss did not improve
Epoch 40/100

Epoch 00040: val_loss did not improve
Epoch 41/100

Epoch 00041: val_loss did not improve
Epoch 42/100

Epoch 00042: val_loss improved from 0.07014 to 0.07001, saving model to ../model/toxic/rasa_weights_base.best.hdf5
Epoch 43/100

Epoch 00043: val_loss improved from 0.07001 to 0.06842, saving model to ../model/toxic/rasa_weights_base.best.hdf5
Epoch 44/100

Epoch 00044: val_loss did not improve
Epoch 45/100

Epoch 00045: val_loss did not improve
Epoch 46/100

Epoch 00046: val_loss did not improve
Epoch 47/100

Epoch 00047: val_loss did not improve
Epoch 48/100

Epoch 00048: val_loss did not improve
Epoch 49/100

Epoch 00049: val_loss did n

<keras.callbacks.History at 0x7f55fc458e10>

And finally, get predictions for the test set and prepare a submission CSV:

In [21]:
# y_test = model.predict([X_te], batch_size=1024, verbose=1)
# sample_submission = pd.read_csv(f'{path}{comp}sample_submission.csv')
# sample_submission[list_classes] = y_test
# sample_submission.to_csv('submission.csv', index=False)

In [24]:
model.load_weights(model_path)

y_test = model.predict(X_te)
# predict(self, x, batch_size=32, verbose=0)
# predict_classes(self, x, batch_size=32, verbose=1)
# predict_proba(self, x, batch_size=32, verbose=1)
# evaluate(self, x, y, batch_size=32, verbose=1, sample_weight=None)

sample_submission = pd.read_csv(TEST_DATA_FILE)

sample_submission[list_classes] = y_test
sample_submission["max"]=sample_submission[list_classes].max(axis=1)

for indexs in sample_submission.index:  
    for  i2 in list_classes:  
        if(sample_submission.loc[indexs,i2] ==sample_submission.loc[indexs,"max"]):
            sample_submission.loc[indexs,"predict"]=i2
for i1 in list_classes:
    sample_submission.rename(columns={i1: "pred_" + i1}, inplace=True)
res_file = "../result/toxic/rasa_baseline.csv"
sample_submission.to_csv(res_file, index=False)

In [25]:
# 正确率评估
score = model.evaluate(X_t, y, batch_size=batch_size)
print(score)
test_pd = pd.read_csv(TEST_DATA_FILE)
res_pd = pd.read_csv(res_file)
total_pd=pd.concat([res_pd,test_pd], join='outer', axis=1)
total_right=0
total_num=0
for i1 in list_classes:
    tmp_obj=total_pd[total_pd[i1] == 1]
    sum_num=tmp_obj.shape[0]
    right_num=tmp_obj[tmp_obj["predict"] == i1].shape[0]
    total_right += right_num
    total_num += sum_num
    try:
        print("%s, sum_num: %d, right_num: %d, accuracy: %.3f" % (i1,sum_num,right_num,right_num/sum_num))
    except Exception as e:
        print("%s, sum_num: %d, right_num: %d, error: %s" % (i1,sum_num,right_num,str(e)))
        
print("total data, total_num: %d, total_right: %d, accuracy: %.3f" % (total_num,total_right,total_right/total_num))

[0.026620823446963284, 0.9922522580301439]
weather, sum_num: 23, right_num: 16, accuracy: 0.696
p2p, sum_num: 8, right_num: 2, accuracy: 0.250
navigation, sum_num: 18, right_num: 13, accuracy: 0.722
travel, sum_num: 35, right_num: 25, accuracy: 0.714
memorandum, sum_num: 5, right_num: 2, accuracy: 0.400
new_schedule, sum_num: 25, right_num: 23, accuracy: 0.920
communication, sum_num: 14, right_num: 13, accuracy: 0.929
choose, sum_num: 2, right_num: 0, accuracy: 0.000
others, sum_num: 21, right_num: 11, accuracy: 0.524
order_food, sum_num: 52, right_num: 28, accuracy: 0.538
news, sum_num: 16, right_num: 10, accuracy: 0.625
medical_consultation, sum_num: 5, right_num: 1, accuracy: 0.200
hospital_register, sum_num: 6, right_num: 6, accuracy: 1.000
stock, sum_num: 39, right_num: 33, accuracy: 0.846
express, sum_num: 5, right_num: 3, accuracy: 0.600
movie, sum_num: 1, right_num: 0, accuracy: 0.000
joke, sum_num: 5, right_num: 5, accuracy: 1.000
history_today, sum_num: 2, right_num: 0, accur