### Keras with GloVe Word Embeddings 

We will now try some more advanced tecniques to see if that will improve our results. We will use GloVe pre-trained word embeddings and a neural network model with Keras.

We will load a pickled version of the Common Crawl GloVe model with 840B tokens, 2.2M vocab, cased, and 300d vectors. Loading the pickled version is much faster than loading the entire 2.03 GB glove model.

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import preprocessing
from keras.preprocessing import text
from keras.preprocessing import sequence
from keras.models import Model
from keras.callbacks import ModelCheckpoint
from keras.optimizers import SGD
from keras.layers import Input
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import add
from keras.layers import Concatenate
from keras.layers import concatenate
from keras.layers import Bidirectional
from keras.layers import GlobalMaxPooling1D
from keras.layers import GlobalAveragePooling1D
from keras.layers import LSTM
import re
import joblib
import pickle
import torch
from time import time
import warnings
warnings.filterwarnings("ignore")
np.random.seed(123)

Using TensorFlow backend.


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# import data downloaded from Kaggle
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/train.csv')

In [0]:
# selecting only relevant question features
question_columns = ['question_title', 'question_body','category']
features = df[question_columns]

# selecting only the question target columns we will be working with
target_columns = ['question_asker_intent_understanding', 'question_conversational', 'question_expect_short_answer'
                  , 'question_fact_seeking', 'question_has_commonly_accepted_answer', 'question_multi_intent'
                  , 'question_opinion_seeking', 'question_well_written']
target = df[target_columns]

In [5]:
# split data to avoid leakage
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=123)

print(f'Features train shape:{X_train.shape}, Target train shape:{y_train.shape}')
print(f'Features test shape:{X_test.shape}, Target test shape:{y_test.shape}')

Features train shape:(4863, 3), Target train shape:(4863, 8)
Features test shape:(1216, 3), Target test shape:(1216, 8)


In [6]:
# tranform continuous values into binary categorical ones
for column in y_train:
    y_train[column] = np.where(y_train[column] >= .5, 1, 0)

# check it worked fine
y_train.head()

Unnamed: 0,question_asker_intent_understanding,question_conversational,question_expect_short_answer,question_fact_seeking,question_has_commonly_accepted_answer,question_multi_intent,question_opinion_seeking,question_well_written
4267,1,0,1,0,1,0,1,1
2641,1,0,1,1,0,0,1,1
838,1,0,1,1,1,0,1,1
553,1,0,0,1,1,0,0,1
5428,1,0,1,0,1,0,1,1


In [0]:
# do the same for test data
for column in y_test:
    y_test[column] = np.where(y_test[column] >= .5, 1, 0)

In [8]:
# load glove model from pickle
t = time()
with open('/content/drive/My Drive/Colab Notebooks/glove.840B.300d.pkl', 'rb') as f:
    glove = pickle.load(f)
print(time()-t)

44.46119260787964


In [9]:
# check we have a dictionary of word embeddings with close to 2.2M words
type(glove), len(glove)

(dict, 2196008)

Great, our GloVe dictionary is ready and it took only a few seconds to load. Because this glove model is for cased words and has such a large vocabulary, we will try to use our text without removing stopwords or lowercasing everything. I'll do another helper function to clean the text to this extent only.

In [0]:
# define function to clean text for glove
def clean_4glove(text):

    '''Takes a corpus of text and applies regex to clean it, removing punctuation, numbers, single characters,
multiples spaces and stopwords. Converts all text to lower case. Returns cleaned text.'''
    
    # Remove punctuations and single numbers
    text = re.sub('[^a-zA-Z]', ' ', text)
    
    # Remove sequences of numbers
    text = re.sub(r'\d+', ' ', text)

    # Remove single character
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)

    # Remove multiple spaces
    text = re.sub(r'\s+', ' ', text)
    
    return text

In [0]:
# list text columns
columns = ['question_title','question_body']

# clean train text data
for col in columns:
    X_train[col] = X_train[col].apply(lambda x: clean_4glove(x))

# clean test text data
for col in columns:
    X_test[col] = X_test[col].apply(lambda x: clean_4glove(x))

In [12]:
X_train.head()

Unnamed: 0,question_title,question_body,category
4267,Create Valid XML from XSD Loaded at Runtime wi...,Possible Duplicate Programmatically Create XM...,STACKOVERFLOW
2641,When enumerating motivations is it correct to ...,Suppose am enumerating reasons not to fly Is i...,CULTURE
838,hard self collision make particles occupy space,By default Blender particles don have any spat...,TECHNOLOGY
553,Removing unwanted color from W image aperture,I ve been shooting ilford color process amp fi...,LIFE_ARTS
5428,Binary tree traversal without using recursion,Can anyone help to to create binary tree and d...,STACKOVERFLOW


In [0]:
# instantiate and fit keras tokenizer to text 
tokenizer = text.Tokenizer(num_words=10000, filters='',lower=False)
tokenizer.fit_on_texts(list(X_train['question_title'])+list(X_train['question_body']))

In [0]:
# define an embedding matrix
glove_matrix = np.zeros((len(tokenizer.word_index) + 1, 300))

In [0]:
# get embeddings from GloVe and list out of vocabulary words
unknown_words = []
    
for word, i in tokenizer.word_index.items():
    try:
        glove_matrix[i] = glove[word]
    except KeyError:
        unknown_words.append(word)

In [16]:
# check how many words we don't have embedding for
len(unknown_words)

4910

In [17]:
# have a look at some of them
unknown_words[:20]

['RippleShaderProgram',
 'Appium',
 'jsonobject',
 'aStack',
 'ShaderProgramConstants',
 'Bodycopy',
 'appium',
 'myStaticIntStack',
 'pGLState',
 'Explosionfilters',
 'setaf',
 'setab',
 'AndroidRuntime',
 'instrumentsettingsid',
 'stackSize',
 'storedElements',
 'brmfc',
 'ComboBoxItem',
 'PSPlayer',
 'AppiumForWindows']

Looks like a good part of our out-of-vocabulary words are due to typing mistakes such as skipping a space. This would be hard to fix unless we could do it manually, which would be too time-consuming for our project. 

In [0]:
embedding_matrix = np.concatenate([glove_matrix], axis=-1)

In [19]:
len(embedding_matrix)

29521

In [20]:
embedding_matrix.shape

(29521, 300)

In [0]:
# tokenize train text features as separate inputs
Xt_train = tokenizer.texts_to_sequences(X_train['question_title'])
Xb_train = tokenizer.texts_to_sequences(X_train['question_body'])

In [0]:
# tokenize test text features as separate inputs
Xt_test = tokenizer.texts_to_sequences(X_test['question_title'])
Xb_test = tokenizer.texts_to_sequences(X_test['question_body'])

In [23]:
# check length and type
len(Xt_train), type(Xt_train)

(4863, list)

In [24]:
# check one sample
Xt_train[0]

[1108, 1392, 26, 7971, 29, 3771, 162, 3176, 1419]

In [0]:
# pad train data into appropriate shape
Xt_train = sequence.pad_sequences(Xt_train, maxlen=300)
Xb_train = sequence.pad_sequences(Xb_train, maxlen=300)

In [0]:
# pad test data into appropriate shape
Xt_test = sequence.pad_sequences(Xt_test, maxlen=300)
Xb_test = sequence.pad_sequences(Xb_test, maxlen=300)

In [0]:
# instantiate and fit label encoder
categories = preprocessing.LabelEncoder()
categories.fit(list(X_train['category'].values))

# encode train categorical feature
X_train['category'] = categories.transform(list(X_train['category'].values))

# encode test categorical feature
X_test['category'] = categories.transform(list(X_test['category'].values))

In [0]:
# make categorical feature as separate input 
X_train_cat = X_train['category']
X_test_cat = X_test['category']

In [29]:
# check all input shapes
Xt_train.shape, Xb_train.shape, X_train_cat.shape

((4863, 300), (4863, 300), (4863,))

Let's build and try out a somewhat simple neural network with Keras. We will pass our categorical feature through a Dense layer with a sigmoid activation. Then, use an embedding layer that will encode our text with GloVe embeddings, and concatenate them. Then we will use a Long Short Term Memory network (LSTM) layer, which is commonly used for natural language processing models, with a bidirectional layer wrapper. We will follow the LSTM layer with a Global Max Pooling and a Global Average Pooling, add two hidden layers with relu and sigmoid activations. Finally we add the text and the categorical outputs and pass one final Dense layer with our 8 targets and a sigmoid activation that will work with our multi-label problem target. 

Let's construct it and see what it can do!

In [0]:
def build_model(embedding_matrix):
    
    title = Input(shape=(300,))
    question_body = Input(shape=(300,))
    category = Input(shape=(1,))
    
    category1 = Dense(30, activation='sigmoid')(category)
    
    title_embb = Embedding(*embedding_matrix.shape, weights=[embedding_matrix], trainable=False)(title)
    body_embb = Embedding(*embedding_matrix.shape, weights=[embedding_matrix], trainable=False)(question_body)
    concat = Concatenate(axis=1)
    embb_final = concat([title_embb, body_embb])
    
    x1 = Bidirectional(LSTM(60, return_sequences=True))(embb_final)
    hidden1 = concatenate([GlobalMaxPooling1D()(x1),GlobalAveragePooling1D()(x1)])
    hidden1 = add([hidden1, Dense(240, activation='relu')(hidden1)])
    hidden1 = Dense(30, activation='sigmoid')(hidden1)
    
    final = add([hidden1,category1])
    result = Dense(8, activation='sigmoid')(final)
    
    model = Model(inputs=[title,question_body,category], outputs=result)
    model._name = 'mymodel'
    model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer='adam')
    model.summary()
    return model

In [0]:
# add checkpoint
filepath='weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5'
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

In [0]:
model = build_model(embedding_matrix)
model.fit([Xt_train,Xb_train,X_train_cat],
            y_train,
            batch_size=10,
            epochs=8,
            validation_data=([Xt_test,Xb_test,X_test_cat], y_test),
            callbacks=callbacks_list,
            verbose=2)

In [0]:
model_save_name = 'model.pt'
path = F"/content/drive/My Drive/Colab Notebooks/{model_save_name}" 
torch.save(model, path)

In [31]:
model = torch.load('/content/drive/My Drive/Colab Notebooks/model.pt')












Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where




In [0]:
# evaluate the model
train_acc = model.evaluate([Xt_train,Xb_train,X_train_cat],
            y_train, verbose=0)
test_acc = model.evaluate([Xt_test,Xb_test,X_test_cat],
            y_test, verbose=0)

In [33]:
print("Train Accuracy:", train_acc[1])
print("Test Accuracy:", test_acc[1])

Train Accuracy: 0.8713242853592384
Test Accuracy: 0.839124177631579


In [0]:
predicted = model.predict([Xt_test,Xb_test,X_test_cat])
predicted[predicted>=0.5] = 1
predicted[predicted<0.5] = 0

In [35]:
report = classification_report(y_test, predicted, target_names=y_test.columns)
print(report)

                                       precision    recall  f1-score   support

  question_asker_intent_understanding       0.99      1.00      0.99      1203
              question_conversational       0.27      0.17      0.21        63
         question_expect_short_answer       0.78      0.99      0.87       940
                question_fact_seeking       0.85      0.95      0.90       989
question_has_commonly_accepted_answer       0.90      0.93      0.92      1020
                question_multi_intent       0.65      0.08      0.14       277
             question_opinion_seeking       0.61      0.48      0.54       543
                question_well_written       0.93      1.00      0.96      1127

                            micro avg       0.87      0.88      0.87      6162
                            macro avg       0.75      0.70      0.69      6162
                         weighted avg       0.85      0.88      0.85      6162
                          samples avg       0.87  

This is a good overall result, and its accuracy is much higher than what we could achieve with the Random Forest model. 

Unfortunately some targets continue to have low metrics, and this can be both due to the type of questions being complex for the machine to understand and thus correctly classify, or due to the sample sizes being so small when compared to other targets.

We can experiment with a slightly more complex model, and try to run a few more epochs with a larger batch size to see if we can get any substantial improvements.

In [0]:
def build_model_2(embedding_matrix):
    title = Input(shape=(300,))
    question_body = Input(shape=(300,))
    category = Input(shape=(1,))
    
    category1 = Dense(32, activation='sigmoid')(category)
    
    title_embb = Embedding(*embedding_matrix.shape, weights=[embedding_matrix], trainable=False)(title)
    question_body_embb = Embedding(*embedding_matrix.shape, weights=[embedding_matrix], trainable=False)(question_body)
    concat = Concatenate(axis=1)
    embb_final = concat([title_embb,question_body_embb])
    
    x1 = Bidirectional(LSTM(200, return_sequences=True))(embb_final)
    x1 = Bidirectional(LSTM(200, return_sequences=True))(x1)
    hidden1 = concatenate([GlobalMaxPooling1D()(x1),GlobalAveragePooling1D()(x1)])
    hidden1 = add([hidden1, Dense(800, activation='relu')(hidden1)])
    hidden1 = Dense(32, activation='sigmoid')(hidden1)
    
    final = add([hidden1,category1])
    result = Dense(8, activation='sigmoid')(final)
    
    model = Model(inputs=[title,question_body,category], outputs=result)
    model._name = 'mymodel'
    model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer='adam')
    model.summary()
    return model

In [0]:
# checkpoint
filepath='weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5'
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

In [38]:
model_2 = build_model_2(embedding_matrix)
model_2.fit([Xt_train,Xb_train,X_train_cat],
            y_train,
            batch_size=32,
            epochs=20,
            validation_data=([Xt_test,Xb_test,X_test_cat], y_test),
            callbacks=callbacks_list,
            verbose=2)

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 300)          0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 300)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 300, 300)     8856300     input_1[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 300, 300)     8856300     input_2[0][0]                    
____________________________________________________________________________________________

<keras.callbacks.History at 0x7f72acaadd68>

In [0]:
model_save_name = 'model_2.pt'
path = F"/content/drive/My Drive/Colab Notebooks/{model_save_name}" 
torch.save(model_2, path)

In [40]:
# evaluate the model
train_acc_2 = model_2.evaluate([Xt_train,Xb_train,X_train_cat],
            y_train, verbose=0)
test_acc_2 = model_2.evaluate([Xt_test,Xb_test,X_test_cat],
            y_test, verbose=0)

print("Train Accuracy:", train_acc_2[1])
print("Test Accuracy:", test_acc_2[1])

Train Accuracy: 0.8276526836144833
Test Accuracy: 0.826891447368421


In [0]:
predicted_2 = model_2.predict([Xt_test,Xb_test,X_test_cat])
predicted_2[predicted_2>=0.5] = 1
predicted_2[predicted_2<0.5] = 0

In [42]:
report_2 = classification_report(y_test, predicted_2, target_names=y_test.columns)
print(report_2)

                                       precision    recall  f1-score   support

  question_asker_intent_understanding       0.99      1.00      0.99      1203
              question_conversational       0.00      0.00      0.00        63
         question_expect_short_answer       0.77      1.00      0.87       940
                question_fact_seeking       0.81      1.00      0.90       989
question_has_commonly_accepted_answer       0.84      1.00      0.91      1020
                question_multi_intent       0.00      0.00      0.00       277
             question_opinion_seeking       0.00      0.00      0.00       543
                question_well_written       0.93      1.00      0.96      1127

                            micro avg       0.87      0.86      0.86      6162
                            macro avg       0.54      0.62      0.58      6162
                         weighted avg       0.75      0.86      0.80      6162
                          samples avg       0.87  