Let's now import the required libraries and load the dataset into our application. The following script imports the required libraries:

From
https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/

In [None]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten, LSTM
from keras.layers import GlobalMaxPooling1D
from keras.models import Model
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.layers import Input
from keras.layers.merge import Concatenate

import pandas as pd
import numpy as np
import re

import matplotlib.pyplot as plt

Let's now load the dataset into the memory:

In [None]:
data_set = pd.read_csv("../Dataset/cleanedReviewsDateset100.csv")

The following script displays the shape of the dataset and it also prints the header of the dataset:

In [None]:
print(data_set.shape)

data_set.head()

Let's now plot the comment count for each label. To do so, we will first filter all the label or output columns.

In [None]:
bug_categories_labels = data_set[["invalidPositionOverTime", "implementationResponseIssue", "invalidContextOverTime", "interruptedEvent", 
                        "invalidEventOccurraceOverTime","actionNotPossible","actionWhenNotAllowed","informationOutOfOrder","lackOfRequiredInformation",
                        "invalidInfoAccess","objectOutOfBoundForAnyState","objectOutOfBoundForSpecificState","artificialStupidity","invalidValueChange",
                        "invalidGraphicalRespresentation","haveBugs"]]
bug_categories_labels.head()

will plot bar plots that show the total comment counts for different labels.

In [None]:
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 10
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size

bug_categories_labels.sum(axis=0).plot.bar()

Creating Multi-label Text Classification Models:
There are two ways to create multi-label classification models: Using single dense output layer and using multiple dense output layers.

In the first approach, we can use a single dense layer with six outputs with a sigmoid activation functions and binary cross entropy loss functions. Each neuron in the output dense layer will represent one of the six output labels. The sigmoid activation function will return a value between 0 and 1 for each neuron. If any neuron's output value is greater than 0.5, it is assumed that the comment belongs to the class represented by that particular neuron.

In the second approach we will create one dense output layer for each label. We will have a total of 6 dense layers in the output. Each layer will have its own sigmoid function.

Multi-lable Text Classification Model with Single Output Layer:

In [None]:
def preprocess_text(sen):
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

In the next step we will create our input and output set.

In [None]:
X = []
sentences = list(data_set["text"])
for sen in sentences:
    X.append(preprocess_text(sen))

y = data_set[["invalidPositionOverTime", "implementationResponseIssue", "invalidContextOverTime", "interruptedEvent", 
                        "invalidEventOccurraceOverTime","actionNotPossible","actionWhenNotAllowed","informationOutOfOrder","lackOfRequiredInformation",
                        "invalidInfoAccess","objectOutOfBoundForAnyState","objectOutOfBoundForSpecificState","artificialStupidity","invalidValueChange",
                        "invalidGraphicalRespresentation","haveBugs"]]

Here we do not need to perform any one-hot encoding because our output labels are already in the form of one-hot encoded vectors.

In the next step, we will divide our data into training and test sets:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=79,test_size=0.20, random_state=42)

In [None]:

y1_train = y_train[["invalidPositionOverTime"]].values
y1_test =  y_test[["invalidPositionOverTime"]].values

y2_train = y_train[["implementationResponseIssue"]].values
y2_test =  y_test[["implementationResponseIssue"]].values

y3_train = y_train[["invalidContextOverTime"]].values
y3_test =  y_test[["invalidContextOverTime"]].values

y4_train = y_train[["interruptedEvent"]].values
y4_test =  y_test[["interruptedEvent"]].values

y5_train = y_train[["invalidEventOccurraceOverTime"]].values
y5_test =  y_test[["invalidEventOccurraceOverTime"]].values

y6_train = y_train[["actionNotPossible"]].values
y6_test =  y_test[["actionNotPossible"]].values

y7_train = y_train[["actionWhenNotAllowed"]].values
y7_test =  y_test[["actionWhenNotAllowed"]].values

y8_train = y_train[["informationOutOfOrder"]].values
y8_test =  y_test[["informationOutOfOrder"]].values

y9_train = y_train[["lackOfRequiredInformation"]].values
y9_test =  y_test[["lackOfRequiredInformation"]].values

y10_train = y_train[["invalidInfoAccess"]].values
y10_test =  y_test[["invalidInfoAccess"]].values

y11_train = y_train[["objectOutOfBoundForAnyState"]].values
y11_test =  y_test[["objectOutOfBoundForAnyState"]].values

y12_train = y_train[["objectOutOfBoundForSpecificState"]].values
y12_test =  y_test[["objectOutOfBoundForSpecificState"]].values

y13_train = y_train[["artificialStupidity"]].values
y13_test =  y_test[["artificialStupidity"]].values

y14_train = y_train[["invalidValueChange"]].values
y14_test =  y_test[["invalidValueChange"]].values

y15_train = y_train[["invalidGraphicalRespresentation"]].values
y15_test =  y_test[["invalidGraphicalRespresentation"]].values

y16_train = y_train[["haveBugs"]].values
y16_test =  y_test[["haveBugs"]].values



We need to convert text inputs into embedded vectors.

In [None]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

vocab_size = len(tokenizer.word_index) + 1

maxlen = 200

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

We will be using GloVe word embeddings to convert text inputs to their numeric counterparts.

In [None]:
from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

glove_file = open('../kaggle/glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

The following script creates the model. Our model will have one input layer, one embedding layer, one LSTM layer with 128 neurons and one output layer with 6 neurons since we have 6 labels in the output.

In [None]:
input_1 = Input(shape=(maxlen,))
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(input_1)
LSTM_Layer1 = LSTM(128)(embedding_layer)

output1 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output2 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output3 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output4 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output5 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output6 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output7 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output8 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output9 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output10 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output11 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output12 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output13 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output14 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output15 = Dense(1, activation='sigmoid')(LSTM_Layer1)
output16 = Dense(1, activation='sigmoid')(LSTM_Layer1)

model = Model(inputs=input_1, outputs=[output1, output2, output3, output4, output5, output6,output7,output8,output9,output10,output11,output12,output13,output14,output15,output16])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])

Let's print the model summary:

In [None]:
print(model.summary())

The following script prints the architecture of our neural network:

In [None]:
from keras.utils.vis_utils import plot_model
plot_model(model, to_file='model_plot4a.png', show_shapes=True, show_layer_names=True)

Let's now train our model:

In [None]:
history = model.fit(x=X_train, y=[y1_train, y2_train, y3_train, y4_train, y5_train, y6_train,y7_train,y8_train
,y9_train,y10_train,y11_train,y12_train,y13_train,y14_train,y15_train,y16_train], batch_size=8192, epochs=5, verbose=1, validation_split=0.2)

Let's now evaluate our model on the test set:

In [None]:
score = model.evaluate(x=X_test, y=[y1_test, y2_test, y3_test, y4_test, y5_test, y6_test, y7_test, y8_test, y9_test, y10_test, y11_test, y12_test, y13_test, y14_test, y15_test,y16_test], verbose=1)

print("Test Score:", score[0])
print("Test Accuracy:", score[1])

Finally, we will plot the loss and accuracy values for training and test sets to see if our model is overfitting.

In [None]:
import matplotlib.pyplot as plt

plt.plot(history.history['dense_acc'])
plt.plot(history.history['val_dense_acc'])

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

plt.plot(history.history['dense_loss'])
plt.plot(history.history['val_dense_loss'])

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()