### Dialog Act Classification using LSTM

In [5]:
import pandas as pd
import glob

### Load the dataset.

##### Using the Switchboard Dialog Act Corpus for training.
##### Corpus can be downloaded from http://compprag.christopherpotts.net/swda.html. 
##### The downloaded dataset should be kept in a data folder in the same directory as this file

In [6]:
f = glob.glob("data/sw*/sw*.csv")
frames = []
for i in range(0, len(f)):
    frames.append(pd.read_csv(f[i]))
result = pd.concat(frames, ignore_index=True)

##### For the purpose of training using only part of the dataset.

In [7]:
print("Number of converations in the dataset:",len(result))

Number of converations in the dataset: 51739


##### The dataset has many features, we are only using act_tag and text for this training

In [8]:
reduced_df = result[['act_tag','text']]

In [9]:
reduced_df.head()

Unnamed: 0,act_tag,text
0,o,Okay. /
1,qw,"{D So, }"
2,qy^d,"[ [ I guess, +"
3,+,"What kind of experience [ do you, + do you ] h..."
4,+,"I think, ] + {F uh, } I wonder ] if that worke..."


##### Classifying Yes-No-Question('qy'), Statement-non-opinion('sd'), Statement-opinion('sv') dialogues.
##### Tags information can be found here http://compprag.christopherpotts.net/swda.html#tags

In [39]:
mapping_of_tags = {
    'qy':'Yes-No-Question',
    'sd':'Statement-non-opinion',
    'sv':'Statement-opinion'
}
frames = []
for e in ['qy', 'sd', 'sv']:
    frames.append(reduced_df.loc[reduced_df['act_tag'] == e])
reduced_df = pd.concat(frames)

##### check frequency of tags

In [47]:
reduced_df['act_tag'].value_counts()

sd    17713
sv     5375
qy      887
Name: act_tag, dtype: int64

reduced_df['act_tag'].ipynb_checkpoints/

In [12]:
# get unique tags
unique_tags = set()
for tag in reduced_df['act_tag']:
    unique_tags.add(tag)

In [13]:
one_hot_encoding_dic = pd.get_dummies(list(unique_tags))

In [14]:
tags_encoding = []
for i in range(0, len(reduced_df)):
    tags_encoding.append(one_hot_encoding_dic[reduced_df['act_tag'].iloc[i]])

### Word vectors

In [15]:
sentences = []
for i in range(0, len(reduced_df)):
    sentences.append(reduced_df['text'].iloc[i].split(" "))

In [16]:
wordvectors = {}
index = 1
for s in sentences:
    for w in s:
        if w not in wordvectors:
            wordvectors[w] = index
            index += 1

In [17]:
MAX_LENGTH = len(max(sentences, key=len))

In [18]:
sentence_embeddings = []
for s in sentences:
    sentence_emb = []
    for w in s:
        sentence_emb.append(wordvectors[w])
    sentence_embeddings.append(sentence_emb) 

### Split the dataset into test and train

In [19]:
from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(sentence_embeddings, np.array(tags_encoding))

##### Pad the sentences with zero to make all sentences of equal length

In [23]:
from keras.preprocessing.sequence import pad_sequences
 
train_sentences_X = pad_sequences(X_train, maxlen=MAX_LENGTH, padding='post')
test_sentences_X = pad_sequences(X_test, maxlen=MAX_LENGTH, padding='post')

### Model

In [52]:
# architecture
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Dropout, InputLayer, Bidirectional, TimeDistributed, Activation, Embedding
from keras.optimizers import Adam
model = Sequential()
model.add(InputLayer(input_shape=(MAX_LENGTH,)))
model.add(Embedding(len(wordvectors)+1, 128))
model.add(Bidirectional(LSTM(256, return_sequences=False)))
model.add(Dense(len(unique_tags)))
model.add(Activation('softmax'))
 
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(0.001),
              metrics=['accuracy'])
 
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 127)               0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 127, 128)          2197504   
_________________________________________________________________
bidirectional_3 (Bidirection (None, 512)               788480    
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 1539      
_________________________________________________________________
activation_3 (Activation)    (None, 3)                 0         
Total params: 2,987,523
Trainable params: 2,987,523
Non-trainable params: 0
_________________________________________________________________


#### As the data set is highly imbalanced we will give our class weight while training

In [64]:
import numpy as np
from sklearn.utils.class_weight import compute_class_weight

y_integers = np.argmax(tags_encoding, axis=1)
class_weights = compute_class_weight('balanced', np.unique(y_integers), y_integers)
d_class_weights = dict(enumerate(class_weights))

In [74]:
model.fit(train_sentences_X, y_train, batch_size=100, epochs=5, class_weight = d_class_weights)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x21b0b7bc550>

#### Saving the model

In [75]:
from keras.models import load_model

model.save('models/text_classification_model.h5')
model = load_model('models/text_classification_model.h5')

In [76]:
score = model.evaluate(test_sentences_X, y_test, batch_size=100)



In [77]:
print("Accuracy:", score[1]*100)

Accuracy: 75.59225850595328


The model is trained for 3 classes.
The tags are one hot encoded.
For embedding of sentences, keras embedding layer is used. The embedding size is kept 128.
The model architecture is as follows: 1. Embedding Layer(to generate word embeddings) 2. Next layer Bidirectional LSTM. 3. Feed forward layer with number of neurons = number of tags 4. softmax activation to get probabilities

Since the dataset is highly unbalanced class weights are supplied while training. 
An accuracy of 75% is achieved on the test data. 
The accuracy of the training set is high in comparison to the accuracy on test data.This is due to overfitting. 
To improve the accuracy further the model can be trained on more data. This will solve the problem of high variance.