The paper proposes a solution to dialog acts classification (DA) by enriching the work done in seq2seq problems, especially for Neural Machine Translation (NMT) problems. Indeed, we will see later on that we can observe some similarities in the modelling of the two problems. However, these two problems remain different on certain points which we will expose. The proposed solutions are based on answers to overcome these differences.

Modelling :      



1.   A dialogue $i$ will be modelled as follows for DA classifications:
\begin{equation*}
C_i = (u^{(i)}_1, u^{(i)}_2, ..., u^{(i)}_{|C_i|})
\end{equation*}
Where refers to a sentence in the dialogue. Note that the two protagonists do not necessarily speak regularly to each other (speaker 1 may say 5 sentences in a row while speaker 2 responds with a single sentence). A conversation is associated with a set of tags for each sentence :    
\begin{equation*}
Y_i = (y^{(i)}_1, y^{(i)}_2, ..., y^{(i)}_{|C_i|})
\end{equation*} 
And finally we introduce :
\begin{equation*}
C = (C_1, C_2, ..., C_{|C|})\\
Y = (Y_1, Y_2, ..., Y_{|C|})
\end{equation*}

2.   Now we focus on the modelling of the NMT problem, we define a sequence as follows :    
\begin{equation*}
X^{l_1} = (x_1^{l_1}, x_2^{l_1}, ..., x_{|X^{l_1}|}^{l_1})
\end{equation*}
Where $l_1$ is a language and each $x_i^{l_1}$ is a word composing the sentence and the goal is to predict the sequence below in an other language $l_2$ :
\begin{equation*}
X^{l_2} = (x_1^{l_2}, x_2^{l_2}, ..., x_{|X^{l_2}|}^{l_2})
\end{equation*}

Similarities :    


1.   We can first constate that the goal is quite the same. Indeed, we need to maximise both following likelihood :  
\begin{equation*}
P(X^{l_2} \mid X^{l_1}) \mbox{ and } P(Y_i \mid C_i)
\end{equation*}  
2.   For the two tasks, there are strong dependencies between units composing both the input and output sequences.

Differencies :


1.   The input dimension in the case of DA classification is widely higher than the input dimension for NMT. Indeed, in the case of NMT, we have a sequence of words while in the DA case, we have a sequence of sequence of words ! Next, we will see in the code that with only 30 conversations, we have a space of dimension $1226*25*20*300$. 
2.   The output length can be widely different between the input and output in the DA case in opposite with the NMT case. Indeed, languages like french and english doesn't provide such a different length for sentences which have same meanings while in the DA case the input size is around $25*20*300$ and in our case we have 78 tags which involves an output space of $25*78$
3.   And it is straightforward to notice with the second point that in opposite in the NMT case, we will have a problem of alignements between the $y_i$ and the $x_i$ which compose an utterance of the speech.


The main ideas will be to build encoders to reduce the large input space, to build decoders with guided attention methods to aligne the input and the outputs and finally exploit the beam search as the loss to not be bounded because of the vocabulary size (the loss isn't implemented here :( ) :     

\begin{equation*}
s(\hat{y}^k, u_i) = \frac{log P(\hat{y}^k \mid u_i)}{lp(\hat{y}^k)}
\end{equation*}

Where $lp(x) = \frac{(5 + |x|)^{\alpha}}{(5+1)^{\alpha}}$. 

In this notebook, we focus on the SwDA data and propose an implementation of each encoders and decoders proposed in the paper with an utils package to execute the benchmark done in the paper.






#Data preprocessing

In [None]:
!pip install gluonnlp
!pip install mxnet

In [None]:
import os
import pickle as pkl
import pandas as pd
import numpy as np
import gluonnlp as nlp
from sklearn.model_selection import train_test_split
from nltk import word_tokenize
import nltk
nltk.download('punkt')

class SwDA:

  def __init__(self, path='swda/', number_conversations=30):

    self.C = [] #conversations
    self.Y = [] #list of list of tags
    self.Callers = [] #list of list of callers
    self.tags = set()

    count = 0
    for path_conv in os.listdir(path):
      p = '/'.join((path, path_conv))
      if os.path.isdir(p):
        for csv in os.listdir(p):
          if count < number_conversations:
            if 'utt.csv' in csv:
              pp = '/'.join((path, path_conv, csv))
              df = pd.read_csv(pp)[['act_tag', 'caller', 'text']]
              tags = pd.unique(df['act_tag'])
              for tag in tags:
                self.tags.add(tag)
              self.Y.append(df['act_tag'].values)
              self.C.append(df['text'].values)
              self.Callers.append(df['caller'].values)
          count += 1
    
  def serialize_conversations(self, T=5**2):
    #serialize the raw data with a window of length T and clean the utterances 

    self.C_serialized = [] 
    self.Y_serialized = []
    self.Callers_serialized = []
    self.words = set()

    for i in range(len(self.C)):
      conv = []
      tags = []
      callers = []


      for t in range(T, len(self.C[i])):
        seq = self.C[i][t - T: t]
        seq_clean = []
        for utt in seq:
          words_utt = word_tokenize(utt)[-1]
          seq_clean.append(words_utt)
          for word in words_utt:
            self.words.add(word)
        conv.append(seq_clean)
        callers.append(self.Callers[i][t - T: t])
        tags.append(self.Y[i][t - T: t])

      self.C_serialized += conv
      self.Y_serialized += tags
      self.Callers_serialized += callers
      
    self.C_serialized = np.asarray(self.C_serialized, dtype=object)
    self.Y_serialized = np.asarray(self.Y_serialized, dtype=object)
    self.Callers_serialized = np.asarray(self.Callers_serialized, dtype=object)
  
  def embedding_pad(self, T=5**2, max_length=20):
    #pad or make a troncature to have max_length words in each sequence
    #embedding with the fasttext embeddings layer
    # tab : array (number_sequence, T, max_length, 300)

    #set fasttext
    counter = nlp.data.count_tokens([word for word in self.words])
    self.vocab = nlp.Vocab(counter)
    fasttext_simple = nlp.embedding.create('fasttext', source='wiki.simple')
    self.vocab.set_embedding(fasttext_simple)

    tab = np.zeros((self.C_serialized.shape[0], T, max_length, 300))
    for i in range(self.C_serialized.shape[0]):
      for j in range(T):
        for k in range(len(self.C_serialized[i][j])):
          tab[i, j, k, :] = self.vocab.embedding[self.C_serialized[i][j][k]].asnumpy()
        tab[i, j, k+1:, :] = self.vocab.embedding['<pad>'].asnumpy()
    
    self.C_serialized = tab
  
  def one_hot_tags(self, T=5**2):
    #one hot encoding of tags
    
    tab = np.zeros((self.Y_serialized.shape[0], T, len(self.tags)))
    self.tags_dict = dict((tag, i) for (i, tag) in enumerate(self.tags))

    for i in range(self.Y_serialized.shape[0]):
      for j in range(T):
        tab[i, j, self.tags_dict[self.Y_serialized[i][j]]] = 1.0
    self.Y_serialized = tab

  def callers_process(self):
    #code by 1 and 0 callers 1 and B 

    self.Callers_serialized = np.where(self.Callers_serialized == 'A', 1.0, 0.0)

  def preprocess(self, T=5**2, max_length=20):
    self.serialize_conversations(T)
    self.embedding_pad(T, max_length)
    self.one_hot_tags(T)
    self.callers_process()

    #shuffle and split

    self.callers_train, self.callers_test, self.conv_train, self.conv_test, self.y_train, self.y_test = train_test_split(self.Callers_serialized, 
                                                                                                                         self.C_serialized, self.Y_serialized, test_size=.5, shuffle=True)
  
  def get_raw_data(self):
    #return the raw data

    return self.Callers, self.C, self.Y
  
  def get_data(self):
    #return serialized utterances : callers for each utterance, utterances, tags

    return self.callers_train, self.callers_test, self.conv_train, self.conv_test, self.y_train, self.y_test
  
  def save(self, path):
    with open(path, 'wb') as f:
      pkl.dump((self.callers_train, self.callers_test, self.conv_train, self.conv_test, self.y_train, self.y_test), f, protocol=4)
    f.close()
  
  def load(self, path):
    with open(path, 'rb') as f:
      self.callers_train, self.callers_test, self.conv_train, self.conv_test, self.y_train, self.y_test = pkl.load(f)
    f.close()
  

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
#preprocess data

dataset = SwDA()
dataset.preprocess()
dataset.save('data.pkl')

In [None]:
#or if it's already done...

dataset = SwDA()
dataset.load('data.pkl')
callers_train, callers_test, conv_train, conv_test, tags_train, tags_test = dataset.get_data()

In [None]:
len(dataset.tags)

78

#Encoders

In [None]:
import keras
import tensorflow as tf
from keras import Model
from keras.layers import Bidirectional, GRU, Lambda, Concatenate, Reshape
from tensorflow.keras.backend import squeeze, repeat_elements
from keras.regularizers import l2

#Vanilla RNN encoder

class VGRUe(Model):

  def __init__(self):
    super(VGRUe, self).__init__()
    
    self.mean_layer = Lambda( lambda x: tf.math.reduce_mean(x, axis=3))
    self.concatenate = Concatenate(axis=1)
    self.biGRU_1 = Bidirectional(GRU(128, return_sequences=True, dropout=0.2, kernel_regularizer=l2(1e-5)))
    self.biGRU_2 = Bidirectional(GRU(64, return_sequences=True, return_state=True, dropout=0.2, kernel_regularizer=l2(1e-5)))

  def call(self, x): # x is a list of two elements: Callers and seq of Utterances

    x = x[1]
    embedding_means = self.mean_layer(x)
    seq = self.biGRU_1(embedding_means)
    seq, state_h, state_c = self.biGRU_2(seq)

    return seq, self.concatenate([state_h, state_c])


#Hierarchical encoders

class HGRU(Model):

  def __init__(self):
    super(HGRU, self).__init__()

    self.word_seqs = Lambda( lambda x: tf.split(x, [1 for i in range(5**2)], axis=1))
    self.concatenate = Concatenate(axis=1)
    self.reshape = Reshape((1, 256))

    self.biGRU_words = Bidirectional(GRU(128, return_sequences=False, return_state=False, dropout=0.2, kernel_regularizer=l2(1e-5)))
    self.biGRU_utt = Bidirectional(GRU(64, return_sequences=True, return_state=True, dropout=0.2, kernel_regularizer=l2(1e-5)))
  
  def call(self, x):

    x = x[1]
    utts = self.word_seqs(x)
    words_dependencies = []
    for utt in utts:
      h = self.biGRU_words(squeeze(utt, axis=1))
      words_dependencies.append(self.reshape(h))
    words_dependencies = self.concatenate(words_dependencies)
    seq, state_h, state_c = self.biGRU_utt(words_dependencies)

    return seq, self.concatenate([state_h, state_c])


#Persona hierarchical encoders

class PersoHGRU(Model):

  def __init__(self):
    super(PersoHGRU, self).__init__()

    self.word_seqs = Lambda( lambda x: tf.split(x, [1 for i in range(5**2)], axis=1))
    self.speaker_differentes_right = Lambda( lambda x: tf.math.abs(tf.subtract(x, tf.roll(x, shift=1, axis=1))))
    self.speaker_differentes_left = Lambda( lambda x: tf.math.abs(tf.subtract(x, tf.roll(x, shift=24, axis=1))))
    self.concatenate = Concatenate(axis=1)
    self.reshape = Reshape((1, 256))

    self.biGRU_words = Bidirectional(GRU(128, return_sequences=False, return_state=False, dropout=0.2, kernel_regularizer=l2(1e-5)))
    self.GRU_persona_left = GRU(128, return_sequences=False, return_state=False, dropout=0.2, kernel_regularizer=l2(1e-5))
    self.GRU_persona_right = GRU(128, return_sequences=False, return_state=False, dropout=0.2, kernel_regularizer=l2(1e-5))
    self.biGRU_utt = Bidirectional(GRU(64, return_sequences=True, return_state=True, dropout=0.2, kernel_regularizer=l2(1e-5)))
  
  def call(self, x):

    c = x[0]
    c_right = repeat_elements(Reshape((25, 1))(self.speaker_differentes_right(c)), rep=256, axis=-1)
    c_left = repeat_elements(Reshape((25, 1))(self.speaker_differentes_left(c)), rep=256, axis=-1)
    u = x[1]
    utts = self.word_seqs(u)
    words_dependencies = []
    for utt in utts:
      h = self.biGRU_words(squeeze(utt, axis=1))
      words_dependencies.append(self.reshape(h))
    words_dependencies = self.concatenate(words_dependencies)
    persona_left = c_left * words_dependencies
    persona_right = c_right * words_dependencies
    p_left = self.word_seqs(persona_left)
    p_right = self.word_seqs(persona_right)
    persona = []
    for t in range(5**2):
      l = self.GRU_persona_left(p_left[t])
      r = self.GRU_persona_right(p_right[t])
      persona.append(Reshape((1, 256))(self.concatenate([l, r])))
    persona = self.concatenate(persona)
    seq, state_h, state_c = self.biGRU_utt(persona)

    return seq, self.concatenate([state_h, state_c])

#Decoders

In [None]:
import keras
import tensorflow as tf
from keras import Model
from keras.layers import GRU, Dense, Concatenate, Reshape
from keras.regularizers import l2
from keras.activations import softmax


#Vanilla decoder

class VGRUd(Model):

  def __init__(self, encoder):
    super(VGRUd, self).__init__()

    self.encoder = encoder #type Model
    self.gru_decoder = GRU(64, return_sequences=True, return_state=True, dropout=0.2, kernel_regularizer=l2(1e-5))
    self.dense_softmax = Dense(78, activation='softmax', kernel_regularizer=l2(1e-5)) #78 tags
    self.concatenate = Concatenate(axis=1)

  def call(self, x):

    seq, H = self.encoder(x)
    input = Reshape((1, 128))(x[1][:, 0, 0, :128])
    state = Reshape((64, ))(H)
    outputs = []
    for t in range(5**2):
      input, state = self.gru_decoder(input, initial_state=state)
      outputs.append(input)
    outputs = self.concatenate(outputs)

    return self.dense_softmax(outputs)

#Decoders with attention

#Vanilla attention

class VGRUatt(Model):

  def __init__(self, encoder):
    super(VGRUatt, self).__init__()
  
    self.encoder = encoder 
    self.seqs = Lambda( lambda x: tf.split(x, [1 for i in range(5**2)], axis=1))
    self.gru_decoder = GRU(128, return_sequences=True, return_state=True, dropout=0.2, kernel_regularizer=l2(1e-5))
    self.dense_softmax = Dense(78, activation='softmax', kernel_regularizer=l2(1e-5)) #78 tags
    self.concatenate = Concatenate(axis=1)
    self.dense_attention = Dense(128, activation=None, kernel_regularizer=l2(1e-5)) 
  
  def call(self, x):

    seq, H = self.encoder(x)

    hs = self.seqs(seq)
    input = Reshape((1, 128))(x[1][:, 0, 0, :128])
    state = Reshape((128, ))(H)
    outputs = []
    for t in range(5**2):
      input, state = self.gru_decoder(input, initial_state=state)
      alphas = []
      for i in range(5**2):
        score = self.dense_attention(hs[i])
        alphas.append(tf.math.reduce_sum(tf.multiply(score, input), axis=2))
      alphas = self.concatenate(alphas)
      alphas = softmax(alphas, axis=-1)
      alphas = self.seqs(alphas)
      state = tf.multiply(Reshape((1, 1))(alphas[0]), hs[0])
      for i in range(1, 5**2):
        state += tf.multiply(Reshape((1, 1))(alphas[i]), hs[i])
      state = Reshape((128, ))(state)
      outputs.append(input)
    outputs = self.concatenate(outputs)

    return self.dense_softmax(outputs)

#Hard guided attention

class VGRUhga(Model):

  def __init__(self, encoder):
    super(VGRUhga, self).__init__()

    self.encoder = encoder 
    self.seqs = Lambda( lambda x: tf.split(x, [1 for i in range(5**2)], axis=1))
    self.gru_decoder = GRU(128, return_sequences=True, return_state=True, dropout=0.2, kernel_regularizer=l2(1e-5))
    self.dense_softmax = Dense(78, activation='softmax', kernel_regularizer=l2(1e-5)) #78 tags
    self.concatenate = Concatenate(axis=1)

  def call(self, x):

    seq, H = self.encoder(x)

    hs = self.seqs(seq)
    input = Reshape((1, 128))(x[1][:, 0, 0, :128])
    state = Reshape((128, ))(H)
    input, state = self.gru_decoder(input, initial_state=state)
    outputs = [input]
    for t in range(1, 5**2):
      input, state = self.gru_decoder(input, initial_state=state)
      state = Reshape((128, ))(hs[t])
      outputs.append(input)
    outputs = self.concatenate(outputs)

    return self.dense_softmax(outputs)
  
#Soft guided attention

class VGRUsga(Model):

  def __init__(self, encoder):
    super(VGRUsga, self).__init__()
  
    self.encoder = encoder 
    self.seqs = Lambda( lambda x: tf.split(x, [1 for i in range(5**2)], axis=1))
    self.gru_decoder = GRU(128, return_sequences=True, return_state=True, dropout=0.2, kernel_regularizer=l2(1e-5))
    self.dense_softmax = Dense(78, activation='softmax', kernel_regularizer=l2(1e-5)) #78 tags
    self.concatenate = Concatenate(axis=1)
    self.dense_attention = Dense(128, activation=None, kernel_regularizer=l2(1e-5)) 
  
  def call(self, x):

    seq, H = self.encoder(x)

    hs = self.seqs(seq)
    input = Reshape((1, 128))(x[1][:, 0, 0, :128])
    state = Reshape((128, ))(H)
    outputs = []
    for t in range(5**2):
      input, state = self.gru_decoder(input, initial_state=state)
      alphas = []
      for i in range(5**2):
        score = self.dense_attention(hs[i])
        a = tf.math.reduce_sum(tf.multiply(score, input), axis=2)
        if i == t:
          alphas.append(tf.add(1.0, a))
        else:
          alphas.append(a)
      alphas = self.concatenate(alphas)
      alphas = softmax(alphas, axis=-1)
      alphas = self.seqs(alphas)
      state = tf.multiply(Reshape((1, 1))(alphas[0]), hs[0])
      for i in range(1, 5**2):
        state += tf.multiply(Reshape((1, 1))(alphas[i]), hs[i])
      state = Reshape((128, ))(state)
      outputs.append(input)
    outputs = self.concatenate(outputs)

    return self.dense_softmax(outputs)

#Train and Test models

In [None]:
#utils

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.metrics import Accuracy
from keras.callbacks import LearningRateScheduler

def scheduler(epoch, lr):
   if epoch < 20:
     return lr
   else:
     return lr * 0.5

def compile_fit(model):
  model.compile(optimizer=Adam(learning_rate=0.01, clipnorm=5.), loss=categorical_crossentropy(), metrics=Accuracy())
  callbacks = LearningRateScheduler(scheduler)
  model.fit(x=[callers_train, conv_train], y=tags_train, epochs=500, validation_split=.2, callbacks=[callbacks], shuffle=True)

def test(model):
  model.test(x=[callers_test, conv_test], y=tags_test)

def benchmark():
  #find paper results

  models = {'VGRUe-VGRUd': VGRUd(VGRUe()),
            'HGRU-VGRUd': VGRUd(HGRU()),
            'PersoHGRU-VGRUd': VGRUd(PersoHGRU()),
            'VGRUe-VGRUatt': VGRUatt(VGRUe()),
            'HGRU-VGRUatt': VGRUatt(HGRU()),
            'PersoHGRU-VGRUatt': VGRUatt(PersoHGRU()),
            'VGRUe-VGRUhga': VGRUhga(VGRUe()),
            'HGRU-VGRUhga': VGRUhga(HGRU()),
            'PersoHGRU-VGRUhga': VGRUhga(PersoHGRU()),
            'VGRUe-VGRUsga': VGRUsga(VGRUe()),
            'HGRU-VGRUsga': VGRUsga(HGRU()),
            'PersoHGRU-VGRUsga': VGRUsga(PersoHGRU()),
              }

  for model in models:
    print('Training {} model ...'.format(model))
    compile_fit(models[model])
    print('Testing {} model ...'.format(model))
    test(models[model])
  