# Lab3.4a Training an emotion classifier using word embeddings

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook is a follow up on the notebook: Lab3.3a.ml.emotion-detection.ipynb and on Lab2. 

The previous notebook *Lab3.3a.ml.emotion-detection* showed how you can create a bag-of-words vector representation from a text and how to feed it to a machine to learn to assign emotion labels to it. Words from the training set become features in the vector and the machine learns to associate these with the emotion classes. Unseen text is represented in the same way as a bag-of-words vector and through similarity with the training data, the machine makes a prediction on the emotion encoded in the unseen text.

In this notebook we are going to replace the bag-of-words by a word embedding representation. We discussed word-embeddings in Lab2 as a powerful model to represent families of words that are related. Now imagine that we replace our text representations of words by a representation based on embeddings? Our hypothesis is that training data will be boosted to represent families of related words and can be compared to unseen data with words beling to similar families.

Using embeddings to replace word tokens in data, is a powerful method to make training data more robust. Whereas the vectors for tokens are large and sparse (thousands of dimensions and many zero values), the embeddings representaions are small and dense (300 dimensions or less and all dimensions have some positive or negative value).

We will test this hypothesis in this notebook by creating vectors based on *averaged* word vectors for the same MELD data and testing it in the same way as we did before on MELD data and our own set of utterances.

### Table of Contents

* [Section 1: Quick introduction to embeddings](#section1)
* [Section 2: Loading the emotion data](#section2)
* [Section 3: Preparing the training and test data](#section3)
* [Section 4: Training and applying the model](#section4)
* [Section 5: Generating the test report](#section5)
* [Section 6: Applying the classifier to your own text](#section6)


## 1 Quick introduction to embeddings  <a class="anchor" id ="section1"></a> 

A recent alternative way to create a 'semantic' representation of a word is by word embeddings: mapping words (or phrases) from the vocabulary to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. For this reason, they are called dense representations.

In linguistics, word embeddings were discussed in the research area of distributional semantics. The idea is to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying notion is that "a word is characterized by the company it keeps" (Firth). Embeddings do not directly represent these context words but are the learned weights in the hidden layer of a neural network that tries to predict the context words. In that sense, we do not need thousands of dimensions to represent all possible context words but just the learned weights.

### Reference:

For another explanation how word embedddings can improve classical bag-of-word approaches, check out this page:

https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html


As in Lab2, we are going to use the *gensim* package to load and use prebuilt embdiings models. Assuming it is already installed on your machine, we first import *gensim*

In [1]:
import gensim

There are many sites that provide pretrained word2vec models that can be loaded through the *gensim* package. Check out the original data from Google for a word2vec model with 300 dimensions trained from Wikipedia and news: [Google code archive](https://code.google.com/archive/p/word2vec/). Here is another website with many ready to use models: http://vectors.nlpl.eu/repository/

Whatever you choose, make sure you can load the model using the 'gensim' package.

In this notebook, we will load pre-trained word embeddings, created in the [Glove project](https://nlp.stanford.edu/projects/glove/) from Stanford University. We will use mebddings trained from twitter data. We hope that twitter model is more adapted to the spoken utterances from the MELD project than other Google and Glove models trained on written news and Wikipedia articles.

We will load the model with the Gensim package that we used before but you can also use the *gensim* api to load the model online, assuming you have a good network connection. If you want to download and load it from disk, follow the instructions for the next cell. If you do not want to store it on disk but load it online, skip to the next subsection.


### 1.1 Downloading models and loading from disk

You can download the twitter models to your disk from:

http://nlp.stanford.edu/data/glove.twitter.27B.zip

You can see that different models are provided with different dimensions:

Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip

On the Stanford website, also other models are provided: http://nlp.stanford.edu/data

We will use the model with 200 dimensions. If your computer cannot handle it, you can try one of the smaller models.

When you unpack the zip file, you see that the models are provided as text files. To load the data into *gensim*, we need to carry some specific code given in the next cell. Don't worry too much about this code. Adapt the path to your local copy and run the cell to load it.

In [2]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

### Adapt the path to your local copy
# glove_file = datapath('/Users/piek/Desktop/ONDERWIJS/data/word-embeddings/classical-models/glove-twitter-models/glove.twitter.27B.200d.txt')
# tmp_file = get_tmpfile("test_word2vec.txt")

# _ = glove2word2vec(glove_file, tmp_file)
# word_embedding_model = KeyedVectors.load_word2vec_format(tmp_file)

# ### this model has 200 dimensions so we set the number of features to 200
# num_features = 200

## 1.2 Loading models using the gensim API

Instead of downloading the models to disk, 'gensim' also provides a downloader API to load the model from the web when needed. In the next cell, we use this API to download a word embeddding model trained on tweets. The following cell use the gensim API. Note that it takes some time to download but it saves some disk space.

In [3]:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

# wordembeddings = "glove-twitter-200"
# ### this model has 200 dimensions so we set the number of features to 200
# num_features = 200

#wordembeddings = "glove-twitter-25"
### this model has 25 dimensions so we set the number of features to 25
#num_features = 25

#wordembeddings = "glove-twitter-50"
### this model has 50 dimensions so we set the number of features to 50
#num_features = 50

#wordembeddings = "glove-twitter-100"
### this model has 100 dimensions so we set the number of features to 100
#num_features = 100

wordembeddings = "glove-wiki-gigaword-300"
num_features = 300

word_embedding_model = api.load(wordembeddings)
print(num_features)

300


Loading the word embedding model and training the classifier may take a while. If your laptop cannot handle this, use a smaller word embeddings model with less dimensions. Note that the performance of the classifier may be degraded when fewer dimensions are used. Alternatively, you can reduced the number of training data as we will show below but this will also likely affect the performance.

Depending on the embedding model you have selected, you need to set the number of features that are used to create vectors for the utterances equal to the number of dimensions. If the value of num_features is different from the dimensions of the model, you get an error creating the embedding representations for the utterances below.

Let's check if the model works.

In [4]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

word1='cat'
word2='dog'
word1_vector=np.array(word_embedding_model[word1]).reshape(1, -1)
word2_vector=np.array(word_embedding_model[word2]).reshape(1, -1)
print(cosine_similarity(word1_vector, word2_vector))

[[0.6816747]]


## 2. Loading the emotion data  <a class="anchor" id ="section2"></a> 

Just as with the previous notebook, we load the training and test data from the MELD data set. This code is the same as before.

In [5]:
import pandas as pd
import nltk

filepath = './data/MELD/train_sent_emo.csv'
dftrain = pd.read_csv(filepath)
### The data has some problematic strings with encoding problems. The next code removes some of these from the utterances
# Try to fix encoding
dftrain['Utterance'] = dftrain['Utterance'].str.replace("\x92|\x97|\x91|\x93|\x94|\x85", "'")

filepath = './data/MELD/test_sent_emo.csv'
dftest = pd.read_csv(filepath)
dftest['Utterance'] = dftest['Utterance'].str.replace("\x92|\x97|\x91|\x93|\x94|\x85", "'")


In [6]:
def tokenize_data(text):
    ### the first loop gets the utterances
    text_tokens = []
    for utterance in text:
        text_tokens.append(nltk.tokenize.word_tokenize(utterance))
        
    return text_tokens

In [7]:
### the first loop gets the utterances
training_instances = tokenize_data(dftrain['Utterance'])

### We use the same loop for the list of emotion labels that correspond with the vector representations of each utterance
training_labels = tokenize_data(dftrain['Emotion'])
    
### the first loop gets the utterances
test_instances = tokenize_data(dftest['Utterance'])

### We use the same loop for the list of emotion labels that correspond with the vector representations of each utterance
test_labels = tokenize_data(dftest['Emotion'])


# 3. Preparing the training and test data  <a class="anchor" id ="section3"></a> 

The following imports are needed again to create a classifier:

In [8]:
import sklearn
import numpy as np
import nltk
from nltk.corpus import stopwords

In the previous notebook, we used CountVectorizer to obtain the full vocabulary of the data set and generate vectors for the bag-of-words endocing of each word. In these vectors, each slot represents a word and a value '1' indicates that the word was present in the utterance and a '0' means absence. This results in large and sparse vector representations for each utterance. We have also seen that we can weight the relevance of a word using the 'TF.IDF' function. This still results in large and sparse vectors but weights are more subtle. The down side is sparseness, lack of generalisation and lack of robustness to deal with new unseen data.

In the following, we are going to represent the utterances by an embedding representation. In fact, we take the word embedding of each token in the utterance and add these together, after which we take the average. All the embeddings have the same number of dimensions in the same order. So if two tokens have a high weight for one dimension then their co-uccurrence in an utterance will enforce that weight. Note that by adding and taking the average, we normalize for the length of the utterance and the order of the tokens is not relevant.

Before we create the embedding representations for the utterances, we are going to filter  words using the NLTK stopword list and their frequency to make the text representations more compatible with our previous notebook. In the case of CountVectorizer, this was done for us using the parameter settings. 

In this case, we need to make our own customized function. The next piece of code shows a loop over all utterances from which we extract a list of all tokens. Next, we count these tokens and extract a list of words that occur above our frequency threshold, which we should set in the same way as for CountVectorizer.

In [9]:
##### This code creates a list of words above the preset frequency threshold

from collections import Counter
frequency_threshold = 4
frequent_keywords = []

####  We first will collect all tokens from all the utterances from the training data using the NLTL tokenizer
alltokens = []
for utterance in dftrain['Utterance']:
    tokenlist = nltk.tokenize.word_tokenize(utterance)
    for token in tokenlist:
        alltokens.append(token)
#### The Counter function will create a frequency count of all the items. The result is a dictionary       
kw_counter = Counter(alltokens)

#### We now loop over the dictionary with counts to get the word and the frequency value
for word, count in kw_counter.items():
    if count>frequency_threshold:
        frequent_keywords.append(word)

Next, we want to lookup every word in our utterance in our word embedding model to obtain the vector representation of that word. We then take the average over all the tokens of an average by summing the weights and dividing them by the number of words.

Before we lookup a word in the embedding model, we check if it belong to the frequent word list and it is not a stopword. Otherwise, we ignore the word. 


We also ignore words that are not in the embedding model. So it matters what type of text the model is trained on: news, Wikipedia, Twitter, spoken dialogues. To know what the coverage is of the model for our utterances in the MELD data set, we create a list of all unknown words and all known words.


We are going to define two customized function using 'def' to create an embedding representation for each utterance. 
These functions are based on: https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec
and adapted for our purposes.

The first function, called 'featureVecMethod', takes the words of the utterance and the embedding model as well as some other values as parameters. The num_features parameter determines the size of the vector. If num_features is different from the actual number of dimensions of the word embedding model, this will generate an error.

In [10]:
unknown_words =[]
known_words = []
# Function to average all word vectors in a paragraph
def featureVecMethod(words, # Tokenized list of tokens from an utterance
                     frequent_keywords, # List of words above the frequency threshold
                     stop_words, # Stopwords that should be skipped
                     model, # The actual word embeddings model
                     modelword_index, # An index on the vocabulary of the model to speed up lookup
                     num_features # the number of dimensions of the embedding model
                    ):
    # Pre-initialising empty numpy array for speed
    # This create a numpy array with the length of the num_features set to zero values
    featureVec = np.zeros(num_features,dtype="float32")
    
    ## A counter for the number of tokens represented so that we can take the average
    nwords = 0
        
    for word in  words:
        #### we only use words that are frequent and not stopwords
        if word in frequent_keywords and not word in stop_words:         
            if word in modelword_index:
                ## The next function would directly take the values from the model
                #featureVec = np.add(featureVec,model[word])
                
                ### Instead of simply taking the embedding values we prefer the next function the 
                ### normalises these values between [-1, 1] to make the average work better
                ### Don't worry about the specifics 
                featureVec = np.add(featureVec,model[word]/np.linalg.norm(model[word]))

                #we keep track of the words detected, just for analysing the data
                known_words.append(word)
                nwords = nwords + 1
            else:
                word = word.lower()
                if word in modelword_index:
                    #featureVec = np.add(featureVec,model[word])
                    featureVec = np.add(featureVec,model[word]/np.linalg.norm(model[word]))
                    
                    #we keep track of the words detected, just for analysing the data
                    known_words.append(word)
                    nwords = nwords + 1
                else:
                    #we keep track of the unknown words to see how well our model fits the data
                    unknown_words.append(word)
                    
    # Dividing the result by number of words in each utterance to get average
    featureVec = np.divide(featureVec, nwords)
    return featureVec

The next function just deals with all the data and creates the list of input vectors. This function calls the previous function and needs to the same parameters.

In [11]:
# Function for calculating the average feature vector
def getAvgFeatureVecs(texts, ### List of texts, in our case tokenized utterances
                      keywords, 
                      stopwords, 
                      model, 
                      modelword_index, 
                      num_features
                     ):
    counter = 0
    #### we initialise a vector with zeros of the type float32
    textFeatureVecs = np.zeros((len(texts),num_features),dtype="float32")
    
    #### We iterate over all the texts
    for text in texts:
        # Printing a status message every 1000th text, to see what we are processing
        if counter%1000 == 0:
            print("Review %d of %d"%(counter,len(texts)))
        ### Each text is transformed into a vector representation based on the averaged token embedding using the previous function
        ### We add these vectors to the total list
        textFeatureVecs[counter] = featureVecMethod(text, keywords, stopwords, model, modelword_index,num_features)
        counter = counter+1
        
    #### Due to the averaging, there could be infinitive values or NaN values. The next numpy function turns these value to "0" scores
    textFeatureVecs = np.nan_to_num(textFeatureVecs) 
    
    return textFeatureVecs


Now back to our input data. We iterate over the Pandas frame in the same way as before but now we extract for each utterance the embedding representation, using the above two functions.

In [12]:
#Converting Index2Word which is a list to a set for better speed in the execution.
#Allows for quicker lookup if the words exist
index2word_set = set(word_embedding_model.wv.index2word)
# We take the stopwords from NLTK
stop_words = set(stopwords.words('english'))

  index2word_set = set(word_embedding_model.wv.index2word)


In [13]:
# Calculating average feature vector for training set

### The full list of utterances is passed to our customized function, with the frequent_keywords list, the stopwords, the embedding model, 
### the index and the number of features that indicates the dimensions of the model
trainDataVecs = getAvgFeatureVecs(training_instances, frequent_keywords, stop_words, word_embedding_model, index2word_set, num_features)


  featureVec = np.divide(featureVec, nwords)


Review 0 of 9989
Review 1000 of 9989
Review 2000 of 9989
Review 3000 of 9989
Review 4000 of 9989
Review 5000 of 9989
Review 6000 of 9989
Review 7000 of 9989
Review 8000 of 9989
Review 9000 of 9989


Let's inspect our training data a bit more. Depending on the break set for loading the training data, you will have a list of vectors with according length:

In [14]:
len(trainDataVecs)

9989

We can inspect the first element in the list:

In [15]:
print('Vector length', len(trainDataVecs[0]))
print (trainDataVecs[0])

Vector length 300
[-2.42284834e-02  2.93585435e-02 -1.98888732e-03 -2.60138828e-02
  2.12003011e-04 -1.65226199e-02 -1.70849566e-03  2.49614175e-02
  8.99223983e-03 -3.20224464e-01  2.05466561e-02  9.02264100e-03
  1.72707322e-03  1.63111109e-02  4.60524857e-02  3.75630483e-02
 -3.11330091e-02 -3.31713678e-03  3.10880598e-04 -1.91135984e-02
  3.76915373e-02  3.55959423e-02  1.80483200e-02  1.18826814e-02
 -3.34082618e-02 -8.30051303e-03  5.46814129e-03 -1.42572133e-03
 -2.53683683e-02  2.36102082e-02  2.80681066e-03  7.12274387e-02
 -2.33282670e-02  6.64981361e-03 -1.54919937e-01  2.01730318e-02
  5.75820636e-03 -3.80901154e-03 -8.61165207e-03 -7.86029082e-03
 -1.62166171e-02 -6.09386247e-03 -1.48823317e-02  5.30707650e-02
 -1.69919822e-02  2.79374942e-02  4.06390727e-02  3.65676433e-02
 -1.76445320e-02  9.20884591e-03  6.82919938e-03 -1.11794947e-02
  3.85308824e-02  8.87811184e-03 -9.13435500e-03  2.98784617e-02
 -4.92461305e-03  5.87546453e-02  2.46372167e-02 -6.12752186e-03
  3.251

It is simply a list with digits, each representing the averaged weight of the tokens or words that made up the utterance. We can checks the length, which should be '300', '100', '50' or '25', etc. depending on the number of dimensions of the word2vec model that you used.

There are two major differences with the bag-of-tokens that we used in the previous notebook:

1. the vectors are short
2. there are no zero's 

Instead of *large sparse* vectors, we now have *short dense* vectors representing each utterance. Whereas in the previous representation, each slot in the vector corresponds with a token, now each slot is a weight from the hidden layer to learn to predict others words in the context.

This is true for each utterance, each having a unique set of values for the same hidden layer weights. These weights now represent the meaning of the utterance for a machine, which can use a similarity function such as cosine similairty to measure the degree of equivalence across these representations. When we inspects any other utterance, we see it is represented in a simlar way.

In [16]:
print(len(trainDataVecs[1000]))
print(trainDataVecs[1000])

300
[-3.34656052e-02 -1.54432449e-02  1.91893093e-02 -5.23057058e-02
  1.76805276e-02  4.07470670e-03 -1.71777830e-02  5.14362641e-02
  3.88247184e-02 -3.75979602e-01  7.19848052e-02 -1.79710761e-02
 -3.62468660e-02  6.15833774e-02 -1.57246254e-02  1.22439843e-02
 -5.39080687e-02  6.31871168e-03  1.29400091e-02 -1.66408233e-02
 -3.34875961e-03  6.07072338e-02  5.52398674e-02  2.99382154e-02
 -5.12655266e-02 -2.50904020e-02  4.18722862e-03 -4.78261448e-02
 -2.41893567e-02  1.42338928e-02 -2.30888464e-02  6.95327073e-02
 -5.94733842e-02  1.46406349e-02 -2.34109908e-01  5.14502414e-02
 -3.86457029e-03  4.01513055e-02 -3.06833498e-02 -5.70127089e-03
  4.84503731e-02 -4.19773385e-02 -1.20958053e-02  6.38127234e-03
 -5.32918237e-03  7.74203241e-03  7.32207969e-02  7.64759630e-02
  1.09588206e-02  2.62346324e-02  3.51432664e-03 -3.24195810e-02
  1.77698527e-02 -3.65351513e-02 -4.13852260e-02  5.94245419e-02
 -1.62900276e-02  4.39020880e-02  4.44586240e-02 -1.69757474e-02
  5.59020117e-02 -1.8

Since the vectors are compatible, we can compare them in the same way as we did before for the word2vec embeddings of *cat* and *dog*:

In [17]:
word1_vector=np.array(trainDataVecs[0]).reshape(1, -1)
word2_vector=np.array(trainDataVecs[1000]).reshape(1, -1)
print(cosine_similarity(word1_vector, word2_vector))

[[0.84406036]]


For training, we use the same labels as before:

In [18]:
print(training_labels[0], training_labels[1])

['neutral'] ['neutral']


So now we have a numeric representation of each text, based on the embeddings of the words. We feed this to a classifier in the same way as we did in the previous notebooks with the Countvectorizer output.

Before we can train the classifier, we stil need to convert the labels to numeric values as we did before.

Before we do that, it may be good to check which words are not in the embedding model and therefore do not contribute to the representation of the utterance. In the above function, we kept track of the unknown words. Now we can inspect this list. We use the *Counter* function to get a frequency count of these words.

In [19]:
from collections import Counter

print('Training data vocabulary statistics:')
print('Frequency threshold', frequency_threshold)
unknown_words_count = Counter(unknown_words)
print('Proportion of unknown tokens', len(unknown_words)/(len(unknown_words)+len(known_words)))
print('Number of unknown words',len(unknown_words_count))
print('Number of unknown word tokens:', len(unknown_words))
print('Unknown words counts')
print(unknown_words_count)

Training data vocabulary statistics:
Frequency threshold 4
Proportion of unknown tokens 0.009389671361502348
Number of unknown words 25
Number of unknown word tokens: 630
Unknown words counts
Counter({"y'know": 320, 'pheebs': 79, 'and-and': 26, 'you-you': 20, 'no-no-no': 18, "i'm-i": 18, 'i-i-i': 18, 'hey-hey': 14, 'that-that': 12, 'we-we': 10, 'oh-oh': 10, 'what-what': 9, 'no-no-no-no': 8, '..': 7, 'yeah-yeah': 7, "'kay": 6, "it's-it": 6, 'buffay': 6, 'oh-ho': 6, 'i-i-i-i': 5, 'ok.': 5, "that's-that": 5, 'um-hmm': 5, 'the-the': 5, 'heldi': 5})


We can see that the proportion of unknown words is low but note that we. considered only words above the frequency threshold. Only a few words making up a small proportion of tokens could not be represented.

If you happen to use the original Google Word2Vec model, this can be about 35% of all the tokens. You can see that using different models, gives different results based on the genre matching.

We also kept track of the *known* words, so lets check these as well:

In [20]:
known_words_count = Counter(known_words)
print('Number of known words',len(known_words_count))
print('Number of known words tokens:', len(known_words))
#print(known_words_count)

Number of known words 1111
Number of known words tokens: 66465


This confirms that a decent amount of words and tokens is represented. For comparison, all tokens are represented in the case of the Bag-of-Words representation of tokens. The Bag-of-Words token representation is less semantic and has the risk of overfitting on the training data. The embedding representation is more semantic but less tuned to the training data.

Next we represent the test data with the same methods to make it compatible.

In [21]:
testDataVecs = getAvgFeatureVecs(test_instances, frequent_keywords, stop_words, word_embedding_model, index2word_set, num_features) 

print('Test data vocabulary statistics:')
print('Frequency threshold', frequency_threshold)


unknown_words_count = Counter(unknown_words)
print('Proportion of unknown tokens', len(unknown_words)/(len(unknown_words)+len(known_words)))
print('Number of unknown words',len(unknown_words_count))
print('Number of unknown word tokens:', len(unknown_words))
print('Unknown words counts')
print(unknown_words_count)

known_words_count = Counter(known_words)
print('Number of known words',len(known_words_count))
print('Number of known words tokens:', len(known_words))
#print(known_words_count)

Review 0 of 2610
Review 1000 of 2610
Review 2000 of 2610
Test data vocabulary statistics:
Frequency threshold 4
Proportion of unknown tokens 0.009275300828530684
Number of unknown words 25
Number of unknown word tokens: 787
Unknown words counts
Counter({"y'know": 430, 'pheebs': 88, 'and-and': 28, 'i-i-i': 27, 'you-you': 26, 'no-no-no': 21, "i'm-i": 20, 'hey-hey': 16, 'we-we': 12, 'that-that': 12, 'what-what': 11, 'oh-oh': 10, 'no-no-no-no': 9, "'kay": 8, '..': 8, 'buffay': 8, 'yeah-yeah': 7, 'oh-ho': 7, 'the-the': 7, 'ok.': 6, "it's-it": 6, 'i-i-i-i': 5, "that's-that": 5, 'um-hmm': 5, 'heldi': 5})
Number of known words 1111
Number of known words tokens: 84062


  featureVec = np.divide(featureVec, nwords)


Just as in the previous notebook, we need to turn the labels into numerical values:

In [22]:
from sklearn import preprocessing
# first we instantiate a label encode
le = preprocessing.LabelEncoder()
# we fee this encoder with the complete list of labels from our data
le.fit(training_labels+test_labels)
print(list(le.classes_))
training_classes = le.transform(training_labels)
print('Train labels', list(training_classes[0:20]))

test_classes = le.transform(test_labels)
print('Test labels', list(test_classes[0:20]))

['anger', 'disgust', 'fear', 'joy', 'neutral', 'sadness', 'surprise']


  return f(**kwargs)


Train labels [4, 4, 4, 4, 6, 4, 4, 4, 4, 4, 2, 4, 6, 4, 6, 5, 6, 2, 4, 4]
Test labels [6, 0, 4, 4, 3, 3, 3, 3, 3, 3, 3, 4, 4, 5, 6, 0, 0, 0, 3, 3]


The next steps are the same as for the previous notebook, except that we pass the embedding representations of the training data.

It is nice to see that you can feed these machines anything represented as numeric vectors, no matter how they are derived of created. You could even type in these numbers your self. Will take a whole though.

## 4. Training and applying the model  <a class="anchor" id ="section4"></a> 

In [23]:
from sklearn import svm

# We choose a Linear model
svm_linear_clf = svm.LinearSVC(max_iter=2000)
### we train the classifier through the *fit* function and by passing the training vectors and the training labels as paramters:
svm_linear_clf.fit(trainDataVecs, training_classes)

LinearSVC(max_iter=2000)

In [24]:
# Predicting the Test set results, find macro recall
y_pred_svm_linear = svm_linear_clf.predict(testDataVecs)

## 5. Generating the test report  <a class="anchor" id ="section5"></a> 

In [25]:
from sklearn.metrics import classification_report

In [26]:
#### this report gives the results for the LINEAR classifier
report = classification_report(test_classes,y_pred_svm_linear,digits = 7)
print(le.classes_)
print('Embeddings SVM LINEAR ----------------------------------------------------------------')
print('Word embedding model used', wordembeddings)
print('Word frequency threshold', frequency_threshold)
print(report)
print('Confusion matrix')
print(sklearn.metrics.confusion_matrix(test_classes,y_pred_svm_linear))

['anger' 'disgust' 'fear' 'joy' 'neutral' 'sadness' 'surprise']
Embeddings SVM LINEAR ----------------------------------------------------------------
Word embedding model used glove-wiki-gigaword-300
Word frequency threshold 4
              precision    recall  f1-score   support

           0  0.4642857 0.1130435 0.1818182       345
           1  0.0000000 0.0000000 0.0000000        68
           2  0.0000000 0.0000000 0.0000000        50
           3  0.4555256 0.4203980 0.4372574       402
           4  0.5991778 0.9283439 0.7282948      1256
           5  0.5882353 0.0480769 0.0888889       208
           6  0.5602094 0.3807829 0.4533898       281

    accuracy                      0.5712644      2610
   macro avg  0.3810620 0.2700922 0.2699499      2610
weighted avg  0.5270647 0.5712644 0.4977527      2610

Confusion matrix
[[  39    0    0   80  198    2   26]
 [   3    0    0    9   53    0    3]
 [   2    0    0   10   35    1    2]
 [   8    0    0  169  211    0   14]
 [   8

  _warn_prf(average, modifier, msg_start, len(result))


Remember the results from the notebook where we trained a NaiveBayes and SVM classifiers with one-hot-encodings of the words? Take some time to compare the results and think about the differences.

Comapre these results with the results from the previous notebook and think about our hypothesis. We see that it is not confirmed by this experiment. Results are slightly less than the BoW approach.

Can you think of possible explanations? Spend a few minutes to think about this. We discuss this in class.

## 6. Applying the model to new data  <a class="anchor" id ="section6"></a> 

We would like to apply the embedding based model to our own data but this works a bit different as we cannot simply use the 'transform' function to represent the utterances using the one-hot vector representation of the training vocabulary.

What we need to do is to create an embedding representation using the same function we used above and assume that our classifier finds sufficient similarity in the embeddings of our data with the correct training data.

We use the same set of utterances.

In [27]:
# some utterances
some_chat = ['That is sweet of you', 
               'You are so funny', 
               'Are you a man or a woman?', 
               'Chatbots make me sad and feel lonely.', 
               'Your are stupid and boring.', 
               'Two thumbs up', 
               'I fell asleep halfway through this conversation', 
               'Wow, I am really amazed.', 
               'You are amazing.',
             'I feel so low being in isolation',
             'People dumping waste are horrible',
             'Its awful that you cannot stop smoking',
             'Dogs scare me',
             'I am afraid I will get sick at work',
             'I run away when I see a dog',
             'When do you start your job?'
            ]


len(some_chat)

16

Next, we define the list of labels that go with our chat.

In [28]:
some_chat_emotions = ['joy', 'joy', 'neutral', 'sadness', 'anger', 'joy', 'anger', 'surprise', 'joy', 'sadness', 'disgust', 'disgust', 'fear', 'fear', 'fear', 'neutral']

We  use the LabelEncoder *le* to convert this list into a numpy array with digits:

In [29]:
print('labels',le.classes_)
some_chat_labels = le.transform(some_chat_emotions)
print(some_chat_labels)

labels ['anger' 'disgust' 'fear' 'joy' 'neutral' 'sadness' 'surprise']
[3 3 4 5 0 3 0 6 3 5 1 1 2 2 2 4]


In [30]:
some_chat_tokens = tokenize_data(some_chat)
some_chat_embedding_vectors = getAvgFeatureVecs(some_chat_tokens, frequent_keywords,stop_words, word_embedding_model, index2word_set, num_features)

Review 0 of 16


  featureVec = np.divide(featureVec, nwords)


In [31]:
# have classifier make a prediction

some_chat_pred = svm_linear_clf.predict(some_chat_embedding_vectors)
print('System predictions', some_chat_pred)
print('Gold labels', some_chat_labels)
for review, predicted_label in zip(some_chat, some_chat_pred):
    
    print('%s => %s' % (review, 
                        le.classes_[predicted_label]))




System predictions [3 3 6 5 0 4 4 6 3 5 4 0 4 4 4 4]
Gold labels [3 3 4 5 0 3 0 6 3 5 1 1 2 2 2 4]
That is sweet of you => joy
You are so funny => joy
Are you a man or a woman? => surprise
Chatbots make me sad and feel lonely. => sadness
Your are stupid and boring. => anger
Two thumbs up => neutral
I fell asleep halfway through this conversation => neutral
Wow, I am really amazed. => surprise
You are amazing. => joy
I feel so low being in isolation => sadness
People dumping waste are horrible => neutral
Its awful that you cannot stop smoking => anger
Dogs scare me => neutral
I am afraid I will get sick at work => neutral
I run away when I see a dog => neutral
When do you start your job? => neutral


In [32]:
report = classification_report(some_chat_labels,some_chat_pred,digits = 7)
print(le.classes_)
print('Embedding SVM LINEAR ON EXTERNAL DATA -----------------------------')
print('Word embedding model used', wordembeddings)
print('Word frequency threshold', frequency_threshold)
print(report)

['anger' 'disgust' 'fear' 'joy' 'neutral' 'sadness' 'surprise']
Embedding SVM LINEAR ON EXTERNAL DATA -----------------------------
Word embedding model used glove-wiki-gigaword-300
Word frequency threshold 4
              precision    recall  f1-score   support

           0  0.5000000 0.5000000 0.5000000         2
           1  0.0000000 0.0000000 0.0000000         2
           2  0.0000000 0.0000000 0.0000000         3
           3  1.0000000 0.7500000 0.8571429         4
           4  0.1428571 0.5000000 0.2222222         2
           5  1.0000000 1.0000000 1.0000000         2
           6  0.5000000 1.0000000 0.6666667         1

    accuracy                      0.5000000        16
   macro avg  0.4489796 0.5357143 0.4637188        16
weighted avg  0.4866071 0.5000000 0.4712302        16



  _warn_prf(average, modifier, msg_start, len(result))


Now compare the results again with those from the BoW SVM of the previous notebook. Now the word embedding model works much better!!!!! 

This can be explained by the robustness of the dense vector representations. The MELD test data is from the same source as the MELD training data and the BoW SVM may be overfitting on all kinds of features. This is much less the case for the embedding based model, which is also more robust since it can handle input with words that do not even occur in the training data but belong to the same family.

### 7. Saving the classifier to disk

Just as with the previous notebook, you can save the emotion classification model to disk and load the model some other time. Note that you need to load the same word2vec model as well to represent any text input with vector representations that are compatible.

In [33]:
import pickle

# save the classifier to disk
filename_classifier = './models/svm_linear_clf_embeddings.sav'
pickle.dump(svm_linear_clf, open(filename_classifier, 'wb'))

filename_freq_keywords = './models/frequent_keywords.sav'
pickle.dump(frequent_keywords, open(filename_freq_keywords, 'wb'))

# some time later...
 
# load the classifier and the vectorizer from disk
loaded_classifier = pickle.load(open(filename_classifier, 'rb'))
loaded_frequent_keywords = pickle.load(open(filename_freq_keywords, 'rb'))
stop_words = set(stopwords.words('english'))

some_chat_tokens = tokenize_data(some_chat)
some_chat_embedding_vectors = getAvgFeatureVecs(some_chat_tokens, loaded_frequent_keywords,stop_words, word_embedding_model, index2word_set, num_features)  
some_chat_pred = loaded_classifier.predict(some_chat_embedding_vectors)

print('System predictions', some_chat_pred)
print('Gold labels', some_chat_labels)

for review, predicted_label in zip(some_chat, some_chat_pred):
    
    print('%s => %s' % (review, 
                        le.classes_[predicted_label]))




Review 0 of 16
System predictions [3 3 6 5 0 4 4 6 3 5 4 0 4 4 4 4]
Gold labels [3 3 4 5 0 3 0 6 3 5 1 1 2 2 2 4]
That is sweet of you => joy
You are so funny => joy
Are you a man or a woman? => surprise
Chatbots make me sad and feel lonely. => sadness
Your are stupid and boring. => anger
Two thumbs up => neutral
I fell asleep halfway through this conversation => neutral
Wow, I am really amazed. => surprise
You are amazing. => joy
I feel so low being in isolation => sadness
People dumping waste are horrible => neutral
Its awful that you cannot stop smoking => anger
Dogs scare me => neutral
I am afraid I will get sick at work => neutral
I run away when I see a dog => neutral
When do you start your job? => neutral


  featureVec = np.divide(featureVec, nwords)


# End of this notebook