# Lab3.6 Training an emotion classifier using word embeddings

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook is a follow up on the notebook: Lab3.5.ml.emotion-detection-bow.ipynb and on Lab2. 

The previous notebook *Lab3.3.ml.emotion-detection-bow* showed how you can create a bag-of-words vector representation from a text and how to feed it to a machine to learn to assign emotion labels to it. Words from the training set become features in the vector and the machine learns to associate these with the emotion classes. Unseen text is represented in the same way as a bag-of-words vector and through similarity with the training data, the machine makes a prediction on the emotion encoded in the unseen text.

In this notebook we are going to replace the bag-of-words by a word embedding representation of the text. We discussed word-embeddings in Lab2 as a powerful model to represent families of words that are related. Now imagine that we replace our text representations of words by a representation based on embeddings? 

Our hypothesis is that training data will be boosted to represent families of related words and can be compared to text with  words unseen in the training data but with embeddings that make them similar to words that have been seen. Our hypothesis is that this will improve the recall but maybe at the price of precision. Our evaluation will show if this is the case.

Using embeddings to replace word tokens in data, is a powerful method to make training data more robust. Whereas the vectors for tokens in a bag-of-words are large and sparse (thousands of dimensions and many zero values), the embeddings representaions are small and dense (300 dimensions or less and all dimensions have some positive or negative value). The dense vectors loose details but form powerful generalisations. Can you image that you represent the meaning of a text using 300 digits?

We will test this hypothesis in this notebook by creating vectors based on *averaged* word vectors for the same MELD data and testing it in the same way as we did before on MELD data and our own set of utterances.

There is one big challenge to do this: how to represent a whole text or document through wordembeddings? We cannot simply replace words in a bag-of-words vector by embeddings. Think about this. What would it mean to compare one text to another? Work out a simple example to try this out.

Here we are going to use a very crude and simple solution below. We average the embeddings of the words of a document to represent the document. Other options are to sum, to concatenate, multiply, each with problems for comparing documents.

Doc2Vec was proposed as an alternative but this requires building new embedding models from data.


### Table of Contents

* [Section 1: Quick introduction to embeddings](#section1)
* [Section 2: Loading the emotion data](#section2)
* [Section 3: Preparing the training and test data](#section3)
* [Section 4: Training and applying the model](#section4)
* [Section 5: Generating the test report](#section5)
* [Section 6: Applying the classifier to your own text](#section6)


## 1 Quick introduction to embeddings  <a class="anchor" id ="section1"></a> 

A recent alternative way to create a 'semantic' representation of a word is by word embeddings: mapping words (or phrases) from the vocabulary to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. For this reason, they are called dense representations.

In linguistics, word embeddings were discussed in the research area of distributional semantics. The idea is to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying notion is that "a word is characterized by the company it keeps" (Firth 1957). 

Embeddings do not directly represent these context words but are the learned weights in the hidden layer of a neural network that tries to predict the context words. In that sense, we do not need thousands of dimensions to represent all possible context words but just the learned weights.

### Reference:

For another explanation how word embedddings can improve classical bag-of-word approaches, check out this page:

https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html


As in Lab2, we are going to use the *gensim* package to load and use prebuilt embeddings models. Assuming it is already installed on your machine, we first import *gensim*.

In [1]:
%pip install gensim


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m22.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import gensim

There are many sites that provide pretrained word2vec models that can be loaded through the *gensim* package. Check out the original data from Google for a word2vec model with 300 dimensions trained from Wikipedia and news: [Google code archive](https://code.google.com/archive/p/word2vec/). Here is another website with many ready to use models: http://vectors.nlpl.eu/repository/

Whatever you choose, make sure you can load the model using the 'gensim' package.

In this notebook, we will load pre-trained word embeddings, created in the [Glove project](https://nlp.stanford.edu/projects/glove/) from Stanford University. We will use embeddings trained from twitter data. We hope that twitter model is more adapted to the spoken utterances from the MELD project than other Google and Glove models trained on written news and Wikipedia articles.

We will load the model with the Gensim package that we used before but you can also use the *gensim* api to load the model online, assuming you have a good network connection. If you want to download and load it from disk, follow the instructions for the next cell. If you do not want to store it on disk but load it online, skip to the subsequent subsection.


### 1.1 Downloading models and loading from disk

You can download the twitter models to your disk from:

http://nlp.stanford.edu/data/glove.twitter.27B.zip

You can see on the website of Glove that different models are provided with different dimensions:

Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip

We will use the model with 200 dimensions. If your computer cannot handle it, you can try one of the smaller models.

When you unpack the zip file, you see that the models are provided as text files. To load the data into *gensim*, we need to run the code given in the next cell for which we use the function *glove2word2vec*. Don't worry too much about this code. Adapt the path to your local copy and run the cell to load it.

In [3]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

### Adapt the path to your local copy
wordembeddings="glove.twitter.27B.200d.txt"
glove_file = datapath('/Users/piek/Desktop/t-ONDERWIJS/data/word-embeddings/classical-models/glove-twitter-models/glove.twitter.27B.200d.txt')

### Here is the code to create a word2vec model from the Glove text data
tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec(glove_file, tmp_file)

word_embedding_model = KeyedVectors.load_word2vec_format(tmp_file)

# ### this model has 200 dimensions so we set the number of features to 200, as we need later.
num_features = 200

## 1.2 Loading models using the gensim API

Instead of downloading the models to disk, 'gensim' also provides a downloader API to load the model from the web when needed. In the next cell, we use this API to download a word embeddding model trained on tweets. Note that it takes some time to download but it saves some disk space.

In [1]:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

# wordembeddings = "glove-twitter-200"
# ### this model has 200 dimensions so we set the number of features to 200
# num_features = 200

wordembeddings = "glove-twitter-25"
### this model has 25 dimensions so we set the number of features to 25
num_features = 25

#wordembeddings = "glove-twitter-50"
### this model has 50 dimensions so we set the number of features to 50
#num_features = 50

#wordembeddings = "glove-twitter-100"
### this model has 100 dimensions so we set the number of features to 100
#num_features = 100

#wordembeddings = "glove-wiki-gigaword-300"
#num_features = 300

word_embedding_model = api.load(wordembeddings)
print(num_features)

25


Loading the word embedding model and training the classifier may take a while. If your laptop cannot handle this, use a smaller word embeddings model with less dimensions. Note that the performance of the classifier may be degraded when fewer dimensions are used. Alternatively, you can reduce the number of training data as we will show below but this will also likely have an effect on the performance.

Depending on the embedding model you selected, you need to set the number of features that are used to create vectors for the utterances equal to the number of dimensions. If the value of num_features is different from the dimensions of the model, you get an error creating the embedding representations for the utterances below.

Let's check if the model works.

In [2]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

word1='cat'
word2='dog'
word1_vector=np.array(word_embedding_model[word1]).reshape(1, -1)
word2_vector=np.array(word_embedding_model[word2]).reshape(1, -1)
print(cosine_similarity(word1_vector, word2_vector))

[[0.9590819]]


That is a good score. Let's try something that is different.

In [3]:
word1='cat'
word2='car'
word1_vector=np.array(word_embedding_model[word1]).reshape(1, -1)
word2_vector=np.array(word_embedding_model[word2]).reshape(1, -1)
print(cosine_similarity(word1_vector, word2_vector))

[[0.74210423]]


At least it scores much lower, although it is not clear what such a score really means. All is relative!

## 2. Loading the emotion data  <a class="anchor" id ="section2"></a> 

Just as with the previous notebook, we load the training and test data from the MELD data set. This code is the same as before.

In [4]:
import pandas as pd
import nltk

filepath = './data/MELD/train_sent_emo.csv'
dftrain = pd.read_csv(filepath)
### The data has some problematic strings with encoding problems. The next code removes some of these from the utterances
# Try to fix encoding
dftrain['Utterance'] = dftrain['Utterance'].str.replace("\x92|\x97|\x91|\x93|\x94|\x85", "'")

filepath = './data/MELD/test_sent_emo.csv'
dftest = pd.read_csv(filepath)
dftest['Utterance'] = dftest['Utterance'].str.replace("\x92|\x97|\x91|\x93|\x94|\x85", "'")


  dftrain['Utterance'] = dftrain['Utterance'].str.replace("\x92|\x97|\x91|\x93|\x94|\x85", "'")
  dftest['Utterance'] = dftest['Utterance'].str.replace("\x92|\x97|\x91|\x93|\x94|\x85", "'")


We cannot rely on the Countvectorizer function because we do not want to create a BoW representation but we want to get the embeddings for each word in the utterance. We therefore are going to tokenize each utterance first. For this, we define a tokenize function using NLTK.

In [5]:
def tokenize_data(text):
    ### the first loop gets the utterances
    text_tokens = []
    for utterance in text:
        text_tokens.append(nltk.tokenize.word_tokenize(utterance))
    return text_tokens

We apply the tokenization function and create the data for training and testing.

In [6]:
### the first loop gets the utterances
training_instances = tokenize_data(dftrain['Utterance'])

### We use the same loop for the list of emotion labels that correspond with the vector representations of each utterance
training_labels = tokenize_data(dftrain['Emotion'])
    
### the first loop gets the utterances
test_instances = tokenize_data(dftest['Utterance'])

### We use the same loop for the list of emotion labels that correspond with the vector representations of each utterance
test_labels = tokenize_data(dftest['Emotion'])


In [7]:
### the first loop gets the utterances
test_instances = tokenize_data(dftest['Utterance'])

In [8]:
print(test_instances[0:5])

[['Why', 'do', 'all', 'you', "'re", 'coffee', 'mugs', 'have', 'numbers', 'on', 'the', 'bottom', '?'], ['Oh', '.', 'That', "'s", 'so', 'Monica', 'can', 'keep', 'track', '.', 'That', 'way', 'if', 'one', 'on', 'them', 'is', 'missing', ',', 'she', 'can', 'be', 'like', ',', "'Where", "'s", 'number', '27', '?', '!', "'"], ["Y'know", 'what', '?'], ['Come', 'on', ',', 'Lydia', ',', 'you', 'can', 'do', 'it', '.'], ['Push', '!']]


Our text data is now a list of lists with tokens

In [9]:
training_instances[0]

['also',
 'I',
 'was',
 'the',
 'point',
 'person',
 'on',
 'my',
 'company',
 "'s",
 'transition',
 'from',
 'the',
 'KL-5',
 'to',
 'GR-6',
 'system',
 '.']

# 3. Preparing the training and test data  <a class="anchor" id ="section3"></a> 

The following imports are needed again to create a classifier:

In [10]:
import sklearn
import numpy as np
import nltk
from nltk.corpus import stopwords

In the previous notebook, we used CountVectorizer to obtain the full vocabulary of the data set and generate vectors for the bag-of-words endocing of each word. In these vectors, each slot represents a word and a value '1' or higher indicates that the word is present in the utterance and a '0' means absence. This results in large and sparse vector representations for each utterance. We have also seen that we can weight the relevance of a word using the 'TF.IDF' function. This results in an equally large and sparse vectors but the weights are more subtle. The down side is sparseness, lack of generalisation and lack of robustness to deal with new unseen data (words from the test that are not in the training data are not represented).

In the following, we are going to represent the utterances by an embedding representation. In fact, we take the word embedding of each token in the utterance and add these together, after which we take the average. All the embeddings have the same number of dimensions in the same order. So if two tokens have a high weight for one dimension then their co-uccurrence in an utterance will enforce that weight. Note that by adding and taking the average, we normalize for the length of the utterance. Finally, note that the order of the tokens is not relevant just as with a bag-of-words representation.

Before we create the embedding representation for the utterances, we can filter  words using the NLTK stopword list and their frequency to make the text representations more compatible with our previous notebook. In the case of CountVectorizer, this was done for us using the parameter settings. 

## 3.1 The vocabulary of the training data: creating a frequent word list

As there is no function for us like CountVectorizer from sklearn, we need to make our own customised function to do the work. You may wonder why there is no such function? Well, sklearn is already an older package and word embeddings are a brand-new technique. 


So we need to make our own customized code, which you find in the cell below. The next code shows a loop over all utterances from which we extract a list of all tokens. Next, we count these tokens and extract a list of words that occur above our frequency threshold, which we should set in the same way as for CountVectorizer. This will give us the full vocabulary of the training set.

Next, we apply a Python function *Counter* to turn this into a Python *Dictionary*, which is a data structure with a key and value. The key is the word and the value is its frequency count. Using this dictionary, you can derive a list of words that are more frequent than a threshold. In the code below, we will set the threshold to "4".

The end result is a list of words stored in the variable *frequent_keywords* that has all the words from the training data above the frequency threshold.

In [11]:
##### This code creates a list of words above the preset frequency threshold

from collections import Counter
frequency_threshold = 4
frequent_keywords = []

####  We first will collect all tokens from all the utterances from the training data using the NLTL tokenizer
alltokens = []
for utterance in dftrain['Utterance']:
    tokenlist = nltk.tokenize.word_tokenize(utterance)
    for token in tokenlist:
        alltokens.append(token)
        
#### The Counter function will create a frequency count of all the items. The result is a dictionary       
kw_counter = Counter(alltokens)

#### We now loop over the dictionary with counts to get the word and the frequency value
for word, count in kw_counter.items():
    if count>frequency_threshold:
        frequent_keywords.append(word)
        
print('Nr of words above the frequency threshold', len(frequent_keywords))
print('Frequency threshold', frequency_threshold)

Nr of words above the frequency threshold 1349
Frequency threshold 4


We thus have 1347 tokens that occur more than 4 times in the training data!

You can imagine how you could filter this list in other ways:

<ul>
    <li>remove stopwords
    <li>only keep words with certain parts-of-speech
    <li>only keep words with certain syntactic dependencies
</ul>

Think about these options and whether they would work for capturing the emotion expressed in an utterance or not. Also think about possible research questions. If we want to compare emotion classification using a BoW representation with an averaged embedding representation, we may want to use a similar filtering of the words across the experiments.

## 3.2 Getting embeddings for the utterances

Next, we want to lookup every word in our utterance in our word embedding model to obtain the vector representation of that word. We then take the average over all the tokens by summing the weights and dividing them by the number of words.

Before we lookup a word in the embedding model, we check if it belongs to the frequent word list and, possibly, it is not a stopword. Otherwise, we ignore the word. 

Obviously, words that are not in the vocabulary of the embedding model will not play a role either. So it matters what type of text the embedding model is trained on: news, Wikipedia, Twitter, spoken dialogues. To know what the coverage is of the embedding model for our utterances in the MELD data set, we keep track of all unknown words and all known words in a separate list.

We are going to define two customized functions using 'def' to create an embedding representation for each utterance. 
These functions are based on: https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec
and adapted for our purposes.

The first function, called 'featureVecMethod', takes as input parameters: 1) the words of the utterance, 2) the frequent word list, 3) stopwords, 4) the embedding model, 5) a fast index to the embedding model and 6) the number of dimensions in our model. The num_features parameter determines the size of the vector. If num_embedding_dimensions is different from the actual number of dimensions of the word embedding model, you will get an error.

The overall steps of the function are as follows:

<ol>
    <li>create a Numpy array *featureVec* with as many slots as the number of embedding dimensions and fill these with zero's. This is an initialisation of the text embedding that we will produce.
    <li>we iterate over all the words in an utterance (*list_of_words_in_the_utterance*) and check if they are in the model_word_index (both as is and lower cased):
        <ol>
            <li>if so, we take the embedding, normalise it and assign it to featureVec to replace the initialised value. Normalisation results in values between -1 and 1 so that extreme values do not dominate the average.
            <li>if not, we keep the initialisation values (all zero)
            <li>while iterating, we store the words with embedding values in the *known_words* list and the others in the *unknown_words* list just for further analysis.
        </ol>
    <li>after the iteration, we divide the values by the number of matched words (*nwords) in the utterance to normalise the score for the length of the utterance
</ol>



In [12]:
# Function to average all word vectors in an utterance
# The result is a single vector with num_embedding_dimensions for the whole utterance
def featureVecMethod(list_of_words_in_the_utterance, # Tokenized list of tokens from an utterance
                     frequent_keywords, # List of words above the frequency threshold
                     stop_words, # Stopwords that should be skipped
                     model, # The actual word embeddings model
                     modelword_index, # An index on the vocabulary of the model to speed up lookup
                     num_embedding_dimensions # the number of dimensions of the embedding model
                    ):
    # Pre-initialising empty numpy array (np) for speed
    # This create a numpy array with the length of the num_features set to zero values
    featureVec = np.zeros(num_embedding_dimensions,dtype="float32")
    
    ## A counter for the number of tokens represented so that we can take the average by dividing the sum by nwords
    nwords = 0
    
    ## Lists to get some statistics on how well the vocabulary of the model matches the vocabulary of the utterances
    embeddings_words = []
    no_embeddings_words = []
    
    for word in  list_of_words_in_the_utterance:
        #### we only use words that are frequent and not stopwords
        if word in frequent_keywords and not word in stop_words:         
            if word in modelword_index:
                ## The next function would directly take the values from the model
                #featureVec = np.add(featureVec,model[word])
                
                ### Instead of simply taking the embedding values we prefer the next function that 
                ### normalises these values between [-1, 1] to make the average work better
                ### Don't worry about the specifics 
                featureVec = np.add(featureVec,model[word]/np.linalg.norm(model[word]))

                #we keep track of the words detected, just for analysing the data
                embeddings_words.append(word)
                nwords = nwords + 1
            else:
                word = word.lower()
                if word in modelword_index:
                    #featureVec = np.add(featureVec,model[word])
                    featureVec = np.add(featureVec,model[word]/np.linalg.norm(model[word]))
                    
                    #we keep track of the words detected, just for analysing the data
                    embeddings_words.append(word)
                    nwords = nwords + 1
                else:
                    #we keep track of the unknown words to see how well our model fits the data
                    no_embeddings_words.append(word)
                    
    # Dividing the result by number of words in each utterance to get average
    if nwords>0:
        featureVec = np.divide(featureVec, nwords)
    
    #Our function returns 3 results: the averaged vectors, the list of words matched and the list of words not matched with the embedding model
    return featureVec, embeddings_words, no_embeddings_words

The above function *featureVecMethod* only works for the list of tokens of a single utterance. The next function deals with all the data and creates the list of input vectors. This function calls the previous function and needs the same input parameters.

In [13]:
# Function for calculating the average feature vector
def getAvgFeatureVecs(texts, ### List of texts, in our case tokenized utterances
                      keywords, 
                      stopwords, 
                      model, 
                      modelword_index, 
                      num_embedding_dimensions
                     ):
    counter = 0
    embeddings_words=[]
    no_embeddings_words=[]

    #### we initialise a numpy matrix with the number of rows defined by the lengths of the texts (all utterances) and the columns the number of dimensions. 
    #### All cells will have zeros of the type float32
    textFeatureVecs = np.zeros((len(texts),num_embedding_dimensions),dtype="float32")
    print('Shape of our matrix is:',textFeatureVecs.shape)
    
    #### We iterate over all the texts
    for text in texts:
        # Printing a status message every 1000th text, to see what we are processing
        if counter%1000 == 0:
            print("Review %d of %d"%(counter,len(texts)))
        ### Each text is transformed into a vector representation based on the averaged token embedding using the previous function
        ### We add these vectors to the total matrix
        textFeatureVecs[counter], emb_words, nemb_words = featureVecMethod(text, keywords, stopwords, model, modelword_index,num_embedding_dimensions)
        counter = counter+1
        ### We add the matched words and not matched words from this utterance to the total list
        embeddings_words.extend(emb_words)
        no_embeddings_words.extend(nemb_words)
        
    #### Due to the averaging, there could be infinitive values or NaN values. The next numpy function turns these value to "0" scores
    textFeatureVecs = np.nan_to_num(textFeatureVecs) 
    print('Shape of our matrix is:',textFeatureVecs.shape)

    
    ### Again this function returns three results: the embedding matrix, the matched words and not matched words
    return textFeatureVecs, embeddings_words, no_embeddings_words


Now back to our input data. We iterated over the Pandas frame and extracted all training and test utterances. Next, we apply our two functions to turn these into averaged embedding representations.

In [14]:
#Converting Index2Word which is a list to a set for better speed in the execution.
#Allows for quicker lookup if the words exist
index2word_set = set(word_embedding_model.index_to_key)
# Here we take the stopwords from NLTK
stop_words = set(stopwords.words('english'))
# If you did not use stopwords for the bag-of-words using the CountVectorizer, you should do the same here and create an empty list:
stop_words = []

In [15]:
#### We use two word list to keep track of the words that are also in the embedding model and words that are not

known_words = []  # utterance words that are also in the embedding model
unknown_words =[] # utterance words that are not in the mebedding model


# Calculating average feature vector for training set

### The full list of utterances is passed to our customized function, with the frequent_keywords list, the stopwords, the embedding model, 
### the index and the number of features that indicates the dimensions of the model
trainDataVecs, known_words, unknown_words = getAvgFeatureVecs(training_instances, frequent_keywords, stop_words, word_embedding_model, index2word_set, num_features)


Shape of our matrix is: (9989, 25)
Review 0 of 9989
Review 1000 of 9989
Review 2000 of 9989
Review 3000 of 9989
Review 4000 of 9989
Review 5000 of 9989
Review 6000 of 9989
Review 7000 of 9989
Review 8000 of 9989
Review 9000 of 9989
Shape of our matrix is: (9989, 25)


Let's inspect our training data a bit more. Depending on the break set for loading the training data, you will have a list of vectors with according length:

In [16]:
len(trainDataVecs)

9989

We can inspect the first element in the list:

In [17]:
print('Vector length', len(trainDataVecs[0]))
print (trainDataVecs[0])

Vector length 25
[ 1.8404325e-04 -7.5551472e-04 -1.4443292e-02 -3.0844459e-02
 -1.2221386e-02 -8.0029191e-03  1.8683831e-01 -2.6975105e-02
  5.7758363e-03  1.6964663e-02  3.4266456e-03  5.5171683e-02
 -8.6039120e-01  1.4299476e-02  2.7869504e-02 -5.0686691e-03
  5.4254908e-02 -1.3258809e-02 -6.5059699e-02 -4.5629501e-02
 -7.8818370e-03 -1.5575601e-02 -5.5708094e-03 -1.2840188e-02
 -4.9684074e-02]


It is simply a list with digits, each representing the averaged weight of the tokens or words that made up the utterance. We can check the length, which should be '300', '100', '50' or '25', etc. depending on the number of dimensions of the word2vec model that you used.

There are three major differences with the bag-of-tokens that we used in the previous notebook:

1. the vectors are short
2. there are no zero's 
3. the numbers do not represent words but are averaged weights learned by a neural network

Instead of *large sparse* vectors, we now have *short dense* vectors representing each utterance. Whereas in the previous representation, each slot in the vector corresponds with a token, now each slot is an averaged weight from the hidden layer to learn to predict others words in the context.

This is true for each utterance, each having a unique set of values for the same hidden layer weights. These weights now represent the meaning of the utterance for a machine, which can use a similarity function such as cosine similairty to measure the degree of equivalence across these representations. When we inspects any other utterance, we see it is represented in a similar way.

In [18]:
print(len(trainDataVecs[1000]))
print(trainDataVecs[1000])

25
[ 0.01521032 -0.07460752 -0.03114454 -0.04025642 -0.02689467  0.13526958
  0.2637765   0.11633199 -0.09245587  0.04048015 -0.06832556 -0.00699742
 -0.785374    0.01586739  0.02801524 -0.09145445  0.00946591 -0.07876154
 -0.11224737 -0.1004042   0.00956066  0.03710267 -0.11011902  0.14089808
 -0.00798917]


Since the vectors are compatible, we can compare them in the same way as we did before for the word2vec embeddings of *cat* and *dog*:

In [19]:
word1_vector=np.array(trainDataVecs[0]).reshape(1, -1)
word2_vector=np.array(trainDataVecs[1000]).reshape(1, -1)
print(word1_vector)
print(cosine_similarity(word1_vector, word2_vector))

[[ 1.8404325e-04 -7.5551472e-04 -1.4443292e-02 -3.0844459e-02
  -1.2221386e-02 -8.0029191e-03  1.8683831e-01 -2.6975105e-02
   5.7758363e-03  1.6964663e-02  3.4266456e-03  5.5171683e-02
  -8.6039120e-01  1.4299476e-02  2.7869504e-02 -5.0686691e-03
   5.4254908e-02 -1.3258809e-02 -6.5059699e-02 -4.5629501e-02
  -7.8818370e-03 -1.5575601e-02 -5.5708094e-03 -1.2840188e-02
  -4.9684074e-02]]
[[0.91521704]]


For training, we use the same labels as before:

In [20]:
print(training_labels[0], training_labels[1000])

['neutral'] ['joy']


So now we have a numeric representation of each text, based on the embeddings of the words. We feed this to a classifier in the same way as we did in the previous notebooks with the Countvectorizer output.

Before we can train the classifier, we  want to convert the labels to numeric values as we did before, so that we can use the evaluation report functions. Note that this is not necessary to train and use the classifier.

## 3.3 How good is the coverage of the embedding model for the training data?

Before we do that, it may be good to check which words are not in the embedding model and therefore do not contribute to the representation of the utterance. In the above function, we kept track of the unknown words. Now we can inspect this list. We use the *Counter* function to get a frequency count of these words.

In [27]:
from collections import Counter

print('Training data vocabulary statistics:')
print('Frequency threshold', frequency_threshold)
print(unknown_words[0:10])
print(type(unknown_words))
unknown_words_count = Counter(unknown_words)
print('Proportion of unknown tokens', len(unknown_words)/(len(unknown_words)+len(known_words)))
print('Number of unknown word types',len(unknown_words_count))
print('Number of unknown word tokens:', len(unknown_words))
print('Unknown words counts')
print(unknown_words_count)

Training data vocabulary statistics:
Frequency threshold 4
['30', 'no-no-no-no', '...', "y'know", '...', "y'know", '...', "y'know", 'no-no-no', '...']
<class 'list'>
Proportion of unknown tokens 0.009145940920253458
Number of unknown word types 27
Number of unknown word tokens: 905
Unknown words counts
Counter({"y'know": 322, '...': 301, '..': 42, '....': 36, 'and-and': 26, 'no-no-no': 18, "i'm-i": 18, 'hey-hey': 14, 'that-that': 12, 'we-we': 10, '10': 9, 'no-no-no-no': 8, '2': 8, '8': 8, '30': 7, 'yeah-yeah': 7, "'kay": 6, '.....': 6, "it's-it": 6, 'oh-ho': 6, "that's-that": 5, '25': 5, 'um-hmm': 5, '40': 5, '7': 5, 'the-the': 5, 'heldi': 5})


We can see that the proportion of unknown words is low. Only a few words making up a small proportion of tokens could not be represented. But(!!) note that we considered only words above the frequency threshold. 

If you happen to use the original Google Word2Vec model, this can be about 35% of all the tokens. You can see that using different models, gives different results based on the genre match.

We also kept track of the *known* words, so lets check these as well:

In [28]:
known_words_count = Counter(known_words)
print('Number of known word types',len(known_words_count))
print('Number of known word tokens:', len(known_words))
#print(known_words_count)

Number of known word types 1167
Number of known word tokens: 98046


This confirms that a decent amount of words and tokens is represented. For comparison, all tokens are represented in the case of the Bag-of-Words representation of tokens. The Bag-of-Words token representation is less semantic and has the risk of overfitting on the training data. The embedding representation is more semantic but less tuned to the training data.

## 3.4 Representing the test data

Next we represent the test data with the same methods to make it compatible.

In [29]:
print(len(test_instances))
print(test_instances[0])
print(test_instances[:5])

2610
['Why', 'do', 'all', 'you', "'re", 'coffee', 'mugs', 'have', 'numbers', 'on', 'the', 'bottom', '?']
[['Why', 'do', 'all', 'you', "'re", 'coffee', 'mugs', 'have', 'numbers', 'on', 'the', 'bottom', '?'], ['Oh', '.', 'That', "'s", 'so', 'Monica', 'can', 'keep', 'track', '.', 'That', 'way', 'if', 'one', 'on', 'them', 'is', 'missing', ',', 'she', 'can', 'be', 'like', ',', "'Where", "'s", 'number', '27', '?', '!', "'"], ["Y'know", 'what', '?'], ['Come', 'on', ',', 'Lydia', ',', 'you', 'can', 'do', 'it', '.'], ['Push', '!']]


In [30]:
testDataVecs = getAvgFeatureVecs(test_instances, frequent_keywords, stop_words, word_embedding_model, index2word_set, num_features) 

print('Test data vocabulary statistics:')
print('Frequency threshold', frequency_threshold)


unknown_words_count = Counter(unknown_words)
print('Proportion of unknown tokens', len(unknown_words)/(len(unknown_words)+len(known_words)))
print('Number of unknown word types',len(unknown_words_count))
print('Number of unknown word tokens:', len(unknown_words))
print('Unknown words counts')
print(unknown_words_count)

known_words_count = Counter(known_words)
print('Number of known word types:',len(known_words_count))
print('Number of known word tokens:', len(known_words))
#print(known_words_count)

Shape of our matrix is: (2610, 25)
Review 0 of 2610
Review 1000 of 2610
Review 2000 of 2610
Test data vocabulary statistics:
Frequency threshold 4
Proportion of unknown tokens 0.009145940920253458
Number of unknown word types 27
Number of unknown word tokens: 905
Unknown words counts
Counter({"y'know": 322, '...': 301, '..': 42, '....': 36, 'and-and': 26, 'no-no-no': 18, "i'm-i": 18, 'hey-hey': 14, 'that-that': 12, 'we-we': 10, '10': 9, 'no-no-no-no': 8, '2': 8, '8': 8, '30': 7, 'yeah-yeah': 7, "'kay": 6, '.....': 6, "it's-it": 6, 'oh-ho': 6, "that's-that": 5, '25': 5, 'um-hmm': 5, '40': 5, '7': 5, 'the-the': 5, 'heldi': 5})
Number of known word types: 1167
Number of known word tokens: 98046


In [31]:

print(trainDataVecs[0])
print(trainDataVecs.shape)

print(testDataVecs[0])
print(testDataVecs[0].shape)

[ 1.8404325e-04 -7.5551472e-04 -1.4443292e-02 -3.0844459e-02
 -1.2221386e-02 -8.0029191e-03  1.8683831e-01 -2.6975105e-02
  5.7758363e-03  1.6964663e-02  3.4266456e-03  5.5171683e-02
 -8.6039120e-01  1.4299476e-02  2.7869504e-02 -5.0686691e-03
  5.4254908e-02 -1.3258809e-02 -6.5059699e-02 -4.5629501e-02
 -7.8818370e-03 -1.5575601e-02 -5.5708094e-03 -1.2840188e-02
 -4.9684074e-02]
(9989, 25)
[[ 0.01237815  0.03665548  0.01498249 ... -0.01772512  0.04217197
  -0.01802435]
 [ 0.0344808  -0.01148803  0.01395976 ... -0.03122862  0.05681234
  -0.01942558]
 [ 0.14661889 -0.00520714  0.00391321 ... -0.12780306  0.16679971
  -0.02236555]
 ...
 [ 0.01442704  0.01896031  0.00028988 ... -0.01993479  0.03729849
  -0.00068553]
 [-0.01111353  0.05922466  0.0477396  ... -0.02336988  0.05264325
  -0.02816456]
 [-0.02504919 -0.01390513  0.00310311 ... -0.00509816  0.0832886
  -0.04034706]]
(2610, 25)


The next steps are the same as for the previous notebook, except that we pass the embedding representations of the training data.

It is nice to see that you can feed these machines anything represented as numeric vectors, no matter how they are derived of created. You could even type in these numbers your self. Will take a while though.

## 4. Training and applying the model  <a class="anchor" id ="section4"></a> 

In [33]:
from sklearn import svm

# We choose a Linear model
svm_linear_clf = svm.LinearSVC(max_iter=2000)
### we train the classifier through the *fit* function and by passing the training vectors and the training labels as paramters:
svm_linear_clf.fit(trainDataVecs, training_labels)

  y = column_or_1d(y, warn=True)


LinearSVC(max_iter=2000)

In [34]:
# Predicting the Test set results, find macro recall
y_pred_svm_linear = svm_linear_clf.predict(testDataVecs[0])

## 5. Generating the test report  <a class="anchor" id ="section5"></a> 

In [36]:
from sklearn.metrics import classification_report

In [38]:
#### this report gives the results for the LINEAR classifier
report = classification_report(test_labels,y_pred_svm_linear,digits = 7)
print(svm_linear_clf.classes_)
print('Embeddings SVM LINEAR ----------------------------------------------------------------')
print('Word embedding model used', wordembeddings)
print('Word frequency threshold', frequency_threshold)
print(report)

['anger' 'disgust' 'fear' 'joy' 'neutral' 'sadness' 'surprise']
Embeddings SVM LINEAR ----------------------------------------------------------------
Word embedding model used glove-twitter-25
Word frequency threshold 4
              precision    recall  f1-score   support

       anger  0.4705882 0.0463768 0.0844327       345
     disgust  0.0000000 0.0000000 0.0000000        68
        fear  0.0000000 0.0000000 0.0000000        50
         joy  0.4370629 0.3109453 0.3633721       402
     neutral  0.5449183 0.9562102 0.6942197      1256
     sadness  0.0000000 0.0000000 0.0000000       208
    surprise  0.5348837 0.1637011 0.2506812       281

    accuracy                      0.5318008      2610
   macro avg  0.2839219 0.2110333 0.1989580      2610
weighted avg  0.4493379 0.5318008 0.4281939      2610



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Remember the results from the notebook where we trained a NaiveBayes and SVM classifiers with one-hot-encodings of the words? Take some time to compare the results and think about the differences.

```
BoW TFIDF Naive Bayes  --------------------------------------------------------------
Word freqeuncy threshold 5
              precision    recall  f1-score   support

       anger  0.6111111 0.0318841 0.0606061       345
     disgust  0.0000000 0.0000000 0.0000000        68
        fear  0.0000000 0.0000000 0.0000000        50
         joy  0.5291005 0.2487562 0.3384095       402
     neutral  0.5353535 0.9705414 0.6900651      1256
     sadness  1.0000000 0.0144231 0.0284360       208
    surprise  0.6829268 0.2989324 0.4158416       281

    accuracy                      0.5429119      2610
   macro avg  0.4797846 0.2235053 0.2190512      2610
weighted avg  0.5731181 0.5429119 0.4392481      2610

BoW TFIDF SVM LINEAR ----------------------------------------------------------------
Word freqeuncy threshold 5
              precision    recall  f1-score   support

       anger  0.3930636 0.1971014 0.2625483       345
     disgust  0.3333333 0.0588235 0.1000000        68
        fear  0.2000000 0.0200000 0.0363636        50
         joy  0.4796954 0.4701493 0.4748744       402
     neutral  0.6542553 0.8813694 0.7510176      1256
     sadness  0.3484848 0.1105769 0.1678832       208
    surprise  0.4888060 0.4661922 0.4772313       281

    accuracy                      0.5835249      2610
   macro avg  0.4139484 0.3148875 0.3242741      2610
weighted avg  0.5335997 0.5835249 0.5373167      2610
```

The results are less than the BoW approach. Just as with the NB results, *disgust* and *fear* have zero scores but also *sadness* now scores zero.
We do see that the recall for the embedding based representation is still close but that it clearly looses on the precision.

Think about our hypothesis. We see that it is not confirmed by this experiment.

Can you think of possible explanations?

Just as before we can generate a confusion matrix to see if there is a pattern:

In [40]:
print('Confusion matrix')
print(svm_linear_clf.classes_)
print(sklearn.metrics.confusion_matrix(test_labels,y_pred_svm_linear))

Confusion matrix
['anger' 'disgust' 'fear' 'joy' 'neutral' 'sadness' 'surprise']
[[  16    0    0   55  262    0   12]
 [   1    0    0    7   59    0    1]
 [   1    0    0    6   43    0    0]
 [   3    0    0  125  268    0    6]
 [   5    0    0   30 1201    0   20]
 [   2    0    0   12  193    0    1]
 [   6    0    0   51  178    0   46]]


This does not show a different pattern than we have seen before.

## 6. Applying the model to new data  <a class="anchor" id ="section6"></a> 

We would like to apply the embedding based model to our own data but this works a bit different as we cannot simply use the 'transform' function to represent the utterances using the bag-of-word representation of the training vocabulary.

What we need to do is to create an embedding representation using the same function we used above and assume that our classifier finds sufficient similarity in the embeddings of our data with the correct training data. Note that it is less relevant if the words have been observed in the training data as long as we get embeddings for these words from the embedding model. Embeddings always match something!!!!

We use the same set of utterances.

In [41]:
# some utterances
some_chat = ['That is sweet of you', 
               'You are so funny', 
               'Are you a man or a woman?', 
               'Chatbots make me sad and feel lonely.', 
               'Your are stupid and boring.', 
               'Two thumbs up', 
               'I fell asleep halfway through this conversation', 
               'Wow, I am really amazed.', 
               'You are amazing.',
             'I feel so low being in isolation',
             'People dumping waste are horrible',
             'Its awful that you cannot stop smoking',
             'Dogs scare me',
             'I am afraid I will get sick at work',
             'I run away when I see a dog',
             'When do you start your job?'
            ]


len(some_chat)

16

Next, we define the list of labels that go with our chat.

In [47]:
some_chat_gold = ['joy', 'joy', 'neutral', 'sadness', 'anger', 'joy', 'anger', 'surprise', 'joy', 'sadness', 'disgust', 'disgust', 'fear', 'fear', 'fear', 'neutral']

In [48]:
some_chat_tokens = tokenize_data(some_chat)
some_chat_embedding_vectors = getAvgFeatureVecs(some_chat_tokens, frequent_keywords,stop_words, word_embedding_model, index2word_set, num_features)

Shape of our matrix is: (16, 25)
Review 0 of 16


In [50]:
# have classifier make a prediction

some_chat_pred = svm_linear_clf.predict(some_chat_embedding_vectors[0])
for review, gold, predicted_label in zip(some_chat, some_chat_gold, some_chat_pred):   
    print('%s => %s, %s' % (review,gold,predicted_label))


That is sweet of you => joy, joy
You are so funny => joy, joy
Are you a man or a woman? => neutral, neutral
Chatbots make me sad and feel lonely. => sadness, neutral
Your are stupid and boring. => anger, neutral
Two thumbs up => joy, neutral
I fell asleep halfway through this conversation => anger, neutral
Wow, I am really amazed. => surprise, neutral
You are amazing. => joy, joy
I feel so low being in isolation => sadness, neutral
People dumping waste are horrible => disgust, neutral
Its awful that you cannot stop smoking => disgust, neutral
Dogs scare me => fear, anger
I am afraid I will get sick at work => fear, neutral
I run away when I see a dog => fear, neutral
When do you start your job? => neutral, neutral


In [52]:
report = classification_report(some_chat_gold,some_chat_pred,digits = 7)
print('Embedding SVM LINEAR ON EXTERNAL DATA -----------------------------')
print('Word embedding model used', wordembeddings)
print('Word frequency threshold', frequency_threshold)
print(report)

Embedding SVM LINEAR ON EXTERNAL DATA -----------------------------
Word embedding model used glove-twitter-25
Word frequency threshold 4
              precision    recall  f1-score   support

       anger  0.0000000 0.0000000 0.0000000         2
     disgust  0.0000000 0.0000000 0.0000000         2
        fear  0.0000000 0.0000000 0.0000000         3
         joy  1.0000000 0.7500000 0.8571429         4
     neutral  0.1666667 1.0000000 0.2857143         2
     sadness  0.0000000 0.0000000 0.0000000         2
    surprise  0.0000000 0.0000000 0.0000000         1

    accuracy                      0.3125000        16
   macro avg  0.1666667 0.2500000 0.1632653        16
weighted avg  0.2708333 0.3125000 0.2500000        16



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Now compare the results again with those from the BoW SVM of the previous notebook. 

```
BOW SVM LINEAR ----------------------------------------------------------------
Word freqeuncy threshold 5
              precision    recall  f1-score   support

       anger  0.3333333 0.5000000 0.4000000         2
     disgust  0.0000000 0.0000000 0.0000000         2
        fear  0.0000000 0.0000000 0.0000000         3
         joy  1.0000000 0.7500000 0.8571429         4
     neutral  0.2500000 1.0000000 0.4000000         2
     sadness  1.0000000 0.5000000 0.6666667         2
    surprise  1.0000000 1.0000000 1.0000000         1

    accuracy                      0.5000000        16
   macro avg  0.5119048 0.5357143 0.4748299        16
weighted avg  0.5104167 0.5000000 0.4601190        16

```

We still see that the BoW SVM classifier outperforms the embedding SVM model.

### 7. Saving the classifier to disk

Just as with the previous notebook, you can save the emotion classification model to disk and load the model some other time. Note that you need to load the same word2vec model as well to represent any text input with vector representations that are compatible.

In [53]:
import pickle

# save the classifier to disk
filename_classifier = './models/svm_linear_clf_embeddings.sav'
pickle.dump(svm_linear_clf, open(filename_classifier, 'wb'))

filename_freq_keywords = './models/frequent_keywords.sav'
pickle.dump(frequent_keywords, open(filename_freq_keywords, 'wb'))

In [61]:
# some time later...
 
# load the classifier and the vectorizer from disk
loaded_classifier = pickle.load(open(filename_classifier, 'rb'))
loaded_frequent_keywords = pickle.load(open(filename_freq_keywords, 'rb'))
stop_words = set(stopwords.words('english'))

some_chat_tokens = tokenize_data(some_chat)
some_chat_embedding_vectors, known_words, unknown_words = getAvgFeatureVecs(some_chat_tokens, loaded_frequent_keywords,stop_words, word_embedding_model, index2word_set, num_features)  
some_chat_pred = loaded_classifier.predict(some_chat_embedding_vectors)

for review, gold, predicted_label in zip(some_chat, some_chat_gold, some_chat_pred):
    
    print('%s => %s, %s' % (review, gold, predicted_label))

Shape of our matrix is: (16, 25)
Review 0 of 16
Shape of our matrix is: (16, 25)
That is sweet of you => joy, joy
You are so funny => joy, surprise
Are you a man or a woman? => neutral, neutral
Chatbots make me sad and feel lonely. => sadness, neutral
Your are stupid and boring. => anger, neutral
Two thumbs up => joy, neutral
I fell asleep halfway through this conversation => anger, neutral
Wow, I am really amazed. => surprise, neutral
You are amazing. => joy, joy
I feel so low being in isolation => sadness, neutral
People dumping waste are horrible => disgust, neutral
Its awful that you cannot stop smoking => disgust, neutral
Dogs scare me => fear, neutral
I am afraid I will get sick at work => fear, neutral
I run away when I see a dog => fear, neutral
When do you start your job? => neutral, neutral


# End of this notebook