# Lab3.3 Training an emotion classifier using embeddings

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

### Table of Contents

* [Section 1: Quick introduction to embeddings](#section1)
* [Section 2: Loading the emotion data](#section2)
* [Section 3: Preparing the training and test data](#section3)
* [Section 4: Training and applying the model](#section4)
* [Section 5: Generating the test report](#section5)
* [Section 6: Applying the classifier to your own text](#section6)


## 1 Quick introduction to embeddings  <a class="anchor" id ="section1"></a> 

Extracting features manually can get us a long way. In addition to lemma and part-of-speech, people have used other information: features of the previous words (on the left) or the next words (on the right), whether the current word starts with a capital, whether it is an abbreviation, etc.

A recent alternative way to create a 'semantic' representation of a word is by word embeddings: mapping words (or phrases) from the vocabulary to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. For this reason, they are called dense representations.

In linguistics, word embeddings were discussed in the research area of distributional semantics. The idea is to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying notion is that "a word is characterized by the company it keeps" (Firth). Embeddings are however the weights in the hidden layer of a neural network that is trained to predict the contexts rather than representing the context in a vector directly.

### Reference:

For a nice explanation how word embedddings can improve classical bag-of-word approaches, check out this page:

https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html


In [1]:
import gensim

There are many sites that provide pretrained word2vec models that can be loaded through the *gensim* package. Check out the original data from Google for a word2vec model with 300 dimensions trained from Wikipedia and news: [Google code archive](https://code.google.com/archive/p/word2vec/). 
Here is a website with many ready to use models: http://vectors.nlpl.eu/repository/

Whatever you choose, make sure you can load the model using the 'gensim' package.

In this notebook, we will load pre-trained word embeddings, created by [Glove project](https://nlp.stanford.edu/projects/glove/) from Stanford University on twitter data. We hope that twitter model is more adapted to the spoken utterances from the MELD project than the Google model trained on written news and Wikipedia articles.

We will load the model with the Gensim package that we used before. You can download the twitter models to your disk from:

http://nlp.stanford.edu/data/glove.twitter.27B.zip

You can see that different models are provided with different dimensions:

Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip

We will use the model with 200 dimensions. If your computer cannot handle it, you can try one of the smaller models.

When you unpack the zip file, you see that the models are provided as text files. To load the data into *gensim*, we need to carry some specific code given in the next cell. Don't worry too much about this code. Adapt the path to your local copy and run the cell to load it.

In [None]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = datapath('/Users/piek/Desktop/ONDERWIJS/data/word-embeddings/classical-models/glove-twitter-models/glove.twitter.27B.200d.txt')
tmp_file = get_tmpfile("test_word2vec.txt")

_ = glove2word2vec(glove_file, tmp_file)
word_embedding_model = KeyedVectors.load_word2vec_format(tmp_file)

### this model has 200 dimensions so we set the number of features to 200
num_features = 200

Instead of downloading the models to disk, 'gensim' also provides a downloader API to load the model from the web when needed. In the next cell, we use this API to download a word embeddding model trained on tweets. The following cell use the gensim API. Note that it takes some time to download but it saves some disk space.

In [2]:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

wordembeddings = "glove-twitter-200"
### this model has 200 dimensions so we set the number of features to 200
num_features = 200

#wordembeddings = "glove-twitter-25"
### this model has 25 dimensions so we set the number of features to 25
#num_features = 25

#wordembeddings = "glove-twitter-50"
### this model has 50 dimensions so we set the number of features to 50
#num_features = 50

#wordembeddings = "glove-twitter-100"
### this model has 100 dimensions so we set the number of features to 100
#num_features = 100

wordembeddings = "glove-wiki-gigaword-300"
num_features = 300

word_embedding_model = api.load(wordembeddings)
print(num_features)

300


Loading the word embedding model and training the classifier may take a while. If you laptop cannot handle it, use a smaller word embeddings model with less dimensions. Note that the performance of the classifier may be degraded when fewer dimensions are used. Alternatively, you can reduced the number of training data as we will show below but this will also likely affect the performance.

Depending on the embedding model you have selected, you need to set the number of features that are used to represent the utterances to the number of dimensions. If the value of num_features is different from the dimensions of the model, you get an error creating the embedding representations for the utterances below.

Let's check if the model works.

In [3]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

word1='cat'
word2='dog'
word1_vector=np.array(word_embedding_model[word1]).reshape(1, -1)
word2_vector=np.array(word_embedding_model[word2]).reshape(1, -1)
print(cosine_similarity(word1_vector, word2_vector))

[[0.6816747]]


## 2. Loading the emotion data  <a class="anchor" id ="section2"></a> 

Just as with the previous notebook, we load the training and test data from the MELD data set:

In [4]:
import pandas as pd
import nltk

filepath = './data/MELD/train_sent_emo.csv'
dftrain = pd.read_csv(filepath)
### The data has some problematic strings with encoding problems. The next code removes some of these from the utterances
# Try to fix encoding
dftrain['Utterance'] = dftrain['Utterance'].str.replace("\x92|\x97|\x91|\x93|\x94|\x85", "'")

filepath = './data/MELD/test_sent_emo.csv'
dftest = pd.read_csv(filepath)
dftest['Utterance'] = dftest['Utterance'].str.replace("\x92|\x97|\x91|\x93|\x94|\x85", "'")


In [5]:
### the first loop gets the utterances
training_instances = []
for utterance in dftrain['Utterance']:
    ### we can limit the data to the first 2000 utterances
    ##if index==2000:
    ##    break
    training_instances.append(nltk.tokenize.word_tokenize(utterance))

### We use the same loop for the list of emotion labels that correspond with the vector representations of each utterance
training_labels = []
for label in dftrain['Emotion']:
    ### we can limit the data to the first 2000 utterances
    ##if index==2000:
    ##    break
    training_labels.append(label)
    
### the first loop gets the utterances
test_instances = []
for utterance in dftest['Utterance']:
    test_instances.append(nltk.tokenize.word_tokenize(utterance))

### We use the same loop for the list of emotion labels that correspond with the vector representations of each utterance
test_labels = []
for label in dftest['Emotion']:
    test_labels.append(label)



# 3. Preparing the training and test data  <a class="anchor" id ="section3"></a> 

The following import are needed again:

In [6]:
import sklearn
import numpy as np
import nltk
from nltk.corpus import stopwords

In the previous notebook, we used CountVectorizer to obtain the full vocabulary of the data set and generate vectors for the one-hot-endcoing of each word. In these vectors, each slot represents a word and a value '1' indicates that the word was present in the utterance and a '0' means absence. This results is large and sparse vector representations for each utterance. We have also seen that we can weight the relevance of a word using the 'TF.IDF' function. This still results in large and sparse vectors but weights are more subtle. The down side is sparseness, lack of generalisation and lack robustness. 

In the following, we are going to represent the utterances by an embedding representation. In fact, we take the word embedding of each token in the utterance and add these together, after which we take the average. All the embeddings have the same number of dimensions in the same order. So if two tokens have a high weight for one dimension then their co-uccurrence in an utterance will enforce that weight. Note that by adding and taking the average, we normalize for the length of the utterance and the order of the tokens is not relevant.

Before we create the embedding representations for the utterances, we are going to filter the words by the NLTK stopword list and their frequency. In the case of CountVectorizer this was done for us. In this case, we need to make our own customized function. The next piece of code shows a loop over all utterances from which we extract a list of all tokens. Next, we count these tokens and extract a list of words that occurs above our threshold, which we should set in the same way as for CountVectorizer.

In [7]:
from collections import Counter
frequency_threshold = 4
frequent_keywords = []

####  We first will collect all tokens from all the utterances from the training data using the NLTL tokenizer
alltokens = []
for utterance in dftrain['Utterance']:
    tokenlist = nltk.tokenize.word_tokenize(utterance)
    for token in tokenlist:
        alltokens.append(token)
#### The Counter function will create a frequency count of all the items. The result is a dictionary       
kw_counter = Counter(alltokens)

#### We now loop over the dictionary with counts to get the word and the frequency value
for word, count in kw_counter.items():
    if count>frequency_threshold:
        frequent_keywords.append(word)

#print(kw)

We are going to define two customized function using 'def' to create an embedding representation for each utterance. 
These functions are taken from: https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec

The first function, called 'featureVecMethod', takes the words of the utterance and the embedding model as parameters. The num_features parameter determines the size of the vector. 

In [8]:
unknown_words =[]
known_words = []
# Function to average all word vectors in a paragraph
def featureVecMethod(words, frequent_keywords, stopwords, model, modelword_index, num_features):
    # Pre-initialising empty numpy array for speed
    # This create a numpy array with the length of the num_features set to zero values
    featureVec = np.zeros(num_features,dtype="float32")
    nwords = 0
        
    for word in  words:
        #### we only use words that are frequent and not stopwords
        if word in frequent_keywords and not word in stop_words:         
            if word in index2word_set:
                #featureVec = np.add(featureVec,model[word])
                featureVec = np.add(featureVec,model[word]/np.linalg.norm(model[word]))

                #we keep track of the words detected, just for analysing the data
                known_words.append(word)
                nwords = nwords + 1
            else:
                word = word.lower()
                if word in index2word_set:
                    #featureVec = np.add(featureVec,model[word])
                    featureVec = np.add(featureVec,model[word]/np.linalg.norm(model[word]))
                    
                    #we keep track of the words detected, just for analysing the data
                    known_words.append(word)
                    nwords = nwords + 1
                else:
                    #we keep track of the unknown words to see how well our model fits the data
                    unknown_words.append(word)
                    
    # Dividing the result by number of words in each utterance to get average
    featureVec = np.divide(featureVec, nwords)
    return featureVec

The next function just deals with all the data and creates the list of input vectors. This function calls the previous function

In [9]:
# Function for calculating the average feature vector
def getAvgFeatureVecs(texts, keywords, stopwords, model, modelword_index, num_features):
    counter = 0
    #### we initialise a vector with zeros of the type float32
    textFeatureVecs = np.zeros((len(texts),num_features),dtype="float32")
    
    #### We iterate over all the texts
    for text in texts:
        # Printing a status message every 1000th text, to see what we are processing
        if counter%1000 == 0:
            print("Review %d of %d"%(counter,len(texts)))
        ### Each text is transformed into a vector representation based on the averaged token embedding using the previous function
        ### We add these vectors to the total list
        textFeatureVecs[counter] = featureVecMethod(text, keywords, stopwords, model, modelword_index,num_features)
        counter = counter+1
    return textFeatureVecs

Now back to our input data. We iterate over the Pandas frame in the same way as before but now we extract for each utterance the embedding representation, using the above two functions.

In [10]:
# Calculating average feature vector for training set

#Converting Index2Word which is a list to a set for better speed in the execution.
#Allows for quicker lookup if the words exist
index2word_set = set(word_embedding_model.wv.index2word)
# We take the stopwords from NLTK
stop_words = set(stopwords.words('english'))

### The full list of utterances is passed to our customized function, with the frequent_keywords list, the stopwords, the embedding model, 
### the index and the number of features that indicates the dimensions of the model
trainDataVecs = getAvgFeatureVecs(training_instances, frequent_keywords, stop_words, word_embedding_model, index2word_set, num_features)

#### Due to the averaging, there could be infinitive values or NaN values. The next numpy function turns these value to "0" scores
trainDataVecs = np.nan_to_num(trainDataVecs) 


  """


Review 0 of 9989
Review 1000 of 9989
Review 2000 of 9989
Review 3000 of 9989
Review 4000 of 9989
Review 5000 of 9989
Review 6000 of 9989
Review 7000 of 9989
Review 8000 of 9989
Review 9000 of 9989


Let's inspect our training data a bit more. Depending on the break set for loading the training data, you will have a list of vectors with according length:

In [11]:
len(trainDataVecs)

9989

We can inspect the first element in the list:

In [12]:
print('Vector length', len(trainDataVecs[0]))
print (trainDataVecs[0])

Vector length 300
[-2.42284834e-02  2.93585472e-02 -1.98888825e-03 -2.60138828e-02
  2.12003011e-04 -1.65226236e-02 -1.70849659e-03  2.49614175e-02
  8.99223983e-03 -3.20224464e-01  2.05466580e-02  9.02264100e-03
  1.72707322e-03  1.63111109e-02  4.60524857e-02  3.75630483e-02
 -3.11330091e-02 -3.31713678e-03  3.10880831e-04 -1.91136003e-02
  3.76915373e-02  3.55959460e-02  1.80483200e-02  1.18826814e-02
 -3.34082618e-02 -8.30051303e-03  5.46814129e-03 -1.42572133e-03
 -2.53683683e-02  2.36102045e-02  2.80681020e-03  7.12274387e-02
 -2.33282633e-02  6.64981361e-03 -1.54919937e-01  2.01730337e-02
  5.75820496e-03 -3.80901154e-03 -8.61165114e-03 -7.86029082e-03
 -1.62166171e-02 -6.09386247e-03 -1.48823317e-02  5.30707687e-02
 -1.69919785e-02  2.79374905e-02  4.06390727e-02  3.65676470e-02
 -1.76445320e-02  9.20884591e-03  6.82919938e-03 -1.11794956e-02
  3.85308862e-02  8.87811556e-03 -9.13435686e-03  2.98784617e-02
 -4.92461119e-03  5.87546527e-02  2.46372167e-02 -6.12752233e-03
  3.251

It is simply a list with digits, each representing the averaged weight of the tokens or words that made up the utterance. We can checks the length, which should be '300', '100', '50' or '25', etc. depending on the number of dimensions of the word2vec model that you used.

There are two major differences with the bag-of-tokens that we used in the previous notebook:

1. the vectors are short
2. there are no zero's 

Instead of *large sparse* vectors, we now have *short dense* vectors representing each utterance. Whereas in the previous representation, each slot in the vector corresponds with a token, now each slot is a weight from the hidden layer to learn to predict others words in the context.

This is true for each utterance, each having a unique set of values for the same hidden layer weights. These weights now represent the meaning of the utterance for a machine, which can use a similarity function such as cosine similairty to measure the degree of equivalence across these representations. When we inspects any other utterance, we see it is represented in a simlar way.

In [13]:
print(len(trainDataVecs[1000]))
print(trainDataVecs[1000])

300
[-3.34656052e-02 -1.54432449e-02  1.91893093e-02 -5.23057058e-02
  1.76805276e-02  4.07470670e-03 -1.71777830e-02  5.14362641e-02
  3.88247184e-02 -3.75979602e-01  7.19848052e-02 -1.79710761e-02
 -3.62468660e-02  6.15833774e-02 -1.57246254e-02  1.22439843e-02
 -5.39080687e-02  6.31871168e-03  1.29400091e-02 -1.66408233e-02
 -3.34875961e-03  6.07072338e-02  5.52398674e-02  2.99382154e-02
 -5.12655266e-02 -2.50904020e-02  4.18722862e-03 -4.78261448e-02
 -2.41893567e-02  1.42338928e-02 -2.30888464e-02  6.95327073e-02
 -5.94733842e-02  1.46406349e-02 -2.34109908e-01  5.14502414e-02
 -3.86457029e-03  4.01513055e-02 -3.06833498e-02 -5.70127089e-03
  4.84503731e-02 -4.19773385e-02 -1.20958053e-02  6.38127234e-03
 -5.32918237e-03  7.74203241e-03  7.32207969e-02  7.64759630e-02
  1.09588206e-02  2.62346324e-02  3.51432664e-03 -3.24195810e-02
  1.77698527e-02 -3.65351513e-02 -4.13852260e-02  5.94245419e-02
 -1.62900276e-02  4.39020880e-02  4.44586240e-02 -1.69757474e-02
  5.59020117e-02 -1.8

Since the vectors are compatible, we can compare them in the same way as we did before for the word2vec embeddings of *cat* and *dog*:

In [14]:
word1_vector=np.array(trainDataVecs[0]).reshape(1, -1)
word2_vector=np.array(trainDataVecs[1000]).reshape(1, -1)
print(cosine_similarity(word1_vector, word2_vector))

[[0.8440603]]


For training, we use the same labels as before:

In [15]:
print(training_labels[0], training_labels[1])

neutral neutral


So now we have a numeric representation of each text, based on the embeddings of the words. We feed this to a classifier in the same way as we did in the previous notebooks with the Countvectorizer output.

Before we can train the classifier, we stil need to convert the labels to numeric values as we did before.

Before we do that, it may be good to check which words are not in the embedding model and therefore do not contribute to the representation of the utterance. In the above function, we kept track of the unknown words. Now we can inspect this list. We use the *Counter* function to get a frequency count of these words.

In [16]:
from collections import Counter

print('Training data vocabulary statistics:')
print('Frequency threshold', frequency_threshold)
unknown_words_count = Counter(unknown_words)
print('Proportion of unknown tokens', len(unknown_words)/(len(unknown_words)+len(known_words)))
print('Number of unknown words',len(unknown_words_count))
print('Number of unknown word tokens:', len(unknown_words))
print('Unknown words counts')
print(unknown_words_count)

Training data vocabulary statistics:
Frequency threshold 4
Proportion of unknown tokens 0.009936389232350618
Number of unknown words 25
Number of unknown word tokens: 667
Unknown words counts
Counter({"y'know": 322, 'pheebs': 79, '..': 42, 'and-and': 26, 'you-you': 20, 'no-no-no': 18, "i'm-i": 18, 'i-i-i': 18, 'hey-hey': 14, 'that-that': 12, 'we-we': 10, 'oh-oh': 10, 'what-what': 9, 'no-no-no-no': 8, 'yeah-yeah': 7, "'kay": 6, "it's-it": 6, 'buffay': 6, 'oh-ho': 6, 'i-i-i-i': 5, 'ok.': 5, "that's-that": 5, 'um-hmm': 5, 'the-the': 5, 'heldi': 5})


We also kept track of the *known* words, so lets check these as well:

In [17]:
known_words_count = Counter(known_words)
print('Number of known words',len(known_words_count))
print('Number of known words tokens:', len(known_words))
#print(known_words_count)

Number of known words 1113
Number of known words tokens: 66460


Next we represent the test data with the same methods to make them compatible.

In [18]:
testDataVecs = getAvgFeatureVecs(test_instances, frequent_keywords, stop_words, word_embedding_model, index2word_set, num_features)

#### Due to the averaging, there could be infinitive values or NaN values. The next numpy function turns these value to "0" scores
testDataVecs = np.nan_to_num(testDataVecs) 

print('Test data vocabulary statistics:')
print('Frequency threshold', frequency_threshold)


unknown_words_count = Counter(unknown_words)
print('Proportion of unknown tokens', len(unknown_words)/(len(unknown_words)+len(known_words)))
print('Number of unknown words',len(unknown_words_count))
print('Number of unknown word tokens:', len(unknown_words))
print('Unknown words counts')
print(unknown_words_count)

known_words_count = Counter(known_words)
print('Number of known words',len(known_words_count))
print('Number of known words tokens:', len(known_words))
#print(known_words_count)

Review 0 of 2610
Review 1000 of 2610
Review 2000 of 2610
Test data vocabulary statistics:
Frequency threshold 4
Proportion of unknown tokens 0.009801611611139908
Number of unknown words 25
Number of unknown word tokens: 832
Unknown words counts
Counter({"y'know": 433, 'pheebs': 88, '..': 50, 'and-and': 28, 'i-i-i': 27, 'you-you': 26, 'no-no-no': 21, "i'm-i": 20, 'hey-hey': 16, 'we-we': 12, 'that-that': 12, 'what-what': 11, 'oh-oh': 10, 'no-no-no-no': 9, "'kay": 8, 'buffay': 8, 'yeah-yeah': 7, 'oh-ho': 7, 'the-the': 7, 'ok.': 6, "it's-it": 6, 'i-i-i-i': 5, "that's-that": 5, 'um-hmm': 5, 'heldi': 5})
Number of known words 1113
Number of known words tokens: 84052




Just as in the previous notebook, we need to turn the labels into numerical values:

In [19]:
from sklearn import preprocessing
# first we instantiate a label encode
le = preprocessing.LabelEncoder()
# we fee this encoder with the complete list of labels from our data
le.fit(training_labels+test_labels)
print(list(le.classes_))
training_classes = le.transform(training_labels)
print('Train labels', list(training_classes[0:20]))

test_classes = le.transform(test_labels)
print('Test labels', list(test_classes[0:20]))

['anger', 'disgust', 'fear', 'joy', 'neutral', 'sadness', 'surprise']
Train labels [4, 4, 4, 4, 6, 4, 4, 4, 4, 4, 2, 4, 6, 4, 6, 5, 6, 2, 4, 4]
Test labels [6, 0, 4, 4, 3, 3, 3, 3, 3, 3, 3, 4, 4, 5, 6, 0, 0, 0, 3, 3]


The next steps are the same as for the previous notebook, except that we pass the embedding representations of the training data.

## 4. Training and applying the model  <a class="anchor" id ="section4"></a> 

In [20]:
from sklearn import svm

# We choose a Linear model
svm_linear_clf = svm.LinearSVC(max_iter=2000)
### we train the classifier through the *fit* function and by passing the training vectors and the training labels as paramters:
svm_linear_clf.fit(trainDataVecs, training_classes)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=2000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [21]:
# Predicting the Test set results, find macro recall
y_pred_svm_linear = svm_linear_clf.predict(testDataVecs)

## 5. Generating the test report  <a class="anchor" id ="section5"></a> 

In [22]:
from sklearn.metrics import classification_report

In [23]:
#### this report gives the results for the LINEAR classifier
report = classification_report(test_classes,y_pred_svm_linear,digits = 7)
print(le.classes_)
print('Embeddings SVM LINEAR ----------------------------------------------------------------')
print('Word embedding model used', wordembeddings)
print('Word frequency threshold', frequency_threshold)
print(report)
print('Confusion matrix')
print(sklearn.metrics.confusion_matrix(test_classes,y_pred_svm_linear))

['anger' 'disgust' 'fear' 'joy' 'neutral' 'sadness' 'surprise']
Embeddings SVM LINEAR ----------------------------------------------------------------
Word embedding model used glove-wiki-gigaword-300
Word frequency threshold 4
              precision    recall  f1-score   support

           0  0.4761905 0.1159420 0.1864802       345
           1  0.0000000 0.0000000 0.0000000        68
           2  0.0000000 0.0000000 0.0000000        50
           3  0.4582210 0.4228856 0.4398448       402
           4  0.6007194 0.9307325 0.7301686      1256
           5  0.5882353 0.0480769 0.0888889       208
           6  0.5654450 0.3843416 0.4576271       281

    accuracy                      0.5735632      2610
   macro avg  0.3841159 0.2717112 0.2718585      2610
weighted avg  0.5303591 0.5735632 0.5001254      2610

Confusion matrix
[[  40    0    0   81  197    2   25]
 [   3    0    0    9   53    0    3]
 [   2    0    0   10   35    1    2]
 [   7    0    0  170  210    0   15]
 [   8

  _warn_prf(average, modifier, msg_start, len(result))


Remember the results from the notebook where we trained a NaiveBayes and SVM classifiers with one-hot-encodings of the words? Take some time to compare the results and think about the differences.

## 6. Applying the model to new data  <a class="anchor" id ="section6"></a> 

We would like to apply the embedding based model to our own data but this works a bit different as we cannot simply use the 'transform' function to represent the utterances using the one-hot vector representation of the training vocabulary.

What we need to do is to create an embedding representation using the same function we used above and assume that our classifier finds sufficient similarity in the embeddings of our data with the correct training data.

We use the same set of utterances.

In [24]:
# some utterances
some_chat = ['That is sweet of you', 
               'You are so funny', 
               'Are you a man or a woman?', 
               'Chatbots make me sad and feel lonely.', 
               'Your are stupid and boring.', 
               'Two thumbs up', 
               'I fell asleep halfway through this conversation', 
               'Wow, I am really amazed.', 
               'You are amazing.']


len(some_chat)

9

Next, we define the list of labels that go with our chat.

In [25]:
some_chat_emotions = ['joy', 'joy', 'neutral', 'sadness', 'anger', 'joy', 'anger', 'surprise', 'joy']

We  use the LabelEncoder *le* to convert this list into a numpy array with digits:

In [26]:
print('labels',le.classes_)
some_chat_labels = le.transform(some_chat_emotions)
print(some_chat_labels)

labels ['anger' 'disgust' 'fear' 'joy' 'neutral' 'sadness' 'surprise']
[3 3 4 5 0 3 0 6 3]


In [27]:
some_chat_tokens = []
for utterance in some_chat:
    some_chat_tokens.append(nltk.tokenize.word_tokenize(utterance))

some_chat_embedding_vectors = getAvgFeatureVecs(some_chat_tokens, frequent_keywords,stop_words, word_embedding_model, index2word_set, num_features)
#### Due to the averaging, there could be infinitive values or NaN values. The next numpy function turns these value to "0" scores
some_chat_embedding_vectors = np.nan_to_num(some_chat_embedding_vectors)  

Review 0 of 9




In [28]:
# have classifier make a prediction

some_chat_pred = svm_linear_clf.predict(some_chat_embedding_vectors)
print('System predictions', some_chat_pred)
print('Gold labels', some_chat_labels)
for review, predicted_label in zip(some_chat, some_chat_pred):
    
    print('%s => %s' % (review, 
                        le.classes_[predicted_label]))




System predictions [3 3 6 5 0 4 4 6 3]
Gold labels [3 3 4 5 0 3 0 6 3]
That is sweet of you => joy
You are so funny => joy
Are you a man or a woman? => surprise
Chatbots make me sad and feel lonely. => sadness
Your are stupid and boring. => anger
Two thumbs up => neutral
I fell asleep halfway through this conversation => neutral
Wow, I am really amazed. => surprise
You are amazing. => joy


In [29]:
report = classification_report(some_chat_labels,some_chat_pred,digits = 7)
print(le.classes_)
print('Embedding SVM LINEAR ON EXTERNAL DATA -----------------------------')
print('Word embedding model used', wordembeddings)
print('Word frequency threshold', frequency_threshold)
print(report)

['anger' 'disgust' 'fear' 'joy' 'neutral' 'sadness' 'surprise']
Embedding SVM LINEAR ON EXTERNAL DATA -----------------------------
Word embedding model used glove-wiki-gigaword-300
Word frequency threshold 4
              precision    recall  f1-score   support

           0  1.0000000 0.5000000 0.6666667         2
           3  1.0000000 0.7500000 0.8571429         4
           4  0.0000000 0.0000000 0.0000000         1
           5  1.0000000 1.0000000 1.0000000         1
           6  0.5000000 1.0000000 0.6666667         1

    accuracy                      0.6666667         9
   macro avg  0.7000000 0.6500000 0.6380952         9
weighted avg  0.8333333 0.6666667 0.7142857         9



### 7. Saving the classifier to disk

Just as with the previous notebook, you can save the emotion classification model to disk and load the model some other time. Note that you need to load the same word2vec model as well to represent any text input with vector representations that are compatible.

In [51]:
import pickle

# save the classifier to disk
filename_classifier = './models/svm_linear_clf_embeddings.sav'
pickle.dump(svm_nonlinear_clf, open(filename_classifier, 'wb'))
 
# some time later...
 
# load the classifier and the vectorizer from disk
loaded_classifier = pickle.load(open(filename_classifier, 'rb'))


# End of this notebook