# Lab3.3 Training an emotion classifier using embeddings

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

### Table of Contents

* [Section 1: Quick introduction to embeddings](#section1)
* [Section 2: Loading the emotion data](#section2)
* [Section 3: Preparing the training and test data](#section3)
* [Section 4: Training and applying the model](#section4)
* [Section 5: Generating the test report](#section5)
* [Section 6: Applying the classifier to your own text](#section6)


## 1 Quick introduction to embeddings  <a class="anchor" id ="section1"></a> 

Extracting features manually can get us a long way. In addition to lemma and part-of-speech, people have used other information: features of the previous words (on the left) or the next words (on the right), whether the current word starts with a capital, whether it is an abbreviation, etc.

A recent alternative way to create a 'semantic' representation of a word is by word embeddings: mapping words (or phrases) from the vocabulary to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. For this reason, they are called dense representations.

In linguistics, word embeddings were discussed in the research area of distributional semantics. The idea is to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying notion is that "a word is characterized by the company it keeps" (Firth). Embeddings are however the weights in the hidden layer of a neural network that is trained to predict the contexts rather than representing the context in a vector directly.

### Reference:

For a nice explanation how word embedddings can improve classical bag-of-word approaches, check out this page:

https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html


In this section, we will load pre-trained word embeddings called word2vec, created by Google. The embeddings have 300 dimensions.

First, download the file from [Kaggle](https://www.kaggle.com/pkugoodspeed/nlpword2vecembeddingspretrained) or from [Google code archive](https://code.google.com/archive/p/word2vec/). Then, create a folder and unpack the word2vec file in that folder.

We will load the embedding model with the Gensim package that we used before.

In [14]:
from sklearn.metrics.pairwise import cosine_similarity
import gensim
import numpy as np

In [2]:
##### Change to path to the location of your local copy of the GoogleNews embeddings
##### It may take a  minute to load the model
path_to_model = '/Users/piek/Desktop/ONDERWIJS/data/word-embeddings/classical-models/GoogleNews-vectors-negative300.bin'
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format(path_to_model, binary=True)  


In case you computer cannot handle big models such as the Google news model, you can also download one of the smaller 'Glove' datasets. These are provided as text files and 'gensim' provides a function to convert and load them. Note that these models have less domensions than the Google model and you need to adapt the number of features in the code below.

In [12]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = datapath('/Users/piek/Desktop/ONDERWIJS/data/word-embeddings/classical-models/glove.6B.50d.txt')
tmp_file = get_tmpfile("test_word2vec.txt")

_ = glove2word2vec(glove_file, tmp_file)
word_embedding_model = KeyedVectors.load_word2vec_format(tmp_file)


Instead of downloading the models to disk, 'gensim' also provides a downloader API to load the model from the web when needed. In the next cell, we use this API to download a word embeddding model trained on tweets. Note that these are GloVe embeddings built using Tweets as the name suggests. These vectors are based on 2B tweets, 27B tokens, 1.2M vocab, uncased. The original source can be found here: https://nlp.stanford.edu/projects/glove/. The 25 in the model name refers to the dimensionality of the vectors.

In [15]:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
# download the model and return as object ready for use
word_embedding_model = api.load("glove-twitter-25")

Let's check if the model works.

In [16]:
word1='cat'
word2='dog'
word1_vector=np.array(word_embedding_model[word1]).reshape(1, -1)
word2_vector=np.array(word_embedding_model[word2]).reshape(1, -1)
print(cosine_similarity(word1_vector, word2_vector))

[[0.9590821]]


## 2. Loading the emotion data  <a class="anchor" id ="section2"></a> 

In [17]:
import pandas as pd
filepath = './data/MELD/train_sent_emo.csv'
df = pd.read_csv(filepath)

# 3. Preparing the training and test data  <a class="anchor" id ="section3"></a> 

The following import are needed again:

In [18]:
import sklearn
import numpy
import nltk
from nltk.corpus import stopwords

In the previous notebook, we used CountVectorizer to obtain the full vocabulary of the data set and generate vectors for the one-hot-endcoing of each word. In these vectors, each slot represents a word and a value '1' indicates that the word was present in the utterance and a '0' means absence. This results is large and sparse vector representations for each utterance. We have also seen that we can weight the relevance of a word using the 'TF.IDF' function. This still results in large and sparse vectors but weights are more subtle. The down side is sparseness, lack of generalisation and lack robustness. 

In the following, we are going to represent the utterances by an embedding representation. In fact, we take the word embedding of each token in the utterance and add these together, after which we take the average. All the embeddings have the same number of dimensions in the same order. So if two tokens have a high weight for one dimension then their co-uccurrence in an utterance will enforce that weight. Note that by adding and taking the average, we normalize for the length of the utterance and the order of the tokens is not relevant.

We are going to define two customized function using 'def' to create an embedding representation for each utterance. These functions are taken from: https://www.kaggle.com/varun08/sentiment-analysis-using-word2vec

The first function, called 'featureVecMethod', takes the words of the utterance and the embedding model as parameters. The num_features parameter determines the size of the vector. 

In [151]:
unknown_words =[]
known_words = []
# Function to average all word vectors in a paragraph
def featureVecMethod(words, stopwords, model, modelword_index, num_features):
    # Pre-initialising empty numpy array for speed
    # This create a numpy array with the length of the num_features set to zero values
    featureVec = np.zeros(num_features,dtype="float32")
    nwords = 0
        
    for word in  words:
        if not word in stop_words: 
            if word in index2word_set:
                nwords = nwords + 1
                featureVec = np.add(featureVec,model[word])
            else:
                word = word.lower()
                if word in index2word_set:
                    nwords = nwords + 1
                    featureVec = np.add(featureVec,model[word])
                    #we keep track of the words detected
                    known_words.append(word)
                else:
                    #we keep track of the unknown words to see how well our model fits the data
                    unknown_words.append(word)
    # Dividing the result by number of words to get average
    featureVec = np.divide(featureVec, nwords)
    return featureVec

The next function just deals with all the data and creates the list of input vectors. This function calls the previous function

In [152]:
# Function for calculating the average feature vector
def getAvgFeatureVecs(texts, stopwords, model, modelword_index, num_features):
    counter = 0
    textFeatureVecs = np.zeros((len(texts),num_features),dtype="float32")
    for text in texts:
        # Printing a status message every 1000th text
        if counter%200 == 0:
            print("Review %d of %d"%(counter,len(texts)))
            
        textFeatureVecs[counter] = featureVecMethod(text, stopwords, model, modelword_index,num_features)
        counter = counter+1
    return textFeatureVecs

Now back to our input data. We iterate over the Pandas frame in the same way as before but now we extract for each utterance the embedding representation.

In [154]:
# Calculating average feature vector for training set
### This is the number of dimensions in the word2vec model used. 
###The Google news model has 300 dimensions but if you use a Glove model you may have to adapt this accordinlgy

#Converting Index2Word which is a list to a set for better speed in the execution.
#Allows for quicker lookup if the words exist
index2word_set = set(word_embedding_model.wv.index2word)
stop_words = set(stopwords.words('english'))

num_features = 25
training_vectors = []
training_labels = []
for index, utterance in enumerate(df['Utterance']):
    ### Running this for all data requires a lot of memory and takes about an hour.
    ### For teaching purposes, it makes sense to limit the data
    ### we limit the data to the first 1000 utterances
    ##if index==2000:
    ##    break
    training_vectors.append(nltk.tokenize.word_tokenize(utterance))
    training_labels.append(df['Emotion'].iloc[index])

trainDataVecs = getAvgFeatureVecs(training_vectors, stop_words, word_embedding_model, index2word_set, num_features)
#### Due to the averaging, there could be infinitive values or NaN values. The next numpy function turns these value to "0" scores
trainDataVecs = np.nan_to_num(trainDataVecs)  

  import sys


Review 0 of 9989
Review 200 of 9989
Review 400 of 9989
Review 600 of 9989
Review 800 of 9989
Review 1000 of 9989
Review 1200 of 9989
Review 1400 of 9989
Review 1600 of 9989
Review 1800 of 9989
Review 2000 of 9989
Review 2200 of 9989
Review 2400 of 9989
Review 2600 of 9989
Review 2800 of 9989
Review 3000 of 9989
Review 3200 of 9989
Review 3400 of 9989
Review 3600 of 9989
Review 3800 of 9989
Review 4000 of 9989
Review 4200 of 9989
Review 4400 of 9989
Review 4600 of 9989
Review 4800 of 9989
Review 5000 of 9989
Review 5200 of 9989
Review 5400 of 9989
Review 5600 of 9989
Review 5800 of 9989
Review 6000 of 9989
Review 6200 of 9989
Review 6400 of 9989
Review 6600 of 9989
Review 6800 of 9989
Review 7000 of 9989
Review 7200 of 9989
Review 7400 of 9989
Review 7600 of 9989
Review 7800 of 9989
Review 8000 of 9989
Review 8200 of 9989
Review 8400 of 9989
Review 8600 of 9989
Review 8800 of 9989
Review 9000 of 9989
Review 9200 of 9989
Review 9400 of 9989
Review 9600 of 9989
Review 9800 of 9989




Training the classifier may take a while. If you laptop cannot handle it, reduce the number of training data. Alternatively, you can use a smaller word2vec embeddings model. Here is a website with many ready to use models: http://vectors.nlpl.eu/repository/

You can either choose a model with a smaller vocabulary or with less dimensions. Whatever you choose, make sure you can load the model using the 'gensim' package. If you choose a model with less than 300 dimensions (e.g. 100 or 200), you also need to adapt the value for *num_features* accordingly = 300.

Let's inspect our training data a bit more. Depending on the break set for loading the training data, you will have a list of vectors with according length:

In [155]:
len(trainDataVecs)

9989

We can inspect the first element in the list:

In [156]:
print('Vector length', len(trainDataVecs[0]))
print (trainDataVecs[0])

Vector length 25
[ 0.18926129 -0.05296415 -0.50138575 -0.24667558  0.2502413   0.23562701
  0.9052485  -0.24662258  0.03203001  0.24278942 -0.12382173  0.302268
 -4.0777287   0.30402184  0.17057772  0.3256581   0.20401253 -0.01058714
 -0.22620572 -0.5342373   0.15801655 -0.00572142 -0.12884729  0.02147143
 -0.26485285]


It is simply a list with digits, each representing the averaged weight of the tokens or words that made up the utterance. We can checks the length, which should be '300', '100', '50' or '25', etc. depending on the number of dimensions of the word2vec model that you used.

There are two major differences with the bag-of-tokens that we used in the previous notebook:

1. the vectors are short
2. there are no zero's 

Instead of *large sparse* vectors, we now have *short dense* vectors representing each utterance. Whereas in the previous representation, each slot in the vector corresponds with a token, now each slot is a weight from the hidden layer to learn to predict others words in the context.

This is true for each utterance, each having a unique set of values for the same hidden layer weights. These weights now represent the meaning of the utterance for a machine, which can use a similarity function such as cosine similairty to measure the degree of equivalence across these representations. When we inspects any other utterance, we see it is represented in a simlar way.

In [157]:
print(len(trainDataVecs[1000]))
print(trainDataVecs[1000])

25
[-9.12399311e-03 -3.35800022e-01 -2.65363783e-01 -9.92293954e-02
 -1.23498216e-01  8.11171889e-01  1.51711392e+00  7.17526078e-01
 -5.45693994e-01  2.19683975e-01 -2.85960019e-01  3.46198073e-03
 -4.36201954e+00  2.35049993e-01  2.65846014e-01 -5.92328012e-01
  9.66439992e-02 -4.67678875e-01 -5.74586034e-01 -4.27910030e-01
  8.56918376e-03  9.68838036e-02 -7.06506014e-01  7.15071976e-01
 -2.31559929e-02]


Since the vectors are compatible, we can compare them in the same way as we did before for the word2vec embeddings of *cat* and *dog*:

In [158]:
word1_vector=np.array(trainDataVecs[0]).reshape(1, -1)
word2_vector=np.array(trainDataVecs[1000]).reshape(1, -1)
print(cosine_similarity(word1_vector, word2_vector))

[[0.9054803]]


For training, we use the same labels as before:

In [159]:
print(training_labels[0], training_labels[1])

neutral neutral


So now we have a numeric representation of each text, based on the embeddings of the words. We feed this to a classifier in the same way as we did in the previous notebooks with the Countvectorizer output.

Before we can train the classifier, we stil need to convert the labels to numeric values as we did before.

Before we do that, it may be good to check which words are not in the embedding model and therefore do not contribute to the representation of the utterance. In the above function, we kept track of the unknown words. Now we can inspect this list. We use the *Counter* function to get a frequency count of these words.

In [160]:
from collections import Counter

unknown_words_count = Counter(unknown_words)
print('Proportion of unknown tokens', len(unknown_words)/(len(unknown_words)+len(known_words)))
print('Number of unknown words',len(unknown_words_count))
print('Number of unknown word tokens:', len(unknown_words))
print('Unknown words counts')
print(unknown_words_count)

Proportion of unknown tokens 0.22693210203618197
Number of unknown words 959
Number of unknown word tokens: 5996
Unknown words counts
Counter({'i\x92m': 580, 'don\x92t': 447, 'it\x92s': 388, '...': 361, 'that\x92s': 299, 'you\x92re': 268, "y'know": 192, 'y\x92know': 188, 'can\x92t': 159, 'we\x92re': 114, 'i\x92ll': 106, 'didn\x92t': 91, 'he\x92s': 89, 'i\x92ve': 86, 'let\x92s': 66, 'she\x92s': 66, 'there\x92s': 66, 'what\x92s': 59, 'they\x92re': 49, '..': 49, 'doesn\x92t': 44, 'i\x92d': 43, '....': 43, '\x91cause': 39, 'we\x92ll': 36, 'wouldn\x92t': 32, 'and-and': 30, 'you\x92ve': 29, 'you\x92ll': 28, 'no-no-no': 25, 'won\x92t': 25, 'haven\x92t': 24, '\x91em': 24, 'wasn\x92t': 23, 'couldn\x92t': 23, 'who\x92s': 23, 'i\x92m-i\x92m': 19, 'isn\x92t': 19, 'shouldn\x92t': 15, 'here\x92s': 15, 'you\x92d': 15, 'doin\x92': 14, 'hey-hey': 14, 'no-no-no-no': 13, 'we-we': 13, 'it\x92s-it\x92s': 13, 'that-that': 13, 'aren\x92t': 13, 'monica\x92s': 12, 'morning\x92s': 12, 'how\x92s': 11, 'would\x92

We also kept track of the *known* words, so lets check these as well:

In [161]:
known_words_count = Counter(known_words)
print('Number of known words',len(known_words_count))
print('Number of known words tokens:', len(known_words))
print('Unknown words counts')
print(known_words_count)

Number of known words 1398
Number of known words tokens: 20426
Unknown words counts
Counter({'i': 4141, 'oh': 1217, 'you': 637, 'yeah': 610, 'okay': 603, 'no': 558, 'what': 528, 'hey': 523, 'well': 507, 'so': 327, 'and': 288, 'ross': 251, 'god': 237, 'all': 236, 'joey': 216, 'it': 205, 'that': 191, 'hi': 183, 'but': 182, 'we': 169, 'chandler': 153, 'yes': 142, 'rachel': 137, 'look': 136, 'monica': 135, 'this': 134, 'phoebe': 134, 'why': 132, 'the': 126, 'how': 125, 'he': 117, 'uh': 103, 'do': 100, 'come': 100, 'ok': 98, 'now': 97, 'thank': 97, 'really': 95, 'i-i': 93, 'umm': 91, 'pheebs': 87, 'wow': 87, 'she': 77, 'my': 75, 'are': 74, 'just': 64, 'rach': 64, 'hello': 64, 'thanks': 58, 'ohh': 58, 'here': 55, 'can': 54, 'they': 54, 'good': 53, 'is': 53, 'listen': 51, 'did': 51, 'who': 47, 'there': 47, 'let': 46, 'great': 45, 'wait': 45, 'ooh': 44, 'if': 42, 'not': 42, 'because': 41, 'when': 41, 'right': 40, 'see': 39, 'um': 39, 'go': 39, 'maybe': 39, 'a': 39, 'whoa': 37, 'where': 36, 'do

Just as in the previous notebook, we need to turn the labels into numerical values:

In [207]:
from sklearn import preprocessing
# first we instantiate a label encode
le = preprocessing.LabelEncoder()
# we fee this encoder with the complete list of labels from our data
le.fit(training_labels)
print(list(le.classes_))
training_classes = le.transform(training_labels)
print(list(training_classes[0:20]))

['anger', 'disgust', 'fear', 'joy', 'neutral', 'sadness', 'surprise']
[4, 4, 4, 4, 6, 4, 4, 4, 4, 4, 2, 4, 6, 4, 6, 5, 6, 2, 4, 4]


The next steps are the same as for the previous notebook, except that we pass the embedding representations of the training data.

In [163]:
# Split data into training and test sets
# from sklearn.cross_validation import train_test_split  # deprecated in 0.18
from sklearn.model_selection import train_test_split

### we again use a aplit of 80% train and 20% test
docs_train, docs_test, y_train, y_test = train_test_split(
    trainDataVecs, # the tf-idf model
    training_classes, # the category values for each utterance represented as numeric values
    test_size = 0.20 # we use 80% for training and 20% for development
    ) 


In [164]:
print(y_test)
print(type(y_test))

[0 0 2 ... 3 1 3]
<class 'numpy.ndarray'>


In [165]:
print(docs_train[0:5])

[[-2.1636000e-01 -5.9840000e-01  4.1309002e-01 -7.6799989e-03
  -7.3045099e-01 -2.8516501e-01  8.5350150e-01  6.8652004e-01
  -9.8149502e-01 -2.5839502e-01 -1.4743000e-01 -2.9082000e-01
  -2.4461000e+00 -5.2369499e-01  3.7858999e-01 -2.2097851e-01
  -4.2551702e-01 -5.3156996e-01 -5.5166495e-01  2.1673501e-01
   3.6101121e-01 -3.5664991e-02 -4.6165699e-01  3.5926002e-01
   2.5298998e-01]
 [ 2.1999502e-01 -1.2061520e-01  1.8493812e-01 -2.6136104e-02
  -1.7570910e-01  7.2638109e-02  9.3399107e-01  3.5766679e-01
  -1.5275399e-01  2.5836399e-01  5.0035994e-02 -7.0398413e-02
  -4.0326300e+00 -3.6890402e-01  3.5035007e-02  1.5320063e-03
   2.3311678e-01 -3.4084767e-01 -4.6753740e-01 -3.7266392e-01
   1.3768899e-01  2.2526228e-01  1.7695587e-02  3.5733521e-01
  -2.0955701e-01]
 [-2.6078999e-01  5.9108001e-01  6.1622000e-01 -7.0367998e-01
  -8.5158998e-01 -2.3238000e-01  1.0481000e+00  6.6642001e-02
  -5.4907000e-01  7.0046997e-01 -8.7221003e-01 -1.3954000e-02
  -5.9671001e+00 -4.3105999e-01 -9

## 4. Training and applying the model  <a class="anchor" id ="section4"></a> 

In [166]:
from sklearn import svm
from sklearn.calibration import CalibratedClassifierCV

linear_model = svm.LinearSVC()
svm_linear_clf = CalibratedClassifierCV(linear_model , method='sigmoid', cv=10)
svm_linear_clf.fit(docs_train, y_train)



CalibratedClassifierCV(base_estimator=LinearSVC(C=1.0, class_weight=None,
                                                dual=True, fit_intercept=True,
                                                intercept_scaling=1,
                                                loss='squared_hinge',
                                                max_iter=1000,
                                                multi_class='ovr', penalty='l2',
                                                random_state=None, tol=0.0001,
                                                verbose=0),
                       cv=10, method='sigmoid')

In [173]:
# Predicting the Test set results, find macro recall
y_pred_svm_linear = svm_linear_clf.predict(docs_test)

If the data is complex, a non-linear SVM may be preferable. A non-linear SVM uses the kernel-trick to separate positive and negative cases when the data is not lineary correlated. We can initialise such a classifier in the same way as done before.

In [174]:
nonlinear_model = svm.SVC(probability=True)
svm_nonlinear_clf = CalibratedClassifierCV(nonlinear_model,  method='sigmoid', cv=10)
svm_nonlinear_clf.fit(docs_train, y_train)

CalibratedClassifierCV(base_estimator=SVC(C=1.0, break_ties=False,
                                          cache_size=200, class_weight=None,
                                          coef0=0.0,
                                          decision_function_shape='ovr',
                                          degree=3, gamma='scale', kernel='rbf',
                                          max_iter=-1, probability=True,
                                          random_state=None, shrinking=True,
                                          tol=0.001, verbose=False),
                       cv=10, method='sigmoid')

In [169]:
# Predicting the Test set results, find macro recall
y_pred_svm_nonlinear = svm_nonlinear_clf.predict(docs_test)

In [170]:
y_pred_svm_nonlinear_proba= svm_nonlinear_clf.predict_proba(docs_test)
print(y_pred_svm_nonlinear_proba)

[[0.05988991 0.01646456 0.02898296 ... 0.51481068 0.04868501 0.25063108]
 [0.12300721 0.02939583 0.02557653 ... 0.53107906 0.07739238 0.03355987]
 [0.16838638 0.01655129 0.02917025 ... 0.51575326 0.04934721 0.19957302]
 ...
 [0.11907268 0.04358776 0.01963189 ... 0.51676135 0.0490766  0.07569682]
 [0.12051742 0.04387402 0.01992854 ... 0.52771971 0.08041836 0.03196055]
 [0.03609401 0.01718546 0.03027797 ... 0.53499558 0.12118331 0.09195121]]


## 5. Generating the test report  <a class="anchor" id ="section5"></a> 

In [175]:
from sklearn.metrics import classification_report

report = classification_report(y_test,y_pred_svm_linear,digits = 7)
print(le.classes_)
print('SVM LINEAR ----------------------------------------------------------------')
print(report)

['anger' 'disgust' 'fear' 'joy' 'neutral' 'sadness' 'surprise']
SVM LINEAR ----------------------------------------------------------------
              precision    recall  f1-score   support

           0  0.4230769 0.0940171 0.1538462       234
           1  0.5000000 0.0384615 0.0714286        52
           2  0.0000000 0.0000000 0.0000000        53
           3  0.5130435 0.3361823 0.4061962       351
           4  0.5345477 0.9072495 0.6727273       938
           5  0.0000000 0.0000000 0.0000000       132
           6  0.3949580 0.1974790 0.2633053       238

    accuracy                      0.5205205      1998
   macro avg  0.3379466 0.2247699 0.2239291      1998
weighted avg  0.4506927 0.5205205 0.4384254      1998



  _warn_prf(average, modifier, msg_start, len(result))


In [176]:
report = classification_report(y_test,y_pred_svm_nonlinear,digits = 7)
print(le.classes_)
print('SVM NONLINEAR ----------------------------------------------------------------')
print(report)

['anger' 'disgust' 'fear' 'joy' 'neutral' 'sadness' 'surprise']
SVM NONLINEAR ----------------------------------------------------------------
              precision    recall  f1-score   support

           0  0.4117647 0.0299145 0.0557769       234
           1  0.0000000 0.0000000 0.0000000        52
           2  0.0000000 0.0000000 0.0000000        53
           3  0.4791667 0.3276353 0.3891709       351
           4  0.5366889 0.9434968 0.6841902       938
           5  0.0000000 0.0000000 0.0000000       132
           6  0.4891304 0.1890756 0.2727273       238

    accuracy                      0.5265265      1998
   macro avg  0.2738215 0.2128746 0.2002665      1998
weighted avg  0.4426265 0.5265265 0.4285937      1998



In [None]:
Remember the results from the notebook where we trained a NaiveBayes and SVM classifiers with one-hot-encodings of the words? Take some time to compare the results and think about the differences.

We show here some screen dumps:


* One-hot-encoding, frequency = 1, 2000 training documents

![one-hot-token-results-nb-svm-2000-f=1.png](attachment:one-hot-token-results-nb-svm-2000-f=1.png)

* One-hot-encoding, frequency = 5, 2000 training documents

![one-hot-token-results-nb-svm-2000-f=5.png](attachment:one-hot-token-results-nb-svm-2000-f=5.png)

## 6. Applying the model to new data  <a class="anchor" id ="section6"></a> 

We would like to apply the embedding based model to our own data but this works a bit different as we cannot simply use the 'transform' function to represent the utterances using the one-hot vector representation of the training vocabulary.

What we need to do is to create an embedding representation using the same function we used above and assume that our classifier finds sufficient similarity in the embeddings of our data with the correct training data.

We use the same set of utterances.

In [96]:
# some utterances
some_chat = ['That is sweet of you', 
               'You are so funny', 
               'Are you a man or a woman?', 
               'Chatbots make me sad and feel lonely.', 
               'Your are stupid and boring.', 
               'Two thumbs up', 
               'I fell asleep halfway through this conversation', 
               'Wow, I am really amazed.', 
               'You are amazing.']


len(some_chat)

9

Next, we define the list of labels that go with our chat.

In [219]:
some_chat_emotions = ['joy', 'joy', 'neutral', 'sadness', 'anger', 'joy', 'anger', 'surprise', 'joy']

We  use the LabelEncoder *le* to convert this list into a numpy array with digits:

In [220]:
print('labels',le.classes_)
some_chat_labels = le.transform(some_chat_emotions)
print(some_chat_labels)

labels ['anger' 'disgust' 'fear' 'joy' 'neutral' 'sadness' 'surprise']
[3 3 4 5 0 3 0 6 3]


In [127]:
some_chat_tokens = []
for utterance in some_chat:
    some_chat_tokens.append(nltk.tokenize.word_tokenize(utterance))

some_chat_embedding_vectors = getAvgFeatureVecs(some_chat_tokens, word_embedding_model, num_features)
#### Due to the averaging, there could be infinitive values or NaN values. The next numpy function turns these value to "0" scores
some_chat_embedding_vectors = np.nan_to_num(some_chat_embedding_vectors)  

Review 0 of 9


  if sys.path[0] == '':


In [135]:
# have classifier make a prediction
#pred = svm_linear_clf.predict(ourDataVecs)
from sklearn.calibration import CalibratedClassifierCV

some_chat_pred = svm_linear_clf.predict(some_chat_embedding_vectors)
print('System predictions', some_chat_pred)
print('Gold labels', some_chat_labels_np)
for review, predicted_label in zip(some_chat, some_chat_pred):
    
    print('%s => %s' % (review, 
                        le.classes_[predicted_label]))




System predictions [3 3 4 4 4 4 4 4 3]
Gold labels [3 3 4 5 1 3 1 6 3]
That is sweet of you => joy
You are so funny => joy
Are you a man or a woman? => neutral
Chatbots make me sad and feel lonely. => neutral
Your are stupid and boring. => neutral
Two thumbs up => neutral
I fell asleep halfway through this conversation => neutral
Wow, I am really amazed. => neutral
You are amazing. => joy


In [136]:
some_chat_pred_probabilities = svm_linear_clf.predict_proba(some_chat_embedding_vectors)
print(some_chat_pred_probabilities)

[[0.05566912 0.03380374 0.0149815  0.30998191 0.25843324 0.04503507
  0.28209541]
 [0.14537836 0.04532041 0.02102572 0.34213409 0.13950034 0.04908878
  0.25755229]
 [0.114386   0.05677569 0.03045086 0.14124956 0.51209787 0.07597918
  0.06906083]
 [0.13183531 0.08699858 0.04358781 0.26563304 0.34602976 0.0825597
  0.0433558 ]
 [0.14198635 0.06169215 0.02710553 0.08070763 0.57224455 0.03777203
  0.07849176]
 [0.27403921 0.02139631 0.03342049 0.0716617  0.53701941 0.0357632
  0.02669968]
 [0.24249596 0.04499757 0.078444   0.05729908 0.46323222 0.08313668
  0.03039449]
 [0.0375989  0.03000299 0.02195037 0.21628631 0.60877046 0.05640656
  0.02898442]
 [0.06568752 0.0380387  0.00887373 0.58062801 0.20605358 0.02699314
  0.07372531]]


Using *Pandas* we can nicely visualise the results. 

In [137]:
some_chat_pred_labels = []
for predicted_label in some_chat_pred:
    some_chat_pred_labels.append(le.classes_[predicted_label])

some_chat_gold_labels = []
for gold_label in some_chat_labels_np:
    some_chat_gold_labels.append(le.classes_[gold_label])


result_frame = pd.DataFrame(some_chat_pred_probabilities*100, columns=le.classes_)
result_frame['Chat']=some_chat
result_frame['Predication']=some_chat_pred_labels
result_frame['Gold']=some_chat_gold_labels

result_frame

Unnamed: 0,anger,disgust,fear,joy,neutral,sadness,surprise,Chat,Predication,Gold
0,5.566912,3.380374,1.49815,30.998191,25.843324,4.503507,28.209541,That is sweet of you,joy,joy
1,14.537836,4.532041,2.102572,34.213409,13.950034,4.908878,25.755229,You are so funny,joy,joy
2,11.4386,5.677569,3.045086,14.124956,51.209787,7.597918,6.906083,Are you a man or a woman?,neutral,neutral
3,13.183531,8.699858,4.358781,26.563304,34.602976,8.25597,4.33558,Chatbots make me sad and feel lonely.,neutral,sadness
4,14.198635,6.169215,2.710553,8.070763,57.224455,3.777203,7.849176,Your are stupid and boring.,neutral,disgust
5,27.403921,2.139631,3.342049,7.16617,53.701941,3.57632,2.669968,Two thumbs up,neutral,joy
6,24.249596,4.499757,7.8444,5.729908,46.323222,8.313668,3.039449,I fell asleep halfway through this conversation,neutral,disgust
7,3.75989,3.000299,2.195037,21.628631,60.877046,5.640656,2.898442,"Wow, I am really amazed.",neutral,surprise
8,6.568752,3.80387,0.887373,58.062801,20.605358,2.699314,7.372531,You are amazing.,joy,joy


In [138]:
report = classification_report(some_chat_labels_np,some_chat_pred,digits = 7)
print(le.classes_)
print('SVM LINEAR ----------------------------------------------------------------')
print(report)

['anger' 'disgust' 'fear' 'joy' 'neutral' 'sadness' 'surprise']
SVM LINEAR ----------------------------------------------------------------
              precision    recall  f1-score   support

           1  0.0000000 0.0000000 0.0000000         2
           3  1.0000000 0.7500000 0.8571429         4
           4  0.1666667 1.0000000 0.2857143         1
           5  0.0000000 0.0000000 0.0000000         1
           6  0.0000000 0.0000000 0.0000000         1

    accuracy                      0.4444444         9
   macro avg  0.2333333 0.3500000 0.2285714         9
weighted avg  0.4629630 0.4444444 0.4126984         9



  _warn_prf(average, modifier, msg_start, len(result))


### 7. Saving the classifier to disk

Just as with the previous notebook, you can save the emotion classification model to disk and load the model some other time. Note that you need to load the same word2vec model as well to represent any text input with vector representations that are compatible.

In [221]:
import pickle

# save the classifier to disk
filename_classifier = 'svm_nonlinear_clf_embeddings.sav'
pickle.dump(svm_nonlinear_clf, open(filename_classifier, 'wb'))
 
# some time later...
 
# load the classifier and the vectorizer from disk
loaded_classifier = pickle.load(open(filename_classifier, 'rb'))


# End of this notebook