In [1]:
import json
import time
from stop_words import get_stop_words
from gensim.models.doc2vec import TaggedDocument,Doc2Vec
from textblob import TextBlob
import pandas as pd
import numpy as np
from scipy import spatial,stats
import matplotlib.pyplot as plt
import csv

<h2>Text Data</h2>

Reads in the ISEAR emotion dataset into Pandas dataframe. http://emotion-research.net/toolbox/toolboxdatabase.2006-10-13.2581092615

In [178]:
df_isear = pd.read_csv('ISEAR1.csv',index_col=False)

In [179]:
df_isear.head()

Unnamed: 0,emotion,text
0,joy,On days when I feel close to my partner and ot...
1,fear,Every time I imagine that someone I love or I ...
2,anger,When I had been obviously unjustly treated and...
3,sadness,When I think about the short time that we live...
4,disgust,At a gathering I found myself involuntarily si...


<h2>Data Preparation and Doc2Vec Model Creation</h2>

Define functions to process text and create a doc2vec model.

First function takes a collection of articles or strings items and creates a doc2vec model

In [4]:
def create_doc2vec_model(articles,name,vector_size=100,epochs=10,lang='en'):
    # import stopwords for specific language of model
    stop_words = get_stop_words(lang)
    ## list of just articles (str)
    #strip stopwords article docs
    nostop = [[i for i in doc.lower().split() if i not in stop_words] for doc in articles]
    #tokenize article docs and convert to doc2vec tagged docs - each article has an index number and list of tokens - taggedoc(['token1','token2',[1]])
    tagged = [TaggedDocument(doc,[i]) for i,doc in enumerate(nostop)]
    # instantiate doc2vec model with parameters - size = # of nums representing each doc (100), min_count - occurences of words in vocab (filter out rare words), iter - passes to create vectors
    model = Doc2Vec(vector_size=vector_size, min_count=2, epochs=epochs)
    ## build vocab from all tagged docs
    model.build_vocab(tagged)
    ## train model on tagged docs - total examples - total # of docs
    model.train(tagged,total_examples=model.corpus_count,epochs=epochs)
    # save model with language - eg esmodel.model for spanish docs
    model_name = name + 'model.model'
    model.save(model_name)
    model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)
    print('saved as: ' + model_name)

Converts a list of articles (strings) and infers the vector from the doc2vec model previously created. 
Input ['string1','string2','string3']

In [3]:
def text2vec(textlist,lang,d2v_model):
    stop_words = get_stop_words(lang)
    histnostop = [[i for i in doc.lower().split() if i not in stop_words] for doc in textlist]
    dlhist_tagged = [TaggedDocument(doc,[i]) for i,doc in enumerate(histnostop)]
    ## infer vectors from current doc2model
    vecs = [d2v_model.infer_vector(doc.words) for doc in dlhist_tagged]
    return vecs

<h2>Creates and Loads Doc2Vec Model</h2>

Creates a doc2vec model from just the text of the dataframe. 

In [91]:
emot_text = df_isear['text']

In [92]:
create_doc2vec_model(emot_text,'emotions1',vector_size=100,epochs=10,lang='en')

saved as: emotions1model.model


In [93]:
d2v_model = Doc2Vec.load('emotions1model.model')

<h2>Text Item Vectors</h2>

Gets the vectors for each text item in the dataframe from the doc2vec just created and add them to dataframe.

In [94]:
emot_vecs = [d2v_model.docvecs[x] for x in range(len(emot_text))]
len(emot_vecs)

7516

In [95]:
df_isear['emot_vecs'] = emot_vecs

<h2>Group Vectors by Emotion</h2>

Group Dataframe by emotion(anger,joy,sadness etc.).

In [96]:
emotions_grp = list(df_isear.groupby(df_isear['emotion']))

<h2>Calculate Vector Mean for Each Emotion Group</h2>

Get vector means (centers) for each emotion group. This represents the central location in the vector space for each emotion. The closer a text vector representation is to a given emotion group mean, the more the given text is characterized by the emotion the mean represents. 

Example: the closer a given sentence is to the 'anger' vector center, the more that sentence is characterized by anger. I purposely use the word 'characterize' because the model can capture a wide variation of relations to the emotion - expressions, descriptions, evaluations, etc. 

A characterization of an emotion contains all 3 of the following cases:

An expression of anger - "I hate that no good bastard and I want to bash his head in.' 
A description of anger - "When she told him she cheated, he banged his fist against the wall and gritted his teeth'
A evaluation of someone else's anger - "I was so disappointed when cursed and stormed out of the room when given the news.' 

Another viewpoint sees the emotion 'score' as the amount of emotion 'content' in the sentence, without necessarily being an expression of that emotion. The third sentence, for instance, is an expression of disappointment, though it has high anger 'content.' A sentence with a high anger 'score,' therefore, doesn't mean the sentence itself is particularly angry. 

We can then evaluate a sentence for it's emotional 'content' by calculating the distance of the sentence vector from each of the emotion vector means. 

In [97]:
emot_vec_mean = [{'emotion':emotions_grp[x][0],'vec':emotions_grp[x][1]['emot_vecs'].mean()} for x in range(len(emotions_grp))]

In [98]:
emot_vec_mean1 = [emot_vec_mean[x]['vec'] for x in range(len(emot_vec_mean))]

In [99]:
emot_vecs_ctr = [x['vec'] for x in emot_vec_mean]
emot_label_ctr = [x['emotion'] for x in emot_vec_mean]

<h2>Create .tsv files</h2>

Functions for creating tsv files for use in a Tensorflow Embedding Projector

In [100]:
## outputs and saves a tsv file for a list of vectors
def output_tab_vecs(vecs,filename):
    csv.register_dialect('tabDialect', delimiter='\t', quoting=csv.QUOTE_NONE)
    myFile = open(filename, 'w')  
    with myFile:  
        writer = csv.writer(myFile, dialect='tabDialect')
        writer.writerows(vecs)
    print('saved tab file as',filename)

In [101]:
## outputs and saves a tsv file for a list of meta tags (2 columns or more)
def output_tab_meta(meta,filename):
    csv.register_dialect('tabDialect', delimiter='\t', quoting=csv.QUOTE_NONE,escapechar='\\')
    myFile = open(filename, 'w')  
    with myFile:  
        writer = csv.writer(myFile, dialect='tabDialect')
        writer.writerows(meta)
    print('saved tab file as',filename)

In [102]:
## outputs and saves a tsv file for a list of single column meta tags
def output_single_meta(metalist,filename):
    with open (filename, 'w') as fo:
        for d in metalist:
            fo.write(str(d) + '\n')
    print('saved tab file as',filename)

Output tsv files for the vector means for each emotion and theirs labels as well as the vectors for each sentence and their labels (emotion,sentence text). 

In [104]:
output_tab_vecs(emot_vecs_ctr,'emot_vecs_means.tsv')

saved tab file as emot_vecs_means.tsv


In [105]:
output_single_meta(emot_label_ctr,'emot_labels_means.tsv')

saved tab file as emot_labels_means.tsv


In [47]:
output_tab_vecs(emot_vecs,'emot_vecs_sent.tsv')

saved tab file as emot_vecs_sent.tsv


In [103]:
meta_list = list(zip(df_isear['emotion'],df_isear['text']))
output_tab_meta(meta_list,'emot_labels_sent.tsv')

saved tab file as emot_labels_sent.tsv


<h2>Tensorflow Embedding Projector </h2>

Links use data previously generated in this notebook that is hosted on github. To use yoru own data, one can generate the required tsv files with the functions above and then upload the vec and meta .tsv files to the embedding projector with the 'Load Data' button and choose the appropriate file (vectors or metadata). 

<h3>All Sentences</h3>

Select 'emotion' for 'color by' so points will be colored by the emotion label.

https://projector.tensorflow.org/?config=https://gist.githubusercontent.com/escottgoodwin/f3728570bf3c7e13a750dd93117053ac/raw/cdeb31000cf4235f297e7315faf35b9a8b398ece/emot_vec2_projector_config.json

<h3>Emotion Vector Means</h3>

Select 'emotion' for 'color by' so points will be colored by the emotion label.

https://projector.tensorflow.org/?config=https://gist.githubusercontent.com/escottgoodwin/1a8e0715f2c6029e835f4a6e315bdd92/raw/904d094e39c443bcae221a247e023c7a30620337/emotion_centers_projector_config.json

<h2>Emotional Content Predictions</h2>

Predicts the emotional content of a list of text articles (strings) 

<b>Outputs:</b>

1 - score for each of the 7 emotions based on cosine distance from emotion vector means - anger:84,joy:21,sadness: .65

2 - dominant emotion (to compare to original ISEAR dataset that had 1 emotion per sentence) - anger

3 - ranks of the each of the 7 emotions - in reverse order by distance - most dominant emotion would be 7 - joy:1,sadness:5, anger:7 

4 - inferred vectors each item in the list of text articles

In [168]:
def predict_emotions(text_list,lang,emot_vec_mean,model_name):
    d2v_model = Doc2Vec.load(model_name)
    text_vecs = text2vec(text_list,lang,d2v_model)
    emot_scan = [{x['emotion']:1 - spatial.distance.cosine(sample_vec,x['vec']) for x in emot_vec_mean} for sample_vec in text_vecs]
    emot_scan1 = [[1 - spatial.distance.cosine(sample_vec,x['vec']) for x in emot_vec_mean] for sample_vec in text_vecs]
    emot_pred = [np.argmax(x) for x in emot_scan1]
    ranks = [stats.rankdata(x) for x in emot_scan1]
    emotions = [x['emotion'] for x in  emot_vec_mean]
    emot_ranks = []
    for x in range(len(ranks)):
        emot_rank = list(zip(emotions,ranks[x]))
        emot_ranks.append(emot_rank)
    return emot_scan,emot_scan1,emot_pred,emot_ranks,text_vecs

In [172]:
sample_sent = ['You mean so much to me and I love you.','How could you do that you bastard. I cannot stand you!'] #sentences must be in a list
emot_scan,emot_scan1,emot_pred,ranks,text_vecs = predict_emotions(sample_sent,'en',emot_vec_mean,'emotions1model.model')

In [174]:
emot_scan

[{'anger': 0.9852735996246338,
  'disgust': 0.985318660736084,
  'fear': 0.9851565361022949,
  'guilt': 0.9853554964065552,
  'joy': 0.9850506782531738,
  'sadness': 0.9850685000419617,
  'shame': 0.9853231906890869},
 {'anger': 0.9795728921890259,
  'disgust': 0.9795154929161072,
  'fear': 0.9794973134994507,
  'guilt': 0.9795477986335754,
  'joy': 0.9796198010444641,
  'sadness': 0.9794784188270569,
  'shame': 0.9796500205993652}]

In [175]:
ranks


[[('anger', 4.0),
  ('disgust', 5.0),
  ('fear', 3.0),
  ('guilt', 7.0),
  ('joy', 1.0),
  ('sadness', 2.0),
  ('shame', 6.0)],
 [('anger', 5.0),
  ('disgust', 3.0),
  ('fear', 2.0),
  ('guilt', 4.0),
  ('joy', 6.0),
  ('sadness', 1.0),
  ('shame', 7.0)]]

<h2>Emotions Related By Distance</h2>

One nice feature of modeling the emotional content of the text in the vector space this way is that it takes into account that certain emotions are related (or near in vector space) to one another. This is an improvement upon an LDA style model that would simply give proportions percentages of emotional content that add to %100 - anger:25%, joy: 5%. 

Of course, one could transform the scores into percentages, but the accuracy of the characterization might be lost. 

In this model, the vector means for disgust and anger are near one another, so one would expect that a text that 'scored' high for anger, would also 'score' high for disgust. 

How accurate the relations (or distances) depicted by the model are another matter. Anger and disgust being so close makes a cetain amount of sense. However, joy and sadness between so close together and shame and guilt being so far apart make less sense. Now is this fault of the modeling method or a peculiarity of the dataset?  

<center><h4>Emotion Vector Centers</h4></center>

<img src="screenshotemotvec.png">

<h2>Emotion Cluster Separation</h2>

The following functions explores how well the various emotion vector means are seperated by looking at the cosine distances between the vector means and the text vectors in each of the emotion groups. 

We calculate the distances between the vector mean for anger and all the text vectors in the anger group. We do the same for the anger vector mean and the all the articles in the six other groups (anger mean and disgust text vectors, anger mean and sadness text vectors etc). 

In [118]:
def emotion_ctr_dist(emot,ctr):
    emot_ser = df_isear[df_isear['emotion']==emot]['emot_vecs']
    emot_dist = np.array([spatial.distance.cosine(ctr,x) for x in emot_ser])
    print(emot)
    print(stats.describe(emot_dist))
    plt.hist(emot_dist)
    plt.show()

In [119]:
def ctr_dist_compare(emot_list,ctr_list):
    for emotion in emot_list:
        print('main emotion',emotion)
        for ctr in emot_vec_mean:
            print(ctr['emotion'])
            emotion_ctr_dist(emotion,ctr['vec'])

In [None]:
ctr_dist_compare(emot_lablel_ctr,emot_vecs_ctr)