# Synopsis

So far we have essentially only learned how to parse and enumerate the number of words in text (doesn't sound like much, huh? But that alone comprises a large amount of basic textual analysis). In this unit we will go a bit further and cover:

1. Preparing text for further analysis
2. Analyzing sentiment

We will also talk about how difficult advanced analysis of unstructured text is despite its appearance as an 'easy' task.

# Read libraries

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from colorama import Back, Fore, Style
from pathlib import Path
from sys import path

path.append('../My_libraries')
path

In [None]:
my_fontsize = 15

In [None]:
import re

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import pandas as pd

from collections import Counter
from random import random
from string import punctuation, whitespace

from My_libraries.my_stats import half_frame

# Text emotional valence

The notebook about the basics of text analysis illustrated some of the kinds of analyses that one can perform on text corpora.  The focus there was on simple calculations based on patterns of occurrence of work tokens.  Probably, not very satisfying for English or Psychology majors...

In this notebook, we are going to skim the surface of a different type of analysis.  Whether the text has positive, neutral, or negative **valence**.  This is typically referred to as **sentiment analysis**. The idea is that while most words are neutral some words convey valence in a polarized manner.  

*Sadness*, *anger*, *despair* convey a very different emotion from *happiness*, *laughter*, or *brightness*.

This realization has been formalized by asking many subjects to rate the valence of different words.  The aggregated ratings were then structured into lists where words are given a valence score.

Fortunately for us, such work is summarized in data files we can easily access.

In [None]:
afinn_folder = Path.cwd() / 'Data' / 'AFINN'

print( list( afinn_folder.glob('*') ) )
print()

valences_file_path = afinn_folder / 'AFINN-111.txt'
with open(valences_file_path, 'r', encoding = 'utf-8') as file_in:
    valence_data = file_in.readlines() 

print(len(valence_data))
print()

print(valence_data[:10])

That needs a little processing, right?

In [None]:
valence_dict  = {}
for line in valence_data:
    token, valence = line.split()
    print(token, valence)
    valence_dict[token.strip()] = float(valence.strip())
    print(valence_dict)
    
    break

Looking good.

Let's remove the `print` and `break` statements 

In [None]:
valence_dict  = {}
for line in valence_data:
    token, valence = line.split()
    valence_dict[token.strip()] = float(valence.strip())
    
print(len(valence_dict))    

Damn! Some lines do not have just two parts!

In [None]:
valence_dict  = {}
for line in valence_data:
    if len(line.split()) == 2:
        token, valence = line.split()
    else:
        print(line.split())
    valence_dict[token.strip()] = float(valence.strip())
    
print(len(valence_dict))    

**Damn!!! Damn!!!**

Some of the tokens are not single words.

What do you say if we ignore them for now?  We can still read them by noting that a `\t` character separates the token from the valence, but we will pretend they are not there for the rest of the work. 

In [None]:
valence_dict  = {}
for line in valence_data:
    token, valence = line.strip().split('\t')
    valence_dict[token.strip()] = float(valence.strip())
    
print(len(valence_dict)) 
print()


Some words for sure are in this `dictionary`, right?

In [None]:
valence_dict['hate']

In [None]:
valence_dict['happy']

In [None]:
valence_dict['screwed up']

# A detour for creating our own_library

I would like us to work with the play **Othello, the moor of Venice**. We could go back to the *basics* notebook but that is silly. Why would we want to keep a large number of versions of the same code across many notebooks?

For one, improvements made in one copy do not transfer to all the other copies.

For another, opening the other notebook and copying and pasting the code is annoying and creates issues if we forget to load some needed library.

So, how to solve this?

**Well, we will create our own library!!!!**

Go to the notebook with the folder contents of the current working directory and create a new text file

<img src = 'Images/create_library_step1.png'>

This will create a new file and open it.

You then would change its name.

<img src = 'Images/create_library_step2.png'>

You will notice that the file type has changed now to `Python`.

Inside, you can add the functions you want.  I wrote a bunch of them from our prior work.

<img src = 'Images/create_library_step3.png'>

Don't forget to save (under `File` on the top left corner).

Your new library file is now available in your folder.

<img src = 'Images/create_library_step4.png'>


**Let's import this new library!**

In [None]:
import my_nlp_library

In [None]:
help(my_nlp_library)

So professional looking ;-)

In [None]:
help(my_nlp_library.read_complete_works)

# Yes! Othello again!

Ok, let's load everything that we need to work with this play.

In [None]:
shapespeare_path = Path.cwd() / 'Data' / 'Shakespeare.txt'

complete_works, plays = my_nlp_library.read_complete_works()
plays

In [None]:
title = 'THE TRAGEDY OF OTHELLO, MOOR OF VENICE'
start_line = plays[title][1]
end_line = plays[title][2]
othello_play = complete_works[start_line: end_line]

In [None]:
personae = my_nlp_library.get_characters(othello_play)

**WE ARE READY TO GO!!!**

Let's look at what is going on with Othello

In [None]:
othello_lines = my_nlp_library.get_character_lines('OTHELLO', othello_play)

print(len(othello_lines))
print()

print(othello_lines[:30])

In [None]:
othello_words = my_nlp_library.extract_words_from_lines('Othello', othello_lines)

print(len(othello_words))
print()

print(othello_words[:50])

## Calculating emotional valences

Now that we have Othello's words, we can compare them to the keys of the valence dictionary

In [None]:
valence = 0
count = 0
corpus = othello_words[:]
for word in corpus:
    if word in valence_dict.keys():
#         print(f"{word:>20} -- {valence_dict[word]}")
        count += 1
        valence += valence_dict[word]
        
print(f"\n\nThe valence of the provided corpus is {valence / count:.3f}") 

print(f"\nOut of {len(corpus)} words, {count} had a non-zero valence.")

You can see that fewer than 10% of words have a non-zero valence. 

There are a other problems, however. *Very happy* should have higher valence than *happy*. And *not happy* should have a negative valence.  

Our bag-of-words approach does not account for these possibilities.

Let's ignore that issue for now, and attempt to check whether the valence of Othello's speech changes in the course of the play.


## Time dependence

We will do a little trick for this.

We will consider blocks of 200 words and move them by steps of 50 words.  This will smooth things a bit and give us some idea of whether there is a change or not.

In [None]:
times = []
valences = []
corpus = othello_words
step = 50
window = 500 

for i in range(0, len(corpus)-window, step): 
    temp_corpus = corpus[i:i+int(window / 2)]
    count = 0
    valence_t = 0
    for word in temp_corpus:
        if word in valence_dict.keys():
            count += 1
            valence_t += valence_dict[word]
            
    times.append(i + int(window / 2))
    valences.append(valence_t / count)
        
fig = plt.figure( figsize = (6, 4) )
ax = fig.add_subplot( 111 )
half_frame(ax, 'Word count', 'Emotional valence', font_size = my_fontsize)
# Guide to the eye
ax.plot([0, 6500], [0, 0], 'k--', lw = 2)
ax.fill_between([0, 6500], -2, color = '0.7')

# Print window size for easy examination of choices
ax.text(50, -1.8, f"Window: {window} words")

ax.plot(times, valences, 'bo-', label = 'Othello')

ax.set_xlim(0, 6400)
ax.set_ylim(-2, 2.5)
ax.legend(loc = 'best', frameon = False, fontsize = my_fontsize)

plt.tight_layout()

**That is how we confirm it is a tragedy!** 

The question, however, is: Was it a tragedy for everyone?


In [None]:
iago_lines = my_nlp_library.get_character_lines('IAGO', othello_play)

print(len(iago_lines))
print()

iago_words = my_nlp_library.extract_words_from_lines('Iago', iago_lines)


In [None]:
times = []
valences = []
corpus = iago_words
step = 50
window = 500 

for i in range(0, len(corpus)-window, step): 
    temp_corpus = corpus[i:i+int(window / 2)]
    count = 0
    valence_t = 0
    for word in temp_corpus:
        if word in valence_dict.keys():
            count += 1
            valence_t += valence_dict[word]
         
    if count > 0:
        times.append(i + int(window / 2))
        valences.append(valence_t / count)
        
fig = plt.figure( figsize = (6, 4) )
ax = fig.add_subplot( 111 )
half_frame(ax, 'Word count', 'Emotional valence', font_size = my_fontsize)
# Guide to the eye
ax.plot([0, 6500], [0, 0], 'k--', lw = 2)
ax.fill_between([0, 6500], -2, color = '0.7')

ax.text(50, -1.8, f"Window: {window} words")

ax.plot(times, valences, 'ro-', label = 'Iago')

ax.set_xlim(0, 6400)
ax.set_ylim(-2, 2.5)
ax.legend(loc = 'best', frameon = False, fontsize = my_fontsize)

plt.tight_layout()

Iago was clearly offended by Othello's happiness at the beginning of the play, don't you think? 

It might be time to actually to recall the [Othello's story](https://en.wikipedia.org/wiki/Othello)...

Not bad, ah?

# Exercises

If I still need to give you exercises, instead of you thinking about something that you would like to do, then I am not doing my job properly.