# Synopsis

Unstructured text is one of the most plentiful sources of data in many disciplines. However, because this data is unstructured (meaning that it isn't organized nicely into an excel spreadsheet) even basic analysis can be a bit more involved than with other data. In this unit we will go over the basics of textual analysis and cover:

* Techniques for **parsing** large-scale text
* Basic **bag of words** analysis
* Examining **distributions** of word usage

# Read libraries and functions

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

my_fontsize = 15

In [None]:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import pandas as pd

from collections import Counter
from pathlib import Path
from random import random
from string import punctuation, whitespace


In [None]:
def half_frame(sub, xaxis_label, yaxis_label, font_size = 15, padding = -0.02):
    """Formats frame, axes, and ticks for matplotlib made graphic 
       with half frame.
       
    """

    # Format graph frame and tick marks
    sub.yaxis.set_ticks_position('left')
    sub.xaxis.set_ticks_position('bottom')
    sub.tick_params(axis = 'both', which = 'major', length = 7, width = 1.5, 
                    direction = 'out', pad = 10, labelsize = font_size)
    sub.tick_params(axis = 'both', which = 'minor', length = 5, width = 1.5, 
                    direction = 'out', labelsize = 10)
    for axis in ['bottom','left']:
        sub.spines[axis].set_linewidth(1.5)
        sub.spines[axis].set_position(("axes", padding))
    for axis in ['top','right']:
        sub.spines[axis].set_visible(False)

    # Format axes
    sub.set_xlabel(xaxis_label, fontsize = 1.6 * font_size)
    sub.set_ylabel(yaxis_label, fontsize = 1.6 * font_size)


# Text as data

Whether it's extracting numerical data from text, or dealing with text directly, the ability to manipulate text in the form of strings is essential for any number of data science projects. More importantly, analyzing text allows for quantitative analysis in a number of areas that would be prohibitive otherwise.

Over a decade ago, Google convinced librarians around the world to scan their books for the [Google Books Project](https://www.newyorker.com/business/currency/what-ever-happened-to-google-books). (At the time, people were still buying into their *Do no harm* bullshit $-$ the [Wikipedia page](https://en.wikipedia.org/wiki/Google_Books) provides a less wide-eyed version).  

While Google intentions were far from altruistic, the scanning and digitization ended up creating an exciting resource that has enabled all sorts of previously impossible analyses. For example [Michel et al.](https://www.science.org/doi/10.1126/science.1199644) studied the changes in occurrences of specific words or groups of words in books over time:

<img src = "Images/electronic_book_use.png" width = 600>


Quantitative textual analysis makes the task of longitudinal analysis or simply large-scale text possible. However, one important point that I want to impress upon you is that **textual analysis is not nearly as simple as it seems**.  Teaching computers how to perform tasks that are innately easy to you can be quite challenging. 


# The Data

While the Google's motivation was problematic, a competing effort $-$ [Project Gutenberg](http://www.gutenberg.org) $-$ was digitizing thousands of post-copyright books and making them freely available to all in numerous formats.

<table>
    <tr>
        <td>
            <img src = 'Images/qPSwl88-_400x400.png' width = 80>
        </td>
        <td>
            <img src = 'Images/gutenberg_twain.png' width = 500>
        </td>
    </tr>
</table>


Luckily for us, this includes the complete works of William Shakespeare which we've pre-downloaded for you.

<img src = 'Images/gutenberg_shakespeare.png' width = 600>


For data processing in a computer, the best format is typically **plain text**.



First things first, **ALWAYS LOOK AT YOUR DATA**. 

<img src = 'Images/complete_works.png' width = 600>



Now is the time to open [Shakespeare.txt](../Data/NLP/Shakespeare.txt).

**So, what questions do you want to be able to answer in order to start working?**

> 1. How are different plays separated from one another?
>
> 2. How is dialogue formatted?
>
> 3. What extraneous information might we want to ignore?
>
> 4. How "well behaved" is our dataset? (i.e. is the formatting general or unique for different plays?)

We will need the answers to all those questions in order to be able to write code to analyze the text. 

.



.

But, first, let's read the file...


In [None]:
with open(Path.cwd() / 'Data' / 'Shakespeare.txt', 
          'r', encoding= 'UTF-8') as file_in:
    complete_works = file_in.readlines()
    
print(len(complete_works))

In [None]:
complete_works[:40]

# What can we do with this text?

First of all, remember that the computer is not an English major.  

However, the computer is able to complete long repetitive actions without error or lack of concentration.  So, whatever we can think involving counting, assigning, or comparing, we can do.

Even these simple analyses can be quite revealing.  For example, do you know that instructor evaluations tend to use different words to describe the teaching of men and women? Do you know that letters of recommendation also tend to use different terms for describing the skills and potential of men and women? Or that different ethnic groups are described using different terms?

Second, the computer is not an English major, but you may very well be one.  For this reason you can instruct the computer on what to do that could be of interest to an English major.

But before we do any of that, it is useful to have a little refresher on `string` parsing. 



# `String` parsing refresher

In [None]:
with open(Path.cwd() / 'Data' / 'spoiler_alert.txt', 
          'r', encoding= 'UTF-8') as file_in:
    spoiler_alert = file_in.readlines()
    
print(len(spoiler_alert))

Let's iterate through the lines to see what the data looks like:

In [None]:
print(spoiler_alert)
print()

for line in spoiler_alert:
    print(line)

## Removing white spaces at start and end of lines

When we print line by line there is a lot of empty lines. This is due to an invisible character: `\n` aka the *new line character*.

Usually, it is helpful to get rid of white space that helps humans read but serves no purpose for computers. As you might recall, we can easily strip a `string` of those things.

In [None]:
for line in spoiler_alert:
    print(line.strip())

Three things to notice.

First, the extra empty lines have disappeared.

Second, actual empty lines did not, and neither did white spaces within a line.

Third, we have not modified our strings


In [None]:
spoiler_alert[3]

## Adding line numbers

Using `enumerate`, we can easily get line number when we print. 

This is particularly helpful when we are trying to **slice** the `list` of line in order to get to what we want. 

In [None]:
for i, line in enumerate(spoiler_alert):
    print(f"{i:>4} -- {line.strip()}")

## Searching for specific text in a line

Many times, you are interested in pulling certain lines for the entire text.  Maybe they contain a keyword of interest to you... or maybe they are spoken by a certain character... 

In [None]:
for i, line in enumerate(spoiler_alert):
    if 'the' in line:
        print(f"{i:>4} -- {line.strip()}")

Why aren't lines 1 ("The potent poison quite o'ercrows my spirit.") or 3 ("But I do prophesy th' election lights") printed?

For one, **capitalization matters!**

For another, we are not **picking up contractions**.  (I'm assuming th' is a contraction of the).

Let's address capitalization first.

In [None]:
# Make everything lower case
#
for i, line in enumerate(spoiler_alert):
    if 'the' in line.lower():
        print(f"{i:>4} -- {line.strip()}")
        
print()

# Make everything upper case
#
for i, line in enumerate(spoiler_alert):
    if 'THE' in line.upper():
        print(f"{i:>4} -- {line.strip()}")

.


Is this the only way to find something in a line? Of course not! 


The `.find()` method even tells us the character position of the match.


In [None]:
for i, line in enumerate(spoiler_alert):
    print(line.lower().find('the'))
    if line.lower().find('the') != -1:
        print(f"{i:>4} -- {line.strip()} -- {line.lower().find('the')}")
        

Not only does `find` tell us whether the text appears in the line, but also exactly where in the line the text appears (indexing from 0). If it doesn't find our search query it will return -1. These methods can also be combined in linear chains:

.


.


So, capitalization is taken care of. What about contractions?

Let's assume that **th'** is always a contraction of **the**.

In [None]:
for i, line in enumerate(spoiler_alert):
    if 'the' in line.lower() or "th'" in line.lower():
        print(f"{i:>4} -- {line.strip()}")

## Exercise

Finding characters in lines can be used for many purposes.  One of them is to select **slices** of a `string`.

As an example, assume that we want to break the text into sentences.

Sentences will be ended by periods, exclamation marks or question marks.  So go ahead and break `spoiler_alert` into sentences.



# `String` splitting refresher

Lines are useful splitting of the text. However, in many situations we want to be looking at individual words. Or what we tell the computer constitutes a word!


In [None]:
for i, line in enumerate(spoiler_alert):
    print(f"{i:>4} -- {line.strip().split()}")


Not bad, ah?

The default option for `.strip()` and `'split()` acts on `whitespace`, which we can `import` from the `string` package.  

As you can see, it does an excellent job of breaking a line into words.

You will notice, however, that some words have some punctuation attached at the end.

We will take care of it easily enough and, in one fell swoop, remove the pesky punctuation.  To accomplish this, we will use another `import` from the `string` package $-$ `punctuation`.

It comprises the characters: 

> ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~



In [None]:
all_the_words = []  # This is where we will store our list of words

for i, line in enumerate(spoiler_alert):
    line_words = line.strip().split()
    print(f"{i:>4} -- {line_words}")
    
    for word in line_words:
        print(f"\t{word.rstrip(punctuation)}")
        all_the_words.append(word.rstrip(punctuation))

        
print(len(all_the_words))
print(all_the_words)

.


Now imagine that we want to count how frequently each words appears (for those words that appear at least once, that is).  Then we must account for the fact that capitalization does not change the word.  That is,  *The* and *the* are the same.

But you know how to take case of this, right?



.




.




.




.




.




.

In [None]:
all_the_words = [] 

for i, line in enumerate(spoiler_alert):
    line_words = line.strip().split()
    print(f"{i:>4} -- {line_words}")
    
    for word in line_words:
#         print(f"\t{word.rstrip(punctuation)}")
        all_the_words.append(word.rstrip(punctuation).lower())

        
Counter(all_the_words)

Wow! Othello is so self-centered. It is all about **I** with him.

Alright, we know you are dying. Get over it.

# Back to the Complete Works

Now that you are refreshed, we can get back to the `complete_works`

In [None]:
print(complete_works[:30])

.



.



We will be focusing on Othello, or, more precisely *THE TRAGEDY OF OTHELLO, MOOR OF VENICE*

So, first things first. Which lines pertain to this play?

Or, more granularly, how do you know when the play begins and when it ends?

The play starts with its title, obviously. So that is an easy one.

If you scroll a bit up from the title, you will see that the previous work ends with *THE END*.

So get going...


In [None]:
# find line_start and line_end
#
found_start = False
for i, line in enumerate(complete_works):
    # Code here
    
    
    
    
the_play_othello = complete_works[line_start:line_end]

print(f"\nThe play Othello has {len(the_play_othello)} lines.\n")


Before moving on, **always** make sure you really have what you _think_ you have. 

In [None]:
print(the_play_othello[:40])
print()

print(the_play_othello[-40:])

## Getting the character names


Great! We have the play!

You will also notice that after **Dramatis Personae**, there is a list of characters in the play.

Let's get their names and their description. 

What would be a good data structure for that data?


.


.



.



.



.

In [None]:
found_personae = False
personae = {}
for i, line in enumerate(the_play_othello):
    # Code here

print()
print(personae)


## Getting the lines from a specific character

Below, you can see the beginning of ACT I. 

You will notice a nice pattern to how lines are assigned to a character


In [None]:
snippet = the_play_othello[40:53]

print(f"    --01234567890")
for i, line in enumerate(snippet):
    print(f"{i:>4}--{line.rstrip()}")


The start of a character's line begins with the two spaces followed by the character's name capitalized and a period.

All the following lines that start with four spaces are assigned to that character. And then, the lines of a new character start.

```
"  CHARACTER. blahblahblah
     blahblahblah"
```

We can work with that, right?

Start by getting RODERIGO's lines from the `snippet`. 

In [None]:
roderigo_lines = []
character = 'RODERIGO'



found_lines = False
for i, line in enumerate(snippet):
    if line.split('.')[0] == '  ' + character:
        print(f"{i:>4}--{line.rstrip()}")
        found_lines = True
        roderigo_lines.append(line)
        continue
    
    if found_lines:
        if line[:4] == '    ':
            print(f"{i:>4}--{line.rstrip()}")
            roderigo_lines.append(line)
        else:
            found_lines = False
            continue


Go ahead and modify the code to read the lines from IAGO in the snippet.

# Getting and working with Othello's lines


Let's start by writing a function that reads and stores a character's lines given the lines in a play and the character's name.

In [None]:
def read_character_lines( character, play_lines ):
    """
    This function takes the name of a character and the lines from the play
    extracted from GP's Complete Works of William Shakespeare and returns
    a list with all the lines from that character in the play
    
    inputs:
        character -- str
        play_lines -- list of str
        
    returns:
        character_lines -- list of str
    """
    character_lines = []
    character = character.upper()
    start_string = '  ' + character
    continuation_string = '    '
    
    found_lines = False
    for i, line in enumerate(play_lines):
        # Code here
    
    return character_lines

In [None]:
othello_lines = read_character_lines('othello', the_play_othello[40:])

print(len(othello_lines))

In [None]:
iago_lines = read_character_lines('iago', the_play_othello[40:])
print(len(iago_lines))

In [None]:
desdemona_lines = read_character_lines('desdemona', the_play_othello[40:])
print(len(desdemona_lines))

In [None]:
brabantio_lines = read_character_lines('brabantio', the_play_othello[40:])
print(len(brabantio_lines))

## Bag of words approaches

Many NLP techniques make use of the so-called **bag of words** model. 

The idea is that you just get the words in some text and completely forget about their order and restrict yourselves to things that can be counted.

In order to move forward with this approach, it is helpful to write a function that given a list of lines breaks them into list of words.

That is your mission for the next few minutes, if you choose to take it.

In [None]:
def extract_words_from_lines( character, character_lines ):
    """
    This function takes the name of a character and a list 
    with all the lines from a character in the play and returns 
    a list of words
    
    inputs:
        character -- str
        character_lines -- list of str
        
    returns:
        character_words -- list of str
    """
    character_words = []
    character = character.upper()
    character_string = character + '.'
    
    for i, line in enumerate(character_lines):
        # Code here
    
    return character_words

Ok, get the words spoken by Othello and use the `Counter` to get some statistics

In [None]:
othello_words = extract_words_from_lines('othello', othello_lines)

print(f"Othello's lines comprise {len(othello_words)} words.\n")
print(f"Othello's lines comprise {len(set(othello_words))} unique words.\n")

othello_counter = Counter(othello_words)

**That means that each unique word is occurs an average of about 3.8 times.**

How good is this average a description of common words?

You can then get his most common words

In [None]:
othello_counter.most_common(20)

**Interesting how the most common words appear about 50 times more frequently than one would expect**.

Some of those are clearly the usual suspects: *the*, *and*, *to*, *of*, *a*.

Not so typically, are *I* and *me*.

As we know, there is no **I** in **team**.

In [None]:
print(othello_counter['we'])
print(othello_counter['team'])

Since the most common words appear so frequently, the least common words must appear only a single time.

**Moreover, there should be many of those single occurrence words**.

In [None]:
othello_counter.most_common()[-20:]


In [None]:
othello_counter.most_common()[-1020:-1000]

Yes, over a thousand words appear only once.

This seems like a very strange distribution doesn't it?  There are a lot of words with very low probability of occurring and then a few with a high probability of occurring.

It is almost as if there was a country where most of the population was only 1 foot tall but then there where a couple 200 feet tall giants walking around.


## Zipf's law

Because this distribution is so strange, we will look at it in a little more detail.


In [None]:
# Get frequency of words from Counter object
#
counts = list( dict(othello_counter).values() )

print(counts[:30])

In [None]:
fig = plt.figure( figsize = (6, 4) )
ax = fig.add_subplot( 111 )
half_frame(ax, 'Number of occurrences', 'Frequency', font_size = my_fontsize)

ax.hist( counts, bins = np.arange(-0.5, 250.5, 10), 
         align = 'mid', rwidth = 0.9, label = 'Othello')

ax.legend(loc = 'best', frameon = False, fontsize = my_fontsize)

plt.tight_layout()

Not the most enlightening plot...

Setting a logarithmic scale on the y-axis will help 

In [None]:
fig = plt.figure( figsize = (6, 4) )
ax = fig.add_subplot( 111 )
half_frame(ax, 'Number of occurrences', 'Frequency', font_size = my_fontsize)

ax.hist( counts, bins = np.arange(-0.5, 250.5, 10), 
         align = 'mid', rwidth = 0.9, label = 'Othello' )

ax.semilogy()
ax.legend(loc = 'best', frameon = False, fontsize = my_fontsize)

plt.tight_layout()

Better, but not yet great. Let's make the x-axis have a logarithmic scale too...

In [None]:
fig = plt.figure( figsize = (6, 4) )
ax = fig.add_subplot( 111 )
half_frame(ax, 'Number of occurrences', 'Frequency', font_size = my_fontsize)

ax.hist( counts, bins = np.arange(0.5, 250.5, 10), 
         align = 'mid', rwidth = 0.9, label = 'Othello' )

ax.loglog()
ax.legend(loc = 'best', frameon = False, fontsize = my_fontsize)

plt.tight_layout()

This is not the best plot, but you can kind of see how the frequency decays as a straight like in this double **logarithmic plot**.

Such a pattern is indicative of a power-law decay

> $P(k) \propto k^{-\alpha}$

In the context of human languages this is called evidence for [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law). 

Because the support of the data is so large, the frequency plot gets to be quite noisy for large values of $k$.  

For this reason, it is typically best to plot the survival function instead of the frequency.

The survival function shows the number of values larger than $k$.

The next cell shows how this is calculated.


In [None]:
k_values = []
survival_function = []

n = len(counts)
for k in sorted(counts):
    if k not in k_values:
        k_values.append(k)
        survival_function.append(n)
        n -= 1
    else:
        n -= 1
        
fig = plt.figure( figsize = (6, 4) )
ax = fig.add_subplot( 111 )
half_frame(ax, 'Number of occurrences', 'Survival function', font_size = my_fontsize)
ax.loglog()

# Add a line as a guide to the eye
ax.plot([1, 200], [1400, 3.8], 'b-', lw = 6, alpha = 0.3)

ax.plot( k_values, survival_function, 'r-', lw = 2, label = 'Othello' )

ax.legend(loc = 'best', frameon = False, fontsize = my_fontsize)

plt.tight_layout()   


## Text entropy

Bag of words approaches allow for easy calculation of many things. Above, we considered three quantities: number of words, size of vocabulary (aka, number of unique words), and distribution of word frequencies.

While is seems hard to compare distributions, the fact is that typical corpus generated from human language obey Zifp's law.  Moreover, the exponent in Zipf's law is related to measures of concentration. The smaller the value of the exponent $\alpha$ the more concentrated the speech is on a few words. Thus, the distribution also gives us a single number with which to compare characters.

Another way to measure someone's speech is by using **entropy**. [Entropy](https://en.wikipedia.org/wiki/Entropy) is a concept introduced in the study of heat. It measures disorder and in a sense tells us how hard to predict something is.

We are all much more predictable then we like to think. That is the reason why we are so easy to make fun of.  For instance, I seem to love the word *essentially*. Or at least that is what my so-called loved ones claim.

Entropy has also been introduced in the context of [Information Theory](https://en.wikipedia.org/wiki/Entropy_%28information_theory%29) where it is related to how surprising new information is.  When we listen to speech, every new word uttered brings new information. Sometimes, the amount of new information is close to zero because we already know what is coming

> baa baa black sheep
>
> have you any...


**So, how is entropy calculated?**

In the context of language, we can say that our speech is composed of words. Each word is an instance of a token (aka, unique word).  We can associate with each token a probability of selection. 

For example, let's say that our vocabulary has three tokens -- *hello*, *friends*, and *folks* -- and that each token has an associated probability of being selected. Then we can create a speech generator. Yes this is AI!





In [None]:
vocabulary = {'hello': 0.5, 'friends': 0.35, 'folks': 0.15}

speech_size = 100
speech = []
for i in range(speech_size):
    x = random()
    p = 0
    for key in vocabulary:
        p += vocabulary[key]
        if x < p:
            speech.append(key)
            break
    
print(speech)
Counter(speech)

speech_counter = Counter(speech)

Entropy is expressed as a function of the probabilities of the tokens:


> $H(X) = -\sum ~ p_i ~\log p_i $

Where $i$ is the index for a token and $p_i$ is the probability of that token appearing in the text. 

In our context, the best estimate that we have of the probability of any given token is the ratio of the number of occurrences to the number of words.

> $p_i = \frac{n_i}{N}$

So, go ahead and write a function to calculate the entropy of a text given a `Counter` object.

In [None]:
def text_entropy( counter ):
    """
    This function takes a Counter object and returns the entropy of the 
    underlying text.
    
    inputs:
        counter -- Counter object
        
    returns:
        entropy -- float
    """
    entropy = 0
    
    # Code here

    return entropy


In [None]:
print(f"The entropy of the speech generator is "
      f"{text_entropy(speech_counter):.2f}\n")

print("The entropy of Othello's speech is "
      f"{text_entropy(othello_counter):.2f}\n")



You can see that the entropy scores for the these two forms of speech are quite different.

Not surprisingly, Othello's speech has much higher entropy -- that is, is much less predictable.

What about the speech of other the characters?


# Exercises

Go ahead and calculate the speech entropy of other major characters. 

So far we have calculated the properties of the entirety of the lines of a character.  However, it could be that the properties of the speech change in the course of the play.

How would you go about doing this?

Yes, exactly, calculate the change in properties of the speech of Othello over a set of 1000 consecutive words and plot how other speech properties change across the play.

In the previous exercises, you calculated a bunch of numbers.  Sometimes, those numbers appear to be quite different (1.46 *vs.* 8.84). Other times, not so much.

How would you be able to estimate the uncertainty in those number?

Is it $8.84 \pm 0.01$ or $8.8 \pm 1.0$?