# Further Applications of Regular Expressions

Today, we will be focusing on more applications of those tools and hoping to explore the potential of tools that we have covered thus far.

### Lesson Outline:
- Q&A regarding Thursday's content
- Examples (with BK's data)
- Practice!

In [None]:
# importing different packages
import re
import codecs
import numpy as np
from datascience import *
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from nltk import tokenize
from collections import Counter
import pprint
pp = pprint.PrettyPrinter()

%matplotlib inline

## Importing our text

In [None]:
with codecs.open('Grand Strategy of Phillip II Text.txt', 'r',encoding = 'utf-8', errors='ignore') as f:
    read_text = f.read()
print(read_text)

In [None]:
#Let's start off by getting a rough list of proper nouns with relative frequencies - this will help us narrow down people, places and their relations
# capital 'W' mean NOT word characters (not letters or numbers)
pp.pprint(Counter(re.findall('[A-Z]\w+',read_text)))

In [None]:
# making a table out of the above dictionary
names_table = Table(['Name', 'Count'])
names = Counter(re.findall('[A-Z]\w+',read_text))
for name in names.keys():
    row = [name, names[name]]
    names_table.append(row)
names_table.sort('Count', descending=True)

In [None]:
# the number of unique names in the work
len(names.keys())

In [None]:
# how many times those unique names appear in total
sum(names_table.column('Count'))

In [None]:
names_table.sort('Count', descending=True).take(np.arange(20)).barh('Name')

In [None]:
# making a histogram of the counts distribution
# most of the names only appear very few times
names_table.hist('Count', bins=50)

In [None]:
# while we're at it, let's get a list of all the years that appear in the text along with relative frequencies
pp.pprint(Counter(re.findall('\d{4}',read_text))) #why did I have 4 within my input?

In [None]:
#let's put these two together in a very simple way. First, let's search for statements that explained events 
#in the most frequently mentioned years
info_1588 = re.findall('[\S\s]{,45}15[7-8][0-9][\S\s]{,45}', read_text)
info_1588_parsed = []
for elem in info_1588:
    if 'Philip' in elem:
        info_1588_parsed.append(elem)
info_1588_parsed
#this query is not perfect. What are ways that you would improve it? Try them out!

In [None]:
#let's try the same thing again, but now with a frequently mentioned country
info_england = re.findall('[\S\s]{,45}England[\S\s]{,45}', read_text)
info_england_parsed = []
for elem in info_england:
    if 'Philip' in elem:
        info_england_parsed.append(elem)
info_england_parsed
#What are ways that you could improve this query?

In [None]:
date_words = [re.findall('[A-Z][a-z]+', elem) + re.findall('\d{4}', elem) for elem in info_england_parsed if re.search('[A-Z][a-z]+', elem)]
date_words

In [None]:
word_date_dict = {}
for x in range(1500,1600):
    for elem in date_words:
        if str(x) in elem:
            word_date_dict[str(x)] = elem
word_date_dict

## Character Position

We are now going to look at where select characters appear within the work.

In [None]:
# our select characters, who we will compare to Philip
characters = ['Elizabeth', 'Estado', 'God']

# getting the starting positions for each occurrence of each character
philip_positions = np.array([wrd.start() for wrd in re.finditer('Philip', read_text)])

# putting the positions of our select characters in a dictionary
# so that we can easily access them later
character_positions = {}
for character in characters:
    positions = np.array([wrd.start() for wrd in re.finditer(character, read_text)])
    character_positions[character] = positions  


# printing out the occurrences of each character's name
for char in characters:
    print(char + ': ' + str(len(character_positions[char])) + ' occurrences')

We can now plot the positions of where each of these mentions comes up. We will start with Philip.

In [None]:
# set up the figure
width = 15
height = 2.5
fig = plt.figure(figsize=(width, height))
ax = fig.add_subplot(111)
ax.set_xlim(0,len(read_text))
ax.set_ylim(0,10)
plt.title('Philip')

# drawing the horizontal line
xmin = 0
xmax = len(read_text)
y = 5
plt.hlines(y, xmin, xmax)

# plotting each point
for each in philip_positions:
    plt.plot(each,y, "|", ms = 55, mew=.6)

# a e s t h e t i c
plt.axis('off')
plt.show()

The line represents the indices within the string that is our corpus, about 1.5 million characters long. Each line represents the point where a substring, in this case 'Philip,' begins. 'Philip' appears all throughout the book, not even accounting for mentions of 'King' or other pronouns refering to him. Not really a suprising result. Let's look at where our other select characters show up.

In [None]:
# for loop with each character
for char in characters:
    # set up the figure
    width = 15
    height = 2.5
    fig = plt.figure(figsize=(width, height))
    ax = fig.add_subplot(111)
    ax.set_xlim(0,len(read_text))
    ax.set_ylim(0,10)
    plt.title(char)

    # drawing the horizontal line
    xmin = 0
    xmax = len(read_text)
    y = 5
    plt.hlines(y, xmin, xmax)
    
    # plotting each point
    for each in character_positions[char]:
        plt.plot(each,y, "|", ms = 55,mew=.6)

    # a e s t h e t i c
    plt.axis('off')
    plt.show()

There are some interesting patterns that you can see in the appearance of names. Estado occurs in a very select group near the end of the work. God is pretty spread out, but has a dense patch about a quarter way through. Elizabeth doesn't have many mentions until a third of the book has passed, then occurs quite a few times. Pick out some other characters that you want to see this for, insert their name into the list `characters`, then run the cells again to explore some others.

## How close are characters?

We are now going to try to judge how close of a relationship two individuals have. One way to do this is by comparing the position of their letters within the string. For example, if we want to see how related God is to Philip, we can see how far away the closest 'Philip' is to 'God' within our corpus. If we do that for all occurrences of God, we can see, on average, how close Philip is to God. We can then do that for every one of our select individuals.

In [None]:
# performing this operation as a loop so that
# we can do this for all select characters
boxplot_table = Table(['Character', 'Distance'])
for character in character_positions.keys():
    
    # collecting numbers for a later boxplot
    to_add = Table()
    distance_collections = []
    
    # for each position for each occurance of the select character
    for position in character_positions[character]:
        # computing the absolute distance between that specific occurance of character and all occurances of philip
        distances = abs(philip_positions - position)
        # finding the smallest of those distances
        closest_distance = min(distances)
        # adding that distance to the list
        distance_collections.append(closest_distance)
        
    # finding the average of those distances
    avg_distance = np.mean(distance_collections)
    
    # collecting numbers for a later boxplot
    to_add = to_add.with_column('Distance', distance_collections).with_column('Character', character)
    boxplot_table.append(to_add)
    

    
    print('Philip is on average {} characters away from {}'.format(str(round(avg_distance,2)),character))
    print('Median: ' + str(np.median(distance_collections)) + ' characters\n')

Initially, we just looked at the average, put the median is much more telling. The data definitely seems to be skewed by some outliers. The following boxplot helps look at the distributions.

In [None]:
height, width = 14, 19
fig = plt.figure(figsize=(width, height))
sns.boxplot(x=boxplot_table['Character'], y=boxplot_table['Distance'])

Another way to do this is to measure relationships is to see how often the characters appear in sentences together.

In [None]:
# using the NLTK tokenizer to break the corpus up by sentences
broken_up_sentences = tokenize.sent_tokenize(read_text)

# initializing dictionary
common_sentences = {}

for character in characters:
    # adding one to a list for each time that philip and character appear in the same sentence
    both_appear = [1 for sentence in broken_up_sentences if 'Philip' in sentence and character in sentence]
    # summing that to get the number of sentences they appear in together
    num_sentences = sum(both_appear)
    # adding value to dictionary
    common_sentences[character] = num_sentences
    # making a string version of num_sentences for printing
    str_num = str(num_sentences)
    print('Philip and {}: {} sentences with both'.format(character, str_num))

One thing you may want to consider is controlling for how often the term appears. For example, Elizabeth is in about the same amount of sentences with Philip as Estado is. But Elizabeth appears 356 times in the work, and Estado only appears 225 times. This might lead us to think that Estado has a much stronger relationship with Philip. Below perform these calculations.

In [None]:
for character in characters:
    # total number of times the character's name appears
    total_appearances = len(character_positions[character])
    # sentences where philip and character appear
    appearances_with_philip = common_sentences[character]
    # dividing common appearances by the total number of appearances
    relative = round(appearances_with_philip / total_appearances, 4)
    print('{}: {}'.format(character, relative))

Now let's take a look at some of the sentences where God and Phillip both appear.

In [None]:
for sentence in broken_up_sentences:
    if 'Philip' in sentence and 'God' in sentence:
        print(sentence + '\n\n\n')

Looking at the above result, we can see that the sentence tokenizer has a little bit of trouble with the text, but nonetheless, God and Philip are still pretty close in the text for these instances.

Once again, if you add some other characters to `characters`, then run through all of the cells again, you'll be able to perform these operations for them.

There are many different ways to do this, these are just some basic ones to get you thinking about the potential things that you can do with your text files.

<b>Bonus</b>: Read about extracting social networks from books
http://www1.cs.columbia.edu/~delson/pubs/ACL2010-ElsonDamesMcKeown.pdf