In [3]:
%%html
<style>
.h1_cell, .just_text {
    box-sizing: border-box;
    padding-top:5px;
    padding-bottom:5px;
    font-family: "Times New Roman", Georgia, Serif;
    font-size: 125%;
    line-height: 22px; /* 5px +12px + 5px */
    text-indent: 25px;
    background-color: #fbfbea;
    padding: 10px;
}

hr { 
    display: block;
    margin-top: 0.5em;
    margin-bottom: 0.5em;
    margin-left: auto;
    margin-right: auto;
    border-style: inset;
    border-width: 2px;
}
</style>

<h1>
<center>
Module 5 - Fun with document vectors
</center>
</h1>
<div class=h1_cell>
<p>
I want to continue to explore using the values in bag-of-words to build vectors. The general idea is that we will generate a vector for EAP, a vector for HPL and a vector for MWS. How do we get these vectors? Simple. We take a column from bag-of-words. Before going further, let's read in bag-of-words fromn week 4.
</div>

In [1]:
#I am using dropbox so got the url to my file. If you have on local drive, then use file reading code
import re  
import json
bag_of_words = json.load(open("bag_of_words.txt"))
sorted_items = sorted(bag_of_words.items())  # need to sort to make sure vectors align
sorted_items[:10]

[(u'aaem', [1, 0, 0]),
 (u'ab', [1, 0, 0]),
 (u'aback', [2, 0, 0]),
 (u'abaft', [0, 0, 1]),
 (u'abandon', [7, 3, 1]),
 (u'abandoned', [11, 13, 5]),
 (u'abandoning', [2, 1, 0]),
 (u'abandonment', [2, 0, 3]),
 (u'abaout', [0, 24, 0]),
 (u'abased', [1, 0, 0])]

<h2>
Challenge 1
</h2>
<div class=h1_cell>
<p>
Let's write a better version of sentence_wrangler. What I noticed this week when going through new books is that I was letting some strange words through. For instance, my sentence_wrangler from last week lets numbers through. And it also lets byte codes through. I think a better design would be to switch from a blacklist (define chars don't want) to a whitelist (define chars that are ok). Change the 3rd argument to the set of legal characters you allow.
<p>
If you want to be fancy, be my guest. Use the 3rd argument to pass in an re pattern that needs to match against each word. Much more elegant.
</div>

In [2]:
def sentence_wrangler(sentence, swords, legal_chars):
    pat = re.compile(legal_chars)
    word_tokes = word_punct_tokenizer.tokenize(sentence)
    removed_words = []
    result = []
    for word in word_tokes:
        word = word.lower()
        if word in swords or re.match(pat, word) == None:
            removed_words.append(word)
        else:
            result.append(word)      
    return (result, removed_words)

In [3]:
#Here is my whitelist - re pattern would be better. Extra credit if you do it

legals = r'^[a-z]+$'

# legals = 'abcdefghijklmnopqrstuvwxyz'

<div class=h1_cell>
<p>
Some other odds and ends. We will need cosine_similarity from prior module, stop words and tokenizer.
<p>
</div>

In [4]:
def cosine_similarity(v1,v2):
    dotProduct = 0.0
    normalDotProduct = 0.0
    aSquaredSum = 0.0
    bSquaredSum = 0.0
    for i in range(len(v1)):
        dotProduct += (v1[i] * v2[i])
        aSquaredSum += (v1[i])**2.0
        bSquaredSum += (v2[i])**2.0
    normalDotProduct = (aSquaredSum**.5) * (bSquaredSum ** .5)
    return (dotProduct / normalDotProduct)

In [5]:
from nltk.corpus import stopwords
swords = stopwords.words('english')

In [6]:
from nltk.tokenize import WordPunctTokenizer
word_punct_tokenizer = WordPunctTokenizer()      

<h2>
Ok, let's get to it
</h2>
<div class=h1_cell>
<p>
What I want to know is how "close" 2 books are to each other. I'll build a word-count vector for each book. And then take the cosine similarity. I'll give you a start.
</div>

In [7]:
#item in sorted_items: (word, (eap_val, hpl_val, mws_val))

eap_vector = [pair[1][0] for pair in sorted_items]
hpl_vector = [pair[1][1] for pair in sorted_items]
mws_vector = [pair[1][2] for pair in sorted_items]

In [8]:
eap_hpl = cosine_similarity(eap_vector, hpl_vector)
eap_hpl

0.7487332567707315

<h2>
Is that close?
</h2>
<div class=h1_cell>
<p>
The range of the cosine similarity for us is 0..1. Does that make .75 high? It is hard to answer this without having the values for other book combinations. I would say it is high enough to warrant a further look if I was searching for plagiarism. Let's check out some other combos.
</div>

<h2>
Challenge 2
</h2>
<div class=h1_cell>
<p>
Go ahead and do the other 2 comparisons.
</div>

In [9]:
eap_mws = cosine_similarity(eap_vector, mws_vector)
eap_mws

0.7462905158479859

In [10]:
mws_hpl = cosine_similarity(mws_vector, hpl_vector)
mws_hpl

0.7371094246610522

<h2>
Kind of interesting
</h2>
<div class=h1_cell>
<p>
All 3 have roughly same similarity score. I would expect that given they are all gothic novels. Do you think we are catching the gothic/horror genre in our vectors through use of words?
</div>

<h2>
Challenge 3
</h2>
<div class=h1_cell>
<p>
Let's test our conjecture that we are capturing something about gothicness. Let's compare the 3 against Huckleberry Finn by Mark Twain. My gut feeling is that this should not be high on gothic scale. Your goal is to build a huck_vector that you can compare against our existing vectors. Here is what you need to do:
<p>
<ol>
<li>Initialize a huck_dict that has same keys as bag_of_words and each key's value is a count of that word in the Huck Finn book.
<li>Find an online version of Huck Finn. Hint: Project Gutenberg is a great source.
<li>Figure out how to read the book in and to break the book into sentences.
<li>Pass each sentence through sentence_wrangler to get words.
<li>For each word, increase the count for huck_dict[word], but only if word is in bag_of_words. If the word is not in bag_of_words, add it to the list huck_left_out.
</ol>
<p>
Check your results against mine.
</div>

In [12]:
import urllib

all_huck_words = bag_of_words.copy()
for key in all_huck_words.keys():
    all_huck_words[key] = 0

huck_left_out = []
huckData = urllib.urlopen("http://www.gutenberg.org/files/76/76-0.txt").read()
huckFinn = huckData.split(".")

for sentence in huckFinn:
    wordsInHuck = sentence_wrangler(sentence, swords, legals)[0]
    for word in wordsInHuck:
        if word in bag_of_words.keys():
            all_huck_words[word] += 1
        else:
            huck_left_out.append(word)
    

  


In [13]:
len(all_huck_words)  # we expect this to be 24944, the len of bag_of_words

24944

In [14]:
sorted(all_huck_words.items())[:10]

[(u'aaem', 0),
 (u'ab', 3),
 (u'aback', 0),
 (u'abaft', 0),
 (u'abandon', 0),
 (u'abandoned', 0),
 (u'abandoning', 0),
 (u'abandonment', 0),
 (u'abaout', 0),
 (u'abased', 0)]

In [15]:
len(set(huck_left_out))  #number of unique words left out

1912

In [16]:
sorted(list(set(huck_left_out)), reverse=False)[:10]  #first 10

['abner',
 'abolitionist',
 'abram',
 'abusing',
 'accessed',
 'ache',
 'actin',
 'actuly',
 'additions',
 'addled']

In [17]:
sorted(list(set(huck_left_out)), reverse=True)[:10]  #last 10

['zip',
 'yuther',
 'yow',
 'yourn',
 'yo',
 'yit',
 'yistiddy',
 'yisterday',
 'yirls',
 'yers']

<h2>
A note about these left out words
</h2>
<div class=h1_cell>
<p>
I am keeping bag_of_words static for simplicity. But in reality, it is a growing thing. We should really add these left out words into bag_of_words and zero them out for eap, hpl and mws. As it is, we are kind of playing by gothic rules, ony using the words we saw in gothic authors. What would happen if we expanded bag_of_words to include all the new words we see in each new book? Would that move us closer or farther away from similarity with the gothic authors?
</div>

<h2>
Challenge 4
</h2>
<div class=h1_cell>
<p>
Build the huck_vector and compare with other 3. Remember to sort items so vectors align.
</div>

In [18]:
huck_sorted_items = sorted(all_huck_words.items())

In [19]:
huck_vector = [pair[1] for pair in huck_sorted_items]

In [20]:
huck_vector[:10]

[0, 3, 0, 0, 0, 0, 0, 0, 0, 0]

In [21]:
eap_huck = cosine_similarity(eap_vector, huck_vector)
eap_huck

0.5215618751637507

In [22]:
hpl_huck = cosine_similarity(hpl_vector, huck_vector)
hpl_huck

0.5751795949324066

In [23]:
mws_huck = cosine_similarity(mws_vector, huck_vector)
mws_huck

0.48095455803255693


<div class=h1_cell>
<p>
Huck Finn is definitely less similar. Closest to Lovecraft. Hmmmmm. They were writing at roughly the same time period.
</div>

<h2>
Challenge 4
</h2>
<div class=h1_cell>
<p>
Let's try one my literary colleague tells me is the antithesis of gothic: Oliver Twist by Charles Dickens.
<p>
Same routine as Huckleberry Finn. Find it, bag it, vectorize it, cosine it with other 4.
</div>

In [24]:
all_twist_words = bag_of_words.copy()
for key in all_twist_words.keys():
    all_twist_words[key] = 0

twist_left_out = []
twistData = urllib.urlopen("http://www.gutenberg.org/cache/epub/730/pg730.txt").read()
oliverTwist = twistData.split(".")

for sentence in oliverTwist:
    wordsInTwist = sentence_wrangler(sentence, swords, legals)[0]
    for word in wordsInTwist:
        if word in bag_of_words.keys():
            all_twist_words[word] += 1
        else:
            twist_left_out.append(word)

  


In [25]:
len(all_twist_words)

24944

In [26]:
twist_sorted_items = sorted(all_twist_words.items())
twist_sorted_items[:10]

[(u'aaem', 0),
 (u'ab', 0),
 (u'aback', 0),
 (u'abaft', 0),
 (u'abandon', 1),
 (u'abandoned', 1),
 (u'abandoning', 0),
 (u'abandonment', 0),
 (u'abaout', 0),
 (u'abased', 0)]

In [27]:
len(set(twist_left_out))

2009

In [28]:
sorted(list(set(twist_left_out)), reverse=False)[:10]

['abase',
 'ablutions',
 'abound',
 'absenting',
 'abstractedly',
 'abuts',
 'acause',
 'acceded',
 'acceding',
 'accessed']

In [29]:
sorted(list(set(twist_left_out)), reverse=True)[:10]

['zip',
 'younker',
 'yokel',
 'yerself',
 'xxxviii',
 'xxxvii',
 'xxxvi',
 'xxxv',
 'xxxix',
 'xxxiv']

<h2>
Dang
</h2>
<div class=h1_cell>
<p>
Looks like sentence_wrangler is letting through preface page numbers, e.g., xxxv. If we know we are dealing with books, I suppose we could write a special sentence_wrangler that knows about the weird things we will see in books and throw them out. I kind of like the idea of having a library of sentence wranglers that are tuned to specific styles of text. Then you can choose the one that makes the most sense.
</div>

<h2>
Same left outs?
</h2>
<div class=h1_cell>
<p>
I wonder how much an overlap there is between words being left out of Huck Finn and words left out of Oliver Twist.
</div>

In [30]:
intersection = set(twist_left_out).intersection(set(huck_left_out))
len(intersection)

316

<div class=h1_cell>
<p>
A pretty big overlap.
</div>

<h2>
Ok, back to the problem
</h2>
<div class=h1_cell>
<p>
Build the twist_vector and compare with other 4.
</div>

In [118]:
twist_vector = [pair[1] for pair in twist_sorted_items]

In [119]:
eap_twist = cosine_similarity(eap_vector, twist_vector)
eap_twist

0.655333952147614

In [120]:
hpl_twist = cosine_similarity(hpl_vector, twist_vector)
hpl_twist

0.5937931706997981

In [121]:
mws_twist = cosine_similarity(mws_vector, twist_vector)
mws_twist

0.5688023249478481

In [122]:
huck_twist = cosine_similarity(huck_vector, twist_vector)
huck_twist

0.5708141316412415

<h2>
Poe is winner this time
</h2>
<div class=h1_cell>
<p>
Wonder if Poe and Dickens knew each other. They were writing at roughly the same time. Maybe we are picking up on the language and jargon of the time?
</div>

<h2>
Challenge 5
</h2>
<div class=h1_cell>
<p>
I'm putting down a challenge. Find a book that has a cosine similarity value of below .5 for all 3 gothic authors. I was able to get below .4! You will get a shout-out if you can beat me.
<p>
To make exploration easier, I packaged up the code to produce the 3 values into a single function. For each book I explored, I saved it in my dropbox account and then got the url to it. That is what I passed into my function. You could do something similar with Google docs. Or change the url to a file path.
</div>

In [128]:
def check_book(url, bag, swords, legals, eap_vec, hpl_vec, mws_vec):
    all_book_words = bag.copy()
    for key in all_book_words.keys():
        all_book_words[key] = 0

    book_data = urllib.urlopen(url).read()
    book = book_data.split(".")

    for sentence in book:
        words_in_book = sentence_wrangler(sentence, swords, legals)[0]
        for word in words_in_book:
            if word in bag.keys():
                all_book_words[word] += 1
    
    sorted_book_items = sorted(all_book_words.items())
    book_vec = [pair[1] for pair in sorted_book_items]
    eap_x = cosine_similarity(eap_vec, book_vec)
    hpl_x = cosine_similarity(hpl_vec, book_vec)
    mws_x = cosine_similarity(mws_vec, book_vec)
    
    return (eap_x, hpl_x, mws_x)

In [163]:
# test to make sure get same values as by hand above
check_book('http://www.gutenberg.org/cache/epub/730/pg730.txt', bag_of_words, swords, legals, eap_vector, hpl_vector, mws_vector) # twist


  


(0.655333952147614, 0.5937931706997981, 0.5688023249478481)

In [131]:
# The Iliad by Homer
check_book('http://www.gutenberg.org/cache/epub/6130/pg6130.txt', bag_of_words, swords, legals, eap_vector, hpl_vector, mws_vector)  # close


  


(0.4759461574820262, 0.4596634751074897, 0.5413930736578376)

In [142]:
# Romeo and Juliet by Shakespeare
check_book('http://www.gutenberg.org/cache/epub/1112/pg1112.txt', bag_of_words, swords, legals, eap_vector, hpl_vector, mws_vector) 


  


(0.4059720034775994, 0.39459773807280546, 0.47293987506093177)

In [146]:
# Alice's Adventures in Wonderland by Lewis Carroll
check_book('http://www.gutenberg.org/files/11/11-0.txt', bag_of_words, swords, legals, eap_vector, hpl_vector, mws_vector) 


  


(0.4779796659228037, 0.4454179192795574, 0.417139349603236)

In [162]:
# The Hitchhikers Guide to the Internet by Ed Krol
check_book('http://www.gutenberg.org/cache/epub/39/pg39.txt', bag_of_words, swords, legals, eap_vector, hpl_vector, mws_vector) 


  


(0.2664383860912784, 0.2519317033136522, 0.23930042366143933)

<h2>
Closing note
</h2>
<div class=h1_cell>
<p>
I'm still interested in using these word-frequency vectors to see what we can do. Next week we will take a look at another way to reason with words, i.e. word co-occurrence matrices.
</div>