# Final Project: Exploration

* Authors: Andrew Larimer and Dan Rasband

The objective of this notebook is to experiment with fictional texts to determine the associations between characters and possibly even summarize their traits in some way. The plan is to first use word counts, part-of-speech tagging, and auto-generated syntax trees to determine the graph of associations, then to use the above to determine adjectives that describe each character.

Texts were obtained from http://www.glozman.com/textpages.html

In [1]:
# File handling
import io

# Data Cleaning
import re

# Utils
from importlib import reload

# NLTK Stuff
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.data import load as nltk_load
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordTokenizer

[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ubuntu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
# Harry Potter: Book 1
! ls -lah data/harry_potter_1_sorcerer_s_stone.txt

-rw-rw-r-- 1 ubuntu ubuntu 439K Nov 14 08:27 data/harry_potter_1_sorcerer_s_stone.txt


In [26]:
with open('data/harry_potter_1_sorcerer_s_stone.txt', mode='r', encoding='utf-8') as text_file:
    text = text_file.read()
    text = re.sub(r'(?:[A-Z]{2,}\s+)', '', text)
    text = text[40:]
    print(text[0:500])
    print('\n...\n')
    print(text[-500:])
    print('Length: {}'.format(len(text)))

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. 

Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amoun

...

p, boy, we haven't got all day." He walked away. 

Harry hung back for a last word with Ron and Hermione. 

"See you over the summer, then." 

"Hope you have -- er -- a good holiday," said Hermione, looking uncertainly after Uncle Vernon, shocked that anyone could be so unpleasant. 

"Oh, I will," said Harry, and they were surprised at the grin that was spreading over his face. "They don't know we're not allowed to use magic at home. I'm going to have a lot of fun with Dudley this summer.

In [27]:
cleaned_text = text.lower()
cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
print(cleaned_text[200000:200500])

ndred and thirteen, if you could call it emptying, taking out that grubby little package. had that been what the thieves were looking for? as harry and ron walked back to the castle for dinner, their pockets weighed down with rock cakes they'd been too polite to refuse, harry thought that none of the lessons he'd had so far had given him as much to think about as tea with hagrid. had hagrid collected that package just in time? where was it now? and did hagrid know something about snape that he d


In [28]:
tokens = word_tokenize(cleaned_text)
print(tokens[0:100])

['mr.', 'and', 'mrs.', 'dursley', ',', 'of', 'number', 'four', ',', 'privet', 'drive', ',', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', ',', 'thank', 'you', 'very', 'much', '.', 'they', 'were', 'the', 'last', 'people', 'you', "'d", 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', ',', 'because', 'they', 'just', 'did', "n't", 'hold', 'with', 'such', 'nonsense', '.', 'mr.', 'dursley', 'was', 'the', 'director', 'of', 'a', 'firm', 'called', 'grunnings', ',', 'which', 'made', 'drills', '.', 'he', 'was', 'a', 'big', ',', 'beefy', 'man', 'with', 'hardly', 'any', 'neck', ',', 'although', 'he', 'did', 'have', 'a', 'very', 'large', 'mustache', '.', 'mrs.', 'dursley', 'was', 'thin', 'and', 'blonde', 'and', 'had', 'nearly', 'twice']


In [29]:
def sentence_tokenize(text):
    """
    Return a sentence-tokenized copy of *text*,
    using NLTK's recommended sentence tokenizer
    (currently :class:`.PunktSentenceTokenizer`
    for the specified language).

    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus
    """
    tokenizer = nltk_load('../nltk_data/tokenizers/punkt/english.pickle')
    return tokenizer.tokenize(text)


sentences = sentence_tokenize(re.sub(r'\s+', ' ', text))
print('\n\n'.join(sentences[:3]))

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.

They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.

Mr. Dursley was the director of a firm called Grunnings, which made drills.


In [30]:
tree_tokenizer = TreebankWordTokenizer()
nltk.pos_tag(tree_tokenizer.tokenize(sentences[0]))

[('Mr.', 'NNP'),
 ('and', 'CC'),
 ('Mrs.', 'NNP'),
 ('Dursley', 'NNP'),
 (',', ','),
 ('of', 'IN'),
 ('number', 'NN'),
 ('four', 'CD'),
 (',', ','),
 ('Privet', 'NNP'),
 ('Drive', 'NNP'),
 (',', ','),
 ('were', 'VBD'),
 ('proud', 'JJ'),
 ('to', 'TO'),
 ('say', 'VB'),
 ('that', 'IN'),
 ('they', 'PRP'),
 ('were', 'VBD'),
 ('perfectly', 'RB'),
 ('normal', 'JJ'),
 (',', ','),
 ('thank', 'NN'),
 ('you', 'PRP'),
 ('very', 'RB'),
 ('much', 'RB'),
 ('.', '.')]

In [31]:
from helpers import core_nlp

In [45]:
server = core_nlp.CoreNLPServer(
    path_to_jar='/mnt/bigdrive/ubuntu/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar',
    path_to_models_jar='/mnt/bigdrive/ubuntu/stanford-english-corenlp-2018-10-05-models.jar',
    port=9000)
server.start()

In [53]:
parser = core_nlp.CoreNLPDependencyParser(tagtype='ner')

In [62]:
sentences[0]

'Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.'

In [63]:
parse, = parser.raw_parse(sentences[0])

In [64]:
print(parse.to_conll(4))

Mr.	NNP	4	compound
and	CC	1	cc
Mrs.	NNP	1	conj
Dursley	NNP	14	nsubj
,	,	4	punct
of	IN	7	case
number	NN	4	nmod
four	CD	7	nummod
,	,	4	punct
Privet	NNP	11	compound
Drive	NNP	4	appos
,	,	4	punct
were	VBD	14	cop
proud	JJ	0	ROOT
to	TO	16	mark
say	VB	14	xcomp
that	IN	21	mark
they	PRP	21	nsubj
were	VBD	21	cop
perfectly	RB	21	advmod
normal	JJ	16	ccomp
,	,	14	punct
thank	VB	14	dep
you	PRP	23	dobj
very	RB	26	advmod
much	RB	23	advmod
.	.	14	punct



In [65]:
print(parse.tree())

(proud
  (Dursley (Mr. and Mrs.) , (number of four) , (Drive Privet) ,)
  were
  (say to (normal that they were perfectly))
  ,
  (thank you (much very))
  .)


In [66]:
for governor, dep, dependent in parse.triples():
    print(governor, dep, dependent)

('proud', 'JJ') nsubj ('Dursley', 'NNP')
('Dursley', 'NNP') compound ('Mr.', 'NNP')
('Mr.', 'NNP') cc ('and', 'CC')
('Mr.', 'NNP') conj ('Mrs.', 'NNP')
('Dursley', 'NNP') punct (',', ',')
('Dursley', 'NNP') nmod ('number', 'NN')
('number', 'NN') case ('of', 'IN')
('number', 'NN') nummod ('four', 'CD')
('Dursley', 'NNP') punct (',', ',')
('Dursley', 'NNP') appos ('Drive', 'NNP')
('Drive', 'NNP') compound ('Privet', 'NNP')
('Dursley', 'NNP') punct (',', ',')
('proud', 'JJ') cop ('were', 'VBD')
('proud', 'JJ') xcomp ('say', 'VB')
('say', 'VB') mark ('to', 'TO')
('say', 'VB') ccomp ('normal', 'JJ')
('normal', 'JJ') mark ('that', 'IN')
('normal', 'JJ') nsubj ('they', 'PRP')
('normal', 'JJ') cop ('were', 'VBD')
('normal', 'JJ') advmod ('perfectly', 'RB')
('proud', 'JJ') punct (',', ',')
('proud', 'JJ') dep ('thank', 'VB')
('thank', 'VB') dobj ('you', 'PRP')
('thank', 'VB') advmod ('much', 'RB')
('much', 'RB') advmod ('very', 'RB')
('proud', 'JJ') punct ('.', '.')
