# Discover Insights into Classic Texts

Novels and text contain insights into ideologies and places that are often originally unknown to the reader. By reading a written piece, you uncover the opinions of the author on their chosen topic and come to understand both the topic and how the author thinks.

In this project we will perform a natural language parsing analysis to gain deeper insight into one of two famous and often discussed novels in the public domain: Oscar Wilde’s The Picture of Dorian Gray or Homer’s The Iliad! One of the beauties of natural language parsing with regular expressions is the ability to gain insight into lengthy pieces of text without a formal read!

By the end of this project, we will find out the main topics of discussion in the novel of choosing and can begin to discern some of the author’s thoughts and beliefs!

## Step 1. Import and Preprocess Text Data

There are text files for the *The Picture of Dorian Gray*, named `dorian_gray.txt`, and *The Iliad*, named `the_iliad.txt`, sourced from Project Gutenberg. Let's import either of the text and convert it to lowercase.

In [1]:
import nltk
from nltk import pos_tag, RegexpParser
from tokenize_words import word_sentence_tokenize
from chunk_counters import np_chunk_counter, vp_chunk_counter

nltk.download('averaged_perceptron_tagger')

# import text
text = open('dorian_gray.rtf', encoding='utf-8').read().lower()

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/awesomeness_a/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


With the text imported, now we need to split the text into individual sentences and then individual words. This allows us to perform a sentence-by-sentence parsing analysis!

`word_sentence_tokenize()` will tokenize a text and then word tokenize each sentence, returning a list of word tokenized sentences.

In [2]:
# sentence and word tokenize text
word_tokenized_text = word_sentence_tokenize(text)

# store and print any word tokenized sentence
single_word_tokenized_sentence = word_tokenized_text[100]
print(single_word_tokenized_sentence)

['in', 'the', 'grass', ',', 'white', 'daisies', 'were', 'tremulous.\\', '\\', 'after', 'a', 'pause', ',', 'lord', 'henry', 'pulled', 'out', 'his', 'watch', '.']


## Step 2. Part-of-speech Tag Text

Next we will part-of-speech tag each sentence to allow for syntax parsing! We begin by creating a list named `pos_tagged_text` that will hold each part-of-speech tagged sentence from the novel.

Then, we loop through each word tokenized sentence in `word_tokenized_text` and part-of-speech tag each sentence using nltk‘s `pos_tag()` function. After that, we append the result to `pos_tagged_text`.

We also save any part-of-speech tagged sentence in `pos_tagged_text` to a variable named `single_pos_sentence`.

In [3]:
# create a list to hold part-of-speech tagged sentences
pos_tagged_text = list()

# create a for loop through each word tokenized sentence
for token in word_tokenized_text:
  # part-of-speech tag each sentence and append to list of pos-tagged sentences
  pos_tagged_text.append(pos_tag(token))

# store and print any part-of-speech tagged sentence
single_pos_sentence = pos_tagged_text[67]
print(single_pos_sentence)

[('it', 'PRP'), ('is', 'VBZ'), ('better', 'RBR'), ('not', 'RB'), ('to', 'TO'), ('be', 'VB'), ('different', 'JJ'), ('from', 'IN'), ("one's\\", 'JJ'), ('fellows', 'NNS'), ('.', '.')]


## Step 3. Chunk Sentences

Now that we have part-of-speech tagged our text, we can move on to *syntax parsing*!

We begin by defining a piece of chunk grammar `np_chunk_grammar` that will chunk a noun phrase. A noun phrase consists of an optional determiner `DT`, followed by any number of adjectives `JJ`, followed by a noun `NN`.

Then, we create a nltk RegexpParser object named `np_chunk_parser` using the noun phrase chunk grammar we defined as an argument.

We define a piece of chunk grammar named `vp_chunk_grammar` that will chunk a verb phrase of the following form: noun phrase, followed by a verb `VB`, followed by an optional adverb `RB`.

After that, we create a nltk RegexpParser object named `vp_chunk_parser` using the verb phrase chunk grammar we defined as an argument.

`np_chunked_text` and `vp_chunked_text` will hold the chunked sentences from the text.

We loop through each part-of-speech tagged sentence in `pos_tagged_text` and noun phrase chunk each sentence using RegexpParser‘s `.parse()` method, and append the result to `np_chunked_text`.

Within the same loop, we verb phrase chunk each part-of-speech tagged sentence using RegexpParser‘s `.parse()` method, and append the result to `vp_chunked_text`.

In [4]:
# define noun phrase chunk grammar
np_chunk_grammar = 'NP: {<DT>?<JJ>*<NN>}'

# create noun phrase RegexpParser object
np_chunk_parser = RegexpParser(np_chunk_grammar)

# define verb phrase chunk grammar
vp_chunk_grammar = 'VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}'

# create verb phrase RegexpParser object
vp_chunk_parser = RegexpParser(vp_chunk_grammar)

# create a list to hold noun phrase chunked sentences and a list to hold verb phrase chunked sentences
np_chunked_text = list()
vp_chunked_text = list()


# create a for loop through each pos-tagged sentence
for sentence in pos_tagged_text:
  # chunk each sentence and append to lists here
  np_chunked_text.append(np_chunk_parser.parse(sentence))
  vp_chunked_text.append(vp_chunk_parser.parse(sentence))

## Step 4. Analyze Chunks

Now that we have chunked the novel, we can analyze the chunk frequencies to gain insights!

We need to use a function `np_chunk_counter()` that returns the 30 most common NP-chunks from a list of chunked sentences. 

Let's call `np_chunk_counter()` with `np_chunked_text` as an argument.

Let's also use `vp_chunk_counter()` that returns the 30 most common VP-chunks from a list of chunked sentences.

In [5]:
# store and print the most common NP-chunks
most_common_np_chunks = np_chunk_counter(np_chunked_text)
print(most_common_np_chunks)

[((('i', 'NN'),), 907), ((('\\', 'JJ'), ('\\', 'NN')), 787), ((('\\', 'NN'),), 333), ((('lord', 'NN'),), 184), ((('henry', 'NN'),), 180), ((('life', 'NN'),), 156), ((('harry', 'NN'),), 137), ((('something', 'NN'),), 117), ((('dorian', 'JJ'), ('gray', 'NN')), 117), ((('the\\', 'NN'),), 94), ((('he\\', 'NN'),), 88), ((('nothing', 'NN'),), 86), ((('basil', 'NN'),), 80), ((('anything', 'NN'),), 65), ((('the', 'DT'), ('world', 'NN')), 62), ((('hallward', 'NN'),), 61), ((('everything', 'NN'),), 60), ((('i\\', 'NN'),), 56), ((('the', 'DT'), ('man', 'NN')), 54), ((('love', 'NN'),), 53), ((('art', 'NN'),), 52), ((('the', 'DT'), ('room', 'NN')), 50), ((('dorian', 'NN'),), 50), ((('face', 'NN'),), 46), ((('course', 'NN'),), 46), ((('it\\', 'NN'),), 46), ((('the', 'DT'), ('door', 'NN')), 46), ((('and\\', 'NN'),), 46), ((('that\\', 'NN'),), 45), ((('round', 'NN'),), 42)]


In [6]:
# store and print the most common VP-chunks
most_common_vp_chunks = vp_chunk_counter(vp_chunked_text)
print(most_common_vp_chunks)

[((('i', 'NN'), ('am', 'VBP')), 97), ((('i', 'NN'), ('was', 'VBD')), 37), ((('i', 'NN'), ('want', 'VBP')), 33), ((('i', 'NN'), ('know', 'VBP')), 32), ((('i', 'NN'), ('have', 'VBP')), 30), ((('i', 'NN'), ('had', 'VBD')), 28), ((('i', 'NN'), ('suppose', 'VBP')), 17), ((('i', 'NN'), ('think', 'VBP')), 14), ((('i', 'NN'), ('do', 'VBP'), ("n't", 'RB')), 13), ((('he\\', 'NN'), ('had', 'VBD')), 13), ((('henry', 'NN'), ('had', 'VBD')), 12), ((('i', 'NN'), ('am', 'VBP'), ('not', 'RB')), 12), ((('\\', 'NN'), ('\\', 'VBZ')), 12), ((('i', 'NN'), ('am', 'VBP'), ('so', 'RB')), 11), ((('it\\', 'NN'), ('was', 'VBD')), 11), ((('i', 'NN'), ('believe', 'VBP')), 10), ((('dorian', 'JJ'), ('gray', 'NN'), ('was', 'VBD')), 10), ((('i', 'NN'), ('met', 'VBD')), 9), ((('i', 'NN'), ('thought', 'VBD')), 9), ((('i', 'NN'), ('did', 'VBD'), ("n't", 'RB')), 8), ((('i', 'NN'), ('am', 'VBP'), ('quite', 'RB')), 8), ((('i', 'NN'), ('said', 'VBD')), 8), ((('life', 'NN'), ('has', 'VBZ')), 8), ((('i', 'NN'), ('see', 'VBP')),

Looking at `most_common_np_chunks`, we can identify characters of importance in the text such as `henry`, `harry`, `dorian gray`, and `basil`, based on their frequency. Additionally another noun phrase `the picture` appears to be very relevant.

Looking at `most_common_vp_chunks`, some interesting findings appear. The verb phrases `i want`, `i know` and `i have` occur frequently, indicating a theme of desire and need.