# Insights Into Texts

## Import and Preprocess Text Data

Text sourced from <a href="http://www.gutenberg.org/" target="_blank" rel="noopener noreferrer">Project Gutenberg</a>.
Import the text, convert it to lowercase, and name it `text` using the chosen novel:
```python
text = open("______.txt", encoding='utf-8').read().lower()
```

In [5]:
from nltk import pos_tag, RegexpParser
from tokenize_words import word_sentence_tokenize
from chunk_counters import np_chunk_counter, vp_chunk_counter

In [6]:
# import text of choice
text = open("alice_in_wonderland.txt", encoding="utf-8").read().lower()

- call the function <code>word_sentence_tokenize()</code> with your text as an argument to sentence tokenize the text and then word tokenize each sentence, returning a list of word tokenized sentences

In [7]:
# sentence and word tokenize text
word_tokenized_text = word_sentence_tokenize(text)

- save any word tokenized sentence in word_tokenized_sentence to any variable and print to visualize what has been done.

In [8]:
# store and print any word tokenized sentence
single_word_tokenized_sentence = word_tokenized_text[100]
print(single_word_tokenized_sentence)

['i', 'shall', 'never', 'get', 'to', 'twenty', 'at', 'that', 'rate', '!']


## Part-of-Speech Tagging

- create a list that will hold each part-of-speech tagged sentence from the novel. this allows for syntax parsing.

In [9]:
# create a list to hold part-of-speech tagged sentences
pos_tagged_text = []

- loop through each word tokenized sentence in word_tokenized_text and POS tag each sentence using NLTKs pos_tag() function. Then append the result to the created list above.

In [10]:
# create a for loop through each word tokenized sentence
for word_tokenized_sentence in word_tokenized_text:
    # part-of-speech tag each sentence and append to list above
    pos_tagged_text.append(pos_tag(word_tokenized_sentence))

- save any sentence to a variable and print to visualize what has been done.

In [11]:
# store and print any POS tagged sentence
single_sentence = pos_tagged_text[105]
print(single_sentence)

[('“', 'VB'), ('how', 'WRB'), ('cheerfully', 'RB'), ('he', 'PRP'), ('seems', 'VBZ'), ('to', 'TO'), ('grin', 'VB'), (',', ','), ('how', 'WRB'), ('neatly', 'RB'), ('spread', 'VB'), ('his', 'PRP$'), ('claws', 'NN'), (',', ','), ('and', 'CC'), ('welcome', 'JJ'), ('little', 'JJ'), ('fishes', 'NNS'), ('in', 'IN'), ('with', 'IN'), ('gently', 'RB'), ('smiling', 'VBG'), ('jaws', 'NN'), ('!', '.'), ('”', 'JJ'), ('“', 'NN'), ('i', 'NN'), ('’', 'VBP'), ('m', 'JJ'), ('sure', 'JJ'), ('those', 'DT'), ('are', 'VBP'), ('not', 'RB'), ('the', 'DT'), ('right', 'NN'), ('words', 'NNS'), (',', ','), ('”', 'NNP'), ('said', 'VBD'), ('poor', 'JJ'), ('alice', 'NN'), (',', ','), ('and', 'CC'), ('her', 'PRP$'), ('eyes', 'NNS'), ('filled', 'VBN'), ('with', 'IN'), ('tears', 'NNS'), ('again', 'RB'), ('as', 'IN'), ('she', 'PRP'), ('went', 'VBD'), ('on', 'IN'), (',', ','), ('“', 'FW'), ('i', 'NN'), ('must', 'MD'), ('be', 'VB'), ('mabel', 'VBN'), ('after', 'IN'), ('all', 'DT'), (',', ','), ('and', 'CC'), ('i', 'VB'), ('

## Chunk Sentences

- "syntax parsing": define a piece of chunk grammar that will chunk a noun phrase, consisting of an optional determiner <code>DT</code>, followed by any number of adjectives <code>JJ</code>, followed by a noun <code>NN</code>

In [12]:
# define noun phrase chunk grammar
np_chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"

- create a <code>nltk RegexpParser</code> object using the noun phrase chunk grammar defined above as an argument

In [13]:
# create noun phrase object
np_chunk_parser = RegexpParser(np_chunk_grammar)