![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=EnglishLanguageArts/PartsOfSpeech/parts-of-speech.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Parts of Speech

In this notebook we will look at the usage of various parts of speech in two different public domain books from the site [Project Gutenberg](https://gutenberg.org), Frankenstein and Anne of Green Gables, using a Python natural language processing library called [spaCy](https://spacy.io/).

You can run all of the code cells in this notebook by clicking the `▶▶` button in the toolbar above or `Run all` from the `Cell` menu.

## Frankenstein

by Mary Wollstonecraft (Godwin) Shelley

### Parts of Speech Proportions

First we'll download the book and create a chart of the relative proportions of the following [parts of speech](https://universaldependencies.org/docs/u/pos):

* adjectives
* adverbs
* verbs
* nouns
* proper nouns
* pronouns

In [None]:
import requests
r = requests.get('http://www.gutenberg.org/files/84/84-0.txt') # get the online book file
r.encoding = 'utf-8' # specify the type of text encoding in the file
text = r.text.split('***')[2] # get the part after the header
text = text.replace("’","'").replace("“",'"').replace("”",'"') # replace any 'smart quotes'

import pandas as pd
frankenstein = pd.DataFrame() # create an empty data frame that we will append to
for chapter in text.split('Chapter'):
    if len(chapter)>500: # so that we are getting actual book chapters
        chapter = chapter.replace('\r',' ').replace('\n',' ').replace('  ', ' ') # delete the 'new line' characters
        frankenstein = frankenstein.append({'Chapter Text':chapter, 'Length':len(chapter)}, ignore_index=True)

try:
    import en_core_web_sm
except:
    !python -m spacy download en_core_web_sm
    import en_core_web_sm
from IPython.display import clear_output
clear_output()
print('Running natural language processing, please wait...')
nlp = en_core_web_sm.load() # set up natural language processing
word_types = ['ADJ', 'ADV', 'VERB', 'NOUN', 'PROPN', 'PRON'] # https://universaldependencies.org/docs/u/pos
parts_of_speech_df = pd.DataFrame(columns=word_types) # create an empty dataframe with column names
for i in range(len(frankenstein)): # iterate through the chapters dataframe
    parts_of_speech_list = [] # create an empty list
    for token in nlp(frankenstein['Chapter Text'][i]): # loop through each token in the chapter
        part_of_speech = token.pos_
        if part_of_speech in word_types: # if it is in the list of types we made
            parts_of_speech_list.append(part_of_speech)
    chapter_length = len(frankenstein['Chapter Text'][i])
    word_type_count = {} # create an empty dictionary
    for word_type in word_types:
        word_type_count.update({word_type:parts_of_speech_list.count(word_type)/chapter_length}) # add to that dictionary
    parts_of_speech_df = parts_of_speech_df.append(word_type_count, ignore_index=True) # add to dataframe
parts_of_speech_df['Chapter'] = parts_of_speech_df.index+1 # create a Chapter column
parts_of_speech_df = parts_of_speech_df.set_index('Chapter') # set the dataframe index to be that new column
clear_output()
pd.options.plotting.backend='plotly' # use the Plotly backend rather than matplotlib
parts_of_speech_df.plot.line(title='Parts of Speech Proportions in Frankenstein by Chapter',
                             labels=dict(value='Proportion of Total Words'))# create and display the visualization

### Pronoun Usage

We can also look at the **cumulative** usage of various pronouns throughout the book.

In [None]:
print('Running natural language processing, please wait...')
he = [] # create an empty list that we will add to
she = []
me = []
I = []
for row_number in range(len(frankenstein)): # loop through the dataframe
    he.append(0) # start with zero mentions in this list for this time through the loop
    she.append(0)
    me.append(0)
    I.append(0)
    for token in nlp(frankenstein['Chapter Text'][row_number]): # loop through the words in the chapter
        if token.lower_ == 'he':
            he[row_number] += 1 # increament this variable in the list if the word is "he"
        if token.lower_ == 'she':
            she[row_number] += 1
        if token.lower_ == 'me':
            me[row_number] += 1
        if token.text == 'I':
           I[row_number] += 1
frankenstein['he'] = he # add the list as a column in the dataframe
frankenstein['she'] = she
frankenstein['me'] = me
frankenstein['I'] = I
clear_output()
frankenstein.drop(columns=['Chapter Text','Length']).cumsum().plot(
    title='Cumulative Pronouns in Frankenstein',labels=dict(index='Chapter')) # create and display the visualization

The next code cell creates a visualization of pronoun usage **by chapter**.

In [None]:
frankenstein.drop(columns=['Chapter Text','Length']).plot(
    title='Pronouns in Frankenstein by Chapter',labels=dict(index='Chapter'))

## Anne of Green Gables
by Lucy Maud Montgomery

### Parts of Speech

In [None]:
url = 'http://www.gutenberg.org/files/45/45-0.txt'
r = requests.get(url)
r.encoding = 'utf-8'
text = r.text.split('***')[2]
text = text.replace("’","'").replace("“",'"').replace("”",'"')

print('Running natural language processing, please wait...')
anne = pd.DataFrame()
for chapter in text.split('CHAPTER'):
    if len(chapter)>500:
        chapter = chapter.replace('\r',' ').replace('\n',' ').replace('  ', ' ')
        anne = anne.append({'Chapter Text':chapter, 'Length':len(chapter)}, ignore_index=True)
word_types = ['ADJ', 'ADV', 'VERB', 'NOUN', 'PRON', 'PROPN'] # https://universaldependencies.org/docs/u/pos
pos_df = pd.DataFrame(columns=word_types)
for i in range(len(anne)):
    pos_list = []
    for token in nlp(anne['Chapter Text'][i]):
        pos = token.pos_
        if pos in word_types:
            pos_list.append(pos)
    word_type_count = {}
    for word_type in word_types:
        word_type_count.update({word_type:pos_list.count(word_type)/len(anne['Chapter Text'][i])})
    pos_df = pos_df.append(word_type_count, ignore_index=True)
pos_df['Chapter'] = pos_df.index+1
pos_df = pos_df.set_index('Chapter')
clear_output()
pos_df.plot.line(title='Parts of Speech Proportions in Anne of Green Gables by Chapter',labels=dict(value='Proportion of Total Words'))

### Pronoun Usage

**Cumulative** pronoun usage.

In [None]:
he = []
she = []
me = []
I = []
print('Running natural language processing, please wait...')
for row_number in range(len(anne)):
    he.append(0)
    she.append(0)
    me.append(0)
    I.append(0)
    for token in nlp(anne['Chapter Text'][row_number]):
        if token.lower_ == 'he':
            he[row_number] += 1
        if token.lower_ == 'she':
            she[row_number] += 1
        if token.lower_ == 'me':
            me[row_number] += 1
        if token.text == 'I':
           I[row_number] += 1
anne['he'] = he
anne['she'] = she
anne['me'] = me
anne['I'] = I
clear_output()
anne.drop(columns=['Chapter Text','Length']).cumsum().plot(
    title='Cumulative Pronouns in Anne of Green Gables',labels=dict(index='Chapter'))

Pronoun usage **by chapter**.

In [None]:
anne.drop(columns=['Chapter Text','Length']).plot(
    title='Pronouns in Anne of Green Gables by Chapter',labels=dict(index='Chapter'))

## Conclusion

Hopefully this has been an interesting introduction to some natural language processing and text analysis, and inspires you to try some NLP on your own by modifying the Python code.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)