![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=EnglishLanguageArts/GreatGatsbyDataScience/great-gatsby-data-science.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# The Great Gatsby and Data Science

![Great Gatsby Cover Art](https://www.gutenberg.org/cache/epub/64317/pg64317.cover.medium.jpg)

Now that [The Great Gatsby](https://en.wikipedia.org/wiki/The_Great_Gatsby) is in the [public domain](https://en.wikipedia.org/wiki/Public_domain), there are no restrictions on how it can be used. We'll get a digital copy of the novel from [Project Gutenberg](https://www.gutenberg.org/ebooks/64317), and then do some data science analysis of the text.

To start, click on the code cell below then click the `▶Run` button.

In [None]:
book_url = 'https://www.gutenberg.org/cache/epub/64317/pg64317.txt'
import requests # import the requests code library
r = requests.get(book_url) # get the text from the url
r.encoding = 'utf-8' # specify the type of text encoding in the file
text = r.text.split('***')[2] # get the part after the header
text = text.replace("’","'").replace("“",'"').replace("”",'"') # replace any 'smart quotes'
print(text) # print the text of the book

## Dividing into Chapters

Now that we have the text of the book, let's split it into chapters. We'll use the Python [library](https://en.wikipedia.org/wiki/Library_(computing)) called [pandas](https://pandas.pydata.org) to create a [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) that includes the text of each chapter.

To determine where a new chapter starts in the text of this book we will look for three blank lines, indicated by the [newline characters](https://en.wikipedia.org/wiki/Newline) `\r\n\r\n\r\n`.

`▶Run` the code cell below.

In [None]:
import pandas as pd
ch = [] # create an empty list that we will append to
for chapter in text.split('\r\n\r\n\r\n'):
    chapter = chapter.replace('\r','').replace('\n',' ').replace('\t','') # remove the newline characters
    chapter = chapter.replace('  ',' ').replace('  ',' ').strip() # remove extra spaces
    chapter = chapter.replace('--','') # remove extra dashes
    ch.append(chapter) # append the chapter text to the list
book = pd.DataFrame(ch, columns=['Text']) # create a dataframe from the list
book = book.drop(book.index[0:3]).reset_index(drop=True) # drop the first three rows because they are not chapters
book

We can see that the 0th "chapter" is the poem at the start of the novel, but we can leave that in. However let's remove the Roman numeral from the start of each chapter.

In [None]:
for x in range(1, len(book)):
    text_start = book['Text'][x].find(' ')+1
    book['Text'][x] = book['Text'][x][text_start:]
book

### Calculating Chapter Length

`▶Run` the following cell to create a `Length` column in dataframe. The length will be the number of characters (letters, numbers, spaces, and punctuation) in the chapter.

In [None]:
book['Length'] = book['Text'].apply(len)
book

## Visualizing Chapter Lengths

Let's create a bar graph and a pie chart of the chapter lengths using the [Plotly Express](https://plotly.com/python/plotly-express/) code library.

In [None]:
import plotly.express as px
px.bar(book, x=book.index, y='Length', title='Chapter Lengths in <i>The Great Gatsby</i>').update_xaxes(title='Chapter').show()
px.pie(book, values='Length', names=book.index, title='Chapter Lengths in <i>The Great Gatsby</i>').update_traces(textinfo='percent+label').show()

## Natural Language Processing

We'll use the [spaCy](https://spacy.io) natural language processing library to process the text and create new columns in our dataframe containing the identified sentences and words. This will allow us to summarize the text and perform some other analysis.

It may take a little while after you click `▶Run`.

In [None]:
# uncomment the following two installation lines if you encounter errors
#!pip install spacy --user
#!python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load() # set up natural language processing
def processLanguage(chapter):
    processed = nlp(chapter)
    sentences = list(processed.sents)
    words = [] # create an empty list
    for token in processed:
        if token.is_alpha: # if the token is a word
            words.append(token)
    return sentences, words

book['Sentences'], book['NLP'] = zip(*book['Text'].apply(processLanguage))
book

### Summarizing the Text


#### Stop Words

To summarize each chapter, we'll start by removing any [stop words](https://en.wikipedia.org/wiki/Stop_word). Stop words are words such as `a`, `it`, `the`, or `and` that don't really add any meaning.

In [None]:
def removeStopWords(chapter):
    chapter = [word for word in chapter if word.is_stop==False]
    return chapter
book['SignificantWords'] = book['NLP'].apply(removeStopWords)
book

#### Most Common Words

Let's find the signficant words in each each chapter by their relative frequencies. This will allow us to weight each sentence based on how many very significant words it contains.

In [None]:
from collections import Counter
def wordFrequencyCounter(chapter):
    words = [word.text for word in chapter]
    word_frequencies = Counter(words).most_common()
    max_frequency = word_frequencies[0][1]
    for w in range(len(word_frequencies)):
        word_frequencies[w] = (word_frequencies[w][0], word_frequencies[w][1]/max_frequency) # normalize the word counts to values between 0 and 1
    return word_frequencies
book['SignificantWordFrequencies'] = book['SignificantWords'].apply(wordFrequencyCounter)
book

#### Creating the Summaries

First we will weight the sentences by the frequencies of the significant words, then we'll record the five percent of the sentences that have the highest scores.

In [None]:
from heapq import nlargest

def summarize(sentences):
    number_to_output = int(len(sentences)*0.05) # 5% of the sentences
    word_frequencies = book['SignificantWordFrequencies'][i]
    sentence_scores = {}
    for sentence in sentences:
        sentence_score = 0
        for word in sentence:
            for word_frequency in word_frequencies:
                if word.text == word_frequency[0]:
                    sentence_score += word_frequency[1]
        sentence_scores[sentence] = sentence_score
    summary = nlargest(number_to_output, sentence_scores, key=sentence_scores.get)
    return summary
#for line in summary:
#    print(line)

book['Summary'] = book['Sentences'].apply(summarize)
book

### Parts of Speech

Next we'll identify the [parts of speech](https://spacy.io/api/annotation#pos-tagging) in the text to see how many verbs, adjectives, nouns, and proper nouns there are. We'll also calculate word counts and proportions.

In [None]:
book['Word Count'] = book['NLP'].apply(len)
book['Verbs'] = book['NLP'].apply(lambda x: [token for token in x if token.pos_ == 'VERB'])
book['Verb Count'] = book['Verbs'].apply(len)
book['Verb %'] = book['Verb Count']/book['Word Count']*100
book['Adjectives'] = book['NLP'].apply(lambda x: [token for token in x if token.pos_ == 'ADJ'])
book['Adjective Count'] = book['Adjectives'].apply(len)
book['Adjective %'] = book['Adjective Count']/book['Word Count']*100
book['Nouns'] = book['NLP'].apply(lambda x: [token for token in x if token.pos_ == 'NOUN'])
book['Noun Count'] = book['Nouns'].apply(len)
book['Noun %'] = book['Noun Count']/book['Word Count']*100
book['Proper Nouns'] = book['NLP'].apply(lambda x: [token for token in x if token.pos_ == 'PROPN'])
book['Proper Noun Count'] = book['Proper Nouns'].apply(len)
book['Proper Noun %'] = book['Proper Noun Count']/book['Word Count']*100
book

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)