![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=EnglishLanguageArts/GreatGatsbyDataScience/great-gatsby-data-science.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# The Great Gatsby and Data Science

![Great Gatsby Cover Art](https://www.gutenberg.org/cache/epub/64317/pg64317.cover.medium.jpg)

Now that [The Great Gatsby](https://en.wikipedia.org/wiki/The_Great_Gatsby) is in the [public domain](https://en.wikipedia.org/wiki/Public_domain), there are no restrictions on how it can be used. We'll get a digital copy of the novel from [Project Gutenberg](https://www.gutenberg.org/ebooks/64317), and then do some data science analysis of the text.

To start, click on the code cell below then click the `▶Run` button.

In [None]:
book_url = 'https://www.gutenberg.org/cache/epub/64317/pg64317.txt'
import requests # import the requests code library
r = requests.get(book_url) # get the text from the url
r.encoding = 'utf-8' # specify the type of text encoding in the file
text = r.text.split('***')[2] # get the part after the header
text = text.replace("’","'").replace("“",'"').replace("”",'"') # replace any 'smart quotes'
print(text) # print the text of the book

## Dividing into Chapters

Now that we have the text of the book, let's split it into chapters. We'll use the Python [library](https://en.wikipedia.org/wiki/Library_(computing)) called [pandas](https://pandas.pydata.org) to create a [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) that includes the text of each chapter.

To determine where a new chapter starts in the text of this book we will look for three blank lines, indicated by the [newline characters](https://en.wikipedia.org/wiki/Newline) `\r\n\r\n\r\n`.

`▶Run` the code cell below.

In [None]:
import pandas as pd
ch = [] # create an empty list that we will append to
for chapter in text.split('\r\n\r\n\r\n'):
    chapter = chapter.replace('\r','').replace('\n',' ').replace('\t','') # remove the newline characters
    chapter = chapter.replace('  ',' ').replace('  ',' ').strip() # remove extra spaces
    chapter = chapter.replace('--','') # remove extra dashes
    ch.append(chapter) # append the chapter text to the list
book = pd.DataFrame(ch, columns=['Text']) # create a dataframe from the list
book = book.drop(book.index[0:3]).reset_index(drop=True) # drop the first three rows because they are not chapters
book

We can see that the 0th "chapter" is the poem at the start of the novel, but we can leave that in. However let's remove the Roman numeral from the start of each chapter.

In [None]:
for x in range(1, len(book)):
    text_start = book['Text'][x].find(' ')+1
    book['Text'][x] = book['Text'][x][text_start:]
book

### Calculating Chapter Length

`▶Run` the following cell to create a `Length` column in dataframe. The length will be the number of characters (letters, numbers, spaces, and punctuation) in the chapter.

In [None]:
book['Length'] = book['Text'].apply(len)
book

## Visualizing Chapter Lengths

Let's create a bar graph and a pie chart of the chapter lengths using the [Plotly Express](https://plotly.com/python/plotly-express/) code library.

In [None]:
import plotly.express as px
px.bar(book, x=book.index, y='Length', title='Chapter Lengths in <i>The Great Gatsby</i>').update_xaxes(title='Chapter').show()
px.pie(book, values='Length', names=book.index, title='Chapter Lengths in <i>The Great Gatsby</i>').update_traces(textinfo='percent+label').show()

## Natural Language Processing

We'll use the [spaCy](https://spacy.io) natural language processing library to process the text and create new columns in our dataframe containing the identified sentences and words. This will allow us to summarize the text and perform some other analysis.

It may take a little while after you click `▶Run`.

In [None]:
try:
    import en_core_web_sm
    nlp = en_core_web_sm.load() # set up natural language processing
except:
    !pip install spacy --user
    !python -m spacy download en_core_web_sm
    import en_core_web_sm
    nlp = en_core_web_sm.load()

def processLanguage(chapter):
    processed = nlp(chapter)
    sentences = list(processed.sents)
    words = [] # create an empty list
    for token in processed:
        if token.is_alpha: # if the token is a word
            words.append(token)
    return sentences, words

book['Sentences'], book['NLP'] = zip(*book['Text'].apply(processLanguage))
book

### Summarizing the Text


#### Stop Words

To summarize each chapter, we'll start by removing any [stop words](https://en.wikipedia.org/wiki/Stop_word). Stop words are words such as `a`, `it`, `the`, or `and` that don't really add any meaning.

In [None]:
def removeStopWords(chapter):
    chapter = [word for word in chapter if word.is_stop==False]
    return chapter
book['SignificantWords'] = book['NLP'].apply(removeStopWords)
book['SignificantWords'][0] # print the significant words in the zeroth chapter

#### Most Common Words

If we find the frequencies of the signficant words in each each chapter, we will be able to weight each sentence based on how many of those it contains.

In [None]:
from collections import Counter
def wordFrequencyCounter(chapter):
    words = [word.text for word in chapter]
    word_frequencies = Counter(words).most_common()
    if chapter:
        max_frequency = word_frequencies[0][1]
        for w in range(len(word_frequencies)):
            word_frequencies[w] = (word_frequencies[w][0], word_frequencies[w][1]/max_frequency) # normalize the word counts to values between 0 and 1
    return word_frequencies
book['SignificantWordFrequencies'] = book['SignificantWords'].apply(wordFrequencyCounter)
book['SignificantWordFrequencies'][1] # print the word frequencies for chapter 1

#### Creating the Summaries

First we will weight the sentences by the frequencies of the significant words, then we'll record the five percent of the sentences that have the highest scores. Again, this might take a minute.

In [None]:
from heapq import nlargest
summaries = []
for i in range(len(book)):
    sentences = book['Sentences'][i]
    number_to_output = int(len(sentences)*0.05) # 5% of the sentences
    word_frequencies = book['SignificantWordFrequencies'][i]
    sentence_scores = {}
    for sentence in sentences:
        sentence_score = 0
        for word in sentence:
            for word_frequency in word_frequencies:
                if word.pos_ != 'PROPN': # if it is not a proper noun
                    if word.text == word_frequency[0]:
                        sentence_score += word_frequency[1]
        sentence_scores[sentence] = sentence_score
    summary = nlargest(number_to_output, sentence_scores, key=sentence_scores.get)
    summaries.append(summary)
book['Summary'] = summaries
book['Summary'][1] # print the summary for chapter 1

Now we can see a basic chapter summary using the syntax `book['Summary'][x]` where `x` is the chapter number.

In [None]:
book['Summary'][2]

### Parts of Speech

Next we'll identify the [parts of speech](https://spacy.io/api/annotation#pos-tagging) in the text to see how many verbs, adjectives, nouns, and proper nouns there are. We'll also calculate word counts and proportions so we can make some graphs.

In [None]:
book['WordCount'] = book['NLP'].apply(len)
parts_of_speech = {'Verbs':'VERB', 'Adjectives':'ADJ', 'Nouns':'NOUN', 'ProperNouns':'PROPN'}
for part in parts_of_speech:
    book[part] = book['NLP'].apply(lambda x: [token for token in x if token.pos_ == parts_of_speech[part]])
    book[part+'Count'] = book[part].apply(len)
    book[part+'%'] = book[part+'Count']/book['WordCount']*100

px.line(book, x=book.index, y=['Verbs%', 'Adjectives%', 'Nouns%', 'ProperNouns%'], title='Parts of Speech in <i>The Great Gatsby</i>').update_xaxes(title='Chapter').update_yaxes(title='Percentage')

#### Most Common Proper Nouns

The spaCy library does a fairly good job of identifying names, but you may see some false positives (words that aren't actually character names). We can look at the character names and how often they occur.

In [None]:
def nameCounter(row):
    return Counter([token.lemma_.strip() for token in row])
book['Names'] = book['ProperNouns'].apply(nameCounter)
name_counter = Counter()
for row in book['Names']:
    name_counter.update(row)
names_df = pd.DataFrame.from_dict(name_counter, orient='index', columns=['Count'])
names_df.sort_values('Count', ascending=False, inplace=True)
names_df.head(25)

To remove nouns that may categorized incorrectly, or are uncommon proper nouns in the story, let's try only keeping rows with more than `20` occurrences. We can also remove `New`, `York`, `West`, and `East`.

Then we'll create a bar chart of the name freqencies.

In [None]:
names_df = names_df[names_df['Count'] > 20]
names_df = names_df.drop(['New','York','West','East'])
px.bar(names_df, title='Name Freqencies in <i>The Great Gatsby</i>').update_yaxes(title='Frequency').update_xaxes(title='Name').update(layout_showlegend=False)

#### Name Frequencies over Time

Since we have the text divided into chapters, let's visualize the mentions of the top `5` character names throughout the book.

In [None]:
def nameCounter(name):
    name_counts = []
    for row in book['ProperNouns']:
        n = 0
        for token in row:
            if token.text == name:
                n += 1
        name_counts.append(n)
    return(name_counts)

main_characters_list = names_df.head(5).index.to_list()
main_characters = pd.DataFrame()
for character in main_characters_list:
    main_characters[character] = nameCounter(character)
    main_characters[character+'Total'] = main_characters[character].cumsum()
main_characters['Chapter'] = main_characters.index+1
px.line(main_characters, y=main_characters_list, title='Main Character Frequencies in <i>The Great Gatsby</i>').update_yaxes(title='Frequency').update_xaxes(title='Chapter').show()
px.line(main_characters, y=[character+'Total' for character in main_characters_list], title='Cumulative Main Character Frequencies in <i>The Great Gatsby</i>').update_yaxes(title='Cumulative Frequency').update_xaxes(title='Chapter').show()

We could, of course, follow a similar process with other parts of speech such as verbs or adjectives. Or we could track the uses of particular words that have special symbolism, such as `green` or `yellow`.

In [None]:
colours = ['green','yellow','white','gold']
for colour in colours:
    book[colour] = book['NLP'].apply(lambda x: [token for token in x if token.text==colour])
    book[colour+'Count'] = book[colour].apply(len)
    book[colour+'Total'] = book[colour+'Count'].cumsum()
plotList = [colour+'Count' for colour in colours]
px.line(book, y=plotList, title='Colours in <i>The Great Gatsby</i>', color_discrete_sequence=colours, template='plotly_dark').update_xaxes(title='Chapter').update_yaxes(title='Mentiones').show()
cumulativePlotList = [colour+'Total' for colour in colours]
px.line(book, y=cumulativePlotList, title='Cumulative Colours in <i>The Great Gatsby</i>', color_discrete_sequence=colours, template='plotly_dark').update_xaxes(title='Chapter').update_yaxes(title='Cumulative Mentions').show()

## Sentiment Analysis

[Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is categorizing text based on its tone (negative, neutral, or positive). For this we will use the [vaderSentiment](https://github.com/cjhutto/vaderSentiment) library, then visualize the positive and negative sentiment by chapter.

Running the cell below will take a minute or two.

In [None]:
# uncomment the following line if you encounter an error
#!pip install nltk
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
book['Sentiment'] = book['Text'].apply(lambda x: analyzer.polarity_scores(x))
book['PositiveSentiment'] = book['Sentiment'].str['pos']
book['NegativeSentiment'] = -book['Sentiment'].str['neg']
px.bar(book, y=['PositiveSentiment','NegativeSentiment'], title='Sentiment Analysis by Chapter in <i>The Great Gatsby</i>').update_yaxes(title='Sentiment').update_xaxes(title='Chapter').show()

### Sentiment Analysis by Colours

Now that we can analyse the sentiment by chapter, let's compare sentiments associated with different colours. We'll look through all of the sentences that contain certain colour words and add up their polarity scores.

In [None]:
colours = ['green','yellow','white','gold']
colours_sentiments = {}
import numpy as np
for i in range(len(book)):
    for sentence in book['Sentences'].iloc[i]:
        for token in sentence:
            if token.text in colours:
                if token.text not in colours_sentiments:
                    colours_sentiments[token.text] = []
                colours_sentiments[token.text].append(analyzer.polarity_scores(str(sentence))['compound']) # append the compound sentiment score

for colour in colours:
    try:
        colours_sentiments[colour] = np.mean(colours_sentiments[colour]) # average the sentiments
    except: # if there are no mentions of that colour
        pass
px.bar(x=colours_sentiments.keys(), y=colours_sentiments.values(), title='Sentiment Analysis by Colour in <i>The Great Gatsby</i>').update_yaxes(title='Average Sentiment').update_xaxes(title='Colour').show()

To delve even deeper and look at colours in individual sentences, we'll create a scatterplot where you can hover over the individual points to see the sentences.

In [None]:
colours = ['green','yellow','white','gold']
colours_sentiment_list = []
import numpy as np
for i in range(len(book)):
    for sentence in book['Sentences'].iloc[i]:
        for token in sentence:
            if token.text in colours:
                sentiment = analyzer.polarity_scores(str(sentence))['compound']
                colours_sentiment_list.append([token.text, sentiment, str(sentence)])
colours_sentiment_df = pd.DataFrame(colours_sentiment_list, columns=['Colour','Sentiment','Sentence'])
fig = px.scatter(colours_sentiment_df, x='Colour', y='Sentiment', hover_data=['Sentence'], title='Sentiment Analysis by Colour in <i>The Great Gatsby</i>')
fig.update_yaxes(title='Sentiment').update_xaxes(title='Colour').add_hline(y=0)
fig.show()


# Conclusion

Hopefully this was an interesting introduction to text statistics such as chapter length, word frequency and word type frequency, and sentiment analysis. You can, of course, use this analyse the text from any other online document in a similar way.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)