![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Finteresting-problems&branch=main&subPath=notebooks/analysing-text-statistics.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Analysing Text Statistics

Let's try out some statistical analysis of text, including [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing), using a [public domain](https://en.wikipedia.org/wiki/Public_domain) book from [Project Gutenberg](http://www.gutenberg.org).

The example we'll use is the [most downloaded](http://www.gutenberg.org/browse/scores/top) book, [Pride and Prejudice by Jane Austen](http://www.gutenberg.org/ebooks/1342). Running this first code cell will import and display the contents of the book.

Feel free to change the link in the following code cell if you'd like to explore another book, but make sure you are using the `Plain Text UTF-8` link.

In [None]:
gutenberg_text_link = 'https://www.gutenberg.org/cache/epub/42671/pg42671.txt'

import requests
r = requests.get(gutenberg_text_link) # get the online book file
r.encoding = 'utf-8' # specify the type of text encoding in the file
text = r.text.split('***')[2] # get the part after the header
text = text.replace("’","'").replace("“",'"').replace("”",'"') # replace any 'smart quotes'
book_title = r.text[r.text.index('Title:')+7:r.text.index('Author:')-4] # find the book title
print(text)

## Making a DataFrame of Chapters

Now that we have the text of the book, let's split it into chapters. We'll use the Python [library](https://en.wikipedia.org/wiki/Library_(computing)) called [pandas](https://pandas.pydata.org) to create a [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) that includes the text and length of each chapter (number of characters, including spaces and punctuation).

In [None]:
import pandas as pd
ch = [] # create an empty list that we will append to
for chapter in text.split('CHAPTER'):
    if len(chapter)>500: # so that we are getting actual book chapters
        chapter = chapter.replace('\r','').replace('\n',' ') # replace the 'new line' characters with spaces
        ch.append(chapter) # append the chapter text to the list
book = pd.DataFrame(ch, columns=['Chapter Text']) # create a dataframe from the list
book['Length'] = book['Chapter Text'].apply(len) # add a column with the length of each chapter
book

Chapter 0 is the introduction and transcriber's note, so we'll eliminate that, and create a column called `Chapter`.

In [None]:
book = book.drop(0).reset_index().rename(columns={'index':'Chapter'})
book

## Visualizing Chapter Lengths

From that dataframe we can create a bar graph of chapter lengths.

In [None]:
import plotly.express as px
px.bar(book, x='Chapter', y='Length', title='Characters per Chapter in '+book_title).update_xaxes(title='Chapter')

## Counting Words by Type

We'll use the [spaCy](https://spacy.io) natural language processing library to identify the [parts of speech](https://spacy.io/api/annotation#pos-tagging) in the text. For this example we'll just look at adjectives, verbs, nouns, and proper nouns, but you can add to the list on the first line in the code cell.

This will take a while to run, and will result in a dataframe containing the number and percent of each of those parts of speech in each chapter.

In [None]:
# uncomment the following two installation lines if you get errors
#!pip install spacy --user
#!python -m spacy download en_core_web_sm

import en_core_web_sm
nlp = en_core_web_sm.load() # set up natural language processing

from collections import Counter
def getWords(text, word_type):
    words = []
    for token in nlp(text):
        if token.pos_ == word_type:
            words.append(token.lemma_.strip().lower())
    return Counter(words)

#sum(book.iloc[0]['Verbs'].values())  # count the number of verbs in a chapter

def getPartsOfSpeech(chapter, word_types):
    parts_of_speech_list = [] # create an empty list
    word_count = 0
    for token in nlp(chapter): # loop through each token in the chapter
        word_count += 1 # increment the word counter
        part_of_speech = token.pos_
        if part_of_speech in word_types: # if it is in the list of types we made
            parts_of_speech_list.append(part_of_speech)
    word_type_count = {} # create an empty dictionary
    word_type_count['Word Count'] = word_count
    for word_type in word_types:
        word_type_count[word_type] = parts_of_speech_list.count(word_type) # add to the dictionary
    return word_type_count

word_types = ['ADJ', 'VERB', 'NOUN', 'PROPN'] # https://spacy.io/api/annotation#pos-tagging
#getPartsOfSpeech(chapters['Chapter Text'][1], word_types)
book[['Word Count'] + word_types] = pd.DataFrame(book['Chapter Text'].apply(getPartsOfSpeech, args=[word_types]).tolist())
book

Let's calculate the proportional occurances of those parts of speech.

In [None]:
book['Adjectives %'] = book['ADJ']/book['Word Count']*100
book['Verbs %'] = book['VERB']/book['Word Count']*100
book['Nouns %'] = book['NOUN']/book['Word Count']*100
book['Proper Nouns %'] = book['PROPN']/book['Word Count']*100
book

We can also plot the proportional occurrences.

In [None]:
px.line(book, y=['Adjectives %', 'Verbs %', 'Nouns %', 'Proper Nouns %'], title='Proportion of Word Types in '+book_title).update_xaxes(title='Chapter')

## Most Common Verbs

To get an idea of the most common words in the text we can look at a part of speech, verbs for example, and count which are the most frequent.

This will also take some time to run.

In [None]:
from collections import Counter
def getWords(text, word_type):
    words = []
    for token in nlp(text):
        if token.pos_ == word_type:
            words.append(token.lemma_.strip().lower())
    return Counter(words)

#verbs = getWords(book['Chapter Text'].sum(), 'VERB')
#print('There are',len(verbs),'unique verbs in the book.')
book['Verbs'] = book['Chapter Text'].apply(getWords, args=['VERB'])

verbs = Counter()
for counts in book['Verbs']:
    verbs.update(counts)
print('There are',len(verbs),'unique verbs in the book.')

In [None]:
book

In [None]:
sum(book.iloc[0]['Verbs'].values())

Let's look at the `10` most common verbs.

In [None]:
verbs.most_common(10)

We can also choose a verb and plot its frequency by chapter as a percent of the total number of words.

In [None]:
word = 'have'

words_df['%'] = words_df[word]/parts_of_speech_df['Words']*100 # calculate the percent of the words in each chapter
px.bar(words_df, y='%', title='Percent Frequency of the Word "'+word+'" by Chapter in '+book_title).update_xaxes(title='Chapter').update_yaxes(title='Percent Freqency')

## Most Common Names

We can also look at character names and how often they occur in each chapter. The spaCy library does a fairly good job of identifying names, but you may see some false positives (words that aren't actually character names).

In [None]:
names_df = pd.DataFrame()
for i in range(1,len(chapters)+1):
    names_list = []
    for token in nlp(chapters['Chapter Text'][i]):
        #if token.pos_ == 'PROPN':
        if token.ent_type_ == 'PERSON': # seems to be more reliable than part_of_speech == proper_noun
            names_list.append(token.text)
    names_df = names_df.append(Counter(names_list), ignore_index=True)
names_df

### List of Character Names

We can check out the list of words identified as names.

In [None]:
for name in names_df.columns:
    print(name)

### Cleaning Data

If you'd like to remove columns that may categorized incorrectly or are uncommon, we can drop columns with fewer than five occurrences.

In [None]:
for column in names_df.columns:
    if names_df[column].sum() < 5: # if there are fewer than five occurences
        names_df.drop(columns=column, inplace=True)
names_df

### Visualization of Name Frequencies

Let's make a bar graph of the top `20` most frequently mentioned characters.

In [None]:
d = names_df.sum().sort_values(ascending=False).head(20)
px.bar(d, title='Character Name Frequencies in '+book_title).update_yaxes(title='Frequency').update_xaxes(title='Name').update(layout_showlegend=False)

### Cumulative Name Frequencies over Time

Since we have the text divided into chapters, let's visualize the cumulative mentions of the top `3` character names throughout the book.

In [None]:
how_many_names = 3
main_character_names = names_df.sum().sort_values(ascending=False).head(how_many_names).index # list the top three
main_characters = names_df[main_character_names].fillna(value=0) # create a new dataframe for top three
main_characters.index = main_characters.index+1
title = 'Cumulative Character Mentions of Time in '+book_title
px.line(main_characters.cumsum(), title=title).update_yaxes(title='Total Mentions').update_xaxes(title='Chapter')

## Sentiment Analysis

[Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is categorizing text based on its tone (negative, neutral, or positive).

For this we will use the [vaderSentiment](https://github.com/cjhutto/vaderSentiment) library, then visualize the positive and negative sentiment by chapter.

In [None]:
#!pip install vaderSentiment --user
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
sentiment_df = pd.DataFrame()
for i in range(1,len(chapters)+1):
    senti = analyzer.polarity_scores(chapters['Chapter Text'][i]) # analyze the sentiment of chapter
    sentiment_df = sentiment_df.append(senti, ignore_index=True) # add to dataframe
sentiment_df['neg'] = -sentiment_df['neg'] # change the sign of the negative sentiment
sentiment_df.index = sentiment_df.index+1
px.bar(sentiment_df, y=['pos', 'neg'], title=book_title+' Sentiment Analysis by Chapter').update(layout_showlegend=False)

## Readability

One last library to introduce, [textstat](https://github.com/shivam5992/textstat) for checking the readability, complexity, and grade level of text. It includes a [number of functions](https://github.com/shivam5992/textstat#list-of-functions), but we'll only use a few of them. This will take a while to run.

In [None]:
#!pip install --user textstat
import textstat
textstats = pd.DataFrame()
for i in range(1,len(chapters)+1):
    text = chapters['Chapter Text'][i]
    textstats_data = {'Flesch-Kincaid Grade':textstat.flesch_kincaid_grade(text),
                        'Gunning Fog Index':textstat.gunning_fog(text),
                        'Linsear Write Formula':textstat.linsear_write_formula(text),
                        'Readability':textstat.text_standard(text, float_output=True)}
    textstats = textstats.append(textstats_data, ignore_index=True)
textstats.index = textstats.index+1
textstats

Now that we have a dataframe of readability information, we can plot and describe the readability statistics.

In [None]:
px.line(textstats, y='Readability', title=book_title+' Readability by Chapter').update_xaxes(title='Chapter')

In [None]:
textstats.describe()

# Conclusion

Hopefully this was an interesting introduction to text statistics such as chapter length, word frequency and word type frequency, sentiment analysis, and reading levels. You can, of course, use this analyse the text from any other online document in a similar way.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)