![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Finteresting-problems&branch=main&subPath=notebooks/analysing-text-statistics.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Analysing Text Statistics

Let's try out some statistical analysis of text, including [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing), using a [public domain](https://en.wikipedia.org/wiki/Public_domain) book from [Project Gutenberg](http://www.gutenberg.org).

The example we'll use is the [most downloaded](http://www.gutenberg.org/browse/scores/top) book, [Pride and Prejudice by Jane Austen](http://www.gutenberg.org/ebooks/1342). Running this first code cell will import and display the contents of the book.

Feel free to change the link in the following code cell if you'd like to explore another book, but make sure you are using the `Plain Text UTF-8` link.

In [None]:
gutenberg_text_link = 'https://www.gutenberg.org/cache/epub/42671/pg42671.txt'

import requests
r = requests.get(gutenberg_text_link) # get the online book file
r.encoding = 'utf-8' # specify the type of text encoding in the file
text = r.text.split('***')[2] # get the part after the header
text = text.replace("’","'").replace("“",'"').replace("”",'"') # replace any 'smart quotes'
book_title = r.text[r.text.index('Title:')+7:r.text.index('Author:')-4] # find the book title
print(text)

## Making a DataFrame of Chapters

Now that we have the text of the book, let's split it into chapters. We'll use the Python [library](https://en.wikipedia.org/wiki/Library_(computing)) called [pandas](https://pandas.pydata.org) to create a [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) that includes the text and length of each chapter (number of characters, including spaces and punctuation).

In [None]:
import pandas as pd
ch = [] # create an empty list that we will append to
for chapter in text.split('CHAPTER'):
    if len(chapter)>500: # so that we are getting actual book chapters
        chapter = chapter.replace('\r','').replace('\n',' ') # replace the 'new line' characters with spaces
        ch.append(chapter) # append the chapter text to the list
book = pd.DataFrame(ch, columns=['Chapter Text']) # create a dataframe from the list
book

Chapter 0 is the introduction and transcriber's note, so we'll eliminate that, and create a column called `Chapter`.

In [None]:
book = book.drop(0).reset_index().rename(columns={'index':'Chapter'})
book

We notice that the `Chapter Text` begins with a roman numeral of the chapter number, let's remove that by splitting the text from the first `.` and stripping off extra spaces.

In [None]:
book['Chapter Text'] = book['Chapter Text'].apply(lambda x: x.split('.', 1)[1].strip())
book

## Visualizing Chapter Lengths

From that dataframe we can calculate chapter length (number of characters) and create a bar graph.

In [None]:
import plotly.express as px
book['Length'] = book['Chapter Text'].apply(len) # add a column with the length of each chapter
px.bar(book, x='Chapter', y='Length', title='Characters per Chapter in '+book_title).update_xaxes(title='Chapter')

## Natural Language Processing

We'll use the [spaCy](https://spacy.io) natural language processing library to identify the [parts of speech](https://spacy.io/api/annotation#pos-tagging) in the text. We'll create a new column in our dataframe containing the processed text.

This will take a while to run.

In [None]:
# uncomment the following two installation lines if you get errors
#!pip install spacy --user
#!python -m spacy download en_core_web_sm

import en_core_web_sm
nlp = en_core_web_sm.load() # set up natural language processing

def processLanguage(chapter):
    parts_of_speech_list = [] # create an empty list
    for token in nlp(chapter):
        if token.is_alpha: # if the token is a word
            parts_of_speech_list.append(token)
    return parts_of_speech_list

book['NLP'] = book['Chapter Text'].apply(processLanguage)
book

Now that we've applied some natural language processing to identify the words in each chapter, let's see how many verbs, adjectives, nouns, and proper nouns there are. We'll also calculate word counts and proportions.

In [None]:
book['Word Count'] = book['NLP'].apply(len)
book['Verbs'] = book['NLP'].apply(lambda x: [token for token in x if token.pos_ == 'VERB'])
book['Verb Count'] = book['Verbs'].apply(len)
book['Verb %'] = book['Verb Count']/book['Word Count']*100
book['Adjectives'] = book['NLP'].apply(lambda x: [token for token in x if token.pos_ == 'ADJ'])
book['Adjective Count'] = book['Adjectives'].apply(len)
book['Adjective %'] = book['Adjective Count']/book['Word Count']*100
book['Nouns'] = book['NLP'].apply(lambda x: [token for token in x if token.pos_ == 'NOUN'])
book['Noun Count'] = book['Nouns'].apply(len)
book['Noun %'] = book['Noun Count']/book['Word Count']*100
book['Proper Nouns'] = book['NLP'].apply(lambda x: [token for token in x if token.pos_ == 'PROPN'])
book['Proper Noun Count'] = book['Proper Nouns'].apply(len)
book['Proper Noun %'] = book['Proper Noun Count']/book['Word Count']*100
book

We can now plot the proportional occurrences of those types of words.

In [None]:
px.line(book, x='Chapter', y=['Verb %','Adjective %','Noun %','Proper Noun %'], title='Proportion of Word Types in '+book_title).update_xaxes(title='Chapter')

## Most Common Verbs

To get an idea of the most common words in the text we can look at a part of speech, verbs for example, and count which are the most frequent. We'll display the `10` most common verbs.

In [None]:
from collections import Counter
def countLemma(row):
    return Counter([token.lemma_.strip().lower() for token in row])
book['Verb Lemmas'] = book['Verbs'].apply(countLemma)
verb_counter = Counter()
for row in book['Verb Lemmas']:
    verb_counter.update(row)
verb_counter.most_common(10)

## Most Common Names

The spaCy library does a fairly good job of identifying names, but you may see some false positives (words that aren't actually character names). We can look at the character names and how often they occur.

In [None]:
def nameCounter(row):
    return Counter([token.lemma_.strip() for token in row])
book['Names'] = book['Proper Nouns'].apply(nameCounter)
name_counter = Counter()
for row in book['Names']:
    name_counter.update(row)
names_df = pd.DataFrame.from_dict(name_counter, orient='index', columns=['Count'])
names_df.sort_values('Count', ascending=False, inplace=True)
names_df

### Cleaning Data

If you'd like to remove nouns that may categorized incorrectly or are uncommon, we can drop rows with fewer than `15` occurrences.

In [None]:
names_df = names_df[names_df['Count'] > 14]
names_df

### Visualization of Name Frequencies

Let's make a bar graph of the top `20` most frequently mentioned characters.

In [None]:
px.bar(names_df.head(20), title='Name Frequencies in '+book_title).update_yaxes(title='Frequency').update_xaxes(title='Name').update(layout_showlegend=False)

### Cumulative Name Frequencies over Time

Since we have the text divided into chapters, let's visualize the mentions of the top `5` character names throughout the book.

In [None]:
def nameCounter(name):
    name_counts = []
    for row in book['Proper Nouns']:
        n = 0
        for token in row:
            if token.text == name:
                n += 1
        name_counts.append(n)
    return(name_counts)

main_characters_list = names_df.head(5).index.to_list()
main_characters = pd.DataFrame()
for character in main_characters_list:
    main_characters[character] = nameCounter(character)
    main_characters[character+' Total'] = main_characters[character].cumsum()
main_characters['Chapter'] = main_characters.index+1
px.line(main_characters, x='Chapter', y=main_characters_list, title='Main Character Frequencies in '+book_title).update_yaxes(title='Frequency')

In [None]:
y = [character+' Total' for character in main_characters_list]
px.line(main_characters, x='Chapter', y=y, title='Cumulative Main Character Frequencies in '+book_title).update_yaxes(title='Frequency')

## Sentiment Analysis

[Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is categorizing text based on its tone (negative, neutral, or positive).

For this we will use the [vaderSentiment](https://github.com/cjhutto/vaderSentiment) library, then visualize the positive and negative sentiment by chapter.

This will take a minute or two.

In [None]:
#!pip install vaderSentiment --user
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
book['Sentiment'] = book['Chapter Text'].apply(lambda x: analyzer.polarity_scores(x))
book['Positive Sentiment'] = book['Sentiment'].str['pos']
book['Negative Sentiment'] = -book['Sentiment'].str['neg']
px.bar(book, x='Chapter',  y=['Positive Sentiment', 'Negative Sentiment'], title=book_title+' Sentiment Analysis by Chapter')

## Readability

One last library to introduce, [textstat](https://github.com/shivam5992/textstat) for checking the readability, complexity, and grade level of text. It includes a [number of functions](https://github.com/shivam5992/textstat#list-of-functions), but we'll only use a few of them.

In [None]:
#!pip install --user textstat
import textstat
book['Flesch-Kincaid Grade'] = book['Chapter Text'].apply(lambda x:textstat.flesch_kincaid_grade(x))
book['Gunning Fog Index'] = book['Chapter Text'].apply(lambda x:textstat.gunning_fog(x))
book['Linsear Write Formula'] = book['Chapter Text'].apply(lambda x:textstat.linsear_write_formula(x))
book['Readability'] = book['Chapter Text'].apply(lambda x:textstat.text_standard(x, float_output=True))
px.line(book, x='Chapter', y=['Flesch-Kincaid Grade','Gunning Fog Index','Linsear Write Formula','Readability'], title='Readability Scores for '+book_title).update_xaxes(title='Chapter')

# Conclusion

Hopefully this was an interesting introduction to text statistics such as chapter length, word frequency and word type frequency, sentiment analysis, and reading levels. You can, of course, use this analyse the text from any other online document in a similar way.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)