![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://colab.research.google.com/github/callysto/jupyterlite/blob/main/content/curriculum-notebooks/EnglishLanguageArts/ReadabilityStatistics/readability-statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Readability Statistics

This notebook analyses the top ten ebooks from [Project Gutenberg](http://www.gutenberg.org/browse/scores/top) for length and readability. We are using a Python code library called [textstat](https://github.com/shivam5992/textstat) to determine readability, which will estimate the grade level readability of the text.

To start, `►Run` the following code cell to import the libraries and define a function for getting ebooks from Project Gutenberg.

In [None]:
%pip install plotly textstat nbformat requests bs4 setuptools pandas

import textstat
import requests
from bs4 import BeautifulSoup
import pandas as pd
import plotly.express as px

def get_readability(gutenberg_number):
    gutenberg_url = 'https://www.gutenberg.org/ebooks/'+str(gutenberg_number)+'.txt.utf-8'
    r = requests.get(gutenberg_url)
    r.encoding = 'utf-8' # specify the type of text encoding in the file
    try:
        raw_text = r.text.split('***')[2] # get the part after the header
    except:
        raw_text = r.text # failsafe if there isn't a header with ***
    text = raw_text.replace("’","'").replace("“",'"').replace("”",'"') # replace any 'smart quotes'
    text = text.replace('\r', ' ').replace('\n', ' ') # replace "new line" characters with spaces
    sentences = textstat.sentence_count(text)
    readability = textstat.text_standard(text, float_output=True)
    return text, sentences, readability

print('Code libraries imported and get_readability function defined.')

### Testing

Next, `►Run` the following code cell to download and measure the readability of War and Peace by Leo Tolstoy from http://www.gutenberg.org/ebooks/2600.

You can also run this with any other ebook number from Project Gutenberg.

In [None]:
gutenberg_number = 2600

text, sentences, readability = get_readability(gutenberg_number)
print('The readability of "War and Peace" is', readability)
print('It contains',sentences,'sentences and',
      len(text),'characters (letters, numbers, punctuation, etc.)')

## Visualizing eBooks

### Readability Versus Number of Sentences
`►Run` the next code cell to download and visualize statistics about the ten ebooks most frequently downloaded yesterday.

In [None]:
r = requests.get('http://www.gutenberg.org/browse/scores/top') # get the page of most downloaded books
soup = BeautifulSoup(r.text, 'html.parser') # read the html
# find the list of the top ten ebooks
links = soup.find_all('a')
ebooks = []
for link in links:
    if len(ebooks) < 10:
        try: # look for 'ebooks/' and a number
            if 'ebooks/' in link.attrs['href'] and link.attrs['href'][8].isnumeric():
                ebooks.append(link.text)
        except:
            pass
# create a dataframe
df = pd.DataFrame(columns=['Title','Number of Sentences',
                           'Number of Characters','Readability','Text'])
# loop through the list items
for li in ebooks:
    try:
        title = li.split(' by')[0] # get the book title
        author = li.split(' by')[1].split(' (')[0] # get the book author
    except:
        title = li.split(' (')[0]
        author = 'Anonymous'
    gutenberg_number = li.split('(')[-1:][0][:-1] # get the ebook number
    print(gutenberg_number, title)
    text, sentences, readability = get_readability(gutenberg_number)
    df = df.append({'Title':title,'Author':author,
                    'Number of Sentences':sentences,'Number of Characters':len(text),
                    'Readability':readability,'Text':text},ignore_index=True)
print('Hover over the points to see more information.')
fig = px.scatter(df,x='Number of Characters',y='Readability',
                 hover_data=['Author'],text='Title',title='eBook Readability versus Length')
fig.update_traces(textposition='bottom right', textfont_size=10)
fig.show()

### Characters and Sentences

We can also look at the relationship between the number of *sentences* and the number of *characters*.

In [None]:
px.scatter(df,x='Number of Characters',y='Number of Sentences',
           hover_data=['Title'],title='Number of Sentences versus Length')

## Your Own List of eBooks

If you'd like to create similar visualizations using your own list of ebooks from the Project Gutenberg site, you can modify and use the following code cell.

In [None]:
books = {
    'Little Women':514,
    'Tom Sawyer':74,
    'Treasure Island':120,
            }

textposition = 'bottom center'
textfont_size = 10
df2 = pd.DataFrame(columns=['Title','Number of Sentences','Number of Characters','Readability','Text'])
for book in books.items():
    title = book[0]
    gutenberg_number = book[1]
    print(gutenberg_number, title)
    text, sentences, readability = get_readability(gutenberg_number)
    df2 = df2.append({'Title':title,'Number of Sentences':sentences,'Number of Characters':len(text),'Readability':readability,'Text':text},ignore_index=True)
fig1 = px.scatter(df2,x='Number of Characters',y='Readability',hover_data=['Title'],text='Title',title='Readability versus Length')
fig1.update_traces(textposition=textposition,textfont_size=textfont_size)
fig1.show()
fig2 = px.scatter(df2,x='Number of Characters',y='Number of Sentences',hover_data=['Title'],text='Title',title='Number of Sentences versus Length')
fig2.update_traces(textposition=textposition,textfont_size=textfont_size)
fig2.show()

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)