![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=EnglishLanguageArts/ReadabilityStatistics/readability-statistics.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Readability Statistics

This notebook analyses the top ten ebooks from [Project Gutenberg](http://www.gutenberg.org/browse/scores/top) for length and readability. We are using a Python code library called [textstat](https://github.com/shivam5992/textstat) to determine readability, which will estimate the grade level readability of the text.

To start, `►Run` the following code cell to import the libraries and define a function for getting ebooks from Project Gutenberg.

In [None]:
!pip install textstat --user
import textstat
import requests
from bs4 import BeautifulSoup
import pandas as pd
import plotly.express as px

def get_readability(gutenberg_number):
    gutenberg_url = 'https://www.gutenberg.org/ebooks/'+str(gutenberg_number)+'.txt.utf-8'
    r = requests.get(gutenberg_url)
    r.encoding = 'utf-8' # specify the type of text encoding in the file
    try:
        raw_text = r.text.split('***')[2] # get the part after the header
    except:
        raw_text = r.text # failsafe if there isn't a header with ***
    text = raw_text.replace("’","'").replace("“",'"').replace("”",'"') # replace any 'smart quotes'
    text = text.replace('\r', ' ').replace('\n', ' ') # replace "new line" characters with spaces
    sentences = textstat.sentence_count(text)
    readability = textstat.text_standard(text, float_output=True)
    return text, sentences, readability

print('Code libraries imported and get_readability function defined.')

### Testing

Next, `►Run` the following code cell to download and measure the readability of War and Peace by Leo Tolstoy from http://www.gutenberg.org/ebooks/2600.

You can also run this with any other ebook number from Project Gutenberg.

In [None]:
gutenberg_number = 2600

text, sentences, readability = get_readability(gutenberg_number)
print('The readability of "War and Peace" is', readability)
print('It contains', sentences, 'sentences and', len(text), 'characters (letters, numbers, punctuation, etc.)')

## Visualizing eBooks

### Readability Versus Number of Sentences
`►Run` the next code cell to download and visualize statistics about the ten most frequently downloaed ebooks.

In [None]:
r = requests.get('http://www.gutenberg.org/browse/scores/top') # get the page of most downloaded books
soup = BeautifulSoup(r.text, 'html.parser') # read the html
top_ten = soup.find_all('li')[:10] # find the first 10 list items
df = pd.DataFrame(columns=['Title','Number of Sentences','Number of Characters','Readability','Text']) # create a dataframe
for li in top_ten: # loop through the list items
    title = li.text.split(' by')[0] # get the book title
    author = li.text.split(' by')[1].split(' (')[0] # get the book author
    gutenberg_number = li.find_all('a')[0].get('href').split('/')[-1:][0] # get the ebook number
    print(li.text, 'ebook number', gutenberg_number)
    text, sentences, readability = get_readability(gutenberg_number)
    df = df.append({'Title':title,'Author':author,'Number of Sentences':sentences,'Number of Characters':len(text),'Readability':readability,'Text':text},ignore_index=True)
print('Hover over the points to see more information.')
fig = px.scatter(df,x='Number of Sentences',y='Readability',hover_data=['Author'],text='Title')
fig.update_traces(textposition='bottom right', textfont_size=10)
fig.show()

### Characters and Sentences

We can also look at the relationship between the number of characters and the number of sentences.

In [None]:
px.scatter(df, x='Number of Sentences', y='Number of Characters', hover_data=['Title'])

## Your Own List of eBooks

If you'd like to create similar visualizations using your own list of ebooks from the Project Gutenberg site, you can modify and use the following code cell.

In [None]:
books = {
    'Little Women by Louisa May Alcott':514,
    'The Adventures of Tom Sawyer by Mark Twain':74,
    'Treasure Island by Robert Louis Stevenson':120,
            }

df2 = pd.DataFrame(columns=['Title','Number of Sentences','Number of Characters','Readability','Text'])
for book in books.items():
    title = book[0]
    gutenberg_number = book[1]
    print(title, 'ebook number', gutenberg_number)
    text, sentences, readability = get_readability(gutenberg_number)
    df2 = df2.append({'Title':title,'Number of Sentences':sentences,'Number of Characters':len(text),'Readability':readability,'Text':text},ignore_index=True)
fig1 = px.scatter(df2,x='Number of Sentences',y='Readability',hover_data=['Title'],text='Title',title='Readability versus Number of Sentences')
fig1.update_traces(textposition='bottom center', textfont_size=10)
fig1.show()
fig2 = px.scatter(df2,x='Number of Sentences',y='Number of Characters',hover_data=['Title'],text='Title',title='Number of Characters versus Number of Sentences')
fig2.update_traces(textposition='bottom center', textfont_size=10)
fig2.show()

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)