![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://colab.research.google.com/github/callysto/jupyterlite/blob/main/content/hackathon/Books/books-intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Books Introduction

## Alice's Adventures in Wonderland

<table>
<tr>
<td> <img src="https://upload.wikimedia.org/wikipedia/commons/e/e2/John_Tenniel-_Alice%27s_mad_tea_party%2C_colour.jpg" alt="Alice's Tea Party" style="width: 432px;"/> </td>
<td> <img src="https://upload.wikimedia.org/wikipedia/en/8/8f/Alice_in_Wonderland_pg41_-_Alice_meets_the_White_Rabbit_-_by_Margaret_Winifred_Tarrant_1916.jpg" alt="Alice Meeting White Rabbit" style="width: 240px;"/> </td>
</tr>
<tr>
<td style="font-size:8px;"><a href="https://commons.wikimedia.org/wiki/File:John_Tenniel-_Alice%27s_mad_tea_party,_colour.jpg">John_Tenniel-_Alice%27s_mad_tea_party,_colour.jpg</a></td>
<td style="font-size:8px;"><a href="https://en.wikipedia.org/wiki/File:Alice_in_Wonderland_pg41_-_Alice_meets_the_White_Rabbit_-_by_Margaret_Winifred_Tarrant_1916.jpg">Alice_meets_the_White_Rabbit_-_by_Margaret_Winifred_Tarrant_1916.jpg</a></td>
</tr>
</table>
    
[Alice's Adventures in Wonderland](https://en.wikipedia.org/wiki/Alice's_Adventures_in_Wonderland) is a popular fiction novel written in 1865 by English author Charles Lutwidge Dodgson.

On a regular day you might be reading the book and speculating about what will happen next. However in this hackathon you will encounter some interesting findings about the book while learning some new coding/hacking skills.

## Getting ready

This section sets up many things behind the scenes which are required for the rest of this notebook. Most of the code blocks in this section are ready-to-run so you won't have to do any modifications. You don't need to know everything about various tasks being accomplished by the code cell in this section to complete the challenges. However feel free to ask mentors about anything that makes you curious.

### Import/Install libraries

Run the cell below to import, or download and install required Python libraries. It may take couple of minutes to complete the execution of the cell.

In [None]:
%pip install plotly spacy nbformat requests pandas

import requests
import pandas as pd
import plotly.express as px
from collections import Counter
import spacy
try:
    nlp = spacy.load('en_core_web_sm')
except:
    %python -m spacy download en_core_web_sm
    nlp = spacy.load('en_core_web_sm')
print('Setup Complete')

### Download the book from Project Guttenberg

**[Project Gutenberg](https://www.gutenberg.org/)** is a digital library with more than 60,000 free eBooks. You can see most popular books downloaded from Guttenberg website [here](http://www.gutenberg.org/ebooks/search/?sort_order=downloads).

In [None]:
# link to the book text file
gutenberg_text_link = 'https://www.gutenberg.org/files/11/11-0.txt'

r = requests.get(gutenberg_text_link) # get the online book file
r.encoding = 'utf-8' # specify the type of text encoding in the file
book = r.text.split('***')[2] # get the part after the header
book = book.replace("’","'").replace("“",'"').replace("”",'"') # replace any 'smart quotes'
book_title = r.text[r.text.index('Title:')+7:r.text.index('Author:')-4] # find the book title
print(book)

Let's see how many characters (letters, numbers, spaces, etc.) are in the book.

In [None]:
len(book)

Next, let's split up the book by chapters.

In [None]:
chapter_list = [] # create a list to hold the chapter texts
#chapters = pd.DataFrame() # create an empty data frame
for chapter in book.split('CHAPTER'):
    if len(chapter)>500: # so that we are getting actual book chapters
        chapter_text = chapter.replace('\r',' ').replace('\n',' ') # delete the 'new line' characters
        chapter_list.append(chapter_text) # add the chapter to the list
chapters = pd.DataFrame(chapter_list, columns=['Chapter Text']) # create a data frame from the list
chapters['Chapter'] = chapters.index+1 # add a column with the chapter number
chapters = chapters[['Chapter', 'Chapter Text']] # reorder the columns
chapters['Chapter Length'] = chapters['Chapter Text'].apply(len) # add a column with the length of each chapter
chapters

We now have a dataframe of the chapters, so we can do things like visualize their lengths (numbers of characters).

In [None]:
px.bar(chapters, x='Chapter', y='Chapter Length', title='Chapter Lengths in '+book_title)

We can see that the chapter lenght varies a bit, and perhaps the author wrote shorter chapters for the rising action and falling action parts of the story.

### Adjectives, verbs, nouns, and proper nouns per chapter

Now let's create a dataframe which tells about various characteristics of a word in the book. Description for each of the columns of the dataframe is provided below:

- **text**: actual word
- **[part of speech](https://spacy.io/api/annotation#pos-tagging)**: ADJ, VERB, NOUN, or PROPN
- **lemma**: headword
- **chapter**: chapter number

|Part of Speech|Definition|Example|
|-|-|-|
|Noun|A person, place, thing, or idea|The white **rabbit** runs quickly.|
|Pronoun|Replaces a noun|**He** runs quickly.|
|Verb|Action|The white rabbit **runs** quickly.|
|Adjective|Describes a noun|The **white** rabbit runs quickly.|
|Adverb|Describes a verb|The white rabbit runs **quickly**.|

Running the next cell may take a couple of minutes.

In [None]:
parts_of_speech = {'Word Count':[], 'Word List':[], 'ADJ':[], 'VERB':[], 'NOUN':[], 'PROPN':[]} # create an empty dictionary to store word counts and lists

for row in chapters.iterrows(): # iterate through the dataframe
    i = row[0]
    chapter_text = row[1]['Chapter Text']
    for k in parts_of_speech.keys():
        parts_of_speech[k].append(0)
    word_list = nlp(chapter_text) # use spacy to parse the text
    parts_of_speech['Word List'][i] = word_list
    parts_of_speech['Word Count'][i] = len(word_list) # get the total number of words
    for word in word_list:
        if word.pos_ in parts_of_speech.keys():
            parts_of_speech[word.pos_][i] += 1

for k in parts_of_speech.keys():  # add the word counts to the dataframe
    chapters[k] = parts_of_speech[k]
chapters

Our dataframe now includes counts of those [parts of speech](https://spacy.io/api/annotation#pos-tagging) by chapter. We can also calculate what percent of the words in each chapter are each of those types.

In [None]:
chapters['Adjectives %'] = chapters['ADJ']/chapters['Word Count']*100
chapters['Verbs %'] = chapters['VERB']/chapters['Word Count']*100
chapters['Nouns %'] = chapters['NOUN']/chapters['Word Count']*100
chapters['Proper Nouns %'] = chapters['PROPN']/chapters['Word Count']*100
chapters

Now let's try a visualization of our data.

In [None]:
px.bar(chapters, x='Chapter', y=['Adjectives %', 'Verbs %', 'Nouns %', 'Proper Nouns %'], title='Parts of Speech in '+book_title)

Or if we prefer a line graph.

In [None]:
px.line(chapters, x='Chapter', y=['Adjectives %', 'Verbs %', 'Nouns %', 'Proper Nouns %'], title='Parts of Speech in '+book_title)

Or visualizing the chapter lengths by word counts.

In [None]:
px.bar(chapters, x='Chapter', y='Word Count', title='Chapter Lengths in '+book_title)

## Most common verbs

To get an idea of the most common words in the text we can look at a part of speech, verbs for example, and count their occurences.

In [None]:
word_type = 'VERB'

list_of_words = []
for row in chapters.iterrows():
    word_list = row[1]['Word List']
    for word in word_list:
        if word.pos_ == word_type:
            list_of_words.append(word.text)
words_df = pd.DataFrame.from_dict(Counter(list_of_words), orient='index').reset_index().rename(columns={'index':'Word', 0:'Count'})
words_df

Let's see what the ten most common verbs are.

In [None]:
words_df.sort_values('Count', ascending=False).head(10)

We could then visualize the data, or compare verbs to adjectives, see which are the most common pronouns, or investigate if word usage changes from chapter to chapter.

Check out the [next notebook](books-challenge.ipynb) to continue your own analysis of this or another book.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)