![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

### Prep work

Run the cell below to install libraries:

In [None]:
!pip install --user spaCy
#!python -m spacy download en

Run the next cells to load libaries and pre-defined functions:

In [None]:
#!wget https://raw.githubusercontent.com/callysto/hackathon/master/Group1_Book/helper_code/book1.py -P helper_code -nc

In [None]:
import re
import spacy
import urllib
import numpy as np
import pandas as pd

nlp = spacy.load('en_core_web_sm')

def get_book_df(chapters):
    book_df = pd.DataFrame(columns=["text", "part-of-speech","lemma","chapter"])
    for i in range(len(chapters)):
        chapter_tokens = nlp(chapters[i])
        for token in chapter_tokens:
             if ((token.pos_=="VERB") | (token.pos_=="NOUN") | (token.pos_=="ADJ") | (token.pos_== "PROPN")):
                    book_df = book_df.append({"text": token.text,
                             "part-of-speech":  token.pos_,
                             "lemma" : token.lemma_.strip().lower(),
                             "chapter": i+1
                              }, ignore_index=True)
    return book_df

def get_speechparts_by_chapter(book_df):
    result = book_df.groupby(["chapter", "part-of-speech"]).size().reset_index(name="count").\
                          pivot(index="chapter", columns='part-of-speech',values="count").reset_index().\
                          rename_axis(None,axis="columns").set_index("chapter")
    return result 

def get_counts(book_df, value):
    result = book_df.groupby(value).size().reset_index(name='count').set_index(value).sort_values(['count'], ascending=False)
    
    return result

def get_counts_by_chapters(book_df):
    result = book_df.groupby(["chapter", "lemma"]).size().reset_index(name="count").\
                                     pivot(index="chapter", columns='lemma',values="count").reset_index().\
                                    rename_axis(None,axis="columns").set_index("chapter")
    return result

In [None]:
# load libraries and helper code
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

#from helper_code.book1 import *

# Group goal

 
Go through the "Alice's Adventures in Wonderland" analysis below, work on challenges, and try modifying the code.

**Extra challenge**:

Explore the "Adventures of Tom Sawyer" book to show interesting results and visualizations.


### Download  book from project Guttenberg website

This book was downloaded from project Gutenberg website.

**Project Gutenberg** is a library of over 60,000 free eBooks

[This link](http://www.gutenberg.org/ebooks/search/?sort_order=downloads) shows the most popular books. 


In this notebook we are going to look at "Alice's Adventures in Wonderland" book.  
"The Adventures of Tom Sawyer" is downloaded as well for extra challenge.

In [None]:
#files names for both books
alice_filename = "alice.txt"
tom_filename = "tom.txt"

In [None]:
#copying files from cloud object storage
alice_url="https://swift-yeg.cloud.cybera.ca:8080/v1/AUTH_d22d1e3f28be45209ba8f660295c84cf/hackaton/alice.txt"
urllib.request.urlretrieve(alice_url, alice_filename)


tom_url="https://swift-yeg.cloud.cybera.ca:8080/v1/AUTH_d22d1e3f28be45209ba8f660295c84cf/hackaton/tom.txt"
urllib.request.urlretrieve(tom_url, tom_filename)

In [None]:
#reading the book into variable 'book'
with open(alice_filename, 'r') as text_file:
    book = text_file.read()

In [None]:
#print the entire book on the screen
print(book)

In [None]:
# how many characters are in the book?
len(book)

In [None]:
# split the book by chapter
chapters = re.split("CHAPTER\s+[IVXLCDM]+.", book)

# strip off any whitespace at the very beginning and very end of each chapter.
chapters = [chapter.strip() for chapter in chapters]

# remove tabs
chapters = [re.sub("\n", " ", c) for c in chapters]

#select only chapters that have more than 1000 characters (to exclude table of contents, title, etc.)
chapters = [c for c in chapters if len(c)>1000]
 
# number of chapters
print(len(chapters), " chapters")

### Create dataframe selecting only nouns, proper nouns, verbs. and adjectives per chapter

- **text**: actual word
- **part-of-speech**:  ADJ, PROPN, VERB, or NOUN
- **lemma**: headword
- **chapter**: chapter number

In [None]:
#This cell will run 3-5 mins!!!

#create a dataframe from the book
book_df = get_book_df(chapters)

In [None]:
# show first 5 rows of the dataframe
book_df.head()

In [None]:
## how many rows (individual words) and columns do we have?
book_df.shape

In [None]:
#excluding lemma equal to '’s' and '’'
book_df = book_df[(book_df["lemma"]!='’s') & (book_df["lemma"]!='’')]
book_df.shape

### Number of adjectives, nouns, proper nouns, and verbs 

In [None]:
#we group by "part-of-speech" column and count the number of rows
counts_by_part_of_speech = book_df.groupby("part-of-speech").size()

counts_by_part_of_speech

In [None]:
#create a pie chart - figure size 5 by 5 to make pie even circle
counts_by_part_of_speech.plot(kind="pie",figsize=(5,5))

#set x and y axis labels to blanks
plt.ylabel("")
plt.xlabel("")

### Challenge: 
 - Try grouping by different column: if you change `groupby("part-of-speech")` to `groupby("chapter")` what will it give you?
 - Experiment with different kinds of plots: try  changing `plot(kind="pie")` to `plot()` or `plot(kind="bar")` or `plot(kind="barh")`. Which of these better represents the data?  

### Number of adjective/nouns/proper nouns and verbs  per chapter

In [None]:
#we call a function to get total number of all parts of speech per chapter, 
speech_parts_by_chapter = get_speechparts_by_chapter(book_df)

#print data on the screen
speech_parts_by_chapter

In [None]:
#different kind of plot - area
speech_parts_by_chapter.plot(kind="area",figsize=(18,8))

plt.xlabel("Chapter", size = 16)
plt.ylabel("Counts", size = 16)

### Challenge
 - Experiment with plots: try changing `plot(kind="area")` to `plot()` or `plot(kind="bar")` or `plot(kind="bar",stacked=True)`. 

 - What kind of plot can better visually demonstrate which chapter has the largest number of verbs?

An alternative way to find the chapter with max number of words is **sorting**:

In [None]:
#sort_values() function - sorts by a column or set of columns
speech_parts_by_chapter.sort_values("VERB",ascending=False)

### Challenges
 - Find the  chapter that has the most **NOUN**s
 - Find the chapter that hast the **fewest** adjectives
 - Try plotting the results
 - Try two new kinds of plots -histogram and boxplot. Can you figure out how to interpret them?
     - Use  `plot(kind="boxplot")`
     - Use  `plot(kind="hist",alpha=0.4)` (try changing alpha)
     
More information on [histograms](https://www.mathsisfun.com/data/histograms.html) and [boxplots](https://www.mathsisfun.com/definitions/box-and-whisker-plot.html)

### Top 10 most common words

In [None]:
#call function to count the number of rows  for every "lemma" 

word_counts = get_counts(book_df, "lemma")

#print top 10 most frequent words on the screen
word_counts.head(10)

### Challenges
 - Try using "text" column instead of "lemma" - why do you get different results?
 - Plot the results using your choice of plot

###  The top 10 most common adjectives 

In [None]:
## subset only to adjectives
adjectives = book_df[book_df["part-of-speech"]=="ADJ"]

adjectives.head()

In [None]:
#call function to count the number of adjectives
adjective_counts = get_counts(adjectives, "lemma")

adjective_counts.head()

In [None]:
#visualize the top 10 adjectives:
adjective_counts.head(10).plot(kind="bar",figsize=(18, 8))

#set x and y axis labels
plt.ylabel("Counts", size = 16)
plt.xlabel("Lemma", size = 16)

### Challenges
 - Find the top 10 most common nouns and verbs
 - Plot the results

### For the top 15 most common  proper nouns, how does the number vary from chapter to chapter?

In [None]:
## subset only to proper nouns
propnouns = book_df[book_df["part-of-speech"]=="PROPN"]

propnouns.head()

In [None]:
#how many most frequent proper nouns do we want to analyse
num_words = 15

#call function to count the number of proper nouns 
top_propnouns = get_counts(propnouns, "lemma")

#get the top proper nouns 
top_propnouns = top_propnouns.head(num_words).index

#transform them into list
top_propnouns = list(top_propnouns)

#print on the screen
top_propnouns

In [None]:
## subset only to the top proper nouns
character_by_chapter = book_df[book_df["lemma"].isin(top_propnouns)]

character_by_chapter.head()

In [None]:
#what is the distribution of top proper nouns per chapter?
# call function to form resulting dataframe 
counts_by_chapter = get_counts_by_chapters(character_by_chapter)

#display on the screen
counts_by_chapter.head()

In [None]:
#what are the main characters in every chapter?
#we use colormap "tab20" to extend the default number of colors
counts_by_chapter.plot(kind="bar",figsize=(18,8),stacked = True, cmap="tab20")

#set x and y axis labels
plt.ylabel("Counts", size = 16)
plt.xlabel("Chapter", size = 16)

### Challenges
 - Try experimenting with the number of proper nouns (change `num_words`)
 - Try doing the same thing with adjectives, nouns, or/and verbs - can you guess whats going on in each chapter based on these plots?

### Extra
Now let's try doing the same thing but using **percentage** instead

In [None]:
#will make a copy of dataframe to work with percentages
counts_percent = counts_by_chapter.copy()

#create addtional column - sum of words per chapter (axis =1 - means -sum by row)
counts_percent["sum"] = counts_percent.sum(axis = 1)

#divide evry column by sum
counts_percent = counts_percent.iloc[:,0:num_words].divide(counts_percent["sum"],axis=0)

#multiply every column by 100
counts_percent = counts_percent.iloc[:,0:num_words].multiply(100,axis=0)

#we choose area plot this time
counts_percent.plot(kind="area",figsize=(18,8),cmap="tab20")

#set x and y axis labels
plt.ylabel("Percent", size = 16)
plt.xlabel("Chapter", size = 16)

![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banners_Bottom_06.06.18.jpg?raw=true)