![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banner_Top_06.06.18.jpg?raw=true)

# Book - Alice's Adventures in Wonderland

**Submitted by: A, B, C, D**

<table><tr>
<td> <img src="data/image2.jpg" alt="Drawing" style="width: 700px;"/> </td>
<td> <img src="data/image1.jpeg" alt="Drawing" style="width: 240px;"/> </td>
</tr></table>
    
[Alice's Adventures in Wonderland](https://en.wikipedia.org/wiki/Alice's_Adventures_in_Wonderland) is one of the most popular fiction novel among adults as well as children. It was written in 1865 by English author Charles Lutwidge Dodgson. 

On a regular day, you would be reading the book and speculating about what will happen next in the book. However, in this hackathon, let us try to get some interesting insights about the book that you would not think of otherwise and that too, while learning some new coding/hacking skills. 

## Getting ready

This section sets up many things behind the scenes which are required to follow through this notebook smoothly. Most of the code blocks in this section are *ready-to-run* and hence you won't have to do any modifications. Also, you do not need to know everything about various tasks being accomplished by the code cell in this section to complete the challenges. However, feel free to ask mentors about anything that makes you really curious.

### 1. Install/Import libraries

Run the cell below to download and install required Python libraries. It may take couple of minutes to complete the execution of the cell.

In [None]:
! pip install -U spaCy
! python -m spacy download en

Run the next cells to load libaries and pre-defined functions which will help us later to complete various challenges.

In [None]:
!wget https://raw.githubusercontent.com/callysto/hackathon/master/Group1_Book/helper_code/book1.py -P helper_code -nc

In [None]:
# load libraries and helper code
import pandas as pd

import cufflinks as cf
cf.go_offline()

colors20 = ['#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#911eb4', '#46f0f0', 
          '#f032e6', '#bcf60c', '#fabebe', '#008080', '#e6beff', '#9a6324', '#fffac8', 
          '#800000', '#aaffc3', '#808000', '#ffd8b1', '#000075', '#808080', '#ffffff', '#000000']


# to enable plotting in colab
def enable_plotly_in_cell():
    import IPython
    from plotly.offline import init_notebook_mode
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
    init_notebook_mode(connected=False)
    
get_ipython().events.register('pre_run_cell', enable_plotly_in_cell) 

from helper_code.book1 import *

### 2. Download  book from project Guttenberg website

**[Project Gutenberg](https://www.gutenberg.org/)** is a digital library with more than 60,000 free eBooks. You can see most popular books downloaded from Guttenberg website [here](http://www.gutenberg.org/ebooks/search/?sort_order=downloads). Can you see *Alice's Adventures in Wonderland* in that list?

As you already know, we are going to look at "Alice's Adventures in Wonderland" book in this hackathon. The book is stored in cloud storage. Let us download and bring it in this notebook. Executing cells below will also make you aware of some interesting statistics about the book.

In [None]:
# file name for the book
alice_filename = "alice.txt"

# copying book from cloud object storage
alice_url="https://swift-yeg.cloud.cybera.ca:8080/v1/AUTH_d22d1e3f28be45209ba8f660295c84cf/hackaton/alice.txt"
urllib.request.urlretrieve(alice_url, alice_filename)

In [None]:
# reading the book into variable 'book'
with open(alice_filename, 'r') as text_file:
    book = text_file.read()

In [None]:
# print the entire book on the screen
print(book)

In [None]:
# how many characters are there in the book?
len(book)

In [None]:
# split the book by chapter
chapters = re.split("CHAPTER\s+[IVXLCDM]+.", book)

# strip off any whitespace at the very beginning and very end of each chapter.
chapters = [chapter.strip() for chapter in chapters]

# remove tabs
chapters = [re.sub("\n", " ", c) for c in chapters]

# select only chapters that have more than 1000 characters (to exclude table of contents, title, etc.)
chapters = [c for c in chapters if len(c)>1000]
 
# number of chapters
print(len(chapters), " chapters")

### 3. Create a dataframe by selecting only nouns, proper nouns, verbs, and adjectives per chapter

We just printed the entire book on the screen, however, it is in an unstructured format. It will be easier to analyze the content if it is in a tabulated format.

Run the following cells to create a dataframe which tells about various characteristics of a word in the book. Description for each of the columns of the dataframe is provided below:

- **text**: actual word
- **part-of-speech**:  ADJ, PROPN, VERB, or NOUN
- **lemma**: headword
- **chapter**: chapter number

In [None]:
# running this cell will take 3-5 mins!!!

#create a dataframe from the book
book_df = get_book_df(chapters)

In [None]:
# show first 5 rows of the dataframe
book_df.head()

In [None]:
# excluding lemma equal to '’s' and '’'
book_df = book_df[(book_df["lemma"]!='’s') & (book_df["lemma"]!='’')]

# how many rows (individual words) and columns do we have?
book_df.shape

Now everything is set up for text crunching. Your group can go through the *Alice's Adventures in Wonderland* analysis below and work on challenges. 

**While working on the challenges, feel free to add new code/markdown cells as needed.**

## Part A: Total number of adjectives, nouns, proper nouns, and verbs in the book

Let us count the number of adjectives, nouns, proper nouns, and verbs (also known as *part-of-speech tags*) in the book. Would it be possible to do it manually?

In [None]:
# group by "part-of-speech" column and count the number of rows
counts_by_part_of_speech = book_df.groupby("part-of-speech").size()

# create additional column - count
counts_by_part_of_speech = counts_by_part_of_speech.reset_index(name="count")

counts_by_part_of_speech 

In [None]:
# create a pie chart
counts_by_part_of_speech.iplot(kind="pie",values="count",labels="part-of-speech")

### Challenge: 
 - If you change `groupby("part-of-speech")` to `groupby("chapter")`, what will it give you? Can you create a pie chart showing percentage of all part-of-speech tags in each chapter?


## Part B: Number of adjectives/nouns/proper nouns and verbs  per chapter

Let us count each of the part-of-speech tags individually in each of the chapters.

In [None]:
# call function to get total number of all parts of speech per chapter -  its defined in the top part
speech_parts_by_chapter = get_speechparts_by_chapter(book_df)

speech_parts_by_chapter

In [None]:
# new kind of plot - area
speech_parts_by_chapter.iplot(kind="area", fill=True, xTitle="Chapter", yTitle="Count")

### Challenges:
 - Experiment with plots: Modify `iplot(kind="area",fill=True)` to `iplot()`, `iplot(kind="bar")` or  `iplot(kind = "bar",barmode="stack")`. 

 - Which type of plot can better visualize the chapter with the largest number of verbs?

An alternate way to find the chapter with maximum number of words is **sorting**:

In [None]:
# sort_values() function - sorts by a column or set of columns
speech_parts_by_chapter.sort_values("VERB",ascending=False)

### Challenges:

 - Find the  chapter that has the most **NOUN**s
 - Find the chapter that has the **fewest** adjectives
 - Plot the grouped bar chart to visualize nouns and adjectives for each chapter
 - Try two new kinds of plots - [boxplots](https://www.mathsisfun.com/definitions/box-and-whisker-plot.html) and [histograms](https://www.mathsisfun.com/data/histograms.html). Can you figure out how to interpret them?
     - use `iplot(kind="box")`
     - use `iplot(kind="histogram",subplots=True)`

## Part C: Top 10 most common words

Let us find top 10 most used words in the book. Is it even possible without computers?

In [None]:
# call function to count the number of rows  for every "lemma" - its defined in the top portion of the notebook
word_counts = get_counts(book_df, "lemma")

# print top 10 most frequent words on the screen
word_counts.head(10)

### Challenges:
 - Use "text" column instead of "lemma". Do you get different results? Why?
 - Visualize the results using the plot of your choice.

## Part D:  Top 10 most common adjectives 

Let us extract top 10 most used adjectives in the book.

In [None]:
# subset only to adjectives
adjectives = book_df[book_df["part-of-speech"]=="ADJ"]

adjectives.head()

In [None]:
# call function to count the number of adjectives
adjective_counts = get_counts(adjectives, "lemma")

adjective_counts.head()

In [None]:
# visualize the top 10 adjectives
adjective_counts.head(10).iplot(kind="bar",xTitle="Lemma",yTitle="Count")

### Challenges
 - Similar to words and adjectives, can you find the top 10 most common nouns and verbs?
 - Plot the results using the chart type of your choice.

## Part E: For the top 15 most common proper nouns, how does the number vary from chapter to chapter?

Now that we know how to find top few words in the book, let us analyze how top 15 proper nouns vary by chapters.

In [None]:
# subset with only proper nouns
propnouns = book_df[book_df["part-of-speech"]=="PROPN"]

propnouns.head()

In [None]:
# how many most frequent proper nouns do we want to analyse
num_words = 15

# call function to count the number of proper nouns 
top_propnouns = get_counts(propnouns, "lemma")

# get the row names(index) for top proper nouns 
top_propnouns = top_propnouns.head(num_words).index

# transform them into list
top_propnouns = list(top_propnouns)

# print on the screen
top_propnouns

In [None]:
# subset with only the top proper nouns
character_by_chapter = book_df[book_df["lemma"].isin(top_propnouns)]

character_by_chapter.head()

In [None]:
# what is the distribution of top proper nouns per chapter?
# call function to form resulting dataframe - its defined in the top portion of the notebook
counts_by_chapter = get_counts_by_chapters(character_by_chapter)

# display on the screen
counts_by_chapter.head()

In [None]:
# what are the main characters in every chapter?
# using colors20 to extend the default number of colors
counts_by_chapter.iplot(kind="bar",barmode = "stack", xTitle="Chapter",yTitle="Counts",colors=colors20)

### Challenges:
 - Change the number of proper nouns (change `num_words`) to any other positive number and visualize how the bar chart changes.
 - Repeat the exercise (i.e. Part E) for adjectives, nouns, and verbs. Can you guess the story line for one of the chapters based on these plots.

## Part F: Explore the "Adventures of Tom Sawyer" book (optional)


From Project Gutenberg, "The Adventures of Tom Sawyer" book is also available and stored in the cloud storage. You can repeat the hackathon challenges with this book and create visualizations. **However, note that this section is not mandatory and will not be part of the final evaluation.**

Run the following code cell to download the book from cloud and bring it here.

In [None]:
# file name for the book
tom_filename = "tom.txt"

# copying book from cloud object storage
tom_url="https://swift-yeg.cloud.cybera.ca:8080/v1/AUTH_d22d1e3f28be45209ba8f660295c84cf/hackaton/tom.txt"
urllib.request.urlretrieve(tom_url, tom_filename)

## Summary

This workbook analyzes the **Alice's Adventures in Wonderland** book with the help of python code blocks. The book is obtained from Project Gutenberg and part-of-speech tags are counted for the book as well as chapters. Also, commonly used words are identified and various relevant challenges are addressed. 

By taking part in this hackathon and completing these challenges, you learnt how to analyze big dataset which is impractical to do manually, create visualizations and most importantly, developed [*computational thinking*](https://en.wikipedia.org/wiki/Computational_thinking) abilities which can be used to solve various problems.

![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banners_Bottom_06.06.18.jpg?raw=true)