DATA 8<br>
1.3 Plotting the Classics<br><br>

Instructions:
Your assignment is to read the narrative and run all existing code step by step while examining output for two texts (this matches the work in Data 8 chapter 1).  Then at the end you will practice on another e-book from Project Gutenberg.  There will be homework instructions guiding you to add the code that is required to examine the new book.  You will save a copy of the file on YOUR Github site for submission.
<br>

In this example, we will explore statistics for two classic novels: The Adventures of Huckleberry Finn by Mark Twain, and Little Women by Louisa May Alcott. The text of any book can be read by a computer at great speed. Books published before 1923 are currently in the public domain, meaning that everyone has the right to copy or use the text in any way. Project Gutenberg is a website that publishes public domain books online. Using Python, we can load the text of these books directly from the web.<br><br>

First we need to set up our environment and import a few packages and related modules:<br>
a.  The Data 8 text uses a package called "Data Science" that provides useful functions.<br>
b. Pandas for tabular data manipulation and analysis<br>
c. NumPy for working with arrays<br>
d. matplotlib for plotting<br>
e. warnings to provide warning control<br>
f. urllib (urlopen) to fetch urls<br>
g. re for regular expression operations<br><br>

STEP 1: Place your cursor (click) in the code cells and click on the triangle to the left of the code to execute.  Some code blocks generate many messages!  You can clear these by clicking on the x where the messages are displayed.


In [None]:
#STEP 1:  we need to install datascience first because it is not a typical package that comes with our programming environment
#more information can be found here (optional reading https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/)
#!pip install datascience
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install datascience
#
#after this is executed you can click on the x (person changes to x when cursor is hovered) in order to clear messages

In [None]:
#STEP 1:  continued
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

from urllib.request import urlopen 
import re
def read_url(url): 
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

STEP 2:  Running the code below will allow us to access https://www.inferentialthinking.com to read two books fast!  We are inputting Huck Finn and Little Women.  Remember to read the comments included with the code!  They start with "#".

In [None]:
# STEP 2:  Read two books, fast!
# No output yet, this stores the text in the string variables
# huck_finn_text, huck_finn_chapters, little_women_text and little_women_chapters

huck_finn_url = 'https://www.inferentialthinking.com/data/huck_finn.txt'
huck_finn_text = read_url(huck_finn_url)
huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]

little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
little_women_text = read_url(little_women_url)
little_women_chapters = little_women_text.split('CHAPTER ')[1:]

In [None]:
# STEP 2:  print the text in the variable huck_finn_chapters
huck_finn_chapters

In [None]:
# STEP 2:  Create a table to display huck_finn_chapters in a more desirable format
Table().with_column('Chapters', huck_finn_chapters)

STEP 3:  Time to explore!  Think about what we have already done with a few lines of code!  Run the code blocks below (read the comments) and learn more about the text in Huck Finn!

In [None]:
# STEP 3:  this creates an array of counts for the number of times the name "Tom" appears in each of the chapters.
np.char.count(huck_finn_chapters, 'Tom')

In [None]:
# STEP 3:  this creates an array of counts for the number of times the name "Jim" appears in each of the chapters.
np.char.count(huck_finn_chapters, 'Jim')

In [None]:
# STEP 3:  Let's display this information in a more user friendly manner.  
counts = Table().with_columns([
    'Tom', np.char.count(huck_finn_chapters, 'Tom'),
    'Jim', np.char.count(huck_finn_chapters, 'Jim'),
    'Huck', np.char.count(huck_finn_chapters, 'Huck'),
])
counts

STEP 4:  We can do better than a simple table.  How about a plot that shows how the counts or name mentions accumulate over the course of the book?
Click and run the code blocks associated with STEP 4 and marvel at how incredibly cool Python is!

In [None]:
# STEP 4:  Remember, we have already counted how many times the names Jim, Tom, and Huck appear in each chapter.
# This information is stored in the "counts".

# In the code, we will plot the cumulative counts:
# how many times in Chapter 1, how many times in Chapters 1 and 2, and so on.

cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 44, 1))
cum_counts.plot(column_for_xticks=3)
plots.title('Cumulative Number of Times Name Appears');

STEP 5:  You know what is next!  We have Little Women ready to go - we have already read the text and we will run similar code on this classic. In order to condense the instructions, we are labeling all of the code for Little Women as "STEP 5".

In [None]:
# STEP 5: The chapters of Little Women

Table().with_column('Chapters', little_women_chapters)

In [None]:
# STEP 5: Counts of names in the chapters of Little Women

people = ['Amy', 'Beth', 'Jo', 'Laurie', 'Meg']
people_counts = {pp: np.char.count(little_women_chapters, pp) for pp in people}

counts = Table().with_columns([
        'Amy', people_counts['Amy'],
        'Beth', people_counts['Beth'],
        'Jo', people_counts['Jo'],
        'Laurie', people_counts['Laurie'],
        'Meg', people_counts['Meg']
    ])

In [None]:
# STEP 5: Plot the cumulative counts

cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 48, 1))
cum_counts.plot(column_for_xticks=5)
plots.title('Cumulative Number of Times Name Appears');

STEP 6:  Now we are going to count the number of characters in each chapter of the books that we are analyzing in order to gain insight into the "length" of the chapters.  Our last plot will show the number of periods in each chapter and we compare Huck Finn and Little Women.  Periods equal sentences . . .

In [None]:
# STEP 6:  In each chapter, count the number of all characters;
# call this the "length" of the chapter.
# Also count the number of periods.

chars_periods_hf = Table().with_columns([
        'HF Chapter Length', [len(s) for s in huck_finn_chapters],
        'Number of Periods', np.char.count(huck_finn_chapters, '.')
    ])
chars_periods_lw = Table().with_columns([
        'LW Chapter Length', [len(s) for s in little_women_chapters],
        'Number of Periods', np.char.count(little_women_chapters, '.')
    ])

In [None]:
# STEP 6:  The counts for Huckleberry Finn

chars_periods_hf

In [None]:
# STEP 6:  The counts for Little Women

chars_periods_lw

In [None]:
# STEP 6:  Final plot - let's compare the number of periods in Huck Finn and Little Women
plots.figure(figsize=(10,10))
plots.scatter(chars_periods_hf[1], chars_periods_hf[0], color='darkblue')
plots.scatter(chars_periods_lw[1], chars_periods_lw[0], color='gold')
plots.xlabel('Number of periods in chapter')
plots.ylabel('Number of characters in chapter');

<h1>HOMEWORK:  </h1>
Follow instructions that are givein the comments in the code block.  Add required code, test and debug.  Save to your github site!

In [5]:
# HW STEP 1:  Using the code above as an example, input The Counte of Monte Cristo from Project Gutenberg.
# The URL is https://www.gutenberg.org/files/1184/1184-0.txt
# No output yet, you are storing the text in the appropriate variables
#monte_cristo_url
#monte_cristo_text
#monte_cristo_chapters
#HINT:  in the previous txt for the earlier ebooks, chapters appeared in upper
#case.  In The Count . . . Chapters appear as 'Chapter '
#make sure you change the last line of code in the appropriate place!
#ADD CODE BELOW THIS LINE


In [4]:
# HW STEP 2:  print the text in the variable monte_cristo_chapters
# ADD CODE BELOW THIS LINE

In [3]:
# HW STEP 3:  Create a table to display Monte Cristo_chapters in a more desirable format
# ADD CODE BELOW THIS LINE

In [2]:
# HW STEP 4: Counts of names in the chapters of The Count of Monte Cristo
# use names Mercedes, Edmond, Marseilles, Fernand, Monte Cristo
# count number of lines which is hidden rows plus displayed rows plus 1
# last line should display counts
# ADD CODE BELOW THIS LINE


In [1]:
# HW STEP 5: Plot the cumulative counts (use the count for the number of chapters)
# ADD CODE BELOW THIS LINE
