# This is a Jupyter Notebook

A Jupyter Notebook is a data-science coding environment that combines three types of content:

  - Narrative: The text describing your analysis
  - Code: The program that does the analysis
  - Results: The output of the program
    
The Jupyter environment was created by Fernando Perez at the University of California, Berkeley. 
These ideas are now used in a lot of different technologies (e.g., Google Collab).

## Example for Day 1

Note: In this lecture there is a lot of code. You are not expected to know any of this yet, so don't worry. This is just a preview of the things you will see in the next few weeks.

We can use the tools of data science to study text. For example, here we will do some basic analysis of "Adventures of Huckleberry Finn" (by Mark Twain) and "Little Women" (by Louisa May Alcott).

Often the first step in data sciences is getting the data. The following block suppresses some annoying warnings and defines a function, `read_url`, for downloading text from the web.

In [None]:
# Supresses annoying warnings
import warnings
warnings.simplefilter('ignore', FutureWarning)

# Defines a function to download text from the web based on a given URL
def read_url(url): 
    from urllib.request import urlopen 
    import re
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

Now we download the content of the two novels from the [Inferential Thinking](https://inferentialthinking.com/chapters/intro.html) textbook website.

In [None]:
# This block gets the text of Huckleberry Finn and splits it into its chapters
huck_finn_url = 'https://www.inferentialthinking.com/data/huck_finn.txt'
huck_finn_text = read_url(huck_finn_url)
huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]

In [None]:
# This block gets the text of Little Women and splits it into its chapters
little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
little_women_text = read_url(little_women_url)
little_women_chapters = little_women_text.split('CHAPTER ')[1:]

The `string.split()` method has converted each novel's text into a list of strings, one long string of
characters for each chapter. So the data type of `huck_finn_chapters` is `list`.

In [None]:
type(huck_finn_chapters)

In [None]:
# For example, we can look at the first chapter, which is at index 0
huck_finn_chapters[0]

In [None]:
# Here is the second chapter.
huck_finn_chapters[1]

## Working with Tables

A lot of data science is about transforming data, often producing tables that we can more easily analyze. In this class you will use the Berkeley `datascience` module (a Python library) to manipulate and visualize data. In Python, `import *` means "import everything".

In [None]:
from datascience import *

In [None]:
# Create a 1-column table of the Huck Finn chapters
t = Table().with_column('Chapters', huck_finn_chapters)
t

We will explore data by extracting summaries. For example, we might ask, **how often did the names of the main characters of the novel appear in each chapter**. We can use snippets of code to answer these questions.

In [None]:
# We will use the Numpy module for Numerical Python computing
import numpy as np

In [None]:
# Numpy lets us easily count the occurrences of 'Tom' in each chapter
np.char.count(huck_finn_chapters, 'Tom')

In [None]:
np.char.count(huck_finn_chapters, 'Jim')

In [None]:
np.char.count(huck_finn_chapters, 'Huck')

We can collect the results of our analysis in a new table, `counts`:

In [None]:
counts = Table().with_columns([
    'Tom', np.char.count(huck_finn_chapters, 'Tom'),
    'Jim', np.char.count(huck_finn_chapters, 'Jim'),
    'Huck', np.char.count(huck_finn_chapters, 'Huck'),
])
counts

## We will Learn to Visualize Data

Let's make a plot of the cumulative counts: How many times in Chapter 1, how many times in Chapters 1 and 2, and so on, was each of the main characters mentioned by name?

In [None]:
# matplotlib is a comprehensive library for creating visualizations in Python
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In [None]:
cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 44, 1))
cum_counts.plot(column_for_xticks="Chapter")
plt.title('Cumulative Number of Times Name Appears');

What can we tell from this visualization? What questions does this raise?

We can explore the characters in Little Women using the same kind of analysis.

In [None]:
# Counts of names in the chapters of Little Women
names = ['Amy', 'Beth', 'Jo', 'Laurie', 'Meg']
mentions = {name: np.char.count(little_women_chapters, name) for name in names}  # makes a dictionary
counts = Table().with_columns([
        'Amy', mentions['Amy'],
        'Beth', mentions['Beth'],
        'Jo', mentions['Jo'],
        'Laurie', mentions['Laurie'],
        'Meg', mentions['Meg']
    ])
counts

In [None]:
Table.static_plots()
cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 48, 1))
cum_counts.plot(column_for_xticks=5)
plt.title('Cumulative Number of Times Name Appears');

  - What can we learn from the Little Women visualization?

Let's consider another angle on these novels, using the built-in `len()` function.

In [None]:
len('January 8th!')

`len` is for finding the length of something using Python, counting the letters, digits, spaces, etc.

In [None]:
len('abc 123')

In [None]:
# In each chapter, count the number of characters (letters, digits, etc.) and the number of periods
# A period typically ends a sentence

# Counts for Huck Finn:
length_hf = Table().with_columns([
    'length', [len(s) for s in huck_finn_chapters], 
    'periods', np.char.count(huck_finn_chapters, '.')
])
length_hf

In [None]:
# Counts for Little Women:
length_lw = Table().with_columns([
    'length', [len(s) for s in little_women_chapters], 
    'periods', np.char.count(little_women_chapters, '.')
])
length_lw

What do you notice when you compare the two tables?

In [None]:
# Build a visualization (scatterplot)
plt.figure(figsize=(10,10))
plt.scatter(length_hf[1], length_hf[0], color='darkblue')
plt.scatter(length_lw[1], length_lw[0], color='gold')
plt.xlabel('Number of periods in chapter')
plt.ylabel('Number of characters in chapter');