**Note:** In this lecture there is a lot of code. You are not expected to know any of this yet. This is just a preview of the things you will see in the next few weeks. 


## This is a Jupyter Notebook

A Jupyter Notebook is a data-science environment that combines:

1. **Narrative:** The text describing your analysis
2. **Code:** The program that does the analysis
3. **Results:** The output of the program

The Jupyter environment was created by faculty here at Berkeley (Fernando Perez). These ideas are now in a lot of different technologies (e.g., Google Collab). 


## Our first example: analyzing the text of popular books

We can use the tools of data science to study text.  For example, here we will do some basic analysis of *["Adventures of Huckleberry Finn"](https://en.wikipedia.org/wiki/Adventures_of_Huckleberry_Finn)* (by Mark Twain) and from *["Little Women"](https://en.wikipedia.org/wiki/Little_Women)* (by Louisa May Alcott).  

Often the first step in data sciences is getting the data.  The following is a tiny program to download text from the web. More specifically, what you see below is a **function**. We will talk more about functions later on!

In [None]:
# A tiny program to download text from the web.
def read_url(url): 
    from urllib.request import urlopen 
    import re
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

Here we download the books, which are actually hosted on the Data 8 textbook website.

In [None]:
huck_finn_url = 'https://www.inferentialthinking.com/data/huck_finn.txt'
huck_finn_text = read_url(huck_finn_url)
huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]

In [None]:
little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
little_women_text = read_url(little_women_url)
little_women_chapters = little_women_text.split('CHAPTER ')[1:]

Let's look at the text from the first chapter of Huckleberry Finn:

In [None]:
huck_finn_chapters

## Tables

- A lot of data science is about transforming data. This is often in service of producing **tables**, a widely used data structure that we can more easily analyze our data with. 
- In this class you will use the `datascience` library (specifically created for this course!!) to manipulate and data.

In [None]:
import datascience
datascience.__version__

In [None]:
from datascience import *

In [None]:
Table().with_column('Chapters', huck_finn_chapters)

## We will learn to summarize data

We will explore data by extracting summaries. For example, we might ask, how often characters appeared in each chapter. We can use snippets of code to answer these questions.

In [None]:
import numpy as np

In [None]:
np.char.count(huck_finn_chapters, 'Tom')

In [None]:
np.char.count(huck_finn_chapters, 'Jim')

We can convert the results of our analysis into more tables.

In [None]:
counts = Table().with_columns([
    'Tom', np.char.count(huck_finn_chapters, 'Tom'),
    'Jim', np.char.count(huck_finn_chapters, 'Jim'),
    'Huck', np.char.count(huck_finn_chapters, 'Huck'),
])
counts

## We will learn to visualize data

- How many times is each character mentioned in Chapter 1, how many times in Chapters 1 and 2, and so on?
- As we saw above, we could answer this with a table, but there are a lot of chapters!! Let's try something else.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
cum_counts_tom = np.cumsum(counts.column("Tom"))
cum_counts_jim = np.cumsum(counts.column("Jim"))
cum_counts_huck = np.cumsum(counts.column("Huck"))
cumulative_table = Table().with_columns(
    'Chapter', np.arange(1, 44, 1),
    'Tom', cum_counts_tom,
    'Jim', cum_counts_jim,
    'Huck', cum_counts_huck
)
cumulative_table.plot(column_for_xticks='Chapter')
plt.title('Cumulative Number of Times Name Appears')
plt.show()

Here, we have plotted what we call *cumulative counts*.

What can we tell from this visualization?  What questions does this raise about the roles of Jim, Tom and Huck in the book?

In [None]:
# The chapters of Little Women
Table().with_column('Chapters', little_women_chapters)

We can explore the characters in Little Women using the same kind of analysis.

In [None]:
# Counts of names in the chapters of Little Women
names = ['Amy', 'Beth', 'Jo', 'Laurie', 'Meg']
mentions = {name: np.char.count(little_women_chapters, name) for name in names}
counts = Table().with_columns([
        'Amy', mentions['Amy'],
        'Beth', mentions['Beth'],
        'Jo', mentions['Jo'],
        'Laurie', mentions['Laurie'],
        'Meg', mentions['Meg']
    ])

In [None]:
# Plot the cumulative counts
cum_counts_amy = np.cumsum(counts.column("Amy"))
cum_counts_beth = np.cumsum(counts.column("Beth"))
cum_counts_jo = np.cumsum(counts.column("Jo"))
cum_counts_laurie = np.cumsum(counts.column("Laurie"))
cum_counts_meg = np.cumsum(counts.column("Meg"))

cumulative_table = Table().with_columns(
    'Chapter', np.arange(1, 48, 1),
    'Amy', cum_counts_amy,
    'Beth', cum_counts_beth,
    'Jo', cum_counts_jo,
    'Laurie', cum_counts_laurie,
    'Meg', cum_counts_meg
)

cumulative_table.plot(column_for_xticks='Chapter')
plt.title('Cumulative Number of Times Names Appear in Little Women')
plt.show()

We can use interactive tools as well!

In [None]:
# Plot the cumulative counts
Table.interactive_plots()
cumulative_table.plot(column_for_xticks=0)

## Visualizing multiple variables

- How long are the chapters in a book?
- How many sentences are in a chapter? We can find where a period is used as a tool to help us determine this.

You don't need to worry about understanding the code below for today!!


In [None]:
len(read_url(huck_finn_url))

In [None]:
# In each chapter, count the number of all characters;
# call this the "length" of the chapter.
# Also count the number of periods.

length_hf = Table().with_columns([
        'Length', [len(s) for s in huck_finn_chapters],
        'Periods', np.char.count(huck_finn_chapters, '.')
    ])
length_lw = Table().with_columns([
        'Length', [len(s) for s in little_women_chapters],
        'Periods', np.char.count(little_women_chapters, '.')
    ])

In [None]:
# The counts for Huckleberry Finn
length_hf

In [None]:
# The counts for Little Women
length_lw

Now that we have a table for each book giving us the information on:

- length per chapter
- of periods per chapter

We might consider examining how these two variables are related. Below is what is called a **scatter plot** (we will talk more in depth about this plot later on)!

In [None]:
Table.static_plots()
plt.figure(figsize=(10,10))
plt.scatter(length_hf[1], length_hf[0], color='darkblue')
plt.scatter(length_lw[1], length_lw[0], color='gold')
plt.xlabel('Number of periods in chapter')
plt.ylabel('Number of characters in chapter');
plt.title('Relationship between numbers of characters and periods in a chapter');

This sub-example illustrates the relationship between difference facets of our course: namely, the exploration and prediction facets. 