<header style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</header>

# Lecture 01: Introduction

Associated Textbook Sections: [1.0, 1.1, 1.2, 1.3](https://inferentialthinking.com/chapters/01/what-is-data-science.html)

---

## Overview

* [What is Data Science?](#What-is-Data-Science?)
* [Reading Books](#Reading-Books)
* [Adventures of Huckleberry Finn](#Adventures-of-Huckleberry-Finn)
* [Little Women](#Little-Women)
* [Exploring Sentence Structure](#Exploring-Sentence-Structure)
* [Ask a Question](#Ask-a-Question)

---

## What is Data Science?

---

### Exploration, Inference, and Prediction

Learning about the world from data using computation:

<img src="./what_is_ds.png" width = 80% alt="Data Science categorized as Exploration, Inference, and Prediction.">

---

### Job Outlook

According to [the US Bureau of Labor Statistics](https://www.bls.gov/ooh/math/data-scientists.htm):
* 35% increase in the number of Data Science positions from 2021 - 2031
* Median Data Scientist's annual wage was $103,500 (May 2022)

---

## Jupyter Notebooks

A [Jupyter Notebook](https://jupyter.org/) is an application that combines:

* Narrative: The text describing your analysis
* Code: The program that does the analysis
* Results: The output of the program

The Jupyter environment was created by faculty ([Fernando Perez](https://bids.berkeley.edu/people/fernando-perez)) at UC Berkeley.

---

## Reading Books

---

### Python "Reads" Two Books

* [Project Gutenberg](https://www.gutenberg.org/) contains public domain book data.
* Load the text from the Adventures of Huckleberry Finn and Little Women.

In [None]:
with open("./huck_finn.txt") as huck_finn, \
open("./little_women.txt") as little_women:
    huck_finn_chapters = huck_finn.read().split('CHAPTER ')[44:]
    little_women_chapters = little_women.read().split('CHAPTER ')[1:]

Notice that all this code did not produce any output.

---

## Adventures of Huckleberry Finn

---

### Preview the Book

View the first 100 characters of Chapter 1.

In [None]:
huck_finn_chapters[0][:100]

This code did produce output!

---

### Import `datascience`

* Data Science often requires transforming data for analysis.
* We will use UC Berkeley's [datascience](https://datascience.readthedocs.io/) library to do this.

---

Import all of the content of the `datascience` library.

In [None]:
from datascience import *

---

Notice that `Table` has a special meaning.

In [None]:
Table

---

### Create a Table

Organize all the text from each chapter of the book within a `Table`.

In [None]:
Table().with_column('Chapters', huck_finn_chapters)

---

### Word Frequency

What can the frequency of words in a story tell us about the story?

* Counting words in a book might seem overwhelming.
* Computers can help with tedious tasks like this.

---

### Import NumPy

* [NumPy](https://numpy.org/) is another standard library that you work with in this course.
* This library is will help with doing the same calculation on many things.

---

Import the `numpy` library and give it the alias `np`.

In [None]:
import numpy as np

---

### Counting Names

Count the number of times "Tom" and "Jim" appear in the text.

In [None]:
np.char.count(huck_finn_chapters, 'Tom')

In [None]:
np.char.count(huck_finn_chapters, 'Jim')

---

Create a table called `counts` that organizes the number of times the names Jim, Tom, and Huck appear in each chapter.

In [None]:
counts_hf = Table().with_columns([
    'Tom', np.char.count(huck_finn_chapters, 'Tom'),
    'Jim', np.char.count(huck_finn_chapters, 'Jim'),
    'Huck', np.char.count(huck_finn_chapters, 'Huck'),
])
counts_hf

---

### Import and Configure Matplotlib

* [Matplotlib](https://matplotlib.org/) is a standard Python visualization library.
* In this course, we will use it indirectly to create visualizations.

---

Import and configure `matplotlib`.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

### Visualize a Trend

Visualize how the cumulative counts for each person changes over the chapters.

In [None]:
original_labels_hf =  list(counts_hf.labels)
cumulative_values_hf = ([np.cumsum(counts_hf.column(index)) 
                         for index in np.arange(counts_hf.num_columns)])
cumulative_counts_hf = (counts_hf
                        .with_columns(zip(original_labels_hf, cumulative_values_hf))
                        .with_column('Chapter', np.arange(1, 44, 1)))
cumulative_counts_hf.plot('Chapter')
plt.title('Cumulative Number of Times Name Appears')
plt.show()

---

## Little Women

---

### Create a Table

Organize the text from each chapter from Little Women within a `Table`.

In [None]:
Table().with_column('Chapters', little_women_chapters)

---

### Counting Names

Count how often the names of some characters in Little Women appear within each chapters.

In [None]:
people_lw = ['Amy', 'Beth', 'Jo', 'Laurie', 'Meg']
counts_lw = {pp: np.char.count(little_women_chapters, pp) for pp in people_lw}

counts_lw = Table().with_columns([
        'Amy', counts_lw['Amy'],
        'Beth',counts_lw['Beth'],
        'Jo', counts_lw['Jo'],
        'Laurie', counts_lw['Laurie'],
        'Meg', counts_lw['Meg']
    ])
counts_lw

---

### Visualize a Trend

Visualize how the cumulative counts for each person changes over the chapters.

In [None]:
original_labels_lw =  list(counts_lw.labels)
cumulative_values_lw = ([np.cumsum(counts_lw.column(index)) 
                      for index in np.arange(counts_lw.num_columns)])
cumulative_counts_lw = (counts_lw
                        .with_columns(zip(original_labels_lw, cumulative_values_lw))
                        .with_column('Chapter', np.arange(1, 48, 1)))
cumulative_counts_lw.plot('Chapter')
plt.title('Cumulative Number of Times Name Appears')
plt.show()

---

## Exploring Sentence Structure

---

### Chapter Length and Sentence Length

* Is there relationship between the length of a chapter and the number of sentences in that chapter?
* To explore this:
    * Count of all the symbolic characters of the chapter.
    * Count the approximate number of sentences in each chapter.

---

### Create Some Tools

Data Science usually requires us to create tools.

In [None]:
def count_characters(book_by_chapters):
    '''Return the count of the number of characters 
    per chapter of the provided book.'''
    return [len(s) for s in book_by_chapters]

In [None]:
def count_sentences(book_by_chapters):
    '''Return the count of the number of periods, question marks, 
    and exclamation points per chapter of the provided book.'''
    punctuation = ['.', '?', '!']
    num_sentences = np.zeros_like(book_by_chapters, dtype=int)
    for char in punctuation:
        num_sentences += np.char.count(book_by_chapters, char)
    return num_sentences

---

### Apply the Tools

Determine the length of every chapter and the number of sentences per chapter of both books.

In [None]:
chapter_lengths_hf = count_characters(huck_finn_chapters)
number_of_sentences_hf = count_sentences(huck_finn_chapters)
chars_sentences_hf = Table().with_columns(
    'Number of Sentences', number_of_sentences_hf,
    'Chapter Length', chapter_lengths_hf,
    'Book', ['Huck Finn'] * len(huck_finn_chapters))

chapter_lengths_lw = count_characters(little_women_chapters)
number_of_sentences_lw = count_sentences(little_women_chapters)
chars_sentences_lw = Table().with_columns(
    'Number of Sentences', number_of_sentences_lw,
    'Chapter Length', chapter_lengths_lw,
    'Book', ['Little Women'] * len(little_women_chapters))

---

### Explore the Results

Show the results for Huck Finn.

In [None]:
chars_sentences_hf

---

Show the results for Little Women.

In [None]:
chars_sentences_lw

---

Combine the two tables into one table.

In [None]:
chars_sentences = chars_sentences_hf.append(chars_sentences_lw)
chars_sentences

---

### Visualizing the Relationship

Visualize the relationship between the length of a chapter and the number of sentences in the same chapter.

In [None]:
chars_sentences.scatter('Number of Sentences', group='Book')
plt.title('Number of Sentence vs. Chapter Length')
plt.show()

---

### Quantifying the Relationship

Compute the ratio of the average chapter length with the average number of sentences per chapter for each book.

In [None]:
ave_sentence_length_hf = (np.average(chapter_lengths_hf) 
                          / np.average(number_of_sentences_hf))
print(f'Huck Finn has an average approximate sentence \
length of {ave_sentence_length_hf} characters.')

In [None]:
ave_sentence_length_lw = (np.average(chapter_lengths_lw) 
                          / np.average(number_of_sentences_lw))
print(f'Little Women has an average approximate sentence \
length of {ave_sentence_length_lw} characters.')

---

## Ask a Question

A big part of this subject is to make observations and ask questions.

* There seems to be a consistency between these two averages!
* Is the difference (~ 3 characters) in their averages meaningful?
* Do all books have the same pattern in their sentence structure?
* ...

---

<footer>
    <p>Adopted from <a href="https://data.berkeley.edu/education/courses/data-8">UC Berkeley DATA 8</a> course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>