<header style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</header>

# Lecture 01: Introduction

Associated Textbook Sections: [1.0, 1.1, 1.2, 1.3](https://inferentialthinking.com/chapters/01/what-is-data-science.html)

---

## Overview

* [What is Data Science?](#What-is-Data-Science?)
* [Analyzing Stories](#Analyzing-Stories)
* [Exploring a Relationship](#Exploring-a-Relationship)

---

## What is Data Science?

Learning about the world from data using computation:

<img src="./img/what_is_ds.png" width = 80% alt="Data Science categorized as Exploration, Inference, and Prediction.">

### Job Outlook

There is expected to be a 36% increase in the number of Data Science positions from 2021 - 2031, according to [the US Bureau of Labor Statistics Quick Facts page for a Data Scientists job role](https://www.bls.gov/ooh/math/data-scientists.htm)

---

## Analyzing Storylines

### Importing Tools

The following code contains things I'll use to run this demonstration. *You don't have to understand how this works for this lecture.*


In [None]:
# The typical setup for the notebooks in this course
from datascience import *
import numpy as np
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
%matplotlib inline

### Python "Reads" Two Books

Load the public domain text from the Adventures of Huckleberry Finn and Little Women from [Project Gutenberg](https://www.gutenberg.org/)

In [None]:
with open("./data/huck_finn.txt") as huck_finn, \
     open("./data/little_women.txt") as little_women:
    
    huck_finn_text = huck_finn.read()
    huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]

    little_women_text = little_women.read()
    little_women_chapters = little_women_text.split('CHAPTER ')[1:]

### Adventures of Huckleberry Finn

View the Initial Contents of Chapter 1

In [None]:
huck_finn_chapters[0][:100]

Organizing all the Chapter Text in a Table

In [None]:
Table().with_column('Chapters', huck_finn_chapters)

Count the number of times "Tom" and "Jim" appear in the text.

In [None]:
np.char.count(huck_finn_chapters, 'Tom')

In [None]:
np.char.count(huck_finn_chapters, 'Jim')

Create a table called `counts` that organizes the number of times the names Jim, Tom, and Huck appear in each chapter.

In [None]:
counts = Table().with_columns([
    'Tom', np.char.count(huck_finn_chapters, 'Tom'),
    'Jim', np.char.count(huck_finn_chapters, 'Jim'),
    'Huck', np.char.count(huck_finn_chapters, 'Huck'),
])
counts

Visualize the counts.

In [None]:
original_labels =  list(counts.labels)
cumulative_values = [np.cumsum(counts.column(index)) for index in np.arange(counts.num_columns)]
cumulative_counts = (counts.with_columns(zip(original_labels, cumulative_values))
                           .with_column('Chapter', np.arange(1, 44, 1)))
cumulative_counts.plot(column_for_xticks=3)
plots.title('Cumulative Number of Times Name Appears');

### Little Women

Organize the Content of the Chapters of Little Women

In [None]:
Table().with_column('Chapters', little_women_chapters)

Count the names in the chapters of Little Women.

In [None]:
people = ['Amy', 'Beth', 'Jo', 'Laurie', 'Meg']
people_counts = {pp: np.char.count(little_women_chapters, pp) for pp in people}

counts = Table().with_columns([
        'Amy', people_counts['Amy'],
        'Beth', people_counts['Beth'],
        'Jo', people_counts['Jo'],
        'Laurie', people_counts['Laurie'],
        'Meg', people_counts['Meg']
    ])
counts

Plot the cumulative counts.

In [None]:
original_labels =  list(counts.labels)
cumulative_values = [np.cumsum(counts.column(index)) for index in np.arange(counts.num_columns)]
cumulative_counts = (counts.with_columns(zip(original_labels, cumulative_values))
                           .with_column('Chapter', np.arange(1, 48, 1)))
cumulative_counts.plot(column_for_xticks=5)
plots.title('Cumulative Number of Times Name Appears');

## Exploring a Relationship

### Sentence Structure

In each chapter, count the number of all characters call this the "length" of the chapter" Also, count the number periods in each chapter.

In [None]:
chapter_lengths_hf = [len(s) for s in huck_finn_chapters]
chapter_lengths_lw = [len(s) for s in little_women_chapters]

number_of_periods_hf = np.char.count(huck_finn_chapters, '.')
number_of_periods_lw = np.char.count(little_women_chapters, '.')

chars_periods_hf = Table().with_columns(
    'Number of Periods', number_of_periods_hf,
    'Chapter Length', chapter_lengths_hf,
    'Book', ['Huck Finn'] * len(huck_finn_chapters))
    
chars_periods_lw = Table().with_columns(
    'Number of Periods', number_of_periods_lw,
    'Chapter Length', chapter_lengths_lw,
    'Book', ['Little Women'] * len(little_women_chapters))

Notice that there is no output here. 🧐

Show the counts for Huckleberry Finn.

In [None]:
chars_periods_hf

Show the counts for Little Women.

In [None]:
chars_periods_lw

Combine the two tables into one table.

In [None]:
chars_periods = chars_periods_hf.append(chars_periods_lw)
chars_periods

---

### Visualizing the Relationship

Visualize the relationship betwen the length of a chapter and the number of periods in a chapter.

In [None]:
chars_periods.scatter('Number of Periods', group='Book')

Compute the ratio of average chapter lengths to number of periods per chapter for each book.

### Recognizing a Pattern

In [None]:
ave_num_chars_per_sentence_hf = np.average(chapter_lengths_hf) / np.average(number_of_periods_hf)
ave_num_chars_per_sentence_hf

In [None]:
ave_num_chars_per_sentence_lw = np.average(chapter_lengths_lw) / np.average(number_of_periods_lw)
ave_num_chars_per_sentence_lw

There seems to be a consistency to sentence lengths for these two books!

---

<footer>
    <p>Adopted from <a href="https://data.berkeley.edu/education/courses/data-8">UC Berkeley DATA 8</a> course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>