<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Lecture 01: Introduction

Associated Textbook Sections: [1.0, 1.1, 1.2, 1.3](https://inferentialthinking.com/chapters/01/what-is-data-science.html)

---

## Overview

* [What is Data Science?](#What-is-Data-Science?)
* [Insights from Text](#Insights-from-Text)
* [Processing Adventures of Huckleberry Finn](#Processing-Adventures-of-Huckleberry-Finn)
* [Visualizing the Story](#Visualizing-the-Story)
* [Analyzing Little Women](#Analyzing-Little-Women)
* [A Relationship](#A-Relationship)
* [What's Next?](#What's-Next?)

---

## What is Data Science?

---

### Exploration, Inference, and Prediction

Learning about the world from data using computation:

<img src="./what_is_ds.png" width = 80% alt="Data Science categorized as Exploration, Inference, and Prediction.">

---

### Job Outlook

According to [the US Bureau of Labor Statistics](https://www.bls.gov/ooh/math/data-scientists.htm):
* Data scientists use analytical tools and techniques to extract meaningful insights from data.
* 2023 Median Pay: \$108,020 per year
* Typical Entry-Level Education: Bachelor's degree
* Number of Jobs, 2022: 168,900
* Job Outlook, 2022-32: 35% Increase

---

### Domain Knowledge and Collaboration

<a href="https://commons.wikimedia.org/wiki/File:Data_types_-_en.svg"><img src="./data.png" width = 40% alt="Types of data"></a>

* Data is information that can be used to inform decisions.
* Data exists in many formats: numbers, videos, images, words, etc.
* Interpretation of data depends on the subject domain (topic).
* In our complex world, [cross-functional team collaboration](https://www.linkedin.com/pulse/cross-functional-team-collaboration-data-scientists-okonkwo--zysvf/) is becoming more and more common.
    * In other words, don't expect to work alone as a data scientist! 

---

### Jupyter Notebooks

* Data scientists use a variety of technologies to perform their tasks.
* A [Jupyter Notebook](https://jupyter.org/) is an application that combines:  
    * Narrative: The text describing your analysis
    * Code: The program that does the analysis
    * Results: The output of the program
* The Jupyter environment was created by [Fernando Perez](https://bids.berkeley.edu/people/fernando-perez) at UC Berkeley.
* * The Jupyter notebooks contain computer code written in a language called [Python](https://www.python.org/).
* You will learn the basics of working with these notebooks and Python over the next few lessons and assignments.

**Note: This notebook contains Python code you are not responsible for understanding in this course.**

---

## Insights from Text

---

### Python "Reads" Two Books

<img src="./huck_finn_book.jpeg" width=200px><img src="./little_women_book.jpg" width=135px>

* [Project Gutenberg](https://www.gutenberg.org/) contains public domain book data.
* The processing of natural languages ([NLP](https://en.wikipedia.org/wiki/Natural_language_processing)) is a major focus of research and application right now.
* The following code loads the text from the Adventures of Huckleberry Finn and Little Women into this environment.

In [None]:
with open("./huck_finn.txt") as huck_finn, \
open("./little_women.txt") as little_women:
    huck_finn_chapters = huck_finn.read().split('CHAPTER ')[44:]
    little_women_chapters = little_women.read().split('CHAPTER ')[1:]

---

### Preview the Data

The following code shows the first 100 characters of Chapter 1 from both books.

In [None]:
huck_finn_chapters[0][:100]

In [None]:
little_women_chapters[0][:100]

The data might not look like you expect it to!

---

### datascience

* Data Science tasks regularly involve transforming data for analysis.
* We will use UC Berkeley's [`datascience`](https://datascience.readthedocs.io/) library (a collection of code) to do most of our coding work.
* The `datascience` library was designed to help students transition from having no coding experience to using more professional libraries such as [`pandas`](https://pandas.pydata.org/).

---

### Import `datascience`

* The following code imports all of the content of the `datascience` library.
* Notice that the word `Table` (case-sensitive) has a special meaning.

---

In [None]:
from datascience import *

In [None]:
Table

---

## Processing Adventures of Huckleberry Finn

---

### Create a Table

The following code organizes all the text from each chapter of the book within a `Table`.

In [None]:
Table().with_column('Chapters', huck_finn_chapters)

---

### Word Frequency

What can the frequency of words in a story tell us about the story?

* Counting words in a book might seem overwhelming.
* Computers can help with tedious tasks like this.

---

### NumPy

* [`numpy`](https://numpy.org/) is another standard library that you work within this course.
* This library will help with performing repetitive calculations.

---

### Import `numpy`

The following code imports the `numpy` library and give it the alias `np`.

In [None]:
import numpy as np

---

### Counting Names with NumPy

The following code counts the number of times "Tom" and "Jim" appear in the text.

In [None]:
np.char.count(huck_finn_chapters, 'Tom')

In [None]:
np.char.count(huck_finn_chapters, 'Jim')

---

### Creating Another Table

The following code creates a table called `counts` that organizes the frequency of the names Jim, Tom, and Huck within each chapter.

In [None]:
counts_hf = Table().with_columns([
    'Tom', np.char.count(huck_finn_chapters, 'Tom'),
    'Jim', np.char.count(huck_finn_chapters, 'Jim'),
    'Huck', np.char.count(huck_finn_chapters, 'Huck'),
])
counts_hf

That is a lot of numbers!

---

## Visualizing the Story

---

### Matplotlib

* Visualizations help summarize data and allow us to see patterns.
* [Matplotlib](https://matplotlib.org/) is a standard Python visualization library.
* In this course, we will use it *indirectly* to create visualizations. 

---

### Import and Configure `matplotlib`

The following code imports and configures `matplotlib`.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

### Visualize a Trend

* Run the following code to visualize how the cumulative counts for each person change over the chapters.
* When you create a graph, observe patterns and ask questions.
    * Do you notice any interesting patterns in the lines?
    * Why might those patterns exist?

In [None]:
original_labels_hf =  list(counts_hf.labels)
cumulative_values_hf = ([np.cumsum(counts_hf.column(index)) 
                         for index in np.arange(counts_hf.num_columns)])
cumulative_counts_hf = (counts_hf
                        .with_columns(zip(original_labels_hf, cumulative_values_hf))
                        .with_column('Chapter', np.arange(1, 44, 1)))
cumulative_counts_hf.plot('Chapter')
plt.title('Cumulative Number of Times Name Appears')
plt.show()

---

## Analyzing Little Women

---

### Create a Table

The following code organizes the text from each chapter from Little Women within a `Table`.

In [None]:
Table().with_column('Chapters', little_women_chapters)

---

### Counting Names

The following code counts how often the names of some characters in Little Women appear within each chapter.

In [None]:
people_lw = ['Amy', 'Beth', 'Jo', 'Laurie', 'Meg']
counts_lw = {pp: np.char.count(little_women_chapters, pp) for pp in people_lw}

counts_lw = Table().with_columns([
        'Amy', counts_lw['Amy'],
        'Beth',counts_lw['Beth'],
        'Jo', counts_lw['Jo'],
        'Laurie', counts_lw['Laurie'],
        'Meg', counts_lw['Meg']
    ])
counts_lw

---

### Visualize a Trend

* Use the following code to visualize how the cumulative counts for each person change over the chapters.
* What do you notice in this visualization?

In [None]:
original_labels_lw =  list(counts_lw.labels)
cumulative_values_lw = ([np.cumsum(counts_lw.column(index)) 
                      for index in np.arange(counts_lw.num_columns)])
cumulative_counts_lw = (counts_lw
                        .with_columns(zip(original_labels_lw, cumulative_values_lw))
                        .with_column('Chapter', np.arange(1, 48, 1)))
cumulative_counts_lw.plot('Chapter')
plt.title('Cumulative Number of Times Name Appears')
plt.show()

---

## A Relationship

---

### Chapter Length vs. Sentence Length

* Is there a relationship between the length of a chapter and the number of sentences in that chapter?
* To explore this determine:
    * Chapter length: A count of all the symbolic characters of the chapter.
    * Sentence Length: An approximate count of the number of sentences in each chapter.

---

### Create Some Tools

* Data Science usually requires us to create tools.
* The following code creates some tools for us to use:
    * `count_characters`: Counts the number of symbolic characters per chapter.
    * `count_sentences`: Counts the number of periods, question marks, and exclamation points per chapter to provide an estimate of the number of sentences in each chapter.

In [None]:
def count_characters(book_by_chapters):
    '''Return the count of the number of characters 
    per chapter of the provided book.'''
    return [len(s) for s in book_by_chapters]

In [None]:
def count_sentences(book_by_chapters):
    '''Return the count of the number of periods, question marks, 
    and exclamation points per chapter of the provided book.'''
    punctuation = ['.', '?', '!']
    num_sentences = np.zeros_like(book_by_chapters, dtype=int)
    for char in punctuation:
        num_sentences += np.char.count(book_by_chapters, char)
    return num_sentences

---

### Apply the Tools

Use the following code to determine the length of every chapter and the number of sentences per chapter of both books.

In [None]:
chapter_lengths_hf = count_characters(huck_finn_chapters)
number_of_sentences_hf = count_sentences(huck_finn_chapters)
chars_sentences_hf = Table().with_columns(
    'Number of Sentences', number_of_sentences_hf,
    'Chapter Length', chapter_lengths_hf,
    'Book', ['Huck Finn'] * len(huck_finn_chapters))

chapter_lengths_lw = count_characters(little_women_chapters)
number_of_sentences_lw = count_sentences(little_women_chapters)
chars_sentences_lw = Table().with_columns(
    'Number of Sentences', number_of_sentences_lw,
    'Chapter Length', chapter_lengths_lw,
    'Book', ['Little Women'] * len(little_women_chapters))

---

### Explore the Results

* Show the results for Huck Finn and Little Women
* Combine the two tables into one table.

In [None]:
chars_sentences_hf

In [None]:
chars_sentences_lw

In [None]:
chars_sentences = chars_sentences_hf.append(chars_sentences_lw)
chars_sentences

---

### Visualizing the Relationship

* Use the following code to visualize the relationship between the length of a chapter and the number of sentences in the same chapter.
* What do you notice about the pattern of the dots?

In [None]:
chars_sentences.scatter('Number of Sentences', group='Book')
plt.title('Number of Sentence vs. Chapter Length')
plt.show()

---

### Quantifying the Relationship

The following code computes the ratio of the average chapter length with the average number of sentences per chapter for each book.

In [None]:
ave_sentence_length_hf = (np.average(chapter_lengths_hf) 
                          / np.average(number_of_sentences_hf))
print(f'Huck Finn has an average approximate sentence \
length of {ave_sentence_length_hf} characters.')

In [None]:
ave_sentence_length_lw = (np.average(chapter_lengths_lw) 
                          / np.average(number_of_sentences_lw))
print(f'Little Women has an average approximate sentence \
length of {ave_sentence_length_lw} characters.')

---

## What's Next?

---

### Keep Asking Questions

* You have a lot to learn. We'll guide you through the technical details.
* For now, be curious and ask bigger questions:
    * Is the difference (~ 3 characters) in their averages meaningful?
    * Do all books have a similar pattern in their sentence structure?
    * etc.

---

### A Life Cycle

A typical cycle involved with this type of work is:


```mermaid
flowchart LR
    A[Ask a Question] --> B[Get Data]
    B --> C[Analyze the Data]
    C --> A
```
    
The end of each project should prompt another project!

---

## Attribution

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a> and derived from the <a href="https://www.data8.org/">Data 8: The Foundations of Data Science</a> offered by the University of California, Berkeley.

<img src="./by-nc-sa.png" width=100px>