In [1]:
from dsc80_utils import *

# Lecture 1 – Introduction, Data Science Lifecycle

## DSC 80, Fall 2024

<center><h2>Welcome to DSC 80! 🎉</h2></center>

### Agenda

- Who are we?
- What does a data scientist do?
- What is this course about, and how will it run?
- The data science lifecycle.
- Example: What's in a name?

### Instructor: Duncan Watson-Parris (call me Duncan)

#### Prof. Duncan Watson-Parris

<img src='imgs/watson-parris.jpg' width=15%>

- Assistant Professor, SIO/HDSI, UCSD
- Website: http://climate-analytics-lab.github.io/

Harnessing machine learning to improve climate projections

Bio: Ph.D. Manchester, UK (2011), B.S. Cardiff (2007).

- PhD from University of Manchester 2011 in Theoretical Physics
- Worked as a software consultant 2011-2015
    - Worked with a range of clients from small research groups to large multinational companies and government departments
- Postdoc + Senior Postdoc at University of Oxford 2015-2023
- Moved over from the UK with wife and two kids (11 & 13) in 2023 
- I love to surf, hike, and am learning the drums 🥁

### Course staff

In addition to the instructor, we have several staff members who are here to help you in discussion, office hours, and on Ed:

- **1 graduate student TA**: Mizuho Fukada.
- **7 undergraduate tutors**: TBD!

Learn more about them at [dsc80.com/staff](https://dsc80.com/staff).

## What is data science? 🤔

### What is data science?

<br>

<center><img src='imgs/what-is-data-science.png' width=60%></center>

Everyone seems to have their own definition of what data science is!

### The DSC 10 approach

In DSC 10, we told you that data science is about **drawing useful conclusions from data using computation**. In DSC 10, you:

- Used Python to **explore** and **visualize** data.

- Used **simulation** to make **inferences** about a population, given just a sample.

- Made **predictions** about the future given data from the past.

Let's look at a few more definitions of data science.

### What is data science?

<center><img src="imgs/image_0.png"></center>

In 2010, Drew Conway published his famous [Data Science Venn Diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram).

### What is data science?

There isn't agreement on which "Venn Diagram" is correct!

<center><img src="imgs/image_1.png" width=30%></center>

- **Why not?** The field is new and rapidly developing.
- Make sure you're solid on the fundamentals, then find a niche that you enjoy.
- Read Taylor, [Battle of the Data Science Venn Diagrams](https://deeplearning.lipingyang.org/wp-content/uploads/2017/10/Battle-of-the-Data-Science-Venn-Diagrams.pdf).

### What does a _data scientist_ do?

The chart below is taken from the [2016 Data Science Salary Survey](https://www.oreilly.com/radar/2016-data-science-salary-survey-results/), administered by O'Reilly. They asked respondents what they spend their time doing on a daily basis. What do you notice? <br>

<center><img src='imgs/survey.png' width=40%></center>

The chart below is taken from the followup [2021 Data/AI Salary Survey](https://www.oreilly.com/radar/2021-data-ai-salary-survey/), also administered by O'Reilly. They asked respondents:

> What technologies will have the biggest effect on compensation in the coming year?

<center><img src='imgs/2021-most-relevant-skill.png' width=45%></center>

### What does a _data scientist_ do?

Our take: in DSC 80, and in the DSC major more broadly, we are training you to **ask and answer questions using data**.

As you take more courses, we're training you to answer questions whose answers are **ambiguous** – this uncertainly is what makes data science challenging!

Let's look at some examples of data science in practice.

### Do people care about climate change?

From [How Americans Think About Climate Change, in Six Maps](https://www.nytimes.com/interactive/2017/03/21/climate/how-americans-think-about-climate-change-in-six-maps.html?amp=&smid=fb-nytimes).

<center><img src='imgs/nyt-climate-legend.png' width=35%></center>
<center><img src='imgs/nyt-climate-harm.png' width=50%></center>

### Do people care about climate change?

<center><img src='imgs/nyt-climate-legend.png' width=35%></center>
<center><img src='imgs/nyt-climate-personal.png' width=50%></center>

An excerpt from the article:

> Global warming is precisely the kind of threat humans are awful at dealing with: a problem with enormous consequences over the long term, but little that is sharply visible on a personal level in the short term. Humans are hard-wired for quick fight-or-flight reactions in the face of an imminent threat, but not highly motivated to act against slow-moving and somewhat abstract problems, even if the challenges that they pose are ultimately dire.

### Data science involves _people_ 🧍

The decisions that we make as data scientists have the potential to impact the livelihoods of other people.

- Flu case forecasting.
- Admissions and hiring.
- Hyper-personalized ad recommendations.

### What is this course really about, then?

- Good data analysis is not:
    - A simple application of a statistics formula.
    - A simple application of computer programs.

- There are many tools out there for data science, but they are merely tools. **They don’t do any of the important thinking – that's where you come in!**


## Course content

### Course goals

**DSC 80 teaches you to *think* like a data scientist.**

In this course, you will...

* **Get a taste of the "life of a data scientist."**
* Practice translating potentially vague questions into quantitative questions about measurable observations.
* Learn to reason about "black-box" processes (e.g. complicated models).
* Understand computational and statistical implications of working with data.
* Learn to use real data tools (and rely on documentation).

### Course outcomes

After this course, you will...

* Be prepared for internships and data science "take home" interviews!
* Be ready to create your own portfolio of personal projects.
* Have the background and maturity to succeed in the upper-division.

### Topics

- Week 1: From `babypandas` to `pandas`.
- Week 2: DataFrames.
- Week 3: Working with messy data, hypothesis and permutation testing.
- Week 4: Missing values.
- Week 5: HTML and web scraping.
- Week 6: **Midterm Exam** and regular expressions.
- Week 7: Text data, modeling.
- Week 8: Feature engineering and generalization.
- Week 9: Modeling in `sklearn`.
- Week 10: Classifier evaluation, fairness, conclusion.
- Week 11: **Final Exam**

## Course logistics

### Course website

The course website is your one-stop-shop for all things related to the course.

<br>

<center><h3><a href="https://dsc80.com">dsc80.com</a></h3></center>

<br>

Make sure to **read the [syllabus](https://dsc80.com/syllabus)**!

### Getting set up

- **Ed**: Q&A forum. Must be active here, since this is where all announcements will be made.
- **Gradescope**: Where you will submit all assignments for autograding, and where all of your grades will live.
- **Canvas**: No ❌.

In addition, you must fill out our **[Welcome Survey](https://forms.gle/9JdiAnu75D7T7MAu7)**


### Accessing course content on GitHub

You will access all course content by pulling the course GitHub repository:

<br>

<center><p><b><a href="https://github.com/dsc-courses/dsc80-2025-wi">github.com/dsc-courses/dsc80-2025-wi</a></b></p></center>

<br>

We will post HTML versions of lecture notebooks on the course website, but otherwise you must `git pull` from this repository to access all course materials (including blank copies of assignments).

### Environment setup

- You're required to set up a Python environment on your own computer.
- To do so, follow the instructions on the [Tech Support](https://dsc80.com/tech_support) page of the course website.
- Once you set up your environment, you will `git pull` the course repo every time a new assignment comes out.
- **Note**: You will submit your work to Gradescope directly, without using Git.
- We will post a demo video with Lab 1.

### Lectures

- Lectures are held in-person on **Tuesdays and Thursdays 11:00am-12:20pm in SOLIS 104**. 
Lectures are podcasted.

- Assignment deadlines are fairly flexible (as I'll explain soon). To help yourself stay on track with material, you can opt into lecture attendance. If you do, lecture attendance is worth 5% of your overall grade (instead of 0%) and the midterm and final are worth 2.5% less.

- To get credit for a class, attend and participate in the activities for both lectures. Lowest two classes dropped.

### Assignments

In this course, you will learn by doing!

- **Labs (20%)**: 9 total, lowest score dropped. Due **Wednesdays at 11:59PM**.
- **Projects (25%)**: 4 total, no drops. Due on **Fridays at 11:59PM**.

In DSC 80, assignments will usually consist of both a Jupyter Notebook and a `.py` file. You will write your code in the `.py` file; the Jupyter Notebook will contain problem descriptions and test cases. Lab 1 will explain the workflow.

### Late Policy

- **No late submissions accepted, but...**
- [Extension Request Form](www.google.com) grants you 1 day extension on assignment submission (instead of slip days).
- We will essentially approve all requests!
- Goal is to help support you if you start falling behind, so if you fill out form a lot, we'll schedule a meeting with you to help come up with a plan for success.
- **No extensions on project checkpoints or Final Project.**

### Redemption for Labs and Projects

- All labs and projects 1-3 have hidden autograder tests.
    - We won't show you what you missed until the deadline has passed.
- But you can resubmit after the original deadline to redeem up to 80% of points lost.

### Discussions

- No discussions this quarter, come to OH instead!
- Will post worksheets with suggested past exam questions to try out that week.

### Exams

- This class has one Midterm Exam and one Final Exam. Exams are cumulative, though the Final Exam will emphasize material after the Midterm Exam.
- **Midterm Exam**: Tuesday Feb 11, 11am-12:20pm in SOLIS 104 (during lecture)
- **Final Exam:** Thursday, March 20th, 11:30AM-2:30PM. Location is TBD
- Both exams will be administered in-person. If you have conflicts with either of the exams, please let us know on the [Exam Accommodations Form](https://forms.gle/rSUYPsHdmxTN9qYv5).

- Your final exam score can **redeem** your midterm score (see the [Syllabus](https://dsc80.com/syllabus) for details).


<!-- 

**Monday:** Free

**Tuesday:** Lecture

**Wednesday:** Lab due

**Thursday:** Lecture

**Friday:** Project Due -->

| Monday | Tuesday                                           | Wednesday | Thursday | Friday                                        |
| ------ | ------------------------------------------------- | --------- | -------- | --------------------------------------------- |
|        | Lecture                                           |           | Lecture  |                                               |
|        | |  <span style='color:red'><b>Lab due</b></span>         |          | <span style='color:red'><b>Project due</b></span>  |

 
🏃‍♂️💨💨💨

### Resources

- Your main resource will be lecture notebooks.
- Most lectures also have supplemental readings that come from our course textbook, [Learning Data Science](https://learningds.org/intro.html). These are not required, but are highly recommended.

<center>
<img alt="Front cover of textbook" src="imgs/book-cover.png" width="300">
</center>

### Support 🫂

It is no secret that this course requires **a lot** of work – becoming fluent with working with data is hard!

- You will learn how to solve problems **independently** – documentation and the internet will be your friends.
- Learning how to effectively check your work and debug is extremely useful.
- Learning to stick with a problem (*tenacity*) is a very valuable skill; but don't be afraid to ask for help.

Once you've tried to solve problems on your own, we're glad to help.

- We have several **office hours** in person each week. See the [Calendar 📆](https://dsc80.com/calendar/) for details.
- **Ed** is your friend too. Make your conceptual questions public, and make your debugging questions private.

### Generative Artificial Intelligence

- We know that tools, like ChatGPT and GitHub Copilot, can write code for you.
- Feel free to use such tools **with caution**. Refer to the [Generative AI](https://dsc80.com/syllabus/#use-of-generative-artificial-intelligence) section of the syllabus for details.
- We trust that you're here to learn and do the work for yourself.
- You won't be able to use ChatGPT on the exams, so make sure you **understand** how your code actually works.

<center><img src="imgs/sets.png" width=75%></center>

<center>You'll have to work a lot, but we'll make the time spent worth it.</center>

## The data science lifecycle 🚴

### The scientific method

You learned about the scientific method in elementary school. 

<center><img src="imgs/image_3.png" width=500></center>

However, it hides a lot of complexity.
- Where did the hypothesis come from?
- What data are you modeling? Is the data sufficient?
- Under which conditions are the conclusions valid?

### The data science lifecycle

<center><img src="imgs/ds-lifecycle.svg" width="60%"></center>

**All steps lead to more questions!** We'll refer back to the data science lifecycle repeatedly throughout the quarter.

## Example: What's in a name?

### Lilith, Lilibet … Lucifer? How Baby Names Went to 'L'

[This New York Times](https://archive.is/NpORG) article claims that baby names beginning with "L" have become more popular over time.

Let's see if these claims are true, based on the data!

### The data

What we're seeing below is a `pandas` DataFrame. The DataFrame contains one row for every combination of `'Name'`, `'Sex'`, and `'Year'`.

In [2]:
baby = pd.read_csv('data/baby.csv')
baby

Unnamed: 0,Name,Sex,Count,Year
0,Liam,M,20456,2022
1,Noah,M,18621,2022
2,Olivia,F,16573,2022
...,...,...,...,...
2085155,Wright,M,5,1880
2085156,York,M,5,1880
2085157,Zachariah,M,5,1880


Recall from DSC 10, to access columns in a DataFrame, you used the `.get` method.

In [3]:
baby.get('Count').sum()

np.int64(365296191)

Everything you learned in `babypandas` translates to `pandas`. However, the more common way of accessing a column in `pandas` involves dictionary syntax:

In [4]:
baby['Count'].sum()

np.int64(365296191)

### How many unique names were there per year?

In [5]:
baby.groupby('Year').size()

Year
1880     2000
1881     1934
1882     2127
        ...  
2020    31517
2021    31685
2022    31915
Length: 143, dtype: int64

A shortcut to the above is as follows:

In [6]:
baby['Year'].value_counts()

Year
2008    35094
2007    34966
2009    34724
        ...  
1883     2084
1880     2000
1881     1934
Name: count, Length: 143, dtype: int64

Why **doesn't** the above Series actually contain the number of unique names per year?

In [23]:
baby[(baby['Year'] == 1880)]

Unnamed: 0,Name,Sex,Count,Year
2083158,John,M,9655,1880
2083159,William,M,9532,1880
2083160,Mary,F,7065,1880
...,...,...,...,...
2085155,Wright,M,5,1880
2085156,York,M,5,1880
2085157,Zachariah,M,5,1880


In [28]:
baby[(baby['Year'] == 1880)].value_counts('Name')

Name
Grace      2
Emma       2
Clair      2
          ..
Evaline    1
Evalena    1
Zula       1
Name: count, Length: 1889, dtype: int64

### How many babies were recorded per year?

In [9]:
baby.groupby('Year')['Count'].sum()

Year
1880     201484
1881     192690
1882     221533
         ...   
2020    3333981
2021    3379713
2022    3361896
Name: Count, Length: 143, dtype: int64

In [10]:
baby.groupby('Year')['Count'].sum().plot()

### "'L' has to be like the consonant of the decade."

In [None]:
(baby
 .assign(first_letter=baby['Name'].str[0])
 .query('first_letter == "L"')
 .groupby('Year')
 ['Count']
 .sum()
 .plot(title='Number of Babies Born with an "L" Name Per Year')
)

### What about individual names?

In [12]:
(baby
 .query('Name == "Siri"')
 .groupby('Year')
 ['Count']
 .sum()
 .plot(title='Number of Babies Born Named "Siri" Per Year')
)

In [13]:
def name_graph(name):
    return (baby
     .query(f'Name == "{name}"')
     .groupby('Year')
     ['Count']
     .sum()
     .plot(title=f'Number of Babies Born Named "{name}" Per Year')
    )

In [22]:
name_graph('Duncan')

### What about other names?

In [15]:
name_graph(...)

### This week...

- Lab 1 will be released tomorrow.
- Start [setting up your environment](https://dsc80.com/tech_support), which you'll need to do before working on Lab 1.
- Also fill out the [Welcome Survey](https://forms.gle/9JdiAnu75D7T7MAu7) and read the [Syllabus](https://dsc80.com/syllabus)!