In [None]:
# Imports
import babypandas as bpd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')

from IPython.display import YouTubeVideo

# Lecture 1 – Introduction

## DSC 10, Summer 2022

### Welcome to DSC 10! 👋
- A guided tour of data science.
- A course developed by UC Berkeley in 2015 and adapted by UC San Diego in 2017.
- Learn just enough programming and statistics to do data science.
    - Statistics without too much math, mostly simulation.
    - Lays the foundation for all other courses in the DSC major.

### Agenda

- Who are we?
- What is data science?
- How will this course run?
- Literature demo.

### About the instructor 👩‍🏫

#### Samuel Lau (call me Sam)

- PhD in Cog Sci (Human-Computer Interaction) at UCSD
  - Research: tools for teaching data science like Pandas Tutor
- BS and MS in Electrical Engineering and Computer Science at UC Berkeley
  - Worked on the OG data science courses at UC Berkeley 
- Writing textbook called Learning Data Science

### Course staff

In addition, we have several other course staff members who are here to support you in discussion, office hours, and Campuswire.

- 1 graduate TAs: Shivani Bhakta
- 3 undergraduate tutors: Eric Chen, Oren Ciolli, and Tiffany Yu.

Learn more about them at [dsc10.com/staff](https://dsc10.com/staff).

### About you

Do you have any programming experience?

<div align=center>
<img src="https://i.imgur.com/Bv3DBGa.gif" width=500>
</div>

A. Yes, I'm a pro!

B. I have some experience.

C. I know a few basic concepts.

D. No experience whatsoever!

### To answer, go to **[menti.com](https://menti.com)** and enter the code **9731 1635**.

## What is "data science"? 🤔

<center><img src='images/what-is-ds.png' width=1250>Everyone seems to have their own definition of data science.</center>

### What is "data science"?

Data science is about **drawing useful conclusions from data using computation**.

- **Exploration**.
    - Identifying patterns in information.
    - Uses visualizations.
- **Prediction**.
    - Making informed guesses.
    - Uses machine learning and optimization.
- **Inference**.
    - Quantifying whether those predictions are reliable.
    - Uses randomization.
    
In this class, we'll focus on the first and third, with a touch of the second.

### Data science is more relevant than ever

<center><img src='images/nyt-covid-line.png'></center>

<center><img src='images/covid-candles.png' width=800>
    <a href="https://twitter.com/zornsllama/status/1473575508784955394/">source</a>
</center>

<span style='color:blue'><b>blue line</b></span>: daily COVID cases in the USA<br>
<span style='color:red'><b>red line</b></span>: bad reviews of Yankee Candles on Amazon saying "they don't have any scent"

### Social implications

A ["Face Depixelizer"](https://github.com/tg-bomze/Face-Depixelizer) released in 2020 takes pixelated images and generates images that are perceptually realistic and downscale correctly.

<center>
<img src='images/depixel.png' width=300>
</center>

    
What happened here? Why do you think this happened?

<center>
<img src='images/depixel2.png' width=600>
</center>

## Course logistics

### Course website

The course website is your one-stop-shop for all things related to the course.

<br>

<center><h3><a href="https://dsc10.com">dsc10.com</a></h3></center>

<br>

This is where lectures, homeworks, labs, discussions, and all other content will be posted. Check it often, and **read the [syllabus](https://dsc10.com/syllabus)**!

### Getting set up

- **Campuswire**: Q&A forum. Must be active here, since this is where all announcements will be made. Should have gotten email invitation; if not, there's a link on the course website.
- **Gradescope**: Where you will submit all assignments for autograding, and where all of your grades will live. Should have been automatically added; contact us if not.
- **DataHub**: Where you will access and run all code in this class. Access at [datahub.ucsd.edu](https://datahub.ucsd.edu).
- We will **not** be using Canvas for anything!

In addition, you must also fill out our [Beginning of Quarter Survey](https://forms.gle/Vc3GiLXfuN4iV9Hf9).

### Lecture

- Lectures will be in-person and podcasted for viewing afterwards.
    - Attendance will never be required (but is strongly encouraged!)
    - Recordings will be linked from the [course homepage](https://dsc10.com).
- Slides/code from lecture will be linked on the course website.
- <span style="color:red;"><b>Important: We will **not** be using the assigned discussion or lab times!</b></span>
    - i.e. you can ignore the Tues and Thurs 11am times on WebReg.
- I bet you're thinking...

### Labs
- Lab assignments are a required part of the course and help you develop fluency in Python and working with data.
- While working on the lab, you'll be able to run **autograder tests** which tell you if your answers are correct.
    - For labs, if you pass all autograder tests, you will get 100\%!
- You must submit labs individually, but you can discuss ideas with others (no sharing code).
- Labs are due on **Saturdays at 11:59pm** to Gradescope. The first lab (due this Saturday) will have submission instructions.

### Homeworks and projects

- Weekly homework assignments build off of skills you develop in labs.
- Key difference between homeworks and labs: passing autograder tests does not guarantee a perfect score!
    - In homeworks, we have "hidden tests" that are only run after you submit the assignment.
    - The tests that are available to you within the assignment itself only verify that your answer is reasonable/on the right track.
- Again, you must work on homeworks yourself, but you can discuss ideas with other students (no sharing code).
- Homeworks are due on **Tuesdays at 11:59pm** to Gradescope.
- **Midterm Project and Final Project**: Deep dive into a data set! Longer than homeworks. Can work with a partner using ["pair programming."](https://dsc10.com/pair-programming/)

### Exams

We will have two exams this quarter.
- Midterm Exam: Friday, July 29, during the scheduled lecture time (11am).
- Final Exam: Saturday, Sept 3 from 11:30am-2:30pm.
- Both exams will be in-person, on-paper exams.

### Readings and resources

- We will draw readings from two sources. Readings for each lecture will be posted on the course homepage.
    - [DSC 10 Course Notes](https://notes.dsc10.com)
    - [Computational and Inferential Thinking (CIT)](https://inferentialthinking.com)
- The [Resources](https://dsc10.com/resources) tab of the course website contains links to helpful resources that you'll want to use throughout the course (e.g. DSC 10 reference sheet, programming tutorials, supplemental videos).
- The [Debugging](https://dsc10.com/debugging) tab of the course website has answers to many common technical issues.

### A typical week in DSC 10

| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday |
|: -- :|: -- :|: -- :|: -- :|: -- :|: -- :|: -- :|
| Nothing! 😎 | Lecture | | Lecture | | Lecture | |
| | | | | | | |
| | | **Homework due** | | | | **Lab due** |

See https://dsc10.com/syllabus for more details.

### First assignments
- Lab 1, due Saturday, July 2 at 11:59pm.
    - Do the lab before you do the homework!
- Homework 1, due Tuesday, July 5 at 11:59pm.
    - Will be released by Wednesday.
- <span style='color:red'><b>Important: Start early and submit often</b>.</span>

### Getting help

This is a tough, fast-paced course. But we're here to help you – here's how:

- **Office Hours (OH)**
    - Remote and in-person throughout the week.
    - See the schedule and Zoom link on the [Calendar 📆](https://dsc10.com/calendar).
- **Campuswire**
    - Post here with any logistical or conceptual questions (please don't ask these questions via email).
    - If posting your code, make your post private to instructors + TAs; otherwise, make your post public (you can post anonymously, if you'd like).
    - No DMs!
- <span style="color:red;"><b>Important: Use these to your advantage!</b></span>

### Advice from previous students

At the end of fall quarter, we asked DSC 10 students to give advice to future students in the course. Here are some responses:

> "Go to office hours! Get a partner for the project even if you don't want to. If you don't understand a topic try the following: go to office hours, ask on campuswire, check the textbook, look at lecture notes. Start on the assignments early try and finish 2 days early to check your work."

> "I would give the advice to attend the discussions and office hours whenever possible, as a lot of the times I found myself learning new things even when I didn't come with a question ready."

> "Do NOT fall behind in lectures. It becomes very difficult to catch up on the concepts. Go to your discussion section! Hearing a concept explained once can be difficult to understand so discussion section was extremely helpful."

> "GO TO OFFICE HOURS!  It's very important to let your voice out. Talk to the professor after lectures, attend office hours, post your questions and ask a question."

### Collaboration

#### Asking questions is highly encouraged!
- Discuss all questions with each other (except exams).
- Submit lab assignments individually, but you can work with others (no sharing code).
- Submit homeworks individually, but you can discuss problem-solving strategies with others (no sharing code).
- Submit projects individually or in pairs using pair programming.

#### The limits of collaboration
- Don't share solutions with each other or look at someone’s code.
- Partners should work together and be physically in the same place (or same Zoom call).
- Academic integrity violations usually result in failing the course. 

### We're here for you!

Regardless of your background, you can succeed in this course. **No prior programming or statistics experience will be assumed!**

This summer might be the smallest DSC 10 course for the foreseeable future---take advantage!

### Campus resources

Counseling and Psychological Services (CAPS) is a campus unit that offers “short term counseling for academic, career, and personal issues and also offers psychiatry services for circumstances when medication can help with counseling.”
If you or anyone you know is ever in need of mental health care, you should contact CAPS.

<center><h3><a href="https://caps.ucsd.edu/">caps.ucsd.edu</a></h3></center>

## Demo

### _Little Women_ (1868)

- _Little Women_, by Louisa May Alcott, is a novel that follows the life of four sisters – Meg, Jo, Beth, and Amy.
- Using tools from this class, we'll learn about the plot of the book, without reading it.
- Do not worry about any of this code – we'll cover the necessary pieces in the weeks to come. Sit back and relax!

In [None]:
# Read in 'lw.txt' to a variable called "little_women_text"
little_women_text = open('data/lw.txt').read()

In [None]:
# See the first three thousand characters
little_women_text[:3000]

In [None]:
# Print the first three thousand characters
print(little_women_text[:3000])

In [None]:
# Create a variable "chapters" by splitting the text on 'CHAPTER '
chapters = little_women_text.split('CHAPTER ') 

# Create a DataFrame with one column -- the chapters
bpd.DataFrame().assign(chapters=chapters)

In [None]:
# Counts of names in the chapters of Little Women

counts = bpd.DataFrame().assign(
    Amy=np.char.count(chapters, 'Amy'),
    Beth=np.char.count(chapters, 'Beth'),
    Jo=np.char.count(chapters, 'Jo'),
    Meg=np.char.count(chapters, 'Meg'),
    Laurie=np.char.count(chapters, 'Laurie'),
)
counts

In [None]:
# Cumulative number of times each name appears

cumulative_counts = bpd.DataFrame().assign(
    Amy=np.cumsum(counts.get('Amy')),
    Beth=np.cumsum(counts.get('Beth')),
    Jo=np.cumsum(counts.get('Jo')),
    Meg=np.cumsum(counts.get('Meg')),
    Laurie=np.cumsum(counts.get('Laurie')),
    Chapter=np.arange(1, 49, 1)
)

cumulative_counts.plot(x='Chapter')

plt.title('Cumulative Number of Times Each Name Appears', y=1.08);

In [None]:
# Interactive version of the previous plot

cumulative_counts_df = cumulative_counts.drop(columns=['Chapter']).to_df().melt().rename(columns={'variable': 'name', 'value': 'count'})
cumulative_counts_df = cumulative_counts_df.assign(chapter = list(range(1, 49)) * 5)
px.line(cumulative_counts_df, x='chapter', y='count', color='name', width=900, height=600, title='Cumulative Number of Times Each Name Appears')

### Discussion Question

In Chapter 32, Jo moves to New York alone. Her relationship with which sister suffers the most from this faraway move?

A. Amy

B. Beth

C. Meg

### To answer, go to **[menti.com](https://menti.com)** and enter the code **9731 1635**.

### Discussion Question

Laurie is a man who marries one of the sisters at the end. Which one?


A. Amy

B. Beth

C. Jo

D. Meg

### To answer, go to **[menti.com](https://menti.com)** and enter the code **9731 1635**.

### Next time

On Wednesday, we'll look at the question of when we can assign a cause to an effect. 