In [None]:
# Imports
import babypandas as bpd
import numpy as np

import plotly.express as px
import matplotlib.pyplot as plt
plt.style.use('ggplot')

# Lecture 1 – Introduction

## DSC 10, Fall 2022

### Welcome to DSC 10! 👋
- A guided tour of data science.
    - Developed by UC Berkeley in 2015.
    - Adapted by UC San Diego in 2017.
- Learn just enough programming and statistics to do data science.
    - Statistics without too much math, mostly simulation.
    - Lays the foundation for all other courses in the DSC major.

### Agenda

- Who are we?
- What is data science?
- How will this course run?
- Literature demo.

## About the instructors 👨‍🏫👨‍🏫👩‍🏫

Come to the **Meet the Professors** event today from 2-3PM! Meet outside the Hopkins Drive entrance of the San Diego Supercomputer Center. There will be cookies! 🍪

### Suraj Rampure (call me Suraj, pronounced “soo-rudge”)

- Originally from Windsor, ON, Canada 🇨🇦.
- BS (’20) and MS (’21) in Electrical Engineering and Computer Sciences from UC Berkeley 🐻.
- Second year teaching Data Science at UCSD.
    - 3rd time teaching DSC 10.
    - Also running the senior capstone program this year. 
    - Previously taught DSC 40A, DSC 80, and DSC 90.
- Outside the classroom: watching basketball, traveling, learning to cook, watching TikTok, FaceTiming my dog 🐶, etc.

<center>
    <table><tr>
        <td> <img src="images/turkey.jpeg">  </td>
        <td> <img src="images/nova.jpeg"> </td>
        <td> <img src="images/junior.jpeg"> </td>
    </tr></table>
</center>

### Dr. Puoya Tabaghi (call me Puoya)

- BS in Electrical and Computer Engineering (ECE) at Amirkabir University of Technology, Tehran.
- MS in ECE at Colorado State University, CO.
- PhD in ECE at University of Illinois at Urbana-Champaign, IL.
- Postdoc, Halıcıoğlu Data Science Institute (HDSI) at UCSD.
    - Collaborate with people in ECE and HDSI.
    - First time teaching at UCSD!
- Outside interests: Movies and more 🙂:
    - Nuts in May (1976).

<center>
    <table><tr>
        <td> <img src="images/puoya-baby.jpg" width=95%>  </td>
    </tr></table>
</center>

### Dr. Janine Tiefenbruck (call me Janine)
- BS in Math and Computer Science at Loyola MD, PhD in Math (combinatorics) at UCSD 🔱.
- Teaching at UCSD: Math ➡️ CSE ➡️ DSC.
    - 8th time teaching DSC 10!
    - Also teach DSC 40A often.
- Outside interests: crafting, board games, hiking, baking 🎂.

<center>
    <table><tr>
        <td> <img src="images/cupcakes.jpg" width=95%>  </td>
        <td> <img src="images/janine-camping.jpg" width=95%> </td>
    </tr></table>
</center>

### Course staff

In addition, we have several other course staff members who are here to support you in discussion, office hours, and online.

- **1 graduate TA**: Dasha Veraksa.
- **20 undergraduate tutors**: Gabriel Cha, Eric Chen, John Driscoll, Daphne Fabella, Charisse Hao, Dylan Lee, Daniel Li, Anthony Li, Anna Liu, Anastasiya Markova, Yash Potdar, Harshita Saha, Selim Shaalan, Yutian (Skylar) Shi, Tony Ta, Vineet Tallavajhala, Andrew Tan, Jiaxin Ye, Tiffany Yu, and Diego Zavalza.

Learn more about them at [dsc10.com/staff](https://dsc10.com/staff).

## What is "data science"? 🤔

<center><img src='images/what-is-ds.png' width=1250>Everyone seems to have their own definition of data science.</center>

### What is "data science"?

Data science is about **drawing useful conclusions from data using computation**. Throughout the quarter, we'll touch on several aspects of data science:

- First 4 weeks: use Python to **explore** data.
    - Lots of visualization 📈📊 and "data manipulation", using industry-standard tools.

- Next 4 weeks: use data to **infer** about a population, given just a sample.
    - Rely heavily on simulation, rather than formulas.

- Last 2 weeks: use data from the past to **predict** what may happen in the future.
    - A taste of machine learning 🤖.

### Data science is more relevant than ever

We've spent the last 2 and a half years looking at graphs like this:

<center><img src='images/covid-sep21.png' width=75%></center>

### It can be fun, too!

<center><img src='images/wordle-moving-average.png' width=75%></center>

Moving average of the average number of guesses taken for each Wordle word, based on patterns shared on Twitter. ([source](https://observablehq.com/@rlesser/wordle-twitter-exploration))

## Course logistics

### Course website

The course website is your one-stop-shop for all things related to the course.

<center><h3><a href="https://dsc10.com">dsc10.com</a></h3></center>

This is where lectures, homeworks, labs, discussions, and all other content will be posted. Check it often, and **read the [syllabus](https://dsc10.com/syllabus)**!

### Getting set up

- **EdStem**: Q&A forum. All announcements will be made here. You should have gotten email invitation; if not, there's a link on syllabus.
- **Gradescope**: Where you will submit all assignments for autograding, and where all of your grades will live. You should have been automatically added; contact us if not.
- **DataHub**: Where you will access and run all code in this class. Access at [datahub.ucsd.edu](https://datahub.ucsd.edu). More on Wednesday.
- We will **not** be using Canvas for anything!

In addition, you must also fill out our [Beginning of Quarter Survey](https://forms.gle/PcQ2dMZmunReUKyN9).

### Lecture

- Lectures will be in-person and recorded for viewing afterwards.
    - You can attend any of the 4 sections, as long as there is space for the students officially enrolled in that section.
    - Attendance will never be required (but is strongly encouraged!)
    - Recordings can be found at [podcast.ucsd.edu](https://podcast.ucsd.edu).
- Slides/code from lecture will be linked on the course website, both in a "runnable" form (💻) and as an HTML file (✏️), which you can save as a PDF and annotate on your tablet.
- We will try to make lectures engaging. **Bring your laptop or tablet**, if you have one.

### Concept Check ✅ – Answer at [cc.dsc10.com](http://cc.dsc10.com) 

**How many of the following food items qualify as a sandwich?**

<center><img src='images/foods.png' width=30%></center>

A. 0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
B. 1 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
C. 2 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
D. 3 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
E. 4 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

_(We are always going to use the same link for Concept Checks, so you should bookmark it._)



### Discussion

- Discussion sections are designed to give you practice with the **conceptual ideas** in the course.
    - All assignments in this class will be done on the computer using code, but the exams are on-paper and in-person.
    - For each discussion, we've prepared a problem set, **made up of old exam problems** (see [practice.dsc10.com](https://practice.dsc10.com)). 
    - Problem sets are posted online, so bring a computer or tablet to access them. But like exams, you will answer the problems **on paper**.
    - Discussion problem sets aren't submitted anywhere.
- Attendance is not required, however **extra credit** is provided for attending.
    - If you do attend, you must attend the discussion section you're enrolled in (but students in Lecture D00 can attend any one).
    - Like lectures, discussions will be podcasted.

### Labs

- <span style="color:red"><b>🚨 Important: We will not be using the Wednesday "Lab" time that appears on WebReg!</b></span>
- Instead, "labs" refer to **lab assignments**. Labs are a required part of the course and help you develop fluency in Python and working with data.
- While working on labs, you'll be able to run **autograder tests** which tell you if your answers are correct.
    - For labs, if you pass all autograder tests, you will get 100\%!
- You must submit labs individually, but you can discuss ideas with others (no sharing code).
- Labs are due on **Saturdays at 11:59PM** to Gradescope. The first lab (due on Saturday, October 1st) will have submission instructions.

### Homeworks and projects

- Weekly homework assignments build off of skills you develop in labs.
- A key difference between homeworks and labs is that **passing autograder tests does not guarantee a perfect score!**
    - In homeworks, we have "hidden tests" that are only run after you submit the assignment.
    - The tests that are available to you within the assignment itself only verify that your answer is reasonable/on the right track.
- Again, you must work on homeworks yourself, but you can discuss ideas with other students (no sharing code).
- Homeworks are due on **Tuesdays at 11:59PM** to Gradescope.
- In the **Midterm Project** and **Final Project**, you will do a deep dive into a dataset! Projects are longer than homeworks, so we give you more time to work on them.
    - This quarter's projects: Spotify Charts 🎵 and Marvel vs. DC 🦸‍♀️.
    - You can work on projects with partners, using ["pair programming."](https://dsc10.com/pair-programming/)

### Exams

We will have two exams this quarter.
- **Midterm Exam**: Friday, October 28th, during your scheduled lecture time.
- **Final Exam**: Saturday, December 3rd, 11:30AM-2:30PM.
- Both exams will be conducted in person and on paper. Let us know if you have a conflict in the [Beginning of Quarter Survey](https://forms.gle/PcQ2dMZmunReUKyN9).

### Readings and resources

- We will draw readings from two sources. Readings for each lecture will be posted on the course homepage.
    - [Computational and Inferential Thinking (CIT)](https://inferentialthinking.com), the textbook created for Berkeley's version of this course.
    - [`babypandas` notes](https://notes.dsc10.com), written specifically for the first part of DSC 10.
- The [Resources](https://dsc10.com/resources) tab of the course website contains links to helpful resources that you'll want to use throughout the course (e.g. DSC 10 Reference Sheet, programming tutorials, supplemental videos).
    - You should bookmark the [DSC 10 Reference Sheet](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view)!
- The [Debugging](https://dsc10.com/debugging) tab of the course website has answers to many common technical issues.

### A typical week in DSC 10

| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday |
|: -- :|: -- :|: -- :|: -- :|: -- :|: -- :|: -- :|
| Nothing! 😎 | Lecture | | Lecture | | Lecture | |
| | Discussion | | | | | |
| | | **Homework due** | | | | **Lab due** |

See the [Syllabus](https://dsc10.com/syllabus) for more details.

### First assignments
- Lab 1 is due **Saturday, October 1st at 11:59PM**.
    - Will be released tomorrow.
- Homework 1 is due **Tuesday, October 4th at 11:59PM**.
    - Will be released by this Tuesday.
- Generally, do the lab before you do the homework!
- <span style='color:red'><b>🚨 Important: Start early and submit often</b>.</span>

### Getting help
This is a tough, fast-paced course, but we're here to help you – here's how:

- **Office Hours (OH)**.
    - Remote and in-person throughout the week.
    - See the schedule and Zoom link on the [Calendar 📆](https://dsc10.com/calendar).
- **EdStem**.
    - Post here with any logistical or conceptual questions (please don't email).
    - No code or solutions in public posts. Such posts should be private to instructors + staff.
    - Otherwise, post publicly (anonymously, if you'd like).
- <span style="color:red;"><b>🚨 Important: Use these to your advantage!</b></span>

### Opportunity: Python bootcamp

Diversity in Data Science, a student organization, is running a one-week Python bootcamp specifically **for students in DSC 10 with no prior programming experience**. 

It starts **on Monday**. Register by Sunday using the QR code below.

<center><img src='images/bootcamp-qr.png' width=25%></center>

### Advice from previous students

At the end of each quarter, we ask DSC 10 students to give advice to future students in the course. Here are some responses:

> "Go to office hours! Get a partner for the project even if you don't want to. If you don't understand a topic try the following: go to office hours, ask on [EdStem], check the [readings], look at lecture notes. Start on the assignments early try and finish 2 days early to check your work."

> "I would give the advice to attend the discussions and office hours whenever possible, as a lot of the times I found myself learning new things even when I didn't come with a question ready."

> "Do NOT fall behind in lectures. It becomes very difficult to catch up on the concepts. Go to your discussion section! Hearing a concept explained once can be difficult to understand so discussion section was extremely helpful."

> "GO TO OFFICE HOURS!  It's very important to let your voice out. Talk to the professor after lectures, attend office hours, post your questions and ask a question."

### Collaboration

#### Asking questions is highly encouraged!
- Discuss all questions with each other (except exams).
- Submit lab assignments individually, but you can work with others (no sharing code).
- Submit homeworks individually, but you can discuss problem-solving strategies with others (no sharing code).
- Submit projects individually or in pairs using pair programming.

#### The limits of collaboration:
- Don't share solutions with each other or look at someone’s code.
- Project partners should work together and be physically in the same place (or same Zoom call). Don't split up the project.
- Academic integrity violations usually result in failing the course. 

### We're here for you!

Regardless of your background, you can succeed in this course. **No prior programming or statistics experience will be assumed!**

Watch on YouTube: [We’re All Data Scientists | Rebecca Nugent | TEDxCMU](https://www.youtube.com/watch?v=YMnqPTLoj7o).

### Campus resources

Counseling and Psychological Services (CAPS) is a campus unit that offers “short term counseling for academic, career, and personal issues and also offers psychiatry services for circumstances when medication can help with counseling.”
If you or anyone you know is ever in need of mental health care, you should contact CAPS.

<center><h3><a href="https://caps.ucsd.edu/">caps.ucsd.edu</a></h3></center>

## Demo

### _Little Women_ (1868)

- _Little Women_, by Louisa May Alcott, is a novel that follows the life of four sisters – Meg, Jo, Beth, and Amy.
- Using tools from this class, we'll learn (a bit) about the plot of the book, without reading it.
- Do not worry about any of this code – we'll cover the necessary pieces in the weeks to come. Sit back and relax!

In [None]:
# Read in 'lw.txt' to a variable called "little_women_text"
little_women_text = open('data/lw.txt').read()

In [None]:
# See the first three thousand characters
little_women_text[:3000]

In [None]:
# Print the first three thousand characters
print(little_women_text[:3000])

In [None]:
# Create a variable "chapters" by splitting the text on 'CHAPTER '
chapters = little_women_text.split('CHAPTER ') 

# Create a DataFrame with one column -- the chapters
bpd.DataFrame().assign(chapters=chapters)

In [None]:
# Counts of names in the chapters of Little Women

counts = bpd.DataFrame().assign(
    Amy=np.char.count(chapters, 'Amy'),
    Beth=np.char.count(chapters, 'Beth'),
    Jo=np.char.count(chapters, 'Jo'),
    Meg=np.char.count(chapters, 'Meg'),
    Laurie=np.char.count(chapters, 'Laurie'),
)
counts

In [None]:
# Cumulative number of times each name appears

cumulative_counts = bpd.DataFrame().assign(
    Amy=np.cumsum(counts.get('Amy')),
    Beth=np.cumsum(counts.get('Beth')),
    Jo=np.cumsum(counts.get('Jo')),
    Meg=np.cumsum(counts.get('Meg')),
    Laurie=np.cumsum(counts.get('Laurie')),
    Chapter=np.arange(1, 49, 1)
)

cumulative_counts.plot(x='Chapter', figsize=(10, 5))

plt.title('Cumulative Number of Times Each Name Appears', y=1.08);

In [None]:
# Interactive version of the previous plot

cumulative_counts_df = cumulative_counts.drop(columns=['Chapter']).to_df().melt().rename(columns={'variable': 'name', 'value': 'count'})
cumulative_counts_df = cumulative_counts_df.assign(chapter = list(range(1, 49)) * 5)
px.line(cumulative_counts_df, x='chapter', y='count', color='name', width=900, height=600, title='Cumulative Number of Times Each Name Appears')

### Concept Check ✅ – Answer at [cc.dsc10.com](http://cc.dsc10.com) 

In Chapter 32, Jo moves to New York alone. Her relationship with which sister suffers the most from this faraway move?

A. Amy

B. Beth

C. Meg

### Concept Check ✅ – Answer at [cc.dsc10.com](http://cc.dsc10.com) 

Laurie is a man who marries one of the sisters at the end. Which one?


A. Amy

B. Beth

C. Jo

D. Meg

### Next time

On Monday, we'll learn all about cause and effect, before moving on to Python on Wednesday.

### Meet the Professors 

Come say hi and eat cookies 🍪 with us! Meet us 2-3PM outside the Hopkins Drive entrance of the San Diego Supercomputer Center.