In [None]:
# Imports
import babypandas as bpd
import numpy as np

import plotly.express as px
import matplotlib.pyplot as plt
plt.style.use('ggplot')

# Lecture 1 – Introduction

## DSC 10, Fall 2023

### Welcome to DSC 10! 👋
- DSC 10 is a guided tour of data science.
    - It was developed by UC Berkeley in 2015 and adapted by UCSD in 2017.
- You'll learn just enough programming and statistics to do data science.
    - We'll cover statistics without too much math – instead, we'll use simulation.
    - This class lays the foundation for all other courses in the DSC major.

### Agenda

- Course staff.
- What is data science?
- How will this course run?
- Fun demo.

## Course staff

### Instructor: Rod Albuyeh (call me Rod)
- PhD in Political Science from USC.
- FinTech and HealthTech data scientist for 8 years, on corporate sabattical 🙂.
- Teaching at HDSI, GPS (forthcoming), and USD's applied AI master's program.
    - 1st time teaching DSC 10! 
    - Also taught DSC 102, DSC 40A.
- Outside interests: training mixed martial arts, tinkering with ML things, international politics, and retro gaming.

<center>
    <table><tr>
        <td> <img src="images/victory_mma.jpg" width=400>  </td>
        <td> <img src="images/robocar.jpg" width=300> </td>
    </tr></table>
</center>

### Instructor: Janine Tiefenbruck (call me Janine) 
- BS in Math and Computer Science at Loyola Maryland, PhD in Math (combinatorics) at UCSD 🔱.
- Teaching at UCSD: Math ➡️ CSE ➡️ DSC.
    - 10th time teaching DSC 10!
    - Also teach DSC 40A often.
- Outside interests: crafting, board games, hiking, baking 🎂.

<center>
    <table><tr>
        <td> <img src="images/camping.jpg" width=400>  </td>
        <td> <img src="images/desserts.jpg" width=220> </td>
        <td> <img src="images/kids.jpg" width=300> </td>
    </tr></table>
</center>

### Instructor: Suraj Rampure (call me Suraj, pronounced "sooh-rudge")

- Originally from Windsor, ON, Canada 🇨🇦.
- BS and MS in Electrical Engineering and Computer Sciences from UC Berkeley 🐻.
- Third year teaching in the Halıcıoğlu Data Science Institute at UCSD.
    - 5th time teaching DSC 10.
    - Also running the [senior capstone program](https://dsc-capstone.org) for the second time.
    - Previously taught DSC 40A, 80, 90, and 95.
- Outside interests: [travelling](https://my.flightradar24.com/surajrampure), hiking, watching basketball, FaceTiming my dog 🐶, etc.

<center><img src="images/suraj-summer.png" width=60%></center>

### Course staff

In addition, we have several other course staff members who are here to support you in discussion, office hours, and online.

- **1 graduate TA**: Arya Rahnama.
- **29 undergraduate tutors**: Oren Ciolli, Nate Del Rosario, Jack Determan, Sophia Fang, Charlie Gillet, Ashley Ho, Henry Ho, Vanessa Hu, Leena Kang, Norah Kerendian, Anthony Li, Weiyue Li, Jasmine Lo, Arjun Malleswaran, Mert Ozer, Aaron Rasin, Chandiner Rishi, Gina Roberg, Harshi Saha, Keenan Serrao, Abel Seyoum, Suhani Sharma, Yutian Shi, Ester Tsai, Bill Wang, Ylesia Wu, Jason Xu, Diego Zavalza, Ciro Zhang.
- **1 stuffed panda mascot**: Baby Panda. 🐼

Learn more about them at [dsc10.com/staff](https://dsc10.com/staff), and come say hi at the Meet the Professors mixer from 11:30AM-12:30PM in the patio of the [HDSI building](https://map.concept3d.com/?id=1005#!m/246301).

## What is "data science"? 🤔

<center><img src='images/data-science.png' width=1250>Everyone seems to have their own definition of data science.</center>

### What is "data science"?

Data science is about **drawing useful conclusions from data using computation**. Throughout the quarter, we'll touch on several aspects of data science:

- First 4 weeks: use Python to **explore** data.
    - Lots of visualization 📈📊 and "data manipulation", using industry-standard tools.

- Next 4 weeks: use data to **infer** about a population, given just a sample.
    - Rely heavily on simulation, rather than formulas.

- Last 2 weeks: use data from the past to **predict** what may happen in the future.
    - A taste of machine learning 🤖.

### Data science is relevant 🤧

We spent years looking at graphs like this:

<center><img src='images/covid.png' width=65%></center>

### It can be fun, too!

The site [The Pudding](https://pudding.cool) is home to several interactive data-rich articles.

<center><img src='images/vocab.png' width=75%>(<a href="https://pudding.cool/projects/vocabulary">source</a>)</center>

<center><img src='images/map.png' width=65%>(<a href="https://pudding.cool/2023/03/same-name/">source</a>)</center>

## Course logistics

### Course website

The course website is your one-stop-shop for all things related to the course.

<center><h3><a href="https://dsc10.com">dsc10.com</a></h3></center>

This is where lectures, homeworks, labs, discussions, and all other content will be posted. Check it often, and **read the [syllabus](https://dsc10.com/syllabus)**!

### Getting set up

- **Ed**: Q&A forum. All announcements will be made here. You should have gotten email invitation; if not, [join here](https://edstem.org/us/join/2j8EjH).
- **Gradescope**: Where you will submit all assignments, and where all of your grades will live. You should have been automatically added; contact us on Ed if not.
- **DataHub**: Where you will access and run all code in this class. Access at [datahub.ucsd.edu](https://datahub.ucsd.edu). More next time.
- We will **not** be using Canvas for anything!

### First tasks

1. Fill out the required [Welcome Survey](https://forms.gle/LBYWU9WMD2SRDz458) as soon as possible. 
2. Take the [pretest](https://practice.dsc10.com/pretest/), which is recommended to help you gauge your preparedness, brush up on prerequisite knowledge, and learn test-taking skills. We'll release the solutions on Monday.

### Lecture

- Lectures will be in-person and recorded for viewing afterwards.
    - You can attend any lecture section, as long as there is space for the students officially enrolled in that section.
    - Recordings can be found at [podcast.ucsd.edu](https://podcast.ucsd.edu).
- Slides/code from lecture will be linked on the course website, both in a "runnable" code format and as an HTML file (✏️), which you can save as a PDF and annotate on your tablet.
- We will try to make lectures engaging. **Bring your laptop or tablet**, if you have one.

### Concept Check ✅ – Answer at [cc.dsc10.com](http://cc.dsc10.com) 

**Is it acceptable to recline your seat on an airplane?**

<center><img src='images/reclining.jpeg' width=35%></center>

A. Yes, you paid for the seat! 

B. Only if the person in front of you reclined their seat first.

C. Only if you ask the person behind you and they're fine with it.

D. No, it's rude.

_(We are always going to use the same link for Concept Checks, so you should bookmark it._)

### Discussion

- Discussion sections are designed to give you practice with the **conceptual ideas** in the course and to prepare you for exams.
    - All assignments in this class will be done on the computer using code, but exams and quizzes are on-paper and in-person.
    - In discussion, you will work through **past exam problems** (see [practice.dsc10.com](https://practice.dsc10.com)). 
    - Problem sets are posted online, so bring a computer or tablet to access them. But like exams, you will answer the problems **on paper**.
    - Discussion problem sets aren't submitted anywhere.
- Discussions are not podcasted; you need to be an active participant of discussion to benefit from it.
- There will be **four quizzes** throughout the quarter, administered in discussion section, which will help prepare you for exams and encourage you to review material regularly.

### Discussion schedule

There are three discussion sections:

- Section A: Wednesday 3-3:50PM in Pepper Canyon Hall 109
- Section B: Wednesday 4-4:50PM in Pepper Canyon Hall 109
- Section C: Wednesday 5-5:50PM in Mandeville B-210

Students in Sections A, B, and C must attend the discussion section that corresponds to the lecture section they are enrolled in. 
- Students in Section D will be assigned to one of the other sections.
- On the  [Welcome Survey](https://forms.gle/LBYWU9WMD2SRDz458), let us know if you have a conflict and we will do our best to reassign you to a different discussion section.
- Students in Section D or with conflicts will hear back from us by Monday with their confirmed discussion section.

<div class="alert alert-block alert-danger">
<b>In the Schedule of Classes, this course is listed as having both a discussion section (DI) and a lab section (LA), but we will only have one weekly meeting outside of lecture, which we'll refer to as a discussion section, scheduled at the time listed above. You should ignore what you see as DI or LA on WebReg and just use the schedule above instead.</b>
</div>

### Labs

- Labs refer to **lab assignments**, which are a required part of the course and help you develop fluency in Python and working with data.
- While working on labs, you'll be able to run **autograder tests** which tell you if your answers are correct.
    - For labs, if you pass all autograder tests, you will get 100\%!
- You must submit labs individually, but you can discuss ideas with others (no sharing code).
- All assignments, including labs will be due at **11:59PM** on the due date and submitted to Gradescope.
- The first lab (due Thursday, October 5) will have submission instructions.

### Homeworks and projects

- Weekly homework assignments build off of skills you develop in labs.
- A key difference between homeworks and labs is that **passing autograder tests does not guarantee a perfect score!**
    - In homeworks, we have "hidden tests" that are only run after you submit the assignment.
    - The tests that are available to you within the assignment itself only verify that your answer is reasonable/on the right track.
- Again, you must work on homeworks yourself, but you can discuss ideas with other students (no sharing code).
- In the **Midterm Project** and **Final Project**, you will do a deep dive into a dataset! Projects are longer than homeworks, so we give you more time to work on them.
    - You can work on projects with partners, following these [project partner guidelines](https://dsc10.com/project-partners). Both of you should actively contribute to **all parts** of the project.

### Exams

We will have two exams this quarter.
- **Midterm Exam**: Monday, October 30, during your registered lecture slot.
- **Final Exam**: Saturday, December 9, 7-10PM.

Both exams will be conducted **in person and on paper**. Let us know if you have a conflict on the [Welcome Survey](https://forms.gle/LBYWU9WMD2SRDz458).

### Readings and resources

- We will draw readings from two sources. Readings for each lecture will be posted on the course homepage.
    - [Computational and Inferential Thinking (CIT)](https://inferentialthinking.com), the textbook created for Berkeley's version of this course.
    - [`babypandas` notes](https://notes.dsc10.com), written specifically for the first part of DSC 10.
- The [Resources](https://dsc10.com/resources) tab of the course website contains links to helpful resources that you'll want to use throughout the course (e.g. DSC 10 Reference Sheet, programming tutorials, supplemental videos).
- The [Debugging](https://dsc10.com/debugging) tab of the course website has answers to many common technical issues.

### Weekly schedule for the first half of the quarter

| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday |
|: -- :|: -- :|: -- :|: -- :|: -- :|: -- :|: -- :|
| | Lecture | | Lecture | | Lecture | |
| | | | Discussion | | | |
| | |  | | **Lab due** | | **HW due** |

The deadlines in the second half are significantly different. Always refer to the [course website](https://dsc10.com/) for the schedule.

### First assignment
- Lab 0 is due **Thursday, October 5 at 11:59PM**.
    - It will be released by tomorrow. Check Ed for an announcement.
- <span style='color:red'><b>🚨 Important: Start early and submit often</b>.</span>

#### Getting help
This is a tough, fast-paced course, but we're here to help you – here's how:

- **Office Hours (OH)**.
    - Not held in an office – rather, held in a large open study space (HDSI 155).
    - Come with questions, or just to work!
    - See the schedule and instructions on the [📆 Calendar](https://dsc10.com/calendar).
- **Ed**.
    - Post here with any logistical or conceptual questions; **please don't email**.
    - No code or solutions in public posts. Such posts should be private to course staff.
    - Otherwise, post publicly (anonymously, if you'd like).
- <span style="color:red;"><b>🚨 Important: Use these to your advantage!</b></span>

### Advice from previous students

At the end of each quarter, we ask DSC 10 students to give advice to future students in the course. Here are some responses from last year's students:

> Start the assignments early, every time that I started an assignment the day or even night of, I always struggled and the added pressure of not getting it in on time didn't help me one bit. The times that I started a day or two in advance, even if it was just completing a couple problems in advance, I felt way more relaxed and in turn I learned and retained a lot more.

> Pay attention in lectures and to begin both labs and homework early because they will pile up. The lectures are very helpful references to use if you’re stuck during labs and homework’s and office hours are incredibly useful so go!!!

> Use TA's and office hours as much as possible, also the reference sheet was crucial. 

> Office hours are really helpful, all the tutors knew what they were doing and could were able to help me work through any of the problems I got stuck on 

### Academic Integrity policies

#### Collaboration
- Discuss all questions with each other (except, of course, on quizzes and exams).
- Projects are submitted in pairs or individually. Both partners should contribute to all parts of the project, not split it up.
- Labs and homeworks are submitted individually.
- No other person should complete your work for you or write any of the code you submit in this course, with the exception of the work you do with a project partner.
- Don't give someone else your code or look at someone else's code.



#### Generative Artificial Intelligence (GenAI)
- The syllabus includes a [discussion of these tools](https://dsc10.com/syllabus/#use-of-generative-artificial-intelligence) and how you may use them in this class. Please read this carefully, ask questions about it, and proceed with care!

### We're here for you!

Regardless of your background, you can succeed in this course. **No prior programming or statistics experience will be assumed!**

Watch on YouTube: [We’re All Data Scientists | Rebecca Nugent | TEDxCMU](https://www.youtube.com/watch?v=YMnqPTLoj7o).

### Campus resources

Counseling and Psychological Services (CAPS) is a campus unit that offers “short term counseling for academic, career, and personal issues and also offers psychiatry services for circumstances when medication can help with counseling.”
If you or anyone you know is ever in need of mental health care, you should contact CAPS.

<center><h3><a href="https://caps.ucsd.edu/">caps.ucsd.edu</a></h3></center>

## Demo

### _Little Women_ (1868)

- _Little Women_, by Louisa May Alcott, is a novel that follows the life of four sisters – Meg, Jo, Beth, and Amy.
    - A movie based on the novel was released in 2019, starring Emma Watson (Meg) and Timothée Chalamet (Laurie).
- Using tools from this class, we'll learn (a bit) about the plot of the book, without reading it.
- Do not worry about any of this code – we'll cover the necessary pieces in the weeks to come. Sit back and relax!

In [None]:
# Read in 'lw.txt' to a variable called little_women_text.
little_women_text = open('data/lw.txt').read()

In [None]:
# See the first three thousand characters.
little_women_text[:3000]

In [None]:
# Print the first three thousand characters.
print(little_women_text[:3000])

In [None]:
# Create a variable "chapters" by splitting the text on 'CHAPTER '.
chapters = little_women_text.split('CHAPTER ') 

# Create a DataFrame with one column - the text of each chapters.
bpd.DataFrame().assign(chapters=chapters)

In [None]:
# Number of occurrences of each name in each chapter.

counts = bpd.DataFrame().assign(
    Amy=np.char.count(chapters, 'Amy'),
    Beth=np.char.count(chapters, 'Beth'),
    Jo=np.char.count(chapters, 'Jo'),
    Meg=np.char.count(chapters, 'Meg'),
    Laurie=np.char.count(chapters, 'Laurie'),
)
counts

In [None]:
# Cumulative number of times each name appears.

cumulative_counts = bpd.DataFrame().assign(
    Amy=np.cumsum(counts.get('Amy')),
    Beth=np.cumsum(counts.get('Beth')),
    Jo=np.cumsum(counts.get('Jo')),
    Meg=np.cumsum(counts.get('Meg')),
    Laurie=np.cumsum(counts.get('Laurie')),
    Chapter=np.arange(1, 49, 1)
)

cumulative_counts

In [None]:
# Putting it all together, we get a helpful visualization.
cumulative_counts_df = cumulative_counts.drop(columns=['Chapter']).to_df().melt().rename(columns={'variable': 'name', 'value': 'Count'})
cumulative_counts_df = cumulative_counts_df.assign(Chapter=list(range(1, 49)) * 5)
px.line(cumulative_counts_df, x='Chapter', y='Count', color='name', width=900, height=600, title='Cumulative Number of Times Each Name Appears', template='ggplot2')

- In Chapter 32, Jo moves to New York alone. Her relationship with which sister suffers the most from this faraway move?

- Laurie is a man who marries one of the sisters at the end. Which one?


### Next time

On Monday, we'll start programming in Python 🐍. Remember to bring a laptop or tablet if you have one.