In [None]:
# Set up packages for lecture. Don't worry about understanding this code,
# but make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd

# Lecture 28 – Review, Conclusion

## DSC 10, Fall 2023

### Announcements

- The Final Exam is **tomorrow 12/9 from 7-10PM**. See [**this post on Ed**](https://edstem.org/us/courses/48101/discussion/3988059) for more details, including your assigned room and seat (ignore the room that WebReg says!).
- If at least 85% of the class fills out both [**SETs**](https://academicaffairs.ucsd.edu/Modules/Evals/) and the internal [**End-of-Quarter Survey**](https://docs.google.com/forms/d/e/1FAIpQLSeaQYHSzfjHIVnn-XtIxEBjEacddwEVC2bomgkTV_vVM--wCA/viewform) **by tomorrow at 8AM**, then the entire class will have **1% of extra credit added to their overall grade**.
    - As of last night, we were only at ~47% for the End-of-Quarter Survey. You can do it!
- The solutions to the [**Spring 2023 Final Exam**](https://practice.dsc10.com/sp23-final) have been posted; Wednesday's podcasts, which explain most of the exam, are linked at the top.

### Agenda

- More review of old exam problems.
- Working on personal projects.
- Demo: Gapminder 🌎.
- Some parting thoughts.

## More review

### Selected problems

We're going to work on as many of the following problems as we can in class. There's a PDF template linked on the course website that you can write on; we'll post annotated slides after class. 

- [Winter 2023 Final Exam, Problem 16](https://practice.dsc10.com/wi23-final/index.html#problem-16) (CLT and hypothesis testing).
- [Fall 2022 Final Exam, Problem 6](https://practice.dsc10.com/fa22-final/index.html#problem-6) (regression).
- [Spring 2022 Final Exam, Problem 16](https://practice.dsc10.com/sp22-final/index.html#problem-16) (probability).

## Personal projects

### Using Jupyter Notebooks after DSC 10

- You may be interested in working on data science projects of your own.
- In [this video](https://www.youtube.com/watch?v=Hq8VaNirDRQ), we show you how to make blank notebooks and upload datasets of your own to DataHub.
- After this quarter, depending on the classes you're enrolled in, you may not have access to DataHub. Eventually, you'll want to install Jupyter Notebooks on your computer.
    - [Anaconda](https://docs.anaconda.com/anaconda/install/index.html) is a great way to do that, as it also installs many commonly used packages.
    - You may want to download your work from DataHub so you can refer to it after the course ends (though you can look at it on Gradescope too).
    - Remember, all `babypandas` code is regular `pandas` code, too!

### Finding data

These sites allow you to search for datasets (in CSV format) from a variety of different domains. Some may require you to sign up for an account; these are generally reputable sources.

Note that all of these links are also available at [rampure.org/find-datasets](https://rampure.org/find-datasets).

- [Data is Plural](https://www.data-is-plural.com/archive/)
- [FiveThirtyEight](https://data.fivethirtyeight.com/).
- [CORGIS](https://corgis-edu.github.io/corgis/csv/).
- [Kaggle Datasets](https://www.kaggle.com/datasets).
- [Google’s dataset search](http://toolbox.google.com/datasetsearch).
- [DataHub.io](https://datahub.io/collections).
- [Data.world.](https://data.world/)
- [R datasets](https://vincentarelbundock.github.io/Rdatasets/articles/data.html).
- Wikipedia. (Use [this site](https://wikitable2csv.ggor.de/) to extract and download tables as CSVs.)
- [Awesome Public Datasets GitHub repo](https://github.com/awesomedata/awesome-public-datasets).
- [Links to even more sources](https://rockcontent.com/blog/data-sources/).

### Domain-specific sources of data

- Sports: [Basketball Reference](https://www.basketball-reference.com/), [Baseball Reference](https://www.baseball-reference.com/), etc.
- US Government Sources: [census.gov](https://www.census.gov/data/tables.html), [data.gov](https://www.data.gov/), [data.ca.gov](https://data.ca.gov/), [data.sfgov.org](https://data.sfgov.org/browse?), [FBI’s Crime Data Explorer](https://crime-data-explorer.fr.cloud.gov/), [Centers for Disease Control and Prevention](https://data.cdc.gov/browse?category=NCHS).
- Global Development: [data.worldbank.org](https://data.worldbank.org/), [databank.worldbank.org](https://databank.worldbank.org/home.aspx), [WHO](https://apps.who.int/gho/data/node.home).
- Transportation: [New York Taxi trips](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DataIndex.asp), [SFO Air Traffic Statistics](https://www.flysfo.com/media/facts-statistics/air-traffic-statistics).
- Music: [Spotify Charts](https://spotifycharts.com/regional).
- COVID: [Johns Hopkins](https://github.com/CSSEGISandData/COVID-19).
- Any Google Forms survey you’ve administered! (Go to the results spreadsheet, then go to “File > Download > Comma-separated values”.)

Tip: if a site only allows you to download a file as an Excel file, not a CSV file, you can download it, open it in a spreadsheet viewer (Excel, Numbers, Google Sheets), and export it to a CSV.

## Demo: Gapminder 🌎

### `plotly`

- All of the visualizations (scatter plots, histograms, etc.) in this course were created using a library called `matplotlib`.
    - This library was called under-the-hood everytime we wrote `df.plot`.
- `plotly` is a different visualization library that allows us to create **interactive** visualizations.
- You may learn about it in a future course, but we'll briefly show you some cool visualizations you can make with it.

In [None]:
import plotly.express as px

### Gapminder dataset

> Gapminder Foundation is a non-profit venture registered in Stockholm, Sweden, that promotes sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels. - [Gapminder Wikipedia](https://en.wikipedia.org/wiki/Gapminder_Foundation)

In [None]:
gapminder = px.data.gapminder()
gapminder

The dataset contains information for each country for several different years.

In [None]:
gapminder.get('year').unique()

Let's start by just looking at 2007 data (the most recent year in the dataset).

In [None]:
gapminder_2007 = gapminder[gapminder.get('year') == 2007]
gapminder_2007

### Scatter plot

We can plot life expectancy vs. GDP per capita. If you hover over a point, you will see the name of the country.

In [None]:
px.scatter(gapminder_2007, x='gdpPercap', y='lifeExp', hover_name='country')

In future courses, you'll learn about transformations. Here, we'll apply a log transformation to the x-axis to make the plot look a little more linear.

In [None]:
px.scatter(gapminder_2007, x='gdpPercap', y='lifeExp', log_x=True, hover_name='country')

### Animated scatter plot

We can take things one step further.

In [None]:
px.scatter(gapminder,
           x = 'gdpPercap',
           y = 'lifeExp', 
           hover_name = 'country',
           color = 'continent',
           size = 'pop',
           size_max = 60,
           log_x = True,
           range_y = [30, 90],
           animation_frame = 'year',
           title = 'Life Expectancy, GDP Per Capita, and Population over Time'
          )

Watch [this video](https://www.youtube.com/watch?v=jbkSRLYSojo) if you want to see an even-more-animated version of this plot.

### Animated histogram

In [None]:
px.histogram(gapminder,
            x = 'lifeExp',
            animation_frame = 'year',
            range_x = [20, 90],
            range_y = [0, 50],
            title = 'Distribution of Life Expectancy over Time')

### Choropleth

In [None]:
px.choropleth(gapminder,
              locations = 'iso_alpha',
              color = 'lifeExp',
              hover_name = 'country',
              hover_data = {'iso_alpha': False},
              title = 'Life Expectancy Per Country',
              color_continuous_scale = px.colors.sequential.tempo
)

## Parting thoughts

### From Lecture 1: What is "data science"?

Data science is about **drawing useful conclusions from data using computation**. Throughout the quarter, we touched on several aspects of data science:

- In the first 4 weeks, we used Python to **explore** data.
    - Lots of visualization 📈📊 and "data manipulation", using industry-standard tools.

- In the next 4 weeks, we used data to **infer** about a population, given just a sample.
    - Rely heavily on simulation, rather than formulas.

- In the last 2 weeks, we used data from the past to **predict** what may happen in the future.
    - A taste of machine learning 🤖.

- In future courses – including DSC 20 and 40A, which you may be taking next quarter – you'll revisit all three of these aspects of data science.

### Note on grades

<center><img src='data/transcript.png' width=60%><i>Suraj's freshman year transcript.</i></center>

Don't let your grades define you, they don't tell the full story.

### Thank you!

This course would not have been possible without...
- **1 graduate TA**: Arya Rahnama.
- **29 undergraduate tutors**: Oren Ciolli, Nate Del Rosario, Jack Determan, Sophia Fang, Charlie Gillet, Ashley Ho, Henry Ho, Vanessa Hu, Leena Kang, Norah Kerendian, Anthony Li, Weiyue Li, Jasmine Lo, Arjun Malleswaran, Mert Ozer, Aaron Rasin, Chandiner Rishi, Gina Roberg, Harshi Saha, Keenan Serrao, Abel Seyoum, Suhani Sharma, Yutian Shi, Ester Tsai, Bill Wang, Ylesia Wu, Jason Xu, Diego Zavalza, Ciro Zhang.
- Learn [more about tutoring](https://datascience.ucsd.edu/current-students/dsc-tutors/) – it's fun, and you can be a tutor as early as your 3rd quarter at UCSD!
- Keep in touch! [dsc10.com/staff](https://dsc10.com/staff)
    - After grades are released, we'll make a post on Ed where you can ask course staff for advice on courses and UCSD more generally.

## Good luck on your finals...

### ...and see you tomorrow at 7PM 😊.