In [1]:
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
from scipy import stats
%reload_ext pandas_tutor
%set_pandas_tutor_options {'projectorMode': True}
set_matplotlib_formats("svg")
plt.style.use('fivethirtyeight')

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

# Setup to start where we left off last time
keep_cols = ['business_name', 'inspection_date', 'inspection_score', 'risk_category', 'Neighborhoods', 'Zip Codes']
restaurants_full = bpd.read_csv('data/restaurants.csv').get(keep_cols)
bakeries = restaurants_full[restaurants_full.get('business_name').str.lower().str.contains('bake')]
bakeries = bakeries[bakeries.get('inspection_score') >= 0] # Keeping only the rows where we know the inspection score

# Lecture 28 – Conclusion, How to Get Data Science Experience

## DSC 10, Summer 2022

### Announcements

- The Final Exam is **tomorrow from 11:30-2:30pm**.
- End of quarter feedback
    - We very much value your feedback on the course and the instruction! 🙏
    - [End-of-quarter survey](https://forms.gle/czvzKHTZ4V7h1u1t8) for course-specific feedback, and [TA evaluations](https://academicaffairs.ucsd.edu/Modules/Evals) for course staff
    - If at least 85% of the class fills out the [end-of-quarter survey](https://forms.gle/czvzKHTZ4V7h1u1t8), then everyone will receive 1 point of extra credit for their final exam.

### Agenda

- Overview of course.
- Personal projects. 
- How do I get more real-world data science experience?

## From Lecture 1: what is data science?

Drawing useful conclusions from data using computation.

- **Exploration**
    - Identifying patterns in information
    - Uses visualizations
- **Prediction**
    - Making informed guesses
    - Uses machine learning and optimization
- **Inference**
    - Quantifying whether those predictions are reliable
    - Uses randomization
    
Throughout the quarter, you were exposed to all three of these facets. In future courses, you'll continue to learn more.

### Thank you!

This course would not have been possible without...
- Our TA: Shivani Bhakta
- Our tutors: Eric Chen, Oren Ciolli, and Tiffany Yu

- Find out [more about tutoring](https://datascience.ucsd.edu/academics/undergraduate/dsc-tutors/) if you're interested - it's a lot of fun!
- Keep in touch! [dsc10.com/staff](https://dsc10.com/staff)

## Personal projects

### Using Jupyter Notebooks after DSC 10

- You may be interested in working on data science projects of your own.
- In [this video](https://www.youtube.com/watch?v=Hq8VaNirDRQ), we show you how to make blank notebooks and upload datasets of your own to DataHub.
- Depending on the classes you're in, you may not have access to DataHub. Eventually, you'll want to install Jupyter Notebooks on your computer.
    - [Anaconda](https://docs.anaconda.com/anaconda/install/index.html) is a great way to do that, as it also installs many commonly used packages.
    - You may want to download your work from DataHub so you can refer to it after the course ends.
    - Remember, all `babypandas` code is regular `pandas` code, too!

### Finding data

These sites allow you to search for datasets (in CSV format) from a variety of different domains. Some may require you to sign up for an account; these are generally reputable sources.

Note that all of these links are also available at https://rampure.org/find-datasets.

- [Data is Plural](https://www.data-is-plural.com/archive/)
- [FiveThirtyEight](https://data.fivethirtyeight.com/).
- [CORGIS](https://corgis-edu.github.io/corgis/csv/).
- [Kaggle Datasets](https://www.kaggle.com/datasets).
- [Google’s dataset search](http://toolbox.google.com/datasetsearch).
- [DataHub.io](https://datahub.io/collections).
- [Data.world.](https://data.world/)
- [R datasets](https://vincentarelbundock.github.io/Rdatasets/articles/data.html).
- Wikipedia. (use [this site](https://wikitable2csv.ggor.de/) to extract and download tables as CSVs)
- [Awesome Public Datasets GitHub repo](https://github.com/awesomedata/awesome-public-datasets).
- [Links to even more sources](https://rockcontent.com/blog/data-sources/)

### Domain-specific sources of data

- Sports: [Basketball Reference](https://www.basketball-reference.com/), [Baseball Reference](https://www.baseball-reference.com/), etc.
- US Government Sources: [census.gov](https://www.census.gov/data/tables.html), [data.gov](https://www.data.gov/), [data.ca.gov](https://data.ca.gov/), [data.sfgov.org](https://data.sfgov.org/browse?), [FBI’s Crime Data Explorer](https://crime-data-explorer.fr.cloud.gov/), [Centers for Disease Control and Prevention](https://data.cdc.gov/browse?category=NCHS).
- Global Development: [data.worldbank.org](https://data.worldbank.org/), [databank.worldbank.org](https://databank.worldbank.org/home.aspx), [WHO](https://apps.who.int/gho/data/node.home).
- Transportation: [New York Taxi trips](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DataIndex.asp), [SFO Air Traffic Statistics](https://www.flysfo.com/media/facts-statistics/air-traffic-statistics).
- Music: [Spotify Charts](https://spotifycharts.com/regional).
- COVID: [Johns Hopkins](https://github.com/CSSEGISandData/COVID-19).
- Any Google Forms survey you’ve administered! (Go to the results spreadsheet, then go to “File > Download > Comma-separated values”.)

Tip: if a site only allows you to download a file as an Excel file, not a CSV file, you can download it, open it in a spreadsheet viewer (Excel, Numbers, Google Sheets), and export it to a CSV.

## Parting thoughts

## How do I get more real-world data science experience?

- Most accessible way is to get involved with a research group on campus.
- How? Sam will walk through a few steps you can do.

- Friendly, but not toooo enthusiastic, know what you're interested in.
- Not hyper-focused on getting a research job.
- I'm really interested in the topic.

Hi [X],

I'm [X], a 3rd year undergrad at UCSD studying [blank].

I saw your paper on [X] and am interested in hearing about the idea and the process of doing your research.

- I see that you're research in [X] and would like to hear about your current research ideas.

Would you be open to having a 30-min chat with me about your work?



At the end of the conversation:

- I'm looking for real-world data science experience during the upcoming school year, would you be open to having me come and help?
- Do you know anyone who you think I could talk to learn more?

## Good luck on the final! 🎉