In [None]:
# Imports
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Setup to start where we left off in Lecture 26
keep_cols = ['business_name', 'inspection_date', 'inspection_score', 'risk_category', 'Neighborhoods', 'Zip Codes']
restaurants_full = bpd.read_csv('data/restaurants.csv').get(keep_cols)
bakeries = restaurants_full[restaurants_full.get('business_name').str.lower().str.contains('bake')]
bakeries = bakeries[bakeries.get('inspection_score') >= 0] # Keeping only the rows where we know the inspection score

# Lecture 27 – Review, Conclusion

## DSC 10, Winter 2022

### Announcements

- The **Final Exam** is **tomorrow from 3-6PM** ‼️.
    - See [this post](https://campuswire.com/c/G6950E967/feed/1403) on Campuswire for more details.
- If at least 85% of the class fills out **both** [CAPEs](https://cape.usd.edu) and the [additional survey](https://forms.gle/pt9HZ4RJzBrz1Vz98), then everyone will receive an extra 0.5% added to their overall course grade.
    - Currently at 69% for CAPEs and 57% for the additional survey – make sure to fill out both!
    - The deadline for both is tomorrow at **8AM**.
- There are office hours all day today.
    - Post questions on Campuswire too, but please search before asking a question about a past exam – chances are, it was already answered!

### Agenda

- Finish the review from Wednesday.
- Personal projects. 
- Demo of cool plotting.
- Parting thoughts.

## Bakeries 🧁

In [None]:
bakeries

In this case we happen to have the inspection scores for all members of the **population** – i.e. **all bakeries in San Francisco** – but in reality we won't. So let's instead consider a random **sample** of the population.

In [None]:
np.random.seed(23) # Ignore this

sample_of_bakeries = bakeries.sample(200, replace=False)
sample_of_bakeries

We're interested in inferring what the population **median** is given our single sample. There is no CLT for sample medians, so instead we'll have to resort to bootstrapping to estimate the distribution of the sample median.

Recall, bootstrapping is the act of **sampling from the original sample, with replacement**. This is also called **resampling**.

In [None]:
# The median of our original sample – this is just one number
sample_of_bakeries.get('inspection_score').median()

In [None]:
# The median of a single bootstrap resample – this is just one number
sample_of_bakeries.sample(200, replace=True).get('inspection_score').median()

Let's resample repeatedly.

In [None]:
np.random.seed(23) # Ignore this

boot_medians = np.array([])

for i in np.arange(5000):
    boot_median = sample_of_bakeries.sample(sample_of_bakeries.shape[0], replace=True).get('inspection_score').median()
    boot_medians = np.append(boot_medians, boot_median)

In [None]:
boot_medians

In [None]:
bpd.DataFrame().assign(boot_medians=boot_medians).plot(kind='hist', density=True, ec='w', bins=10, figsize=(10, 5));

Note that this distribution is not at all normal.

To compute a 95% confidence interval, we take the middle 95% of the bootstrapped medians.

In [None]:
left = np.percentile(boot_medians, 2.5)
right = np.percentile(boot_medians, 97.5)

[left, right]

### Discussion Question

Which of the following interpretations of this confidence interval are valid?

1. 95% of SF bakeries have an inspection score between 85 and 88.  
2. 95% of the resamples have a median inspection score between 85 and 88.  
3. There is a 95% chance that our sample has a median inspection score between 85 and 88.  
4. There is a 95% chance that the median inspection score of all SF bakeries is between 85 and 88.  
5. If we had taken 100 samples from the same population, about 95 of these samples would have a median inspection score between 85 and 88.  
6.  If we had taken 100 samples from the same population, about 95 of the confidence intervals created would contain the median inspection score of all SF bakeries.  

<details>
<summary>Click for the answer <b>after</b> you've entered your guess above. <b>Don't scroll any further.</b></summary>
    
The correct answers are Option 2 and Option 6.

</details>

## Physicians 🩺

### The setup

You work as a family physician. You collect data and you find that in 6354 patients, 3115 were children and 3239 were adults.

You want to test the following hypotheses:

- **Null Hypothesis:** Family physicians see an equal number of children and adults.
- **Alternative Hypothesis:** Family physicians see more adults than they see children.

### Discussion Question

Which test statistic(s) could be used for this hypothesis test? Which values of the test statistic point towards the alternative?

A. proportion of children seen   
B. number of children seen  
C. number of children minus number of adults seen  
D. absolute value of number of children minus number of adults seen

**There may be multiple correct answers.**

<details>
<summary>Click for the answer <b>after</b> you've entered your guess above. <b>Don't scroll any further.</b></summary>
    
All of these but the last one would work for this alternative. Small values of these statistics would favor the alternative.
    
If the alternative was instead "Family physicians see a different number of children and adults", the last option would work while the first three wouldn't.

</details>

Let's use option B, the number of children seen, as a test statistic. Small values of this statistic favor the alternative hypothesis.

How do we generate a single value of the test statistic?

In [None]:
np.random.multinomial(6354, [0.5, 0.5])[0]

As usual, let's simulate the test statistic many, many times.

In [None]:
test_stats = np.array([])

for i in np.arange(10000):
    stat = np.random.multinomial(6354, [0.5, 0.5])[0]
    test_stats = np.append(test_stats, stat)

In [None]:
test_stats

In [None]:
bpd.DataFrame().assign(test_stats=test_stats) \
               .plot(kind='hist', density=True, ec='w', figsize=(10, 5), bins=20);
plt.axvline(3115, color='red', label='observed statistic')
plt.legend();

Recall that you collected data and found that in 6354 patients, 3115 were children and 3239 were adults.

### Discussion Question

What goes in blank (a)?

```py
p_value = np.count_nonzero(test_stats __(a)__ 3115) / 10000
```

A. `>=`

B. `>`

C. `<=`

D. `<`

### To answer, go to [menti.com](https://menti.com) and enter the code 3437 4946.

<details>
<summary>Click for the answer <b>after</b> you've entered your guess above. <b>Don't scroll any further.</b></summary>
    <=

</details>

In [None]:
# Calculate the p-value

### Discussion Question

What do we do, assuming that we're using a 5% p-value cutoff?

A. reject the null  

B. fail to reject the null 

C. not sure

### To answer, go to [menti.com](https://menti.com) and enter the code 3437 4946.

<details>
<summary>Click for the answer <b>after</b> you've entered your guess above. <b>Don't scroll any further.</b></summary>
    Fail to reject the null, since the p-value is above 0.05.

</details>

Note that while we used `np.random.permutation` to simulate the test statistic, we could have used `np.random.choice`, too:

In [None]:
choices = np.random.choice(['adult', 'child'], p=[0.5, 0.5], size=6354, replace=True)
choices

In [None]:
np.count_nonzero(choices == 'child')

### Discussion Question

Is this an example of bootstrapping?

A. Yes, because we are sampling with replacement.

B. No, this is not bootstrapping.

### To answer, go to [menti.com](https://menti.com) and enter the code 3437 4946.

<details>
<summary>Click for the answer <b>after</b> you've entered your guess above. <b>Don't scroll any further.</b></summary>
    
No, this is not bootstrapping. Bootstrapping is when we resample from a single sample; here we're simulating data under the assumptions of a model.

</details>

## Personal projects

### Using Jupyter Notebooks after DSC 10

- You may be interested in working on data science projects of your own.
- In [this video](https://www.youtube.com/watch?v=Hq8VaNirDRQ), we show you how to make blank notebooks and upload datasets of your own to DataHub.
- Depending on the classes you're in, you may not have access to DataHub. Eventually, you'll want to install Jupyter Notebooks on your computer.
    - [Anaconda](https://docs.anaconda.com/anaconda/install/index.html) is a great way to do that, as it also installs many commonly used packages.
    - You may want to download your work from DataHub so you can refer to it after the course ends.
    - Remember, all `babypandas` code is regular `pandas` code, too!

### Finding data

These sites allow you to search for datasets (in CSV format) from a variety of different domains. Some may require you to sign up for an account; these are generally reputable sources.

Note that all of these links are also available at https://rampure.org/find-datasets.

- [Kaggle Datasets](https://www.kaggle.com/datasets).
- [Google’s dataset search](http://toolbox.google.com/datasetsearch).
- [FiveThirtyEight](https://data.fivethirtyeight.com/).
- [DataHub.io](https://datahub.io/collections).
- [Data.world.](https://data.world/)
- [CORGIS](https://corgis-edu.github.io/corgis/csv/).
- [R datasets](https://vincentarelbundock.github.io/Rdatasets/articles/data.html).
- Wikipedia. (use [this site](https://wikitable2csv.ggor.de/) to extract and download tables as CSVs)
- [Awesome Public Datasets GitHub repo](https://github.com/awesomedata/awesome-public-datasets).
- [Links to even more sources](https://rockcontent.com/blog/data-sources/)

### Domain-specific sources of data

- Sports: [Basketball Reference](https://www.basketball-reference.com/), [Baseball Reference](https://www.baseball-reference.com/), etc.
- US Government Sources: [census.gov](https://www.census.gov/data/tables.html), [data.gov](https://www.data.gov/), [data.ca.gov](https://data.ca.gov/), [data.sfgov.org](https://data.sfgov.org/browse?), [FBI’s Crime Data Explorer](https://crime-data-explorer.fr.cloud.gov/), [Centers for Disease Control and Prevention](https://data.cdc.gov/browse?category=NCHS).
- Global Development: [data.worldbank.org](https://data.worldbank.org/), [databank.worldbank.org](https://databank.worldbank.org/home.aspx), [WHO](https://apps.who.int/gho/data/node.home).
- Transportation: [New York Taxi trips](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DataIndex.asp), [SFO Air Traffic Statistics](https://www.flysfo.com/media/facts-statistics/air-traffic-statistics).
- Music: [Spotify Charts](https://spotifycharts.com/regional).
- COVID: [Johns Hopkins](https://github.com/CSSEGISandData/COVID-19).
- Any Google Forms survey you’ve administered! (Go to the results spreadsheet, then go to “File > Download > Comma-separated values”.)

Tip: if a site only allows you to download a file as an Excel file, not a CSV file, you can download it, open it in a spreadsheet viewer (Excel, Numbers, Google Sheets), and export it to a CSV.

## Demo: Gapminder 🌎

### plotly

- All of the visualizations (scatter plots, histograms, etc.) in this course were created using a library called `matplotlib`.
    - This library was called under-the-hood everytime we wrote `df.plot`.
- `plotly` is a visualization library that allows us to create **interactive** visualizations.
- You may learn about it in a future course, but we'll briefly show you some cool visualizations you can make with it.

In [None]:
import plotly.express as px

### Gapminder dataset

> Gapminder Foundation is a non-profit venture registered in Stockholm, Sweden, that promotes sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels. - [Gapminder Wikipedia](https://en.wikipedia.org/wiki/Gapminder_Foundation)

In [None]:
gapminder = px.data.gapminder()
gapminder

The dataset contains information for each country for several different years.

In [None]:
np.unique(gapminder.get('year'))

Let's start by just looking at 2007 data (the most recent year in the dataset).

In [None]:
gapminder_2007 = gapminder[gapminder.get('year') == 2007]
gapminder_2007

### Scatter plot

We can plot life expectancy vs. GDP per capita. If you hover over a point, it'll tell you the name of the country.

In [None]:
px.scatter(gapminder_2007, x='gdpPercap', y='lifeExp', hover_name='country')

In future courses, you'll learn about transformations. Here, we'll apply a log transformation to the x-axis to make the plot look a little more linear.

In [None]:
px.scatter(gapminder_2007, x='gdpPercap', y='lifeExp', log_x=True, hover_name='country')

### Animated scatter plot

We can take things one step further.

In [None]:
px.scatter(gapminder,
           x = 'gdpPercap',
           y = 'lifeExp', 
           hover_name = 'country',
           color = 'continent',
           size = 'pop',
           size_max = 60,
           log_x = True,
           range_y = [30, 90],
           animation_frame = 'year',
           title = 'Life Expectancy, GDP Per Capita, and Population over Time'
          )

Watch [this video](https://www.youtube.com/watch?v=jbkSRLYSojo) if you want to see an even-more-animated version of this plot.

### Animated histogram

In [None]:
px.histogram(gapminder,
            x = 'lifeExp',
            animation_frame = 'year',
            range_x = [20, 90],
            range_y = [0, 50],
            title = 'Distribution of Life Expectancy over Time')

### Choropleth

In [None]:
px.choropleth(gapminder,
              locations = 'iso_alpha',
              color = 'lifeExp',
              hover_name = 'country',
              hover_data = {'iso_alpha': False},
              title = 'Life Expectancy Per Country',
              color_continuous_scale = px.colors.sequential.tempo
)

## Parting thoughts

### From Lecture 1: what is data science?

Drawing useful conclusions from data using computation.

- **Exploration**
    - Identifying patterns in information
    - Uses visualizations
- **Prediction**
    - Making informed guesses
    - Uses machine learning and optimization
- **Inference**
    - Quantifying whether those predictions are reliable
    - Uses randomization
    
Throughout the quarter, you were exposed to all three of these facets. In future courses, you'll continue to learn more.

### Note on grades

<center><img src='data/transcript.png' width=700> Suraj's freshman year transcript. </center>

Don't let your grades define you, they don't tell the full story.

### Thank you!

This course would not have been possible without...
- Our 2 TAs: Murali and Yifei.
- Our 14 tutors: John Driscoll, Judy Feng, Jessica Guzman, Yujian (Ken) He, Aven Huang, Dylan Lee, Teresa Lee, Karthikeya Manchala, Yash Potdar, Shiv Sakthivel, Ethan Shapiro, Du Xiang, Guantong Zhang, and Xuzhe (Alina) Zhi.
- Keep in touch! [dsc10.com/staff](https://dsc10.com/staff)

## Good luck on the final! 🎉