# DSCI 100 - Introduction to Data Science


## Lecture 12 - Bootstrapping & wrap-up

<img src="img/wrap_up.gif" width = 500 />

# Housekeeping

- Today is the last lecture
- Course evaluation is live (SEI Surveys)!
    - Canvas > Course Evaluation
- No tutorial assignment this week -- use the time to wrap up + polish your project!
- Final project and teamwork document
- Final exam (covers all the material)

https://students.ubc.ca/enrolment/exams/exam-schedule

## Review

**Inference:** Using a sample to make a conclusion about the wider population

What do the following terms mean:

- population (and population parameter)
- sample
- estimation (and estimate)
- sampling distribution

## Review - Population

Population distribution of price per night for all Airbnb listings in Vancouver.

![image.png](attachment:image.png)

## Review - Sample

Distribution of price per night for a sample of 40 Airbnb listings

![image.png](attachment:image.png)

## Review - Sampling distribution

Sampling distribution of the sample means for 20,000 samples of size 40.

![image-5.png](attachment:image-5.png)
![image-3.png](attachment:image-3.png)

1. What happens to the peak and the range of the histogram if I make my sample smaller / larger?
2. Can we create this figure in a real data analysis problem? Why/Why not?
3. What problem does that cause and what could be some possible resolutions?

We generated fig by creating many samples of size N, computing average age, plotting the histogram. This is a visualization of the sampling distribution.

1. If sample size gets larger, the spread of the distribution shrinks (and vice versa). The peak is centered at the true population parameter value.
2. No; in a real data analysis problem we only have one sample to work with (we can't create many samples)
3. This means we cannot visualize the spread, and so have no way of understanding how reliable our point estimate is using just one sample.

 But in real data analysis settings, we usually have just one sample from our population and do not have access to the population itself. Therefore we cannot construct the sampling distribution as we did in the previous section. And as we saw, our sample estimate’s value can vary significantly from the population parameter. So reporting the point estimate from a single sample alone may not be enough. We also need to report some notion of uncertainty in the value of the point estimate.

## Bootstrapping

We only have one sample... but if it's big enough, the sample looks like the population!

![image.png](attachment:image.png)

Let's pretend our *sample* **is** our population. Then we can take many samples from our original sample (called *bootstrap samples*) to give us an approximation of the sampling distribution (the *bootstrap sampling distribution*).

Note that by taking many samples from our single, observed sample, we do not obtain the true sampling distribution, but rather an approximation that we call the bootstrap distribution.

## Generating a single bootstrap sample

1. Randomly draw an observation from the original sample (which was drawn from the population)

2. Record the observation's value

3. Return the observation to the original sample

4. Repeat the above the **same number of times as there are observations in the original sample**

## Bootstrap Sampling Distribution
<img src="https://ubc-dsci.github.io/introduction-to-datascience/_main_files/figure-html/11-bootstrapping7-1.png" width=1300/>

What would happen if we sampled *without* replacement? What does that mean?

## Sampling vs bootstrap distribution

The bootstrap mean will be centered around the mean of the intial sample rather than the true population mean (which is unknown) 

![image.png](attachment:image.png)

There are two essential points that we can take away from Fig. 10.14. First, the shape and spread of the true sampling distribution and the bootstrap distribution are similar; the bootstrap distribution lets us get a sense of the point estimate’s variability. The second important point is that the means of these two distributions are slightly different.

## Using the bootstrap to calculate a plausible range of the mean

1. Take a bootstrap sample

2. Calculate the bootstrap mean (or any other point estimate such as the median, proportion, slope, etc.) from that bootstrap sample

3. Repeat steps (1) and (2) many times to create a **bootstrap sampling distribution** - the distribution of bootstrap point estimates

4. Calculate the plausible range of values around our observed point estimate (we will call this a confidence interval and learn more about it in the worksheet)

## Using the bootstrap to calculate a plausible range of the mean

![image.png](attachment:image.png)

- We can use our bootstrap distribution to calculate the plausible range of values for the population parameter:
- The (approximate) *95% confidence interval* we report is the range from 119.28 to 203.63.
    - 95% of the means from our bootstrap resampling fell in this interval.
- We can report **both** our sample point estimate and the plausible range where we expect our true population quantity to fall.

A confidence interval is a range of plausible values for the population parameter. We will find the range of values covering the middle 95% of the bootstrap distribution, giving us a 95% confidence interval. You may be wondering, what does “95% confidence” mean? If we took 100 random samples and calculated 100 95% confidence intervals, then about 95% of the ranges would capture the population parameter’s value. Note there’s nothing special about 95%. We could have used other levels, such as 90% or 99%. There is a balance between our level of confidence and precision. A higher confidence level corresponds to a wider range of the interval, and a lower confidence level corresponds to a narrower range.

Here the sample mean price-per-night of 40 Airbnb listings was $153.48, and we are 95% “confident” that the true population mean price-per-night for all Airbnb listings in Vancouver is between $(121.6, 191.5). Notice that our interval does indeed contain the true population mean value, $154.51! 

## Data Science wrap-up

At the start of the semester, we started with this gif


<img src="https://media.giphy.com/media/up25s7QBalEmQ/giphy.gif">


And we laid out these goals and this path:



## High-level goals of this course:

1. Map research/statistical questions to the appropriate type of data analysis

2. Use modern reproducible tools (Jupyter notebooks, R, tidyverse & tidymodels) to load, wrangle, and explore data; and to solve classification, regression, clustering, and inference problems

3. Correctly interpret and communicate results from all of the above analyses

## Problems we focused on:

1. Predict a class/category for a new observation/measurement (predictive)

2. Predict a value for a new observation/measurement (predictive). 

3. Find previously unknown/unlabelled subgroups in your data (exploratory)

4. Estimating a quantity from a population (inferential)

## Another way to think of what we did in this course:

![](https://python.datasciencebook.ca/_images/chapter_overview.png)


## Where to from here

- you learned a lot in this course!


- many of you are asking for more Data Science (yeah!)


- so here's a list of some UBC courses of interest you might want to take:
    - [STAT 201 - Statistical Inference for Data Science](https://ubc-stat.github.io/stat-201)
    - [STAT 301 - Statistical Modelling for Data Science](https://ubc-stat.github.io/stat-301)
    - [STAT 406 - Methods for Statistical Learning](https://github.com/msalibian/STAT406)
    - [CPSC 330 - Applied Machine Learning](https://courses.students.ubc.ca/cs/courseschedule?tname=subj-course&course=330&campuscd=UBC&dept=CPSC&pname=subjarea)
    - [CPSC 368 - Databases in Data Science](https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&dept=CPSC&course=368) 
    - [DSCI 310 - Reproducible and trustworthy workflows for data science](https://ubc-dsci.github.io/dsci-310-student/)
    - [DSCI 320 - Visualization for Data Science](https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-course&dept=DSCI&course=320)

## Minor in Data Science

- UBC also has a new [Minor in Data Science](https://datascience.ubc.ca/minor). 


- Anybody can apply -- no matter your home faculty, no matter your major!


- Outside of classes, I can recommend reading
    - [An Introduction to Statistical Learning](https://www.statlearning.com/)
    - [John Hopkins Coursera Data Science courses](https://www.coursera.org/specializations/jhu-data-science)

## Thank-you and it's been a blast!

<img align="left" src="https://media.giphy.com/media/12xvz9NssSkaS4/giphy.gif" width="500">

## Course Evaluation


Let's take some time right now to all go do the course evaluation! 

We really appreciate your feedback here. It helps us understand what worked / what didn't / how to improve! 


## One last round of Worksheet work!
