## DSCI 100 - Introduction to Data Science

### Lecture 11 - Introduction to inference & sampling through simulation

## What is statistical inference?

- Statistical inference is the process of using a small sample to make conclusions about the wider population the sample came from

- Examples of types of inference: estimation and testing

## Things we can do with inference

## 1. Make a statement such as this:

Based on a the results of the latest poll, we estimate that 47.2% of Americans think that firearms should have strong regulations or restrictions when thinking about gun ownership rights and gun laws.

source: http://polling.reuters.com/#!response/PV20/type/smallest/dates/20180505-20181002/collapsed/true

### This is estimation!

## 2. Answer a marketing question such as this:

What proportion of undergraduate students have an iphone?

<img img align="left" src="https://media.wired.com/photos/5b22c5c4b878a15e9ce80d92/master/w_582,c_limit/iphonex-TA.jpg" width="500"/>

source: https://media.wired.com/photos/5b22c5c4b878a15e9ce80d92/master/w_582,c_limit/iphonex-TA.jpg

### This can be answered with estimation!

## 3. or a health question such as this:

Are first babies born later than non-first born babies?

<img img align="left" src="https://images.mentalfloss.com/sites/default/files/styles/mf_image_16x9/public/baby_0.jpg" width="500"/>

source: https://images.mentalfloss.com/sites/default/files/styles/mf_image_16x9/public/baby_0.jpg

### This can be answered with a hypothesis test!

## 4. or a A/B testing question such as this:

Which of the 2 website designs will lead to more customer engagement (measured by click-through-rate, for example)?

<img img align="left" src="https://images.ctfassets.net/zw48pl1isxmc/4QYN7VubAAgEAGs0EuWguw/165749ef2fa01c1c004b6a167fd27835/ab-testing.png" width="600"/>

source: https://images.ctfassets.net/zw48pl1isxmc/4QYN7VubAAgEAGs0EuWguw/165749ef2fa01c1c004b6a167fd27835/ab-testing.png

### This can be answered with a hypothesis test!

## Estimation

What is estimation? And how do we do it?


### Marketing example revisited

**Question:** What proportion of undergraduate students have an iphone?

How could we answer this question? Discuss with your neighbour.

<img align="left" src="img/sampling.001.jpeg" width="700"/>

What if we randomly selected a subset and then asked them if they have an iphone? We could then calculate a proportion that we could use as an **estimate** of the true population proportion (parameter)? Could this work? 

<img align="left" src="img/sampling.002.jpeg" width="700"/>

Let's experiment and see how well sample estimates reflect the true population parameter we are interested in measuring!

# Virtual sampling simulation

- Let's create a virtual box of timbits (our population)
- Let's each use R to:
    - collect a random sample of 40 timbits, 
    - calculate a proportion of chocolate timbits
    - add our proportion to this shared [Google sheet](https://docs.google.com/spreadsheets/d/12nCqhf4RZoUtZenTjYvgdNjEbfggrX1VrYE9-v9WYK0/edit?usp=sharing) to build a distribution of this sample statistic

<img align="left" src="https://cdn.insidetimmies.com/wp-content/uploads/2014/05/tibits.jpg" width="300"/>




source: https://insidetimmies.com/2014/05/20/tim-hortons-has-sold-400000-km-of-timbits-since-its-introduction-in-1976/

### As always, load the libraries we'll be using:

In [4]:
# load libraries for wrangling and plotting
library(dplyr)
library(ggplot2)
library(infer) #install.packages("infer")


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



### 1. Create a virtual box of timbits (population)

Let's "create" a population of 10000 timbits where the proportion of chocolate timbits is 0.63 and the proportion of old fashioned timbits is 0.37.

**IMPORTANT - set your seed to 1234 so that we all create the same population!**

In [5]:
# create virtual box
set.seed(1234)
virtual_box <- tibble(timbit_id = seq(1, 10000, by = 1),
                     color = factor(rbinom(10000, 1, 0.63),
                     labels = c("old fashioned", "chocolate")))
head(virtual_box)

timbit_id,color
<dbl>,<fct>
1,chocolate
2,chocolate
3,chocolate
4,chocolate
5,old fashioned
6,old fashioned


- Here we use `rbinom` to "create" the population of 10000 timbits where the proportion of chocolate is 0.63 and the proportion of old fashioned is 0.37.

- We also use `seq` to create a column called `timbit_id` that holds the value from 1 to 10000.

- We use `tibble` to make keep these two columns together as a data frame (tibble is a special type of data frame that you will learn more about in the Data Wrangling course).

Sanity check that the virtual box contains ~ 63% chocolate timbits:

In [11]:
virtual_box %>% 
    group_by(color) %>% 
    summarize(n = n(),
             proportion = n() / 10000)

color,n,proportion
<fct>,<int>,<dbl>
old fashioned,3705,0.3705
chocolate,6295,0.6295


### 2. Drawing a single sample of size 40

Let's simulate taking one random sample from our virtual timbits box. We will use the `rep_sample_n` function from the `infer` package:

In [21]:
# draw a single sample from the virtual box
set.seed(NULL)
samples_1 <- rep_sample_n(virtual_box, size = 40)
head(samples_1)

replicate,timbit_id,color
<int>,<dbl>,<fct>
1,6026,chocolate
1,3360,chocolate
1,1398,chocolate
1,2824,chocolate
1,9108,old fashioned
1,8298,chocolate


We can tell by the `timbit_id` column that R indeed did what we asked - randomly selected 40 timbits from our virtual box.

### What is the proportion of chocolate in our single sample?

In [10]:
choc_sample <- summarize(samples_1, n = sum(color == "chocolate"),
                                        prop = sum(color == "chocolate") / 40)
choc_sample

replicate,n,prop
<int>,<int>,<dbl>
1,22,0.55


- `summarize` applies a a data transformation across the rows of a data frame (more about this in Data Wrangling)

## Add our calculated proportion to the shared Google sheet

- [Google sheet](https://docs.google.com/spreadsheets/d/152tE3_dA6yz3fhbihFXgH0HtMmtLinb2WMGNCJrbAcg/edit?usp=sharing)

## Now it's your turn! Go!

1. Collect a random sample of 40 timbits & calculate the sum and proportion of chocolate timbits (code given below).

2. add your proportion to this shared [Google sheet](https://docs.google.com/spreadsheets/d/152tE3_dA6yz3fhbihFXgH0HtMmtLinb2WMGNCJrbAcg/edit?usp=sharing) to build a distribution of this sample statistic.

In [30]:
set.seed(1234) # so that we all have the same population
virtual_box <- tibble(timbit_id = seq(from = 1, to = 10000, by = 1),
                      color = factor(rbinom(10000, 1, 0.63), 
                                     levels = c(1, 0),
                                     labels = c("chocolate", "old fashioned")))
set.seed(NULL) # so that we each collect a different sample
choc_sample <- rep_sample_n(virtual_box, size = 40) %>% 
    summarize(n = sum(color == "chocolate"),
              prop = sum(color == "chocolate") / 40)
choc_sample

replicate,n,prop
<int>,<int>,<dbl>
1,28,0.7


## Discussion time

How well do our samples represent the population parameter we are interested in (proportion of chocolate timbits)?

## Back to our marketing example

Is randomly selecting a subset of the students (taking a single sample) and then asking them if they have an iphone a good way to estimate the true proportion of all undergraduates who have iphones (population parameter we are interested in)? 

<img align="left" src="img/sampling.002.jpeg" width="700"/>

# Wrap-up 

What did we learn so far today? Let's make a list here!

- 

- 

- 

## Questions that we will try to answer next?

- Usually we only have one sample? So what can we do? 

# Acknowledgements
- [Data Science in a box](https://github.com/rstudio-education/datascience-box) by Mine Cetinkaya-Rundel
- [Inference in 3 hours](https://github.com/AllenDowney/CompStats) by Allan Downey
- [Modern Dive: An Introduction to Statistical and Data Sciences via R](https://moderndive.com/index.html) by Chester Ismay and Albert Y. Kim