# DSCI 100: Introduction to Data Science

## Tutorial 11 - Introduction to Statistical Inference

First, load the necessary libraries.

In [None]:
### Run this cell before continuing
library(tidyverse)
library(repr)
library(digest)
library(infer)
library(gridExtra)

Just like in the tutorial, we're going to create a simulated dataset of data science final grades for a large population of students. 

In [None]:
# run this cell to simulate a finite population
set.seed(12341)
students_pop <- tibble(grade = (rnorm(mean = 70, sd = 8, n = 10000)))
head(students_pop)

Draw 200 random samples from our population of students. Each sample should have 50 observations. Name the data frame `samples_50`. 

In [None]:
set.seed(12341)

### BEGIN SOLUTION
samples_50 <- rep_sample_n(students_pop, size = 50, reps = 200)
### END SOLUTION

head(samples_50)

Group by the sample replicate number, and then for each sample, calculate the mean. Name the data frame `sample_estimates_50`. The data frame should have the column names `replicate` and `sample_mean`.

In [None]:
### BEGIN SOLUTION
sample_estimates_50 <- samples_50 %>% 
    group_by(replicate) %>% 
    summarise(sample_mean = mean(grade))
### END SOLUTION

head(sample_estimates_50)

Visualize the distribution of the sample estimates (sample_estimates) you just calculated by plotting a histogram using `binwidth = 0.5` in the `geom_histogram` argument. 

In [None]:
### BEGIN SOLUTION
sampling_distribution_50 <-  ggplot(sample_estimates_50, aes(x = sample_mean)) +
    geom_histogram(binwidth = 0.5) +
    xlab("Sample Mean Grades") +
    ggtitle("Sampling distribution of the sample means") 
### END SOLUTION

sampling_distribution_50

Next, repeat this process, but with a sample size of 500. Name the data frame `samples_500`. 

In [None]:
set.seed(12341)

### BEGIN SOLUTION
samples_500 <- rep_sample_n(students_pop, size = 500, reps = 200)

sample_estimates_500 <- samples_500 %>% 
    group_by(replicate) %>% 
    summarise(sample_mean = mean(grade))

sampling_distribution_500 <-  ggplot(sample_estimates_500, aes(x = sample_mean)) +
    geom_histogram(binwidth = 0.5) +
    xlab("Sample Mean Grades") +
    ggtitle("Sampling distribution of the sample means") 
### END SOLUTION

sampling_distribution_500

Based on these two plots, how does sample size affect the sampling distribution and standard error?