# Distribution of the sample mean

Here we will sample repeatedly from a population and investigate how the distribution of the sample mean depends on the sample size. The "population" we will sample from is the length of Swedish men measured when they were 18-20 years old. The data comes from Inskrivningsarkivregistret and were made public by Riksarkivet (Swedish National Archives). Although the data does not represent the complete population of Swedish men, we pretend it does for this exercise.  

In [None]:
## load the ggplot2 package for pretty plotting, and load data
require(ggplot2)
options(repr.plot.width=14, repr.plot.height=8)
## load data from Riksarkivet, height and weight of military conscripts
data <- readRDS("insark_h_w.rds")

Let's look at the population data.

In [None]:
dim(data)
mean(data$h)
ggplot(data=data,aes(x=h))+geom_histogram(binwidth=1)+xlab("height (cm)")+ylab("count")+theme_bw(base_size=18)

We see that the "population" mean is 178.7 cm. Let's try to estimate this from a sample from this population.

In [None]:
nsamp <- 10
samp <- sample(data$h,nsamp,replace=FALSE)
mean(samp)

Now we can repeat this sampling a number times and for different sample sizes. 

In [None]:
nsamp <- 10
niter <- 5000
smean10 <- vector(length=niter)
for (i in 1:niter){
    smean10[i] <- mean(sample(data$h,nsamp,replace=FALSE))
}
nsamp <- 100
smean100 <- vector(length=niter)
for (i in 1:niter){
    smean100[i] <- mean(sample(data$h,nsamp,replace=FALSE))
}
nsamp <- 1000
smean1000 <- vector(length=niter)
for (i in 1:niter){
    smean1000[i] <- mean(sample(data$h,nsamp,replace=FALSE))
}

Now we will plot these three different sets of sample means.

In [None]:
## make a data frame for plotting
df <- data.frame(x=smean10,N="N=10")
df <- rbind(df,data.frame(x=smean100,N="N=100"))
df <- rbind(df,data.frame(x=smean1000,N="N=1000"))

ggplot(data=df,aes(x=x)) + facet_wrap(.~as.factor(N),scales="free_x")+
geom_histogram(position='identity',alpha=0.4)+xlab("length (cm)")+theme_bw(base_size=18)

Note the different scales on the X-axis. With larger sample, the variability is reduced and the distributon becomes more normal.