### 3.1 Approximation results and confidence intervals

#### Law of Large Numbers

>In probability theory, the **law of large numbers (LLN)** is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

> *Source: [Wikipedia](https://en.wikipedia.org/wiki/Law_of_large_numbers)*

Imagine we are running an experiment to check the ratio of people with and w/o **Hypertension**. We assume that ratio from this dataset is true ratio of population. Then we take the sample of Hypertension results and check whether the difference between ratio of the sample and ratio of population is going to zero after bigger sample size.

In [None]:
n <- 500

#covert hypertension column to string on (0,1 integers)
hypertension  <- as.numeric(levels(Stroke_Data$hypertension))[Stroke_Data$hypertension]

avg_value <- mean(hypertension)

#taking three random sample n-size each
x1 = sample(hypertension, size = n, replace = F)
x2 = sample(hypertension, size = n, replace = F)
x3 = sample(hypertension, size = n, replace = F)

#creating three vectors that will hold information about ratio for each size of sample (from 1 to n)
xbar1 = rep(0,length(x1))
xbar2 = rep(0,length(x2))
xbar3 = rep(0,length(x3))

for (i in 1:length(x1)) {
    xbar1[i] = mean(x1[1:i])
    xbar2[i] = mean(x2[1:i])
    xbar3[i] = mean(x3[1:i])
}

plot(1:n, xbar1-avg_value, type="l", col="red", lwd=1, ylim=c(-0.1,0.2),
     xlab="Number of subjects sampled",
     ylab="Distance to the mean")
lines(1:n, xbar2-avg_value, col="blue", lwd=1)
lines(1:n, xbar3-avg_value, col="orange", lwd=1)
lines(1:n, rep(0,n), lwd=3)

We can see, that the higher the number of subjects in our samples, the more likely mean of the sample is going to equal to the mean of population.

The **law of large numbers** will establish that as *n* increases the averages are close to the target, while the
**central limit theorem** will say how close and with what probability are the results of the experiment to the
true target.

#### Central Limit Theorem

> In probability theory, the **central limit theorem (CLT)** establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

> *Source: [Wikipedia](https://en.wikipedia.org/wiki/Central_limit_theorem)*

Let's assume the **BMI** of people who had a **stroke** falls to normal distribution.

In [None]:
bmi <- Stroke_Data$bmi[Stroke_Data$stroke == 1]

n <- length(bmi)
mu_hat <- mean(bmi) # Sample mean estimates the expected value (mu) 
s <- sd(bmi) # Sample standard deviation
error <- s/sqrt(n) # The standard error of mu_hat
z <- qnorm(0.05/2, lower.tail = F) # Compute the critical value z

lower_ci <- mu_hat - z*error
upper_ci <- mu_hat + z*error

interval_estimate <- c("estimate" = mu_hat, "lower 95%" = lower_ci, "upper 95%" = upper_ci)
round(interval_estimate, digits = 1)

In [None]:
hist(bmi)
abline(v = mean(bmi), col = "royalblue", lwd = 2)
abline(v = lower_ci, col = "red", lwd = 2)
abline(v = upper_ci, col = "red", lwd = 2)

In [None]:
N <- 100
means <- numeric(N)

for(i in 1:N) {
  bmi_sample <- sample(Stroke_Data$bmi[Stroke_Data$stroke == 1], size = 50, replace = F)
  means[i] <- mean(bmi_sample)
}

# Compute basic summary statistics
summary(means)

# Visualize the distribution of the means with a histogram
hist(means)

In [None]:
# Draw a single sample of n = 50 exam points
n <- 600
bmi_sample <- sample(Stroke_Data$bmi[Stroke_Data$stroke == 1], size = n, replace = F)

# Compute the sample mean
mu_hat <- mean(bmi_sample)

# Compute the sample standard deviation
s <- sd(bmi_sample)

# Compute the standard error of the mean
error <- s/ sqrt(n)

# (1) Histogram of the previously simulated sample means and
# (2) The normal approximation of that distribution, based on a single sample
hist(means, freq = F); 
curve(dnorm(x, mean = mu_hat, sd = error), add = T)

### 3.2 Introduction to t-test and confidence intervals

Let's assume that we want to check whether the difference in age for people who had stroke and who didn't is significant. To do that we can run a **t-test** to find out.

>The **t-test** (also called **Student’s T-Test**) compares two averages (means) and tells you if they are different from each other. The t-test also tells you how significant the differences are; In other words it lets you know if those differences could have happened by chance.

>*Source: [Statistics How To](http://www.statisticshowto.com/probability-and-statistics/t-test/)*

First, lets visually inspect the distribution of **Age** by **Stroke** outcome.

In [None]:
Stroke_Data %>%
    mutate(group = case_when(
        (stroke == 1) ~ "Stroke",
        (stroke == 0) ~ "No Stroke")) %>%
    group_by(group) %>%
    summarise(count = length(age),
              mean = round(mean(age)),
              variance = round(var(age),2))

What we can see is that average age for having a stroke is much higher (which should not come as a big surprise). If we compare the average values of two groups we will get: 
\begin{align} 
\mu^0- \mu^1 = 41 - 68 = -27 
\end{align}

Our hypothesis for **t-test** look like this:

* $H_0$: there is no difference in mean between two groups. $\mu^0 = \mu^1$

* $H_1$: there is a difference in mean between two groups. $\mu^0 \neq \mu^1$

Now we can just have to run `t.test` function which is built in in R and find out if our difference in means can be considered as significant or it's just a random chance. We can not use paired t-test vecause our groups have different samle size, so we have to set `paired = FALSE` (which comes as default). Another parameter `var.equal` should be also set to `FALSE` since variances between groups are not equal. 

In [None]:
t.test(age~stroke, 
       data=Stroke_Data,
       var.equal = FALSE,
       paired = FALSE,      
       conf.level = 0.95)

Confidence interval of mean differences is: [-28; -26]. This does not include 0, so we **reject** null hypothesis and we say that there is a significance difference between group means with confidence level of 95%. In other words, **we are 95% confident that stroke is more likely to happen to older people**.

This may seems unclear - we had two samples, calculated the difference of their averages. This difference was far away from 0, so why did we still have to run t-test to check fot significance? The answer is that the size of confidence interval depends on the sample size (the higher observations we have the more confident we can be). Let's run an another example with less amount of observations.

In [None]:
example <- Stroke_Data %>%
    select(stroke, age) 

#I will fill out data frame with unreal values just for example
example  <- example[1:50,]
example$stroke[1:25]  <- 0
example$stroke[26:50] <- 1

example %>% group_by(stroke) %>%
    summarise(count = length(age),
              mean_age = round(mean(age)),
              variance = round(var(age),2))

In [None]:
t.test(age~stroke, 
       data=example,
       var.equal = FALSE,
       paired = FALSE,      
       conf.level = 0.95)

In this case difference was 9, which also can be considered as "away from zero", but confidence interval is too wide [-4; 21]. It includes 0 value, so we can not be so sure, that Age is not related to Stroke outcome based on these samples.