# Lecture 7 - Two-sample t-tests
This notebook is the conceptual examples we will work through in Lecture.  This will not focus on the actual "doing" of the analysis in a practical application.  Lab 5 will focus on that and show you examples of how to use the appropriate code to conduct your analyses in HW 5.

In [None]:
# LIBRARIES
library(tidyverse)
library(magrittr) ## for pipe operators
library(scales) ## for scaling functions for ggplot2

#### plot size options for Jupyter Notebooks ONLY
options(repr.plot.width  = 8,
        repr.plot.height = 6)
#### do not use these options for RStudio

bold.14.text <- element_text(face = "bold", size = 14)

In [None]:

## DATA
cah <- read_csv("201710-CAH_PulseOfTheNation_Raw.csv")
## variable names currently full questions - need to rename
new_names <- c("income", "gender", "age", "age_cat", "polaffil", "trump", "educ", "race", "whtnat", "whtnat_rep",
              "love_us", "love_us_dem", "helppoor", "helppoor_rep", "racist", "racist_dem", "friendtrump", "civilwar",
              "hunting", "kale", "therock", "trumpvader")
colnames(cah) <- new_names
cah_oct <- cah %>% drop_na(income) %>% filter(!gender %in% c("DK/REF", "Other"))
glimpse(cah_oct)

## Two-Sample t-test - Conceptual Example
We're going to again look at income in the CAH sample. This time we are going to compare the average income within two samples - these two samples are defined by the gender variable - the male sample vs. the female sample.  Note that these are two subgroups that come from the same sample survey, but they are treated as two samples.  This is how the majority of analysts conduct t-tests.

How would we run a hypothesis test for this?

### Step 1 - Formulate Hypothesis

$H_0 : \mu_{female} = \mu_{male}$

$H_A : \mu_{female} \neq \mu_{male}$

**Note:** Given our $H_A$ we're running a two-tailed test.

In [None]:
cah_oct %>%
  ggplot( aes(x=income/1000, fill=gender)) +
    geom_density(alpha=0.6) +
    scale_fill_manual(values=c("red", "#404080")) +
    labs(fill="Gender",
         y = "Density",
         x = "Income in $1000",
         title = "Distribution of Income by Gender") +
    theme(text = bold.14.text)

In [None]:
meantab <- cah_oct %>% 
                group_by(gender)  %>% 
                summarize(freq = n(),
                          mean = mean(income),
                          stddev = sd(income),
                          var = stddev^2)
meantab

### Step 2 - Prepare and Check Conditions

Set alpha ->>> $\alpha = 0.05$

Random and independent sample ->>> Yes

Sample is <10% of the population? ->>> Yes

Sampling distribution is normally distributed? ->>> Yes, given Central Limit Theorem

**Are the variances of each sample equal? ->>>**

The variance of each group/sample ($s_x^2$) need to be relatively equal to each other.  If the variance is equal, they have a ratio of one.

We can check this assumption via hypothesis test:

$H_0: var1 = var2$

$H_A: var1 \neq var2$

### $\frac{\sigma_1^2}{\sigma_2^2} = F$

We compare the calculated F ratio to an F-distribution.  

The F distribution is defined by two degrees of freedom, the numerator dof and the denominator dof.  

Numerator df = n of first(numerator) group - 1<br>
Denominator df = n of second(denominator) group - 1

In [None]:
# observed F - the ratio of the variances of the samples
# variance of income in female sample / variance of income in male sample
meantab
meantab$var[1] / meantab$var[2]

In [None]:
# critical F
qf(0.05, df1 = meantab$freq[1]-1, df2 = meantab$freq[2]-1, lower.tail = FALSE)

In [None]:
# p-value
2*pf(1.06429, df1 = meantab$freq[1]-1, df2 = meantab$freq[2]-1, lower.tail = FALSE)

In [None]:
# checking the results with var.test()
var.test(cah_oct$income ~ cah_oct$gender)

### Step 3: Calculate t-statistic and p-value

We know that our variances are not significantly different, so we can treat them as equal. Let's now see if the mean of income among female respondents is significantly different than the mean of male respondents. 

We're doing a two-tailed test.

The formula is:
![](tformula.PNG)

In [None]:
## calculate the denominator - the se_diff
se_diff = sqrt(meantab$var[1]/meantab$freq[1] + 
               meantab$var[2]/meantab$freq[2])

# calculate observed t-value
t_obs = (meantab$mean[1] - meantab$mean[2]) / se_diff
t_obs

In [None]:
# critical t-value, lower.tail because observed t is negative
qt(0.025, df = sum(meantab$freq) - 2, lower.tail = TRUE)

In [None]:
# p-value
2*pt(t_obs, df = sum(meantab$freq) - 2, lower.tail = TRUE)

In [None]:
# check with t.test()
t.test(cah_oct$income ~ cah_oct$gender)

#### Conclusion:
1. The observed t-value does not exceed the critical t-value.
2. The p-value is greater than alpha = 0.05.
3. There is no significant difference between mean of income for males and females.
4. There is no evidence to indicate that males and females earn different amounts of money.

## Two-Sample t-test - Proportions
Now we'll look at an example of testing proportions.  We'll look at the proportion of individuals who reported eating kale, again by gender.

How would we run a hypothesis test for this?

### Step 1 - Formulate Hypothesis

$H_0 : p_{female} = p_{male}$

$H_A : p_{female} \neq p_{male}$

**Note:** Given our $H_A$ we're running a two-tailed test.

In [None]:
cah2 <- cah %>% filter(kale != "DK/REF" & !gender %in% c("DK/REF", "Other")) %>% 
                mutate(prefkale = ifelse(kale == "Yes", 1, 0))
summary(cah2$prefkale)

In [None]:
proptab <- cah2 %>% 
            group_by(gender)  %>% 
            summarize(freq = n(),
                      prop = mean(prefkale),
                      sd = sqrt(prop*(1-prop)),
                      var = (sd^2), # variance of observations - sd squared
                      se = sd / sqrt(freq)) %>% # standard error - sqrt of variance of proportion 
            select(-sd)
proptab

In [None]:
proptab %>% ggplot(aes(x = gender, y = prop, fill = gender)) +
                geom_bar(stat = "identity", position = position_dodge()) +
                geom_errorbar(aes(ymin = prop - 1.96*se, ymax = prop + 1.96*se), 
                                  width = 0.3, position = position_dodge(0.9), size = 1) +
                labs(title = "Proportion of Respondents Who Eat Kale by Gender",
                     subtitle = "With 95% Confidence Interval",
                     x = "Gender",
                     y = "Proportion") +
                theme(legend.position = "none", text = bold.14.text) +
                scale_fill_manual(values=c("#52C87d", "#26d5b8")) 

### Step 2 - Prepare and Check Conditions

Set alpha ->>> $\alpha = 0.05$

Random and independent sample ->>> Yes

Sample is <10% of the population? ->>> Yes

Sampling distribution is normally distributed? ->>> Yes, given Central Limit Theorem

**Are the variances of each sample equal? ->>>**

In [None]:
# observed F
proptab$var[1] / proptab$var[2]

In [None]:
# critical F
qf(0.05, df1 = proptab$freq[1]-1, df2 = proptab$freq[2]-1, lower.tail = FALSE)

In [None]:
# p-value
2*pf(0.9907, df1 = proptab$freq[1]-1, df2 = proptab$freq[2]-1, lower.tail = TRUE)

In [None]:
# compare to var.test
var.test(cah2$prefkale ~ cah2$gender)

### Step 3: Calculate t-statistic and p-value

In [None]:
# Check with t.test()
t.test(cah2$prefkale ~ cah2$gender)

#### Conclusions:
1. The observed t-value does not exceed the critical t-value.
2. The p-value is greater than alpha = 0.05.
3. There is no significant difference between proportion of men and women who eat kale.

### Two 0/1 variables? Chi-square test.
Since we have two variables with two levels, we could have instead run chi-square test.

In [None]:
kale <- table(cah2$prefkale, cah2$gender)
chisq.test(kale)

## Two-Sample test - Non-parametric
t-tests are relatively robust to parametric violations (deviation from a normal distribution), especially at large n. But in some conditions it's appropriate to use a non-parametric test.
When it might be appropriate:
- DV is ordinal and cannot appoximate numeric
- DV is numeric but highly skewed

For non-paired data we use the Mann-Whitney U Test.

### Step 1 - Formulate Hypothesis
Instead of comparing means we're comparing medians.

$H_0: $ There is no difference in medians between the two groups.

$H_A: $ There is a significant difference in medians between the two groups

**Note:** Given our $H_A$ we're running a two-tailed test.

In [None]:
cah3 <- cah %>% drop_na(whtnat_rep) %>% filter(kale != "DK/REF") 
summary(cah3$whtnat_rep)
table(cah3$kale)

In [None]:
medtab <- cah3 %>% 
            group_by(kale)  %>% 
            summarize(freq = n(),
                      median = median(whtnat_rep),
                      mean = mean(whtnat_rep)) 
medtab

In [None]:
cah3 %>%
  ggplot(aes(x=whtnat_rep, fill=kale)) +
    geom_histogram(bins = 15, alpha = 0.9) +
    scale_fill_manual(values=c("#52C87d", "#26d5b8"))  +
    labs(fill="Eats Kale",
         y = "Frequency",
         x = "% Republicans Agree with White Nationalists",
         title = "Guesses of % of Republicans that agree with White Nationalists",
         subtitle = "By if they eat kale") +
    theme(text = bold.14.text) 

In [None]:
cah3  %>%  ggplot(aes(sample = whtnat_rep)) +
  geom_qq_line(color = "purple", size = 1) +
  geom_qq(color = "#1d9a86") +
  labs(title = "QQ Plot of Guesses")+
    theme(axis.text.x = bold.14.text, 
                      text = bold.14.text)

### Step 2 - Prepare and Check Conditions

Set alpha ->>> $\alpha = 0.05$

Random and independent sample ->>> Yes

Sample is <10% of the population? ->>> Yes

Sampling distribution is normally distributed? ->>> Doesn't matter - we're running a non-parametric test

Are the variances of each sample equal? ->>> Doesn't matter for this test


### Step 3 - Calculate Mann-Whitney U

In [None]:
# the wilcox.test with paired = FALSE conducts Mann-Whitney
wilcox.test(cah3$whtnat_rep ~ cah3$kale, paired = FALSE)

#### Conclusion:
1. The p-value is lower than alpha = 0.05.
3. There is a significant difference between medians .
4. There is evidence that suggests that depending on whether or not a person eats kale impacts their guess of proportion of Republicans that support White Nationalists.

## New Version of Effect Size

### R-squared
Finally we'll turn to R-squared to see what proportion of the variance in our DV is explained by the IV (the groups).  

For t-values, $r^2$ is calculated using this formula:

# $$  r^2 = \frac{t^2}{t^2 + df}  $$

where $t$ is your t-value(statistic) and $df$ is your degrees of freedom.

$r^2$ ranges from 0 to 1 where 0 means there is no variation explained by the IV and 1 means all of the variation is explained by the IV.
- $r^2 \approx$ 0.1, little to no effect
- $r^2 \approx$ 0.3, weak effect
- $r^2 \approx$ 0.5, moderate effect
- $r^2 \approx$ 0.6 to 1, strong effect

In [None]:
k <- t.test(cah2$prefkale ~ cah2$gender)
str(k)

In [None]:
#rsquared of proportion of kale by gender
rsq <- k$statistic^2 / (k$statistic^2 + k$parameter)
names(rsq) <- "r-squared"
rsq
percent(rsq, accuracy = .01)

In [None]:
# rsquared of mean of guesses of white nationalist support by kale
w <- t.test(cah3$whtnat_rep ~ cah3$kale)
rsq2 <- w$statistic^2 / (w$statistic^2 + w$parameter)
names(rsq2) <- "r-squared"
rsq2
percent(rsq2, accuracy = .01)

## Two-Sample t-test - Paired
Finally, we'll look at a paired t-test.  We need to use this if our observations are not independent, either because the individuals are paired (spouses), or because the responses are paired within individuals (pre- and post-test, for example).

How would we run a hypothesis test for this?

### Step 1 - Formulate Hypothesis

$H_0 : \mu_{pretest} = \mu_{posttest}$

$H_A : \mu_{pretest} \neq \mu_{posttest}$

**Note:** Given our $H_A$ we're running a two-tailed test.

In [None]:
#load the .rds file
anes <- readRDS("anes2.rds")
glimpse(anes)

In [None]:
anes_long <- anes %>% 
                select(ft_pre_rep, ft_post_rep) %>% 
                mutate("pre-election" = as.numeric(ft_pre_rep), "post-election" = as.numeric(ft_post_rep))  %>% 
                select(-ft_pre_rep, -ft_post_rep) %>% 
                gather(key = "time", value = "ft") %>% 
                mutate(time = factor(time)) %>% 
                mutate(time = fct_relevel(time, "pre-election"))
pre_mean <- mean(anes$ft_pre_rep)
post_mean <- mean(anes$ft_post_rep)
anes_long %>%
  ggplot(aes(x=ft, fill=time)) +
    geom_density(alpha=0.6) +
    scale_fill_manual(values=c("#26d5b8", "#ff5733")) +
    labs(fill="",
         y = "Density",
         x = "Feeling Thermometer Rating",
         title = "Distribution of FT Rating toward 2016 Republican Candidate",
         subtitle = "Pre-election vs. Post-election") +
    theme(text = bold.14.text, legend.position = "top") + 
    geom_vline(xintercept = post_mean, color = "#ff5733", size = 2) +
    geom_vline(xintercept = pre_mean, color = "#26d5b8", size = 2) +
    annotate(geom="text", x=67, y=.0175, 
             label=paste0("Pre-election Mean = ", round(pre_mean, digits = 1)),
             color = "#26d5b8", size = 6, fontface = 2)+
    annotate(geom="text", x=67, y=.015, 
             label=paste0("Post-election Mean = ", round(post_mean, digits = 1)),
             color = "#ff5733", size = 6, fontface = 2)

### Step 2 - Prepare and Check Conditions

Set alpha ->>> $\alpha = 0.05$

Random and independent sample ->>> No - paired, therefore paired test.

Sample is <10% of the population? ->>> Yes

Sampling distribution is normally distributed? ->>> We can rely on the Central Limit Theorem

**Are the variances of each time period equal? ->>> Let's check**

In [None]:
# checking the results with var.test()
var.test(anes$ft_pre_rep, anes$ft_post_rep)

### Step 3: Calculate t-statistic and p-value

In [None]:
# use the options pooled and paired for paired data
t.test(anes$ft_pre_rep, anes$ft_post_rep, pooled=F, paired=T)

In [None]:
# effect size - rsquared
ft <- t.test(anes$ft_pre_rep, anes$ft_post_rep, pooled=F, paired=T)
rsq3 <- ft$statistic^2 / (ft$statistic^2 + ft$parameter)
names(rsq3) <- "r-squared"
rsq3
percent(rsq3, accuracy = .01)

#### Conclusion:
1. The p-value is lower than alpha = 0.05.
3. There is a significant difference between means.
4. There is a statistically significant difference in support for Trump pre-election vs. support for Trump post-election.
5. The time period of the survey explains about 8% of the variance in responses.