# More Two Sample t-tests

Includes:

Two-sample t-test of proportions

Non-parametric test

Effect size

Power analysis

In [32]:
# load libraries
library(tidyverse)
library(DescTools)
library(plotrix)
library(effsize)
library(pwr)
library(readxl)

## Two-sample t-test of proportions
Two sample t-tests of proportions can largely be conducted the same way as two-sample t-tests, using the same functions. For this we'll look back at the `small_gss` data, and the `abany` variable we looked at in the one-sample t-test examples.  This is a yes/no question regarding support for abortion.  Instead of comparing the mean of the entire sample to an external value (null hypothesis mean) we'll test the difference in means between males and females.

In [5]:
#read in data
small_gss <- read_xls("small_gss.xls")

## some data cleaning
df <- small_gss %>% ## save recoded version as df
    filter (abany != "Not applicable")  %>% #filter out rows where abany was missing (because the question wasn't asked)
    mutate (abany = ifelse(abany == "Yes", 1, 0)) ## using ifelse to convert chr yes/no to numerical 1/0 
                                                ## the format of the function is ifelse(test, valiftrue, valiffalse)
df %>%count(abany)
df  %>% count(sex)

abany,n
0,1733
1,1601


sex,n
Female,1850
Male,1484


Before we get started, we can run a var.test to see if the variances between the groups (male vs. female) on the variable are equal.  Remember this is not a requirement, if we want to we can just always run the version that assumes unequal variances.

In [9]:
## running var.test
var.test(abany ~ sex, data = df)


	F test to compare two variances

data:  abany by sex
F = 0.99741, num df = 1849, denom df = 1483, p-value = 0.9566
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.9052806 1.0983416
sample estimates:
ratio of variances 
         0.9974089 


The ratio of the variances is very close to 1, and the p-value is very very large, therefore we fail to reject null and can consider our variances as equal.  So we can use the t-test version for equal variances, if we want to.

In [10]:
## Conduct a t-test on the proportion (mean of 0/1 variable) of abany by sex
## general form - t.test(dv ~ iv, data)

t.test(abany ~ sex, data = df, var.equal = TRUE)


	Two Sample t-test

data:  abany by sex
t = -0.93292, df = 3332, p-value = 0.3509
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.05038734  0.01789663
sample estimates:
mean in group Female   mean in group Male 
           0.4729730            0.4892183 


We get our familiar two-sample t-test results.  We see that the difference in proportions supporting abortion between males and females is not significantly different.  Our p-value is larger than an alpha of 0.05 and our 95% CI of the difference contains 0.

## Non-parametric Tests
t-tests are relatively robust to parametric violations (deviation from a normal distribution), especially at large n.  But in some conditions it's appropriate to use a non-parametric test.

When it might be appropriate:
1. DV is ordinal but should be interval or ratio
2. DV is interval or ratio but highly skewed

**The non-parametric test will test difference in MEDIANS instead of difference in MEANS.**

However, there are drawbacks:
- Less power
- Less accurate
- Less specific: Don’t always know why we rejected the null

#### The Non-parametric t-tests:

1. Mann-Whitney U test
    - Used with independent samples (paired=F)
    - Gives W (in R) or U (original formula) statistic instead of t


2. Wilcoxon Rank Sum Test
    - Used for paired tests (paired=T)
    - Gives W statistic instead of t
    
For our example we will compare level of education (years) by gender, from the same `small_gss` dataset.

In [13]:
table(small_gss$educ)
table(small_gss$sex)


   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
   6    5    7   13    7    7   51   26   83  110  155  213 1481  425  672  264 
  16   17   18   19   20 
 915  205  268  108  182 


Female   Male 
  2887   2328 

In [14]:
# let's take a peek at the medians, since it is what we will be comparing
median(as.numeric(small_gss$educ[small_gss$sex =="Male"]), na.rm=TRUE)
median(as.numeric(small_gss$educ[small_gss$sex =="Female"]), na.rm=TRUE)

So our median years of education is 13 for Males and 14 for Females in the GSS sample.  Let's test the difference in these medians to see if it is significant.  For this we use the Mann-Whitney test, because they are not paired values.

In [15]:
wilcox.test(as.numeric(small_gss$educ) ~ small_gss$sex, paired = FALSE) ## paired = FALSE for Mann-Whitney


	Wilcoxon rank sum test with continuity correction

data:  as.numeric(small_gss$educ) by small_gss$sex
W = 3395110, p-value = 0.3483
alternative hypothesis: true location shift is not equal to 0


So we get a W value (our test statistic) as well as a p-value.  We interpret the p-value compared to an alpha of 0.05 and detemrine that we fail to reject null, therefore there is NO significant difference in the mean level of education between males and females.

To test the paired version, we can pretend that these observations come from a sample of equivalent males/females paired on another characteristic (maybe job type?).  To do this we need to have equal numbers of male and female observations (paired).

In [26]:
### pull out equal numbers of male and female observations 

males <- small_gss  %>% filter(sex == "Male")  %>% .[1:2000, ]  
females <- small_gss  %>% filter(sex == "Female")  %>% .[1:2000, ]  

In [27]:
wilcox.test(as.numeric(males$educ), as.numeric(females$educ), paired = TRUE) ## paired = TRUE


	Wilcoxon signed rank test with continuity correction

data:  as.numeric(males$educ) and as.numeric(females$educ)
V = 693242, p-value = 0.6334
alternative hypothesis: true location shift is not equal to 0


Our paired sample also has no significant difference between median education level by gender.

## Effect Size
We'll look at 4 different effect sizes for two-sample t-tests:
- cohen's d
- hedge's g
- r-squared

Cohen's d we already looked at in the one-sample t-test format.  Here we're using the same process, but with two-sample data (instead of a single sample mean vs. a set null hypothesis mean).

Hedge's g is used instead of Cohen's d when we have small n (less than 20).

R-squared is a new type of effect size we're going to talk about. It reflects the proportion of the variance in the data explained by our IV(s).  So in the case of a t-test, it's the proportion of the overall variance in our DV that is explained by the difference in the group means.  This will be something we will look at a lot when we get to regression.

For these examples I'll go back to the anes data.


In [28]:
#load the .rds files
anes <- readRDS("anes.rds")
anes2 <- readRDS("anes2.rds")
head(anes)
head(anes2)

ft_sci,ft_bigbusn,ft_rich,ft_congress,race,educ
70,70,50,70,white,less than BA
70,40,50,50,white,coll grad or higher
70,70,50,50,white,less than BA
50,40,50,50,white,less than BA
85,50,70,50,white,less than BA
50,70,50,0,white,coll grad or higher


ft_pre_dem,ft_pre_rep,ft_post_dem,ft_post_rep,partyid
0,85,15,85,rep
0,85,50,60,rep
85,0,85,50,dem
0,85,0,100,dem
85,0,70,15,dem
50,60,0,70,dem


### Cohen's d
Let's look at the Cohen's d for our t-test of mean rating of scientists by education level, from the last lecture

In [29]:
# review the results, first
t.test(ft_sci ~ educ, data = anes)


	Welch Two Sample t-test

data:  ft_sci by educ
t = 9.1345, df = 3316, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 4.627323 7.156726
sample estimates:
mean in group coll grad or higher        mean in group less than BA 
                         80.18772                          74.29569 


Now we'll use the `cohen.d` function to calculate our effect size.  This is a function from the package `effsize` which is different from the one from the package `lsr` we used in the one-sample lab.  This one has more options for our different two-sample t-tests.

Select options:
- na.rm=T ---- ignores missing values
- pooled=T ---- pools variance
- paired=T ---- paired sample test; =F if independent sample test
- hedges=T ---- add if you want Hedge’s g instead of Cohen’s d

In [39]:
# run cohen's d - this time using "effsize" package instead of "lsr" due to additional options
cohen.d(as.numeric(anes$ft_sci) ~ anes$educ)


Cohen's d

d estimate: 0.3057668 (small)
95 percent confidence interval:
    lower     upper 
0.2381440 0.3733896 

The output tells us the magnitude of our difference after our estimate, in this case small.

In general, magnitude of the estimate of d:
- d ≥ 0.2, small effect
- d ≥ 0.5, medium effect
- d ≥ 0.8, large effect
- d ≥ 1.2, very large effect
- d ≥ 2.0, huge effect

### Cohen's d - Paired Data
Let's look at the Cohen's d for our pre- vs. post-election candidate ratings from ANES.

In [40]:
## paired t-test filtered group of only republicans - ratings of Trump
reps <- anes2  %>% filter(partyid == "rep")
t.test(reps$ft_pre_rep, reps$ft_post_rep, pooled=F, paired=T)


	Paired t-test

data:  reps$ft_pre_rep and reps$ft_post_rep
t = -9.7663, df = 1019, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.570947 -4.372190
sample estimates:
mean of the differences 
              -5.471569 


In [42]:
## and cohen's d
cohen.d(as.numeric(reps$ft_pre_rep), as.numeric(reps$ft_post_rep), pooled=F, paired=T)


Glass's Delta

Delta estimate: -0.2012716 (small)
95 percent confidence interval:
     lower      upper 
-0.2420955 -0.1604478 

You can ignore the name - this is Cohen's d for our paired data.  Our difference of 5 and half percentage points (unstandardized effect size) corresponds to a small standardized effect.

### Hedge's g
Hedge's g is the alternative to Cohen's d when we have sample size less than 20.  To look at this I'll create a small "fake" dataset - 20 observations of animal weights - 10 dogs, 10 cats.

In [47]:
tinydf <- data.frame (
        animal = c(rep("dog", 10), rep("cat", 10)),
        weight = c(22, 44, 35, 11, 80, 56, 67, 45, 88, 44, 36, 20, 15, 16, 20, 22, 12, 14, 16, 17)
    )
head(tinydf)

animal,weight
dog,22
dog,44
dog,35
dog,11
dog,80
dog,56


In [53]:
# run t-test first
t.test(weight ~ animal, data = tinydf)


	Welch Two Sample t-test

data:  weight by animal
t = -3.8195, df = 10.392, p-value = 0.00315
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -48.04359 -12.75641
sample estimates:
mean in group cat mean in group dog 
             18.8              49.2 


We have a significant difference in mean weights.  But how big is the difference substantively?  The unstandardized effect size is 49.2 - 18.8 = 30.4 lbs.  What is the standardized effect size?

In [50]:
# and run hedge's g
cohen.d(tinydf$weight ~ tinydf$animal, hedges = TRUE)



Hedges's g

g estimate: -1.635977 (large)
95 percent confidence interval:
     lower      upper 
-2.6755224 -0.5964321 

The standardized effect size is also large - 1.64 (negative only because cats are the first group listed and have a smaller mean weight).

### R-squared
Finally we'll turn to R-squared to see what proportion of the variance in our DV is explained by the IV (the groups).  

For t-values, $r^2$ is calculated using this formula:

# $$  r^2 = \frac{t^2}{t^2 + df}  $$

where $t$ is your t-value(statistic) and $df$ is your degrees of freedom.

$r^2$ ranges from 0 to 1 where 0 means there is no variation explained by the IV and 1 means all of the variation is explained by the IV.
- $r^2 \approx$ 0.1, little to no effect
- $r^2 \approx$ 0.3, weak effect
- $r^2 \approx$ 0.5, moderate effect
- $r^2 \approx$ 0.6 to 1, strong effect

This time let's look at the mean rating of congress (`ft_congress`) by `race`.

In [61]:
## first run t-test and save output
con_t <- t.test(ft_congress ~ race, data = anes, var.equal = TRUE)
con_t


	Two Sample t-test

data:  ft_congress by race
t = 7.7307, df = 3521, p-value = 1.386e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 4.817960 8.092189
sample estimates:
mean in group not_white     mean in group white 
               47.35863                40.90355 


We have our values saved, so we can use `con_t$statistic` to obtain our t, and `con_t$parameter` to obtain degrees of freedom

In [62]:
## calculating r-squared for t-test (by hand)
t <- con_t$statistic
df <- con_t$parameter
r_sq <- t^2 / (t^2 + df)
r_sq

Our $r^2$ is 0.016, which means that the IV (race) explains very little of the variation in mean feeling about congress.

Let's check out the cohen's d to compare:

In [68]:
cohen.d(as.numeric(ft_congress) ~ race, data = anes)


Cohen's d

d estimate: 0.292337 (small)
95 percent confidence interval:
    lower     upper 
0.2178817 0.3667922 

We can get R to calculate this $r^2$ in another way - a function that we won't talk about for a couple more weeks....

In [66]:
summary(lm(ft_congress ~ race, data = anes))$r.squared

## Power
Finally, we need to talk about power.  We'll still be using our old friend `pwr`, just with a couple of small adjustments to the arguments.

First, let's calculate the power of our previous t-test, ratings of congress by race.

We know we had a cohen's d of 0.292337, which we'll need for our calculation.

In [70]:
# calculate power of two-sample t-test - ft_congress by race - leave power = NULL to obtain power
n = length(anes$ft_congress) # need to know n
pwr.t.test(d = 0.29, n = n, sig.level = 0.05, power = NULL, type = "two.sample", alt = "two.sided")


     Two-sample t test power calculation 

              n = 3523
              d = 0.29
      sig.level = 0.05
          power = 1
    alternative = two.sided

NOTE: n is number in *each* group


Our power was practically 1, which means our probability of Type II error was practically 0.  Remember when I mentioned that with a large sample size most differences are statistically significant?  

What n did we need to see the effect we saw if we only needed power = 0.8?

In [71]:
pwr.t.test(d = 0.29, n = NULL, sig.level = 0.05, power = 0.8, type = "two.sample", alt = "two.sided")


     Two-sample t test power calculation 

              n = 187.6206
              d = 0.29
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group


We only needed a sample size of 188 to discern a statistical difference with an effect size of d = 0.29.