07_regression_dominance.Rmd

---
title: "Analysis of Spanish and English Vowel Space Areas"
author: "Annie Helms (annie_helms@berkeley.edu)"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    toc: true
    toc_float: true
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, dev = "cairo_pdf")
#options(scipen = 999) # avoid scientific notation
```

Library packages

```{r message = FALSE, warning = FALSE}
library(lmerTest) #mixed effects linear regression
library(dplyr)
library(kableExtra) # formats dataframes in html output
library(emmeans) # Tukey tests
library(ggplot2)
library(r2glmm)
library(ggsignif)
```

Set working directory and import data set of areas by speaker in the corpus.

```{r message = FALSE, echo = FALSE}
setwd("/Users/atarv/Box Sync/PhD/github/vsa_langdom/")
```

# Spanish only

```{r}
spa = read.csv("data/spa_areas_dominance.csv", 
                  stringsAsFactors = TRUE)
kbl(head(spa), table.attr = "style = \"color: black;\"") %>% # piped to allow for kable formatting
  kable_styling(bootstrap_options = c("striped", # alternates row color in table
                                      "hover", # rows are highlighted when hovered over
                                      "condensed", # row height is condensed
                                      "responsive")) %>% # horizontal scrolling across table
  scroll_box(width = "100%")
```

Since we are lacking full dominance information from p126, drop from dataframe.

```{r}
spa=subset(spa, Participant!="p126")
```

Now check df dimensions.

```{r}
dim(spa)
```

Scale avg duration, area, language dominance.

```{r}
spa$Area_scale = scale(spa$Area)
spa$Dominance_scale = scale(spa$Dominance)
spa$avg_dur_scale = scale(spa$avg_dur)
spa$is_stress = as.factor(spa$is_stress)
```

Create subsets of data for each cutoff

```{r}
cut_10 = spa %>%
  filter(Cutoff==0.1)
cut_15 = spa %>%
  filter(Cutoff==0.15)
cut_20 = spa %>%
  filter(Cutoff==0.2)
cut_25 = spa %>%
  filter(Cutoff==0.25)
cut_30 = spa %>%
  filter(Cutoff==0.3)
```


## 10 density cutoff

Set up mixed effects linear regression, with DV of area, IV of stress, and random intercept of speaker. Run for each cutoff level.

```{r}
# submit to mixed effects linear regression model
model_10 = lmer(Area_scale ~ Dom * is_stress + avg_dur_scale + (1|Participant), 
                data = cut_10, REML = FALSE)
summary(model_10)
```
## 15 density cutoff

According to the model output above, neither avg_dur, stress, or language dominance is a significant predictor of vowel space area.

```{r}
# submit to mixed effects linear regression model
model_15 = lmer(Area_scale ~ Dom * is_stress + avg_dur_scale + (1|Participant), 
                data = cut_15, REML = FALSE)
summary(model_15)
```
According to the model output above, neither avg_dur, stress, or language dominance is a significant predictor of vowel space area.

## 20 density cutoff

```{r}
# submit to mixed effects linear regression model
model_20 = lmer(Area_scale ~ Dom * is_stress + avg_dur_scale + (1|Participant), 
                data = cut_20, REML = FALSE)
summary(model_20)
```

According to the model output above, neither avg_dur, stress, or language dominance is a significant predictor of vowel space area.

## 25 density cutoff

```{r}
# submit to mixed effects linear regression model
model_25 = lmer(Area_scale ~ Dom * is_stress + avg_dur_scale + (1|Participant), 
                data = cut_25, REML = FALSE)
summary(model_25)
```

According to the model output above, neither avg_dur, stress, or language dominance is a significant predictor of vowel space area.

## 30 density cutoff

```{r}
# submit to mixed effects linear regression model
model_30 = lmer(Area_scale ~ Dom * is_stress + avg_dur_scale + (1|Participant), 
                data = cut_30, REML = FALSE)
summary(model_30)
```

According to the model output above, neither avg_dur, stress, or language dominance is a significant predictor of vowel space area.


# Now just CBAS speakers

```{r}
cbas = read.csv("data/cbas_areas_dominance.csv", 
                  stringsAsFactors = TRUE)
kbl(head(cbas), table.attr = "style = \"color: black;\"") %>% # piped to allow for kable formatting
  kable_styling(bootstrap_options = c("striped", # alternates row color in table
                                      "hover", # rows are highlighted when hovered over
                                      "condensed", # row height is condensed
                                      "responsive")) %>% # horizontal scrolling across table
  scroll_box(width = "100%")
```

Remove p126 from analysis

```{r}
cbas = subset(cbas, Participant != "p126")
```

Get df dimension

```{r}
dim(cbas)
```
Scale avg duration, area, language dominance.

```{r}
cbas$Area_scale = scale(cbas$Area)
cbas$Dominance_scale = scale(cbas$Dominance)
cbas$avg_dur_scale = scale(cbas$avg_dur)
cbas$is_stress = as.factor(cbas$is_stress)
```

Create subsets of data for each cutoff

```{r}
cut_10 = cbas %>%
  filter(Cutoff==0.1)
cut_15 = cbas %>%
  filter(Cutoff==0.15)
cut_20 = cbas %>%
  filter(Cutoff==0.2)
cut_25 = cbas %>%
  filter(Cutoff==0.25)
cut_30 = cbas %>%
  filter(Cutoff==0.3)
```

Plot dominance scores

```{r}
dom_scores = ggplot(cut_25,
                    aes(x = Dom, y = Dominance, fill = Dom)) +
  geom_boxplot() + 
  geom_point() +
  coord_flip() +
  labs(x = "Binned Groups",
       y = "Dominance (BLP)",
       title = "") +
  theme_classic()+
  theme(legend.position = "none")
dom_scores
```

```{r echo = FALSE, eval = FALSE}
cairo_pdf("new_sounds_figs/dom_plot.pdf", family = "Charis SIL", width = 6, height = 4)
dom_scores
dev.off()
```

## 10 density cutoff

Set up mixed effects linear regression, with DV of area, IV of stress, and random intercept of speaker. Run for each cutoff level.

```{r}
# submit to mixed effects linear regression model
model_10 = lmer(Area_scale ~ Dom * is_stress * Language + avg_dur_scale * Language + (1|Participant), 
                data = cut_10, REML = FALSE)
summary(model_10)
```

Effect sizes:

```{r}
r2beta(model_10, method ="nsj")
```

According to the model output above, neither avg_dur, stress, or language dominance is a significant predictor of vowel space area.

Post-hoc:

```{r}
emmeans(model_10, list(pairwise ~ is_stress * Language), adjust="tukey")

ems = emmeans(model_10, c("is_stress","Language"), infer = c(T, T))

# Cohens D effect size
emmeans::eff_size(ems, sigma = sigma(model_10), edf = df.residual(model_10))
```

Stress is only a predictor of English vowel space area (i.e. not Spanish) and stressed English is significantly larger than stressed Spanish. Increasing English dominance does not affect stress patterns in Spanish or English.

## 15 density cutoff

```{r}
# submit to mixed effects linear regression model
model_15 = lmer(Area_scale ~ Dom * is_stress * Language + avg_dur_scale * Language + (1|Participant), 
                data = cut_15, REML = FALSE)
summary(model_15)
```
According to the model output above, neither avg_dur (approaching significance) or language dominance is a significant predictor of vowel space area.

Effect sizes:

```{r}
r2beta(model_15, method ="nsj")
```

Post-hoc:

```{r}
emmeans(model_15, list(pairwise ~ is_stress * Language), adjust="tukey")

ems = emmeans(model_15, c("is_stress","Language"), infer = c(T, T))

# Cohens D effect size
emmeans::eff_size(ems, sigma = sigma(model_15), edf = df.residual(model_15))
```

Stressed English is bigger than stressed Spanish, language dominance has no effect.

## 20 density cutoff

```{r}
# submit to mixed effects linear regression model
model_20 = lmer(Area_scale ~ Dom * is_stress * Language + avg_dur_scale * Language + (1|Participant), 
                data = cut_20, REML = FALSE)
summary(model_20)
```

Effect sizes:

```{r}
r2beta(model_20, method ="nsj")
```

According to the model output above, neither avg_dur, stress, or language dominance is a significant predictor of vowel space area.

Post-hoc:

```{r}
emmeans(model_20, list(pairwise ~ is_stress * Language), adjust="tukey")

ems = emmeans(model_20, c("is_stress","Language"), infer = c(T, T))

# Cohens D effect size
emmeans::eff_size(ems, sigma = sigma(model_20), edf = df.residual(model_20))
```
Stressed English bigger than stressed Spanish, language dominance has no effect.

```{r}
emmeans(model_20, list(pairwise ~ avg_dur_scale * Language), adjust="tukey")

ems = emmeans(model_20, c("avg_dur_scale","Language"), infer = c(T, T))

# Cohens D effect size
emmeans::eff_size(ems, sigma = sigma(model_20), edf = df.residual(model_20))
```
Utterances in English are slightly longer.

## 25 density cutoff

```{r}
# submit to mixed effects linear regression model
model_25 = lmer(Area_scale ~ Dom * is_stress * Language + avg_dur_scale * Language + (1|Participant), 
                data = cut_25, REML = FALSE)
summary(model_25)
```
Effect sizes:

```{r}
r2beta(model_25, method ="nsj")
```


Post-hoc:

```{r}
emmeans(model_25, list(pairwise ~ is_stress * Language), adjust="tukey")

ems = emmeans(model_25, c("is_stress","Language"), infer = c(T, T))

# Cohens D effect size
emmeans::eff_size(ems, sigma = sigma(model_25), edf = df.residual(model_25))
```

English has stress increase, Spanish doesn't. Language dominance doesn't matter.

## 30 density cutoff

```{r}
# submit to mixed effects linear regression model
model_30 = lmer(Area_scale ~ Dom * is_stress * Language + avg_dur_scale * Language + (1|Participant), 
                data = cut_30, REML = FALSE)
summary(model_30)
```

```{r}
r2beta(model_30, method ="nsj")
```

According to the model output above, neither avg_dur, stress, or language dominance is a significant predictor of vowel space area.


Post-hoc:

```{r}
emmeans(model_30, list(pairwise ~ is_stress * Language), adjust="tukey")

ems = emmeans(model_30, c("is_stress","Language"), infer = c(T, T))

# Cohens D effect size
emmeans::eff_size(ems, sigma = sigma(model_30), edf = df.residual(model_30))
```

English has stress, Spanish doesn't, language dominance doesn't matter.

Plot size across cutoff, language dominance, and language

```{r}
cbas_eng = cut_25 %>%
  filter(Language=="English")
cbas_eng_stress = ggplot(data = cbas_eng, aes(x = is_stress, y = Area_scale, fill = is_stress)) + 
  geom_boxplot() +
  geom_signif(comparisons = list(c(0, 1)), # which pair of levels
              annotations = "**", # what significance level
              color = "black") + # otherwise defaults to your fill color
  theme_classic() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "Stress",
       y = "Scaled Area",
       title = "English vowel space area across lexical stress") +
  scale_x_discrete(labels = c("Unstressed", "Stressed"))
  
cbas_eng_stress
```
```{r echo = FALSE, eval = FALSE}
cairo_pdf("new_sounds_figs/english_stress.pdf", family = "Charis SIL", width = 6, height = 4)
cbas_eng_stress
dev.off()
```

```{r}
cbas_spa = cut_25 %>%
  filter(Language=="Spanish")
cbas_spa_stress = ggplot(data = cbas_spa, aes(x = Dom, y = Area_scale, fill = is_stress)) + 
  geom_boxplot() +
  theme_classic() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "Dominance Groups",
       y = "Scaled Area",
       title = "Spanish vowel space area across dominance and lexical stress") +
  scale_x_discrete(labels = c("Balanced", "L2")) +
  guides(fill = guide_legend(title = "Lexical Stress")) +
  scale_fill_discrete(labels = c("No stress", "Stress"))
cbas_spa_stress
```

```{r echo = FALSE, eval = FALSE}
cairo_pdf("new_sounds_figs/spanish_stress.pdf", family = "Charis SIL", width = 6, height = 4)
cbas_spa_stress
dev.off()
```

```{r}
cbas_lang = ggplot(data = cut_25, aes(x = Dom, y = Area_scale, fill = Language)) + 
  geom_boxplot() +
  theme_classic() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "Dominance Groups",
       y = "Scaled Area",
       title = "Vowel space area across language and dominance") +
  scale_x_discrete(labels = c("Balanced", "L2"))
cbas_lang
```
```{r echo = FALSE, eval = FALSE}
cairo_pdf("new_sounds_figs/lang_area.pdf", family = "Charis SIL", width = 6, height = 4)
cbas_lang
dev.off()
```