# Exercise 9: Mixed effects

This homework assignment is designed to give you practice fitting and interpreting mixed effects models. 

We will be using the **LexicalData.csv** and **Items.csv** files from the *Homework/lexDat* folder in the class GitHub repository again. 

This data is a subset of the [English Lexicon Project database](https://elexicon.wustl.edu/). It provides the reaction times (in milliseconds) of many subjects as they are presented with letter strings and asked to decide, as quickly and as accurately as possible, whether the letter string is a word or not. The **Items.csv** provides characteristics of the words used, namely frequency (how common is this word?) and length (how many letters?). Unlike in the previous homework, there isn't any missing data in the **LexicalData.csv** file. 

*Data courtesy of Balota, D.A., Yap, M.J., Cortese, M.J., Hutchison, K.A., Kessler, B., Loftis, B., Neely, J.H., Nelson, D.L., Simpson, G.B., & Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445-459.*

---
## 1. Loading and formatting the data (1 point)

Load in data from the **LexicalData.csv** and **Items.csv** files. As in the previous homeworks, remove the commas from the reaction times and convert them from strings to numbers. Use `left_join` to add word characteristics `Length` and `Log_Freq_Hal` from **Items** to **LexicalData**. 

*Note: the `Freq_HAL` variable in **Items.csv** has a similar formatting issue, using string values with commas. We're not going to worry about fixing this since we're only using `Log_Freq_HAL`, which is the natural log transformation of `Freq_HAL`, in this homework.*

In [3]:
# WRITE YOUR CODE HERE
library(dplyr)
items <- read.csv("items.csv")
lexicaldata <- read.csv("LexicalData.csv")
lexicaldata <- lexicaldata %>% mutate(D_RT = as.numeric(gsub(",","",D_RT)))

col <- items %>% select(Word,Length,Log_Freq_HAL)
lexdataplus <- lexicaldata %>% left_join(col,by=c('D_Word'='Word'))
head(lexdataplus)

"package 'dplyr' was built under R version 4.2.2"

Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




Unnamed: 0_level_0,Sub_ID,Trial,Type,D_RT,D_Word,Outlier,D_Zscore,Length,Log_Freq_HAL
Unnamed: 0_level_1,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<dbl>,<int>,<dbl>
1,157,1,1,710,browse,False,-0.437,6,8.856
2,67,1,1,1094,refrigerant,False,0.825,11,4.644
3,120,1,1,587,gaining,False,-0.645,7,8.304
4,21,1,1,984,cheerless,False,0.025,9,2.639
5,236,1,1,577,pattered,False,-0.763,8,1.386
6,236,2,1,715,conjures,False,-0.364,8,5.268


---
## 2. Model fitting (4 points)

First, fit a linear model with `Log_Freq_HAL` and `Length` as predictors, and `D_RT` as the output. Include an interaction term. Use `summary()` to look at the model output. 

In [4]:
# WRITE YOUR CODE HERE
lm <- lm(D_RT ~ Log_Freq_HAL + Length, data=lexdataplus)
summary(lm)


Call:
lm(formula = D_RT ~ Log_Freq_HAL + Length, data = lexdataplus)

Residuals:
     Min       1Q   Median       3Q      Max 
-1047.98  -208.56   -86.46    92.10  3133.65 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  765.7746     7.8141   98.00   <2e-16 ***
Log_Freq_HAL -29.2445     0.6617  -44.20   <2e-16 ***
Length        28.8180     0.6290   45.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 359.6 on 62607 degrees of freedom
Multiple R-squared:  0.09246,	Adjusted R-squared:  0.09243 
F-statistic:  3189 on 2 and 62607 DF,  p-value: < 2.2e-16


Now, install `lme4` using `install.packages()` and then load the library. 

In [5]:
# WRITE YOUR CODE HERE
# install.packages("lme4")
library(lme4)

Installing package into 'C:/Users/david/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)



package 'lme4' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\david\AppData\Local\Temp\RtmpUrL08g\downloaded_packages


"package 'lme4' was built under R version 4.2.2"
Loading required package: Matrix

"package 'Matrix' was built under R version 4.2.2"


Now fit a mixed effects model that includes the same predictors as the linear model above, as well as random intercepts for `Sub_ID` (i.e., cases where subject ID shifts the RT mean). Use `summary()` to look at the model output. 

In [6]:
# WRITE YOUR CODE HERE
lmer = lmer(D_RT ~ Log_Freq_HAL + Length + (1 | Sub_ID), data = lexdataplus)
summary(lmer)


Linear mixed model fit by REML ['lmerMod']
Formula: D_RT ~ Log_Freq_HAL + Length + (1 | Sub_ID)
   Data: lexdataplus

REML criterion at convergence: 888465.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.4813 -0.5551 -0.1609  0.3132 10.7066 

Random effects:
 Groups   Name        Variance Std.Dev.
 Sub_ID   (Intercept) 46352    215.3   
 Residual             83286    288.6   
Number of obs: 62610, groups:  Sub_ID, 299

Fixed effects:
             Estimate Std. Error t value
(Intercept)   769.077     13.950   55.13
Log_Freq_HAL  -30.157      0.533  -56.58
Length         29.226      0.506   57.76

Correlation of Fixed Effects:
            (Intr) L_F_HA
Log_Frq_HAL -0.352       
Length      -0.379  0.365

---
## 3. Model assessment (4 points)

Compare the three t-values for the fixed effects and the mixed effects models. How do they differ, and why? 

> The t-value for the fixed effects model's intercept is significantly larger than the one for the mixed effects model's intercept. This is because the random intercept is now being accounted for some of the total intercept, which reduces the t-value of the fixed intercept.  
>  
Meanwhile, the t-values for the fixed effects model's slopes are somewhat smaller than the ones for the mixed effects model's slopes. I am not sure why this is.

Use the Aikeke Information Criterion (AIC) to compare these two models. Which one is better? 

In [7]:
# WRITE YOUR CODE HERE
AIC(lm, lmer)


Unnamed: 0_level_0,df,AIC
Unnamed: 0_level_1,<dbl>,<dbl>
lm,4,914591.2
lmer,5,888475.9


> The mixed effects model is better.

---
##  4. Reflection (1 point)

What other random effects could be controlled for in this data set? 

> You could control for the words themselves rather than just the word length. I am not sure if this would be a random effect though since they are correlated.  
>  
> I have a question, hope you see this. How do I interpret the random effects part of the lmer summary? Are there estimates? What are the residuals supposed to be?

**DUE:** 5pm EST, March 15, 2023

**IMPORTANT** Did you collaborate with anyone on this assignment? If so, list their names here. 
> *Someone's Name*