# Homework 6: Mixed effects

This homework assignment is designed to give you practice fitting and interpreting mixed effects models. 

We will be using the **LexicalData.csv** and **Items.csv** files from the *Homework/lexDat* folder in the class GitHub repository again. 

This data is a subset of the [English Lexicon Project database](https://elexicon.wustl.edu/). It provides the reaction times (in milliseconds) of many subjects as they are presented with letter strings and asked to decide, as quickly and as accurately as possible, whether the letter string is a word or not. The **Items.csv** provides characteristics of the words used, namely frequency (how common is this word?) and length (how many letters?). Unlike in the previous homework, there isn't any missing data in the **LexicalData.csv** file. 

*Data courtesy of Balota, D.A., Yap, M.J., Cortese, M.J., Hutchison, K.A., Kessler, B., Loftis, B., Neely, J.H., Nelson, D.L., Simpson, G.B., & Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445-459.*

---
## 1. Loading and formatting the data (1 point)

Load in data from the **LexicalData.csv** and **Items.csv** files. As in the previous homeworks, remove the commas from the reaction times and convert them from strings to numbers. Use `left_join` to add word characteristics `Length` and `Log_Freq_Hal` from **Items** to **LexicalData**. 

*Note: the `Freq_HAL` variable in **Items.csv** has a similar formatting issue, using string values with commas. We're not going to worry about fixing this since we're only using `Log_Freq_HAL`, which is the natural log transformation of `Freq_HAL`, in this homework.*

In [9]:
#install.packages("lme4") # Uncomment if not installed.
library(lme4)
library(ggplot2)
library(tidyverse)

LD <- read.csv("~/Desktop/DataSciencePsychNeuro/Homeworks/lexDat/LexicalData_withIncorrect.csv")
Items <- read.csv("~/Desktop/DataSciencePsychNeuro/Homeworks/lexDat/Items.csv")


LD[,'D_RT'] <- gsub(",","",LD[,'D_RT'])


LD$D_RT<-as.numeric(LD$D_RT)
is.numeric(LD$D_RT)

LD<- left_join(LD, Items %>% select (Word, Length, Log_Freq_HAL), by=c("D_Word"="Word"))

head(LD)

Unnamed: 0_level_0,X,Sub_ID,Trial,Type,D_RT,D_Word,Outlier,D_Zscore,Correct,Length,Log_Freq_HAL
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<dbl>,<int>,<int>,<dbl>
1,1,157,1,1,710,browse,False,-0.437,1,6,8.856
2,2,67,1,1,1094,refrigerant,False,0.825,1,11,4.644
3,3,120,1,1,587,gaining,False,-0.645,1,7,8.304
4,4,21,1,1,984,cheerless,False,0.025,1,9,2.639
5,5,236,1,1,577,pattered,False,-0.763,1,8,1.386
6,6,236,2,1,715,conjures,False,-0.364,1,8,5.268


---
## 2. Model fitting (4 points)

First, fit a linear model with `Log_Freq_HAL` and `Length` as predictors, and `D_RT` as the output. Include an interaction term. Use `summary()` to look at the model output. 

In [10]:
mod1 <-lm(D_RT~Log_Freq_HAL + Length + Log_Freq_HAL * Length, data = LD)
summary(mod1)




Call:
lm(formula = D_RT ~ Log_Freq_HAL + Length + Log_Freq_HAL * Length, 
    data = LD)

Residuals:
    Min      1Q  Median      3Q     Max 
-1128.3  -217.6   -94.0    94.2  3317.2 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         650.3764    14.3247  45.403  < 2e-16 ***
Log_Freq_HAL        -10.0802     1.9643  -5.132 2.88e-07 ***
Length               45.5806     1.5992  28.503  < 2e-16 ***
Log_Freq_HAL:Length  -2.6346     0.2345 -11.236  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 384.2 on 70585 degrees of freedom
  (4280 observations deleted due to missingness)
Multiple R-squared:  0.08867,	Adjusted R-squared:  0.08863 
F-statistic:  2289 on 3 and 70585 DF,  p-value: < 2.2e-16


Now, install `lme4` using `install.packages()` and then load the library. 

In [11]:
#install.packages("lme4")
library(lme4)


Now fit a mixed effects model that includes the same predictors as the linear model above, as well as random intercepts for `Sub_ID` (i.e., cases where subject ID shifts the RT mean). Use `summary()` to look at the model output. 

In [12]:
mem<-lmer(D_RT ~ Log_Freq_HAL + Log_Freq_HAL + Log_Freq_HAL * Length + (1| Sub_ID), data = LD)
summary(mem)


Linear mixed model fit by REML ['lmerMod']
Formula: D_RT ~ Log_Freq_HAL + Log_Freq_HAL + Log_Freq_HAL * Length +  
    (1 | Sub_ID)
   Data: LD

REML criterion at convergence: 1012299

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.2579 -0.5401 -0.1555  0.2986 11.0922 

Random effects:
 Groups   Name        Variance Std.Dev.
 Sub_ID   (Intercept) 51061    226.0   
 Residual             97009    311.5   
Number of obs: 70589, groups:  Sub_ID, 299

Fixed effects:
                    Estimate Std. Error t value
(Intercept)         652.6262    17.4987  37.296
Log_Freq_HAL        -11.0297     1.5960  -6.911
Length               45.7392     1.2990  35.211
Log_Freq_HAL:Length  -2.5742     0.1905 -13.514

Correlation of Fixed Effects:
            (Intr) Lg_F_HAL Length
Log_Frq_HAL -0.621                
Length      -0.635  0.913         
Lg_Fr_HAL:L  0.561 -0.943   -0.919

---
## 3. Model assessment (4 points)

Compare the three t-values for the fixed effects and the mixed effects models. How do they differ, and why? 

Overall, when looking at the absolute value of the t-values, with the mixed effects models the t values increased. For frequency, the t-value moved from 5.13 in the linear model to 6.91. Length increased from 28.5 to 35.2, and the interaction term which was 11.24 in the linear model is 13.5 in the mixed effects model. The mixed effects model includes random effects that the linear model does not account for, thus the linear model is a more conservative model fit than the mixed effects.

Use the Aikeke Information Criterion (AIC) to compare these two models. Which one is better? 

In [14]:

 AIC(mod1, mem)



Unnamed: 0_level_0,df,AIC
Unnamed: 0_level_1,<dbl>,<dbl>
mod1,5,1040519
mem,6,1012311


The mixed effect model is a better model given that it has a lower AIC value. The mixed effect model accounts for more variance in reaction times than the linear model.  

---
##  4. Reflection (1 point)

What other random effects could be controlled for in this data set? 

Trial number

**DUE:** 5pm EST, March 25, 2022

**IMPORTANT** Did you collaborate with anyone on this assignment? If so, list their names here. 
Ketura 