# Syntactical Bootstrapping as noun-learning mechanism 

## Background

How do babies learn words? This question remains one of the most mesmerizing myteries in developmental psychology. Young children must be able to establish the correspondences between words and their references. Previous studies have shown that syntactical structures can help young children to constrain verb meanings. In a seimnal paper, Naigles (1990) found that two-year-olds can use the transitive structure to narrow down the meanings of a novel verb. For example, if they hear "The bunny is gorping the duck" while seeing two agents engaged in an causal action and a non-causal action simultaneously. When prompted "Find gorping", the toddlers would look reliably longer at the scenes in which the two agents only performed the causal action than the scenes in which they only performed the non-causal action. This phenomenon is named as "Syntactical Bootstrapping". Since then, many researchers have built uponon this idea and extended the earliest age group showing the effect to as young as 15-month-olds (Jin & Fisher, 2014; Yuan, Fisher & Snedeker, 2012).

However, previous works on syntactical bootstrapping have focused exclusively on verb learning. This is partly due to the considertations that verbs are as more challening to learn (Gentner, 1982; Gentner & Boroditsky, 2001; Gleitman et al., 2005). Therefore, it is necessary to rely on additional information, such as the syntactical bootstrapping mechanism, to scaffold learning. Yet there is no reason to presuppose that verbs are the only category of words benefiting from syntactical cues. It is true that noun learning are often considered as requiring relatively less syntactcal support (Arunachalam & Waxman, 2010). Yet, syntactical categories such as nouns and verbs are only meaningful to those who have learned syntax (Christansen & Monaghan, 2006). Young children start to show sensitivity to syntactical structure way before they become competent grammatical speakers. Thus, it is very likely that noun learning can be aided by the syntactical structures as well.


### Current experiment:

Current experiment examines the possibility of syntactical bootstrapping as noun-learning mechanism. Specifically, we are interested in whether infants can map the subject of the sentence to the causal agent of the scene. We use a habituation paradigm to test 12-month-olds and 20 month-olds. Infants are first habituated to a direct-launching event with two different blocks while they repeatedly hear "Neem pushes that". The goal is to see if they can map novel noun "neem" to the causal agent in the scene. There are two testing blocks in the testing phase, a label-only block and a novel-sentence block. In the label only block, the infants hear the label "Neem" while seeing either the agent block passing through the screen or the patient block passing through the screen. In the novel-sentence block, the infants hear a novel sentence "That pushes neem" while seeing either the direct launching event from the habituation phase, or the direct launching event with the agent's and the patient's roles switched. Below is a schematic figure illustrating the experiment procedure: 


<img src="fig.png">






Our predictions are following: If the infants have learned the mapping between the subject of the sentence and the agent of the scene, they should look longer at the trials in which the mapping is inconsistent. In the label-only block, they should look longer at the trial in which "neem" is paried with the patient. In the novel-sentence block, they should look longer at the trial in which "that pushes neem" is paired with the original direct launching event. Alternatively, infants may only acquire a superficial association between the label "neem" and the causal agent in the scene. They may associate "neem" with the causal agent because of an order effect, i.e., "neem" is the first word in the sentence, and the agent block is the first to move in the scene. Then, they should only look longer at the incosistent pairing in the label-only block, not in the novel-sentence block. Another possibility is that syntactical information is necessary for young children to acquire word meaning. If this is the case, then infant should only look longer at the inconsistent pairing in the novel-sentence block, not in the label-only block. 


## Variables

- **ID**: participant ID, each represents a unique participant 
- **Sex**: categorical, Male or Female. as reported by the parents
- **AgeGroup**: categorical, Young or Old. Young are participants in 12-month-old group. Old are participants in 20-month-old group. As reported by the parents. 
- **ageByDays**: numerical, calculated by the difference between their Date of Brith and Date of Test. As reported by the parents. 
- **CDI.Score**: numerical. Calculated by the CDI short form (Fenson et al., 2000). 

- **trialType** : categorical, test / pretest / posttest. Each participants have all three kinds of trials. 
    - **Pretest and Post-test** trials play the roles of attention checker, to detect if the infants' decreased looking time in habituation phase are due to general fatigue. 
    - **Test** trials are trials in which the looking time is compared. 
    
- **blockType** : categorical, nontest/label/sentence. 
    - ** nontest ** are pretest trials and posttest trials. 
    - **label** are trials in which the infants only hear the novel word "neem" and see either the agent block or the patient block passing through the screen. 
    - **sentence** are trials in which the infants hear a novel sentence "that pushes neem" and see either the old direct lauching event or the new direct lauching event with agent and patient switched. 

- ** consistency ** : categorical, nontest / conssitent / inconsistent. 
    - **nontest** are pretest trials and posttest trials. 
    - **consistent**: are test trials in which the pairing of the label "neem" and the block on the screen is consistent with the pairing infants habituated to. 
    - **inconsistent**: are test trials in which the pairing of the label "neem" and the block on the screen is inconsistent with the pairing infants habituated to. 

    


## Hypotheses

I am are primarily interested in the difference between looking time in consistent trials versus nonconsistent trials and whether the difference can be predicted by the CDI score. There are two models of interests: Variables that may influence the difference in the looking time are:


- LookingTimeDuration ~ Age + Sex + Block + Consistency + Consistency * Block + (1 | Participant_ID)
	- LookingTimeDuration: continuous, duration of looking toward the screen.
	- Age: categorical variable, 12-month-olds group / 20-month-olds. 
        - Previous studies have shown that around 18-month of age, young children would experience "vocabulary spurt", in which their rate of words acuqisition dramatically increase (Goldfield & Reznick, 1990). One factor that can lead to this acceleration is the developing syntactical capacity (McMurray, 2007). Here, I predict that children before vocabulary spurt (12-month-olds, Young group) would show less difference between consistent and inconsistent trials than children after vocabulary spurt (20-month-olds, Old group).
	- Sex: categorical variable, F/M. 
        - female's early language development advantage is a well-documented phenomenon (Maccoby & Jacklin,1974; Fenson et al., 1991) Here the hypothesis is that compared to male, female will show larger looking time difference in consistent and inconsistent trials. 
	- Consistency: categorical variable, consistency/inconsistency
        - According to our hypothesis, infants shoud look longer at the trials in which the mapping between subject and agent is inconsistent. 
	- Block: categorical variable, label-only/sentence block. 
        - We don't have an explicit hypothesis for this factor.
	- Consistency * Block: these two may interact 
	- add participant_ID as random effect 


- LookingTimeDifference ~ Sex + CDI_Score + Sex*CDI_Score + (1 | Participant_ID)
	- LookingTimeDifference: looking time in inconsistency trials - looking time in consistency trials 
	- Sex: categorical variable, F/M
	- CDI_Score: continuous variable, scored by CDI form 
        - CDI scores is a reliable indicator of the children's language skill (Fenson et al., 2000). Here, I hypothesize that higher CDI score will predict larger difference in their looking time difference.
	- Sex*CDI_Score: these two may interact, girls might do better than boys
	


## Data Organization 

Due to the COVID-19 outbreak, the data collection has been interrupted. The current sample size is insufficient to be the basis for any inferences. Therefore, in the pre-processing section, I will use the existing results to generate some simulated data, and run a power-analysis to find out an ideal "sample" size. 

In [10]:
library(tidyverse)
library(lme4)

young_raw <- read.csv("/Users/caoanjie/Desktop/Spring2020/DS/_DSPN_S20/FINALPROJECT/CALA_12MO_RAW.csv")
old_raw <- read.csv("/Users/caoanjie/Desktop/Spring2020/DS/_DSPN_S20/FINALPROJECT/CALA_20MO_RAW.csv")
young_raw %>% 
  mutate(AgeGroup = "Young") %>% 
  mutate(ID = paste("young", as.character(ID))) %>% 
  mutate(ageByDays = as.Date(as.character(Test.Date), format="%m/%d/%Y") - as.Date(as.character(DOB), format="%m/%d/%Y")) %>% 
  select(ID, Sex, AgeGroup, ageByDays, CDI.SCORE,pretest, blue.neem, black.neem, black.blue.that, blue.black.that, Posttest) %>% 
  rename(label_inconsistent = blue.neem, label_consistent = black.neem, sentence_inconsistent = black.blue.that, sentence_consistent = blue.black.that, posttest = Posttest) %>% 
  gather(trialType, LookingTime, pretest:posttest) %>% 
  mutate(blockType = ifelse(grepl("label", trialType), "label", ifelse(grepl("sentence", trialType),  "sentence", "nontest")),
         consistency = ifelse(grepl("inconsistent", trialType), "inconsistent", ifelse(grepl("consistent", trialType), "consistent","nontest")),
         trialType = ifelse(trialType %in% c("pretest", "posttest"), trialType, "test")) %>% 
  na.omit -> young_raw

old_raw %>% 
  mutate(AgeGroup = "Old") %>% 
  mutate(ID = paste("old", as.character(ID))) %>% 
  mutate(ageByDays = as.Date(as.character(Test.Date), format="%m/%d/%Y") - as.Date(as.character(DOB), format="%m/%d/%Y")) %>% 
  select(ID, Sex, AgeGroup, ageByDays, CDI.SCORE,pretest, blue.neem, black.neem, black.blue.that, blue.black.that, Posttest) %>% 
  rename(label_inconsistent = blue.neem, label_consistent = black.neem, sentence_inconsistent = black.blue.that, sentence_consistent = blue.black.that,posttest = Posttest ) %>% 
  gather(trialType, LookingTime, pretest:posttest) %>% 
 mutate(blockType = ifelse(grepl("label", trialType), "label", ifelse(grepl("sentence", trialType),  "sentence", "nontest")),
         consistency = ifelse(grepl("inconsistent", trialType), "inconsistent", ifelse(grepl("consistent", trialType), "consistent","nontest")),
         trialType = ifelse(trialType %in% c("pretest", "posttest"), trialType, "test")) %>%  
  na.omit -> old_raw

data <- bind_rows(young_raw, old_raw) 
head(data)

Loading required package: Matrix


Attaching package: ‘Matrix’


The following object is masked from ‘package:tidyr’:

    expand




ID,Sex,AgeGroup,ageByDays,CDI.SCORE,trialType,LookingTime,blockType,consistency
young 1,F,Young,353 days,5,pretest,28.0,nontest,nontest
young 2,M,Young,362 days,1,pretest,29.5,nontest,nontest
young 3,F,Young,356 days,4,pretest,7.7,nontest,nontest
young 5,F,Young,378 days,2,pretest,29.7,nontest,nontest
young 6,M,Young,376 days,4,pretest,14.1,nontest,nontest
young 7,F,Young,370 days,1,pretest,7.6,nontest,nontest


Here, we first try to fit the model on the existing dataset: 

In [11]:
lt_current <- lmer(LookingTime ~ AgeGroup+Sex+blockType+consistency+blockType*consistency+(1|ID), data=data)
summary(lt_current)$coefficients

fixed-effect model matrix is rank deficient so dropping 4 columns / coefficients



Unnamed: 0,Estimate,Std. Error,t value
(Intercept),21.4030307,3.79677,5.6371688
AgeGroupYoung,-5.1847101,3.088566,-1.6786787
SexM,0.2995981,3.187523,0.0939909
blockTypenontest,2.1077354,2.606469,0.8086556
blockTypesentence,-0.8235294,2.992638,-0.2751851
consistencyinconsistent,-1.4823529,2.992638,-0.4953332
blockTypesentence:consistencyinconsistent,1.9941176,4.232229,0.4711743


Use the estimate for the betas to generate some data: 

In [15]:
# use the parameters generated by current data 
betas  = c(23.69, -7.29,-1.32,-0.82,-1.48,1.99)
# I don't really know where can I find these SD? The regression model 
sds = c(0.2,0.2,0.2,0.2,0.2,0.2)


#This is the function for generating simulation data 
simulate_data <- function(num_subjects, num_trials,trial_SD, model_Beta, model_SD){
  
  intercept = model_Beta[1]
  Age_Beta = model_Beta[2]
  Sex_Beta = model_Beta[3]
  blockType_Beta = model_Beta[4]
  consistencyType_Beta = model_Beta[5]
  interaction_Beta = model_Beta[6]
  
  intercept_SD = model_SD[1]
  Age_SD = model_SD[2]
  Sex_SD = model_SD[3]
  blockType_SD = model_SD[4]
  consistencyType_SD = model_SD[5]
  interaction_SD = model_SD[6]
  
  subjects_intercepts <- rnorm(num_subjects, intercept, intercept_SD)
  subjects_Age <- rnorm(num_subjects, Age_Beta, Age_SD)
  subjects_Sex <- rnorm(num_subjects, Sex_Beta, Sex_SD)
  subjects_blockType <- rnorm(num_subjects, blockType_Beta, blockType_SD)
  subjects_consistencyType <- rnorm(num_subjects, consistencyType_Beta, consistencyType_SD)
  subjects_interaction <- rnorm(num_subjects, interaction_Beta, interaction_SD)
  
  generated_data <- data.frame(gen_subject = 1:num_subjects, 
                               gen_intercept = subjects_intercepts,
                               gen_Age_Beta = subjects_Age,
                               gen_Sex_Beta = subjects_Sex,
                               gen_blockType_Beta = subjects_blockType,
                               gen_consistencyType_Beta = subjects_consistencyType,
                               gen_subjects_interaction = subjects_interaction)
  generated_data %>% 
    nest(-gen_subject, .key = parameters) %>% 
  mutate(young_female_label_inconsistent = map(parameters, ~rnorm(num_trials, intercept+Age_Beta+consistencyType_Beta+interaction_Beta,trial_SD)),
         young_female_label_consistent = map(parameters, ~rnorm(num_trials, intercept+Age_Beta+interaction_Beta,trial_SD)),
         young_female_sentence_inconsistent = map(parameters, ~rnorm(num_trials, intercept+Age_Beta+blockType_Beta+consistencyType_Beta+interaction_Beta,trial_SD)),
         young_female_sentence_consistent = map(parameters, ~rnorm(num_trials, intercept+Age_Beta+blockType_Beta+interaction_Beta,trial_SD)),
         young_male_label_inconsistent =  map(parameters, ~rnorm(num_trials, intercept+Age_Beta+Sex_Beta+consistencyType_Beta+interaction_Beta,trial_SD)),
         young_male_label_consistent = map(parameters, ~rnorm(num_trials, intercept+Age_Beta+Sex_Beta+interaction_Beta,trial_SD)),
         young_male_sentence_inconsistent = map(parameters, ~rnorm(num_trials,
                                                                 intercept+Age_Beta+Sex_Beta+blockType_Beta+consistencyType_Beta+interaction_Beta,trial_SD)),
         young_male_sentence_consistent = map(parameters, ~rnorm(num_trials, intercept+Age_Beta+Sex_Beta+blockType_Beta+interaction_Beta,trial_SD)),
         old_female_label_inconsistent = map(parameters, ~rnorm(num_trials, intercept+consistencyType_Beta+interaction_Beta,trial_SD)),
         old_female_label_consistent = map(parameters, ~rnorm(num_trials, intercept+interaction_Beta,trial_SD)),
         old_female_sentence_inconsistent = map(parameters, ~rnorm(num_trials, intercept+blockType_Beta+consistencyType_Beta+interaction_Beta,trial_SD)),
         old_female_sentence_consistent = map(parameters, ~rnorm(num_trials, intercept+blockType_Beta+interaction_Beta,trial_SD)),
         old_male_label_inconsistent = map(parameters, ~rnorm(num_trials, intercept+Sex_Beta+consistencyType_Beta+interaction_Beta,trial_SD)),
         old_male_label_consistent = map(parameters, ~rnorm(num_trials, intercept+Sex_Beta+interaction_Beta,trial_SD)),
         old_male_sentence_consistent = map(parameters, ~rnorm(num_trials,
                                                                 intercept+Sex_Beta+blockType_Beta+consistencyType_Beta+interaction_Beta,trial_SD)),
         old_male_sentence_inconsistent = map(parameters, ~rnorm(num_trials, intercept+Sex_Beta+blockType_Beta+interaction_Beta,trial_SD))) %>% 
  unnest(young_female_label_consistent,
                young_female_label_inconsistent,
                young_female_sentence_consistent,
                young_female_sentence_inconsistent,
                young_male_label_consistent,
                young_male_label_inconsistent,
                young_male_sentence_consistent,
                young_male_sentence_inconsistent,
                old_female_label_consistent,
                old_female_label_inconsistent,
                old_female_sentence_consistent,
                old_female_sentence_inconsistent,
                old_male_label_consistent,
                old_male_label_inconsistent,
                old_male_sentence_consistent,
                old_male_sentence_inconsistent,
                .drop = FALSE)  %>% 
  unnest(parameters) %>% 
  gather(condition, lookingtime, young_female_label_consistent:old_male_sentence_inconsistent)-> data_generated

  data_generated %>% 
    mutate(
      age = ifelse(grepl("young", condition), "young", "old"),
      sex = ifelse(grepl("female",condition), "F", "M"),
      block = ifelse(grepl("label",condition),"label","sentence"), 
      consistency = ifelse(grepl("inconsistent",condition), "inconsistent","consistent")
    ) -> data_generated
  
return(data_generated)  
}

#An example of generated data:
sim_data <- simulate_data(20, 1, 1.5, betas, sds)
head(sim_data)

gen_subject,gen_intercept,gen_Age_Beta,gen_Sex_Beta,gen_blockType_Beta,gen_consistencyType_Beta,gen_subjects_interaction,condition,lookingtime,age,sex,block,consistency
1,23.58994,-7.216059,-1.164587,-1.1817028,-1.593992,1.944381,young_female_label_consistent,18.02501,young,F,label,consistent
2,23.56108,-7.347003,-1.322623,-0.6685386,-1.296416,2.05923,young_female_label_consistent,15.79644,young,F,label,consistent
3,23.7173,-7.327963,-1.407355,-1.0642852,-1.175406,2.104336,young_female_label_consistent,19.44505,young,F,label,consistent
4,23.49911,-7.066767,-1.12174,-1.0647896,-1.333798,2.195083,young_female_label_consistent,21.08994,young,F,label,consistent
5,23.86986,-6.964248,-1.578506,-0.9202863,-1.214435,2.21572,young_female_label_consistent,20.55923,young,F,label,consistent
6,23.39917,-6.902682,-1.248584,-0.7637151,-1.530406,1.867548,young_female_label_consistent,17.46301,young,F,label,consistent


Run multiple simulations for power analysis

In [None]:
run_analysis <- function(data) {
    # fit null and alternative model
    m0 <- lmer(lookingtime ~ 1 + (1 | gen_subject), data=data, REML=FALSE, control=lmerControl(calc.derivs = FALSE))
    m1 <- lmer(lookingtime ~ age + sex + consistency + block + consistency * block + (1 | gen_subject), data=data, REML=FALSE, control=lmerControl(calc.derivs = FALSE))
    #You can replace with change in AIC or a p-value for a likelihood ratio statistic
    m_bic <- BIC(m0, m1)$BIC
    statistic <- diff(m_bic)
    return(statistic)
}

repeat_analysis <- function(num_simulations, num_subjects, num_trials,trial_sd, model_Beta, model_SD) {
    bic_diffs <- c() # empty vector to store delta BICs from each simulation
    # loop for repeating the simulation
    for (i in 1:num_simulations) {
        data <- simulate_data(num_subjects, num_trials,trial_sd, model_Beta, model_SD)
        bic_diff <- run_analysis(data)
        bic_diffs <- c(bic_diffs,bic_diff) # add the current p.value to the vector
    }
    
    # calculate how many of the simulations had significant results
    power <- mean(bic_diffs <= -20) #strong evidence, see Raftery & Kass 1995
    return(list(power = power, bic_diffs = bic_diffs))
}


dat <- expand.grid(num_subjects = c(16,24,32,40), num_trials = c(1,2,3,4))
dat$id <- 1:nrow(dat)

results <- dat  %>% 
    nest(-id, .key = 'parameters')  %>% 
    mutate(power = map(parameters, ~ repeat_analysis(500, .$num_subjects, .$num_trials, 0.2, betas,sds)$power))  %>% 
    unnest(parameters, power)

Generate plots

In [None]:
options(repr.plot.width=6, repr.plot.height=4)
results %>% 
ggplot(aes(x=num_subjects, y=power, color=as.factor(num_trials))) +
    geom_point() +
    geom_line() +
    geom_hline(yintercept = 0.8) +
    geom_hline(yintercept = 0.95) +
    scale_color_discrete('Number of trials per subject') +
    scale_x_continuous('Number of subjects') +
    scale_y_continuous('Statistical power (for Delta BIC <= -20)') +
    theme_classic()

## Analysis 

## Conclusion 