# Why (Bayesian) Model?

In this week's lecture, we defined computational modelling and the process of building a computational model. Hopefully, you all realised that there was no actual coding required to implement the skills and frameworks we covered in class. So in this week's tutorial, I want to get you acquainted to how I write code in R, which is more `tidyverse` than `base` (what HRMI relies on).

The code (and solutions) given in this notebook will use functions from the tidyverse, including `mutate()`, `filter()`, `pull()`, `gather()` and the pipe operator `%>%`. If you're not familiar with the tidyverse or would like to refresh your memory, please take a look at [R for psychological science]( https://psyr.djnavarro.net/index.html ) by Danielle Navarro. The following sections discuss core elements of the tidyverse:
* https://psyr.djnavarro.net/prelude-to-data.html
* https://psyr.djnavarro.net/describing-data.html
* https://psyr.djnavarro.net/manipulating-data.html


In [None]:
options(jupyter.plot_scale=1) # I like big plots

library(tidyverse)


In this tutorial, we are going to illustrate a few of the concepts from lecture working through an example from word learning. In language acquisition, researchers have spent a long time trying to predict what makes some words learned earlier than other words [(see the Wordbank book)](https://wordbank-book.stanford.edu/). In these analyses, researchers ask parents to complete a survey of some 600ish words that their child may or may not understand or produce. Researchers then take a predictor variable, like word frequency or word concreteness, and use some kind of computational model to predict the age of acquisition. 

For many years, researchers used logistic regression as an off-the-shelf linking hypothesis. I don't expect you to know all the details of logistic regression. I'm going to try and give you enough understanding to follow along. I have designed the exercises in this notebook to be achievable without some of the advanced knowledge I'm going to reference. If you can, you should work with a group through this notebook. 

In logistic regression, we want to predict a binary variable $y$ (Yes they can produce a word / No they can't) as a function of some predictor variable $x$ (age).
$$ y = f(x) .$$
We call it logistic regression because we use the logistic function:
$$ f(x) = \frac{1}{1 + e^{-(\beta_1 x + \beta_0)}},$$
where $\beta_1$ is a scaling parameter and $\beta_0$ is a base rate---i..e, how likely will kids know this word independent of age.

Let's show you logistic regression in action. Let's use [Wordbank](https://wordbank-book.stanford.edu/), a large collection of child age of acquisition data, to predict when some Australian children learn to say "kangaroo."


In [None]:
# I've already downloaded the Australian data from Wordbank so let's load it in
wb = read.csv('wordbank_instrument_data.csv')

# And today I only care about kangaroo, which is item_30
d = wb %>% filter(item_id == 'item_30')

head(d)


There's a lot of columns in this data.frame including when I downloaded the dataset `downloaded`, word properties like  `category` and child identifiers `child_id`. Today, we are going to focus on `age` and whether or not the parent reported their child could say "kangaroo" `value`. `value` can take three possible values: blank, 'produces' or NA. NA's happen when a parent accidentally skips a question on the survey. As a first step, let's filter out the missing data. Then let's create a binary variable for whether a child produces "kangaroo".



In [None]:
d = d %>%
    filter(!is.na(value)) %>% # remove the missing data; sometimes parents skip a page \_o_/
    mutate(produces = ifelse(value == '', 0, 1)) # let's recode the data into 1=produces and 0=can't produce


Now I don't know about y'all but when I get data, the first thing I want to do is plot it. `ggplot` is an excellent R plotting tool that's build on a visualization called the grammar of graphics (hence the gg). We build plots layer by layer giving us full control over the graph. I'll annotate the code the first time I do something so you can follow along.


In [None]:
d %>%
    ggplot(aes(x=age, y=produces)) + # I'm going to make a plot with age on the x-axis and produces on the y-axis
    stat_summary(fun=mean, geom='line') + # Let's take the mean of produces at each age and connect them with a line
    stat_summary(fun=mean, geom='point') + # Now let's put a point on the means at each age
    stat_summary(fun.data=mean_cl_boot, geom='linerange') + # Let's add some bootstrapped confidence intervals
    theme_bw(base_size=20) + # I like clean themes \_o_/ and let's make the font size big (20)
    xlab('Age (months)') + # Add an x-axis label
    ylab('Proportion kids producing') + # Now a y-axis label
    ggtitle('kangaroo') + # Let's give it a title
    coord_cartesian(ylim=c(0, 1)) + # Now produces is binary variable so let's set the y-axis to span [0,1]
    theme(plot.title = element_text(hjust = 0.5)) # Let's center the title


Now that's pretty :)

**Back to logistic regression.** So now we can use logistic regression to predict this acquisition curve as a function of age. We have a few steps to this:

1. Aggregate the data by age.
2. Fit a logistic regression model using `glm`
3. Grab the model predictions
4. Plot

Let's do them all in the next chunk.


In [None]:
# Step 1
d.agg = d %>%
    group_by(age, uni_lemma) %>%  # For each age
    summarise(successes = sum(produces),  # Let's get the number of kids who can successfully say kangaroo
           failures = n() - successes) %>% # and the number of kids who fail to say kangaroo
    ungroup() # and we ungroup to keep good habits


# Step 2
# Logistic regression is 1 line in R
    # The first argument says predict (success, failure) using age
    # The family argument says use the logistic function as a linking hypothesis
g.Logit  <- glm(cbind(successes, failures) ~ age, family=binomial(link="logit"),  data=d.agg)


# Step 3
# Let's see the model response predictions on the aggregated dataset
d.agg = d.agg %>%
    mutate(Logit=predict(g.Logit, newdata=d.agg, type="response"))

# Step 4
# Make a pretty plot
d %>%
    ggplot(aes(x=age, y=produces)) +
    # Instead of a line for the data, let's plot the predictions as a line
        # So we want to use a different data for this layer, 
        # We'll specify the axes
        # And for good measure, let's color this line blue 
    geom_line(data=d.agg, aes(x=age, y=Logit), color='blue') + 
    # Now we just back to using the original dataset d
    stat_summary(fun=mean, geom='point') +
    stat_summary(fun.data=mean_cl_boot, geom='linerange') +
    theme_bw(base_size=20) +
    xlab('Age (months)') +
    ylab('Proportion kids producing') +
    ggtitle('kangaroo') +
    coord_cartesian(ylim=c(0, 1)) +
    theme(plot.title = element_text(hjust = 0.5))


You can't argue with how pretty that plot is :)

But is a logistic function a good linking hypothesis? Put simply, this linking hypothesis says:
<center>
    <br>
    Word learning is a function of a child's age.
</center>

### Exercise 1: What's in a linking hypothesis?

Let's try and critique this linking hypothesis using what we've learning in class. I'll outline some questions that you might consider. You certainly don't need to answer all of them. I want you to form your own opinion on whether this is a *reasonable* linking hypothesis and be prepared to argue for your opinion. Use the next cell to jot down your thoughts, make a pro/con list or whatever helps you organize your thoughts.

- Framework: What are the first principles behind this hypothesis?
- Theory: What kinds of theories might this linking hypothesis (dis-)agree with?
- Specification: Are there any concerns about this specification?
- Implementation: Did I do anything unreasonable in the implementation?
- Hypothesis: Does the linking hypothesis allow us to make useful hypotheses about word learning?
- Data: Are the data coherent with the linking hypothesis?



Your thoughts here!



# Don't read further until you've completed Exercise 1

Well, I have some opinions here and I've even [spilt ink over it](https://dx.doi.org/10.1162/opmi_a_00006).

As I see it, it is a truth universally acknowledged that children need to observe linguistic input before they can learn the meaning of a word. But how much data do children use to learn words? Some studies have said that children can learn from a single learning instance (Carey & Bartlett, 1978); whereas, other studies suggest children need tens, hundreds or even thousands of instances before they can learn a word (Smith & Yu, 2008). 

But also shouldn't the quality of a learning instance count? If I just sit there and say "ball" over and over and over again is that three instances or just one? Hanging out with my niece, it certainly feels like it doesn't count at all! Does it count if she overheard me say "cornbread" while she's playing with a toy car? So maybe we don't care about the raw frequency of explosure at all. Instead, we want to know the rate of "effective" learning instances.

In my younger and more impressionable years, I came across an analysis by Hidaka (2013) that uses a two-parameter linking hypothesis to account for these critiques. Using a gamma linking hypothesis, we model word learning as how long it takes for a learner to observe $k$ effective learning instances when effective learning instances come at a constant rate $\lambda$ per month.

Let's fit one of these gamma models on our kangaroo data.


In [None]:
# Here's the gamma model, let's not worry about it
gammaModel = function(params){
    k = params[1]
    lambda = params[2]
    -1*sum(dbinom(d.agg$successes, d.agg$successes + d.agg$failures, p=pgamma(d.agg$age, k, lambda), log=TRUE))
}

# Here's the "regression" function (again let's not worry about it)
o = optim(c(10, 1/2), gammaModel, method = "L-BFGS-B", lower = c(0, 0))

# Let's grab our predictions
d.agg = d.agg %>%
    mutate(Gamma=pgamma(age, o$par[1], o$par[2]))

# Finally the plot thickens!
d %>%
    ggplot(aes(x=age, y=produces)) +
    # Now that we have two models, we have to do a bit of data wrangling
    # Right now, we have one column for each of our model predictions
    # We want one column with all of the model predictions: prediction
    # And another column telling us which model each prediction comes from
    # This can be done using `gather` in the tidyverse
    geom_line(data=d.agg %>% gather(model, prediction, Logit:Gamma), 
              aes(x=age, y=prediction, color=model, linetype=model), size=1) +
    stat_summary(fun=mean, geom='point') +
    stat_summary(fun.data=mean_cl_boot, geom='linerange') +
    theme_bw(base_size=20) +
    xlab('Age (months)') +
    ylab('Proportion kids producing') +
    ggtitle('kangaroo') +
    coord_cartesian(ylim=c(0, 1)) +
    theme(plot.title = element_text(hjust = 0.5))


### B-E-A-UTIFUL

But this isn't the only horse in the race! Other's have suggested linking hypotheses for word learning. For exampe, a famous linking hypothesis was proposed by Bob McMurray (2007) to account for the *vOcAbUlArY eXpLoSiOn*!

On average, children start to acquire their first words between 12-18 months. While the first few words take a long time to learn, the next 50 words are aquired rapidly. This observation is lovingly referred to as the vocabulary explosion. For a long time, the vocabulary explosion puzzled researchers, who proposed complicated models of how children's ability to process language must rapidly change. 

In 2007, Bob McMurray proposed a simple model to account for the explosion. He started with the question: <b>what would happen if a child accumulates information about each word at the same rate in parallel and some words were more difficult than others</b>. In Figure panel A, we see that he places a normal (Gaussian) distribution over how difficult it is to learn words. Few words are really easy to learn (requiring few timesteps) and few words are really difficult to learn (requiring many timesteps). In panel B, we see the results of this simulation for acquiring 10,000 words. In panel C, Bob repeats the simulation; however, this time he defines difficulty in terms of a word's frequency. More frequent words are easier to acquire and there are only a few frequent words in language. Bob concludes that <b>"Acceleration is guaranteed in any system in which (i) words are acquired in parallel... and (ii) the difficulty of learning words is distributed such that there are few words that can be acquired quickly and a greater number that will take longer."</b>

![McMurray](images/mcmurray.png)


### Exercise Two: Core vs Auxilliary Assumptions

In this description, there are several components of the model specification. Which of the assumptions are core and which are auxiliary? Try it yourself first and then compare with a group.

1. Difficulty is Gaussian distributed.
2. Difficulty is distributed such that there are a few easy words and more hard words.
3. Information about words is acquired at the same rate.
4. Information about all words are acquired simultaneously.
5. Difficulty is distributed according to frequency. 
6. Information about 10000 words are acquired simultaneously.



You can jot down your answers for safe keeping here.


# Don't read further until you've completed Exercise 2

Bob's model is equivalent to a probit regression (which you definitely don't need to know). It's really elegant (and it's only one line of R code) so let's add it to our plot!


In [None]:
# Fit the model
g.Probit <- glm(cbind(successes, failures) ~ age, family=binomial(link="probit"), data=d.agg)

# Make some predictions
d.agg = d.agg %>%
    mutate(Probit=predict(g.Probit, newdata=d.agg, type="response"))

# PLOT ALL THE THINGS!!!
d %>%
    ggplot(aes(x=age, y=produces)) +
    geom_line(data=d.agg %>% gather(model, prediction, Logit:Probit), 
              aes(x=age, y=prediction, color=model, linetype=model), size=1) +
    stat_summary(fun=mean, geom='point') +
    stat_summary(fun.data=mean_cl_boot, geom='linerange') +
    theme_bw(base_size=20) +
    xlab('Age (months)') +
    ylab('Proportion kids producing') +
    ggtitle('kangaroo') +
    coord_cartesian(ylim=c(0, 1)) +
    theme(plot.title = element_text(hjust = 0.5))


Ok so this is pretty, but like boring. All the models predict the same thing :/

From a high level evaluation standpoint, we can make arguments for why one might be a better explanation/description, more coherent, valid or tractable than the others. But at least two of these linking hypotheses seem *reasonable*$^\dagger$. Actually, this is a nice demo for why we should be using multiple methods of evaluation! 

### Exercise Three: Weak or Strong Generalization

In the next two chunks, I've provided code for doing held-out prediction. All you have to do is provide the filter condition to define the training data. I've flagged it with SCARY ALL CAPS COMMENTS. Use the first cell to illustrate weak generalization (where the held-out data is within the training range) and the second cell to illustrate strong generalization (where the held-out data is out of the training range). Then summarise your findings.

$^\dagger$ At least to me and I am the veritable and just arbitrator of reason (at least for this tutorial).

In [None]:
# Weak Generalization

train = d %>%
    # ADD YOUR FILTER CONDITION HERE 
    # filter() %>%
    group_by(age, uni_lemma) %>%
    summarise(successes = sum(produces),
           failures = n() - successes) %>%
    ungroup()

test = d %>%
    group_by(age, uni_lemma) %>%
    summarise(successes = sum(produces),
           failures = n() - successes) %>%
    ungroup()


g.Logit  <- glm(cbind(successes, failures) ~ age, family=binomial(link="logit"),  data=train)
g.Probit <- glm(cbind(successes, failures) ~ age, family=binomial(link="probit"), data=train)

gammaModel = function(params){
    k = params[1]
    lambda = params[2]
    -1*sum(dbinom(train$successes, train$successes + train$failures, p=pgamma(train$age, k, lambda), log=TRUE))
}

o = optim(c(21, 0.12), gammaModel, method = "L-BFGS-B", lower = c(0, 0), upper = c(Inf, Inf))

pred = test %>%
    mutate(Logit=predict(g.Logit, newdata=test, type="response"),
           Probit=predict(g.Probit, newdata=test, type="response"),
           Gamma=pgamma(age, o$par[1], o$par[2]))

d %>%
    # ADD YOUR FILTER CONDITION IN THE ifelse BELOW
    mutate(splt = ifelse(#FILTER CONDITION HERE#, 'Train', 'Test')) %>% 
    ggplot(aes(x=age, y=produces)) +
    geom_line(data=pred %>% gather(model, prediction, Logit:Gamma), 
              aes(x=age, y=prediction, color=model, linetype=model), size=1) +
    stat_summary(fun=mean, geom='point', aes(shape=splt), size=2) +
    stat_summary(fun.data=mean_cl_boot, geom='linerange') +
    theme_bw(base_size=20) +
    xlab('Age (months)') +
    ylab('Proportion kids producing') +
    ggtitle('kangaroo') +
    coord_cartesian(ylim=c(0, 1)) +
    theme(plot.title = element_text(hjust = 0.5))




In [None]:
# Strong Generalization

train = d %>%
    # ADD YOUR FILTER CONDITION HERE 
    # filter() %>%
    group_by(age, uni_lemma) %>%
    summarise(successes = sum(produces),
           failures = n() - successes) %>%
    ungroup()

test = d %>%
    group_by(age, uni_lemma) %>%
    summarise(successes = sum(produces),
           failures = n() - successes) %>%
    ungroup()


g.Logit  <- glm(cbind(successes, failures) ~ age, family=binomial(link="logit"),  data=train)
g.Probit <- glm(cbind(successes, failures) ~ age, family=binomial(link="probit"), data=train)

gammaModel = function(params){
    k = params[1]
    lambda = params[2]
    -1*sum(dbinom(train$successes, train$successes + train$failures, p=pgamma(train$age, k, lambda), log=TRUE))
}

o = optim(c(21, 0.12), gammaModel, method = "L-BFGS-B", lower = c(0, 0), upper = c(Inf, Inf))

pred = test %>%
    mutate(Logit=predict(g.Logit, newdata=test, type="response"),
           Probit=predict(g.Probit, newdata=test, type="response"),
           Gamma=pgamma(age, o$par[1], o$par[2]))

d %>%
    # ADD YOUR FILTER CONDITION IN THE ifelse BELOW
    mutate(splt = ifelse(#FILTER CONDITION HERE#, 'Train', 'Test')) %>% 
    ggplot(aes(x=age, y=produces)) +
    geom_line(data=pred %>% gather(model, prediction, Logit:Gamma), 
              aes(x=age, y=prediction, color=model, linetype=model), size=1) +
    stat_summary(fun=mean, geom='point', aes(shape=splt), size=2) +
    stat_summary(fun.data=mean_cl_boot, geom='linerange') +
    theme_bw(base_size=20) +
    xlab('Age (months)') +
    ylab('Proportion kids producing') +
    ggtitle('kangaroo') +
    coord_cartesian(ylim=c(0, 1)) +
    theme(plot.title = element_text(hjust = 0.5))



Summarise your results here



Ok Okay OK I know you're all tired and done with this. Hopefully, you have a better idea of how to use some of the tools that we talked about in class. I promise we'll get back to real coding exercises next time. 

One parting exercise, just to make sure you understand Marr's levels.

### Exercise Four: Marr's Levels

In the following images, it is your job to assign each sub-image to Marr's levels. Try it on your own and discuss with a group. Put your answers and any comments into the following cell.

Image one:
![knitting](images/marrQ.png)

Image two:
![pdp](images/marrQ2.png)



Record your answers here


## References

Carey, S., & Bartlett, E. (1978). Acquiring a single new word. *Papers and Reports on Child Language Development, 15,* 17–29.

Frank, M. C., Braginsky, M., Yurovsky, D., and Marchman, V. A. (2021). Variability and Consistency in Early Language Learning: The Wordbank Project. Cambridge, MA: MIT Press.

Hidaka, S. (2013). A computational model associating learning process, word attributes, and age of acquisition. *PLOS one, 8*(11), e76242.

McMurray, B. (2007). Defusing the childhood vocabulary explosion. *Science, 317*(5838), 631-631.

Mollica, F., & Piantadosi, S. T. (2017). How data drive early word learning: A cross-linguistic waiting time analysis. *Open Mind, 1*(2), 67-77.

Smith, L., & Yu, C. (2008). Infants rapidly learn word-referentmappings via cross-situational statistics. *Cognition, 106*(3),1558–1568.