# CBS Week 10 Assessment: Information Theory
## Semester 2 2024


This notebook is due on October 7th. Please make sure that your notebook validates before you submit it --- if your notebook doesn't validate the automated grader may run into issues.


In [None]:
suppressPackageStartupMessages({
    library(tidyverse)
    library(testthat)
})

options(repr.plot.width=16, repr.plot.height=8)


Friendly reminder that because we are using information theory, we are going to make the assumption that instead of $\log_2(0)=-\infty$, we'll assume $\log_2(0)=0$. So we'll make a special `safe_log` function to use instead of the base `log` function.

In [None]:
safe_log_single = function(x){ifelse(x<0, 0, log2(x))}
safe_log = Vectorize(safe_log_single)

<div class="alert alert-info" role="alert">

<h3>Exercise 1</h3>

Write a function for entropy. The function should take as input a probability vector `p` and output a numeric value for the entropy.

(2 Points)
</div>

In [None]:

entropy = function(p){
    NA # YOUR CODE HERE 
}


In [None]:
## Let's test this!

expect_equal(entropy(c(1, 1)/2), 1, tol=0.00001)
expect_equal(entropy(c(1, 1, 1, 1)/4), 2, tol=0.00001)
expect_equal(entropy(c(1, 1/4, 1/4, 1/4, 1/4)/2), 2, tol=0.00001)
expect_equal(entropy(c(1, 1, 1, 1, 1, 1, 1, 1)/8), 3, tol=0.00001)


## Let's analyse a corpus

Majoram really wants to test this whole reading time correlates stronger to surprisal than to frequency claim. So they generate a corpus of non-sense syllables and non-sense words. Then, they had ten participants read the training corpus and a special test-set, for which they measured reading time in milliseconds. From the corpus, they generated `log_freq` and `surprisal` measures as follows.

In [None]:
words = c('ba ka ti', 'ta bu ti', 'ti ba ka', 'to bu ti', 'ba re do', 're bu ka', 'ti go bu', 'ka re ba')

syl = c('ba', 'ka', 'ta', 'bu', 'to', 'bu', 'go')

set.seed(8675309)
train = paste(sample(c(words, syl, syl, syl, words, syl, syl, syl), 1000, replace=TRUE), collapse=' ')
test = paste(sample(c(words, syl, syl, syl, words, syl, syl, syl), 200, replace=TRUE), collapse=' ')

corpus = data.frame(word = strsplit(train, split=' ')[[1]]) %>% 
    mutate(last_word=lag(word, 1),
           bigram = paste(last_word, word))


key = corpus %>%
    mutate(N=n()) %>%
    filter(!is.na(last_word)) %>%
    group_by(bigram, last_word, word, N) %>%
    summarise(n = n()) %>%
    group_by(last_word) %>%
    mutate(surprisal = -1 * safe_log(n / sum(n))) %>%
    group_by(word) %>%
    mutate(log_freq=safe_log(sum(n)/N)) %>%
    ungroup() %>%
    select(-N, -n)


<div style='display:none'>

    parts = data.frame(participant=1:10,
               p_int = rnorm(1:20, 0, 20),
               p_surp = rnorm(1:20, 0, 1),
               p_freq = rnorm(1:20, 0, 1))

    d = NULL
    for(i in 1:20){
        d = bind_rows(d, 
                      data.frame(word = strsplit(train, split=' ')[[1]]) %>%
                      mutate(last_word=lag(word, 1),
                             bigram = paste(last_word, word)) %>%
                      filter(!is.na(last_word)) %>%
                      left_join(key) %>%
                      mutate(RT = round(rnorm(1:n(), 
                                              350 + parts$p_int[i] + 
                                              (1.75+parts$p_surp[i])*surprisal + 
                                              (-0.8+parts$p_freq[i])*log_freq, 16)),
                            participant=parts$participant[i]) %>%
        select(participant, bigram, word, RT))
    }

    write.csv(d, 'reading_times.csv', row.names=FALSE)
</div>

And here is the data, they collected.

In [None]:

data = read.csv('reading_times.csv') %>%
    left_join(key)

head(data)

<div class="alert alert-info" role="alert">

<h3>Exercise 2</h3>

Alright, let's plot this data! We want to see reaction time (y-axis) as a function of the predictors (x-axis), surprisal and log frequency. To do so, you should `gather` the predictors and use a `facet_wrap` as in your last assignment. Make sure you label both axes appropriately and that the plot renders in a legible font of appropriate size.

To be clear, for full marks, you plot must:
    
-  Have RT on the y-axis and predictor on the x-axis
-  Have correct y-axis label: Reaction Time (ms)
-  Have correct x-axis label: Predictor
-  Have both predictors in the same row of the graph
-  Have legible axes and axis labels
    
(2 Points)
</div>

In [None]:

# YOUR CODE HERE


<div class="alert alert-info" role="alert">

<h3>Exercise 3</h3>

What is the correlation between reaction time and surprisal?

Store your answer as `corr_surprisal`.
    
For sake of the auto-grader, make sure that the `typeof` call in the test cell returns "double," which is computer speak for a precise number.

(1 Point)

</div>

In [None]:

corr_surprisal = NA # YOUR CODE HERE


In [None]:

typeof(corr_surprisal)

### BEGIN HIDDEN TESTS
expect_equal(corr_surprisal, 0.0799232145125666, tol=0.0001)
### END HIDDEN TESTS


<div class="alert alert-info" role="alert">

<h3>Exercise 4</h3>

What is the correlation between reaction time and log freq?

Store your answer as `corr_freq`.
    
For sake of the auto-grader, make sure that the `typeof` call in the test cell returns "double," which is computer speak for a precise number.

(1 Point)
</div>

In [None]:

corr_freq = NA # YOUR CODE HERE


In [None]:

typeof(corr_freq)

### BEGIN HIDDEN TESTS
expect_equal(corr_freq, -0.0495188609843859, tol=0.0000001)
### END HIDDEN TESTS


<div class="alert alert-info" role="alert">

<h3>Exercise 5</h3>

My friend, Sumak, thinks my demonstration is wrong because I logged the frequencies. Give me some evidence to show them that logging the frequencies won't change the correlation.

For your answer, code is necessary and sufficient but if you're anxious, you can provide two sentences explaining our answer.

(1 Point)
</div>

In [None]:
# YOUR CODE HERE

 ANSWER HERE

## Hyssop's Communicative Efficiency Analysis of Fictional Spices

One of my fictional students, Hyssop, wants to do an communicative efficiency analysis of the words for spices from their favorite fantasy series. Hyssop has some concerns about implementing the KL Divergence to measure the difference between two probability distributions. He's concerned that the standard KL functions don't give informative error messages when assumptions are violated.

Here are the KL Divergence Assumptions they're concerned about:

1) Both distributions have to have the same support over X <br>
2) $\sum$ P(x) = 1 <br>
3) $\sum$ Q(x) = 1 <br>

Now that we're experts at the function syntax. Let's try adding some built in tests to our function using the `stopifnot` function.

The arguements of the `stopifnot` function are the conditions that must be true in order to keep running the function. Each condition is labeled with an error message.

`stopifnot(ERROR = CONDITION, ...)`

You can play with the following example:

In [None]:
x <- 4
y <- 1

stopifnot('x has negative value'= x > 0,
         'y is greater than x'= y <= x )


<div class="alert alert-info" role="alert">

<h3>Exercise 6</h3>

Let's write them a function for KL divergence that also checks these assumptions. 

(3 Point)
</div>

In [None]:

kl_divergence = function(p, q){
    
    stopifnot('Both distributions should have the same support' = # YOUR CODE HERE
               'P(X) = 1' = # YOUR CODE HERE
               'Q(X) = 1' = # YOUR CODE HERE )
    
    # YOUR CODE HERE
}


In [None]:
# Let's test if both distributions should have the same support

# This should produce an error
expect_error(kl_divergence(c(1, 1, 1)/3, c(1, 0)), 'support' )

# This shouldn't produce an error
expect_gt(kl_divergence(c(1, 1, 1)/3, c(0.5, 0.25, 0.25)), 0)

# Let's test if P(X) sums to one.

expect_error(kl_divergence(c(1, 2, 1)/3, c(1, 1, 1)/3), 'P\\(X\\)')

expect_gt(kl_divergence(c(0.64, 0.16, 0.2), c(1, 1, 1)/3), 0)

# Let's test if Q(X) sums to one.

expect_error(kl_divergence(c(1, 1, 1)/3, c(4, 1, 1)/3), 'Q\\(X\\)')

expect_gt(kl_divergence(c(0.32, 0.32, 0.36), c(1, 1, 1)/3), 0)

# Hidden Tests

### BEGIN HIDDEN TESTS
expect_equal(kl_divergence( c(1, 1, 1)/3 , c(0.1, 0.8, 0.1) ), 0.736965594166206, tol=0.00001)
expect_equal(kl_divergence( c(0.8, 0.1, 0.1) , c(0.1, 0.8, 0.1) ), 2.1, tol=0.00001)
expect_equal(kl_divergence( c(0.4, 0.1, 0.4, 0.1), c(0.25, 0.25, 0.25, 0.25) ), 0.278071905112638, tol=0.00001)
### END HIDDEN TESTS
