# Computational Modelling Tutorial - Simulating Infant Statistical Learning

## Introduction

This IPython notebook contains exercise 1 of the Computational Modelling tutorial at ISOLDE 2018 in Potsdam. 

- What does the model do?

This model tests the hypothesis that infants use transitional probabilities to learn where word boundaries lies within unsegmented, continuous speech. The model will be written in Python, but the same ideas could also be implemented using a different programming language.

- What do we need?


1. Reading in the data from the corpus, by opening it, reading it in line by line and counting the syllables and syllable pairs. 
2. Computing the probabilities for the bigrams (syllable pairs) and unigrams (single syllables)
3. Testing the model by comparing the average probability of words to the average probability of non-words

- But first:

Let's learn some Python!



## Mini-tutorial: Python basics

Important note for using Python: 

Blocks of code that belong together must by indented the same way, so be careful not to mix indention using spaces and tabs as this might look the same. If your code does not compile, checking the indenting is a good thing to check at first.

### Variable-types

#### Numbers

Variables containing numbers are made by assigning a value to a variable name, for instance:

In [None]:
number_of_sandwiches = 4

price_per_sandwich = 3.50

price = number_of_sandwiches * price_per_sandwich

#### Text

Variables containing text are called strings, and they are made by assigning the text (within quotation marks) to a variable name, for instance:

In [2]:
type_of_sandwich = "tomato and cheese"

#### Lists
When you want to store multiple bits of information in one variable, you could use a list. Lists are made by using square brackets [] and separating everything you store in the list with a comma.


In [None]:
empty_order = []

order = [4,"tomato_and_cheese", 3.50]

#### Dictionaries
Another way of storing information is using dictionaries. In a dictionary, you can store combinations of a 'key' and value. Keys are usually string-variables, but the value can be any kind of variable-type. 


In [None]:
#Making an empty dictionary is done by using curly brackets {}:

employee_numbers = {}

#You can add a key-value pair to the dictionary: 
employee_numbers['Hank'] = 6758
employee_numbers['Sarah'] = 5664

#And asks for the value linked to a key: 

Sarahs_employee_number = employee_numbers['Sarah']

### Control-flow

#### If-else - statements
If else - statements can be used when you only want something to happen in certain cases, and not in others. For instance:


In [None]:
credit = 40.65

if credit < price:
    print('not enough credit, try paying in a different way')
else:
    credit -= price
    print('thank you for your purchase!')



#### For-loops
With a for-loop you can iterate over a sequence (either a list or a string) and perform a certain computation for each item within the sequence. For example:

In [None]:
names = ['Hank', 'Sarah', 'Joe']

for name in names:
    print("Hello "+ name + " !")
    
#Or, another example:
prices = [3.5,5.6,8.0,2.4]
total_price = 0

for i in range(0,len(prices)):
    total_price += prices[i]
    
print(total_price)

## Writing your own code

### Step 1: Reading in the data

We need a function that can read in the data from the corpusfile. 

First, we need to create a list to store the results. 

Then, we open the file for reading, signalled by 'r', and read it in line by line. For each line of in the corpus we remove the end-of-line symbol and add the remained syllable to the list of results. 

Finally, we return the list with syllables so that it can be used by the rest of the program.

In [None]:
def read_corpus(filename):
    """ read corpus from '\n'-delimited text file
    returns a list of syllables """
    
    #Write your Python-code here
    #...
    
    # open the file for reading and loop over the lines
    for line in open(filename, 'r'):
        #...
    
    return result

In [None]:
#This is how we want to use this little routine:
corpusfile = 'saffran_corpus.cor'
corpus = read_corpus(corpusfile)

### Step 2: Processing the data

Next, we need to process the corpus and compute the probabilities for the syllables and syllable-pairs. We split this functionality into three parts.

The counts of the syllables or syllable-pairs can be stored in dictionaries. Then, we go through the list of syllables to add the syllables and syllable-pairs to the dictionaries, or update their respective counts if they are already in the dictionary. 

In [None]:
def process_corpus(list_of_syllables):
    """ extract count of uni- & bigram occurrences from sequence
    of syllables in list_of_syllables """
    
    #Write your code here
    #...
    

    # return the dictionaries with the unigram and bigram counts
    return unigram_dict, bigram_dict

The probability of a bigram (syll_1, syll_2) can be computed by divided the count of both syllables togethers by the count of the first syllable alone.  So, you compute how often the first syllable is followed by the second, opposed to by any other syllable.

In [None]:
def estimated_bigram_probability(bigram, unigram_dict, bigram_dict):
    """ estimate the probability of bigram (= (syll_1,syll_2)) by:
    (count (syll_1,syll_2)) / (count syll_1)
    """

    #Write your code here
    #...
    
    #return the estimated bigram probability 
    return prob

The probability of a sequence of syllables can be computed by going through the sequence and multiplying all the estimated bigram probabilities of the sequence.

In [None]:
def estimated_sequence_probability(list_of_syllables, unigram_dict, bigram_dict):
    """ estimate probability of sequence of syllables,
    represented as a list """
    
    #Write your code here
    #...

    # return the estimated probability of the entire sequence
    return prob

### Step 3: Testing the model

To test the model we need the experimental test-phase stimuli from the Saffran-study. These are given below. 

We want to compare the average probability for word to the average probability for non-words. 

In [None]:
def test_model(unigram_dict, bigram_dict):
    """ test the model on saffran's words and non-words
    """
    
    # the words and non-words from saffran
    words = [['tu','pi','ro'],
             ['go','la','bu'],
             ['bi','da','ku'],
             ['pa','do','ti']]
    non_words = [['da','pi','ku'],
                 ['ti','la','do']]

    # calculate the sum of the probabilities of the words
    #Write your code here
    #...

    # divide by the number of words to get the average
    #Write your code here
    #...

    # idem for the non-words
    #Write your code here
    #...

    print('Average probability for words:', average_word)
    print('Average probability for non-words:', average_non_word)



Now you can complete a whole run through the model. Which commands do you need to type in below in what order?

In [None]:
#Let's see whether everything works. Don't worry if you get an error message, that is normal even for experienced programmers.