# Computational Modelling Tutorial - Simulating Infant Statistical Learning

## Introduction

This IPython notebook contains exercise 1 of the Computational Modelling tutorial at ISOLDE 2018 in Potsdam. 

- What does the model do?

This model tests the hypothesis that infants use transitional probabilities to learn where word boundaries lies within unsegmented, continuous speech. The model will be written in Python, but the same ideas could also be implemented using a different programming language.

- What do we need?


1. Reading in the data from the corpus, by opening it, reading it in line by line and counting the syllables and syllable pairs. 
2. Computing the probabilities for the bigrams (syllable pairs) and unigrams (single syllables)
3. Testing the model by comparing the average probability of words to the average probability of non-words

- But first:

Let's learn some Python!



## Mini-tutorial: Python basics

Important note for using Python: 

Blocks of code that belong together must by indented the same way, so be careful not to mix indention using spaces and tabs as this might look the same. If your code does not compile, checking the indenting is a good thing to check at first.

### Variable-types

#### Numbers

Variables containing numbers are made by assigning a value to a variable name, for instance:

In [1]:
number_of_sandwiches = 4

price_per_sandwich = 3.50

price = number_of_sandwiches * price_per_sandwich

price 

14.0

#### Text

Variables containing text are called strings, and they are made by assigning the text (within quotation marks) to a variable name, for instance:

In [2]:
type_of_sandwich = "tomato and cheese"

#### Lists
When you want to store multiple bits of information in one variable, you could use a list. Lists are made by using square brackets [] and separating everything you store in the list with a comma.


In [None]:
empty_order = []

order = [4,"tomato_and_cheese", 3.50]

#### Dictionaries
Another way of storing information is using dictionaries. In a dictionary, you can store combinations of a 'key' and value. Keys are usually string-variables, but the value can be any kind of variable-type. 


In [None]:
#Making an empty dictionary is done by using curly brackets {}:

employee_numbers = {}

#You can add a key-value pair to the dictionary: 
employee_numbers['Hank'] = 6758
employee_numbers['Sarah'] = 5664

#And asks for the value linked to a key: 

Sarahs_employee_number = employee_numbers['Sarah']

### Control-flow

#### If-else - statements
If else - statements can be used when you only want something to happen in certain cases, and not in others. For instance:


In [4]:
credit = 40.65

if credit < price:
    print('not enough credit, try paying in a different way')
else:
    credit -= price
    print('thank you for your purchase!')

thank you for your purchase!




#### For-loops
With a for-loop you can iterate over a sequence (either a list or a string) and perform a certain computation for each item within the sequence. For example:

In [5]:
names = ['Hank', 'Sarah', 'Joe']

for name in names:
    print("Hello "+ name + " !")
    
#Or, another example:
prices = [3.5,5.6,8.0,2.4]
total_price = 0

for i in range(0,len(prices)):
    total_price += prices[i]
    
print(total_price)

Hello Hank !
Hello Sarah !
Hello Joe !
19.5


## Writing your own code

### Step 1: Reading in the data

We need a function that can read in the data from the corpusfile. 

First, we need to create a list to store the results. 

Then, we open the file for reading, signalled by 'r', and read it in line by line. For each line of in the corpus we remove the end-of-line symbol and add the remained syllable to the list of results. 

Finally, we return the list with syllables so that it can be used by the rest of the program.

In [6]:
def read_corpus(filename):
    """ read corpus from '\n'-delimited text file
    returns a list of syllables """
    
    #Write your Python-code here
    
    result = []
    
    # open the file for reading and loop over the lines
    for line in open(filename, 'r'):
        syll = line.strip()
        result.append(syll)
    
    return result

In [8]:
#This is how we want to use this little routine:
corpusfile = 'saffran_corpus.cor'
corpus = read_corpus(corpusfile)
corpus

['go',
 'la',
 'bu',
 'tu',
 'pi',
 'ro',
 'go',
 'la',
 'bu',
 'bi',
 'da',
 'ku',
 'pa',
 'do',
 'ti',
 'bi',
 'da',
 'ku',
 'pa',
 'do',
 'ti',
 'tu',
 'pi',
 'ro',
 'bi',
 'da',
 'ku',
 'pa',
 'do',
 'ti',
 'tu',
 'pi',
 'ro',
 'bi',
 'da',
 'ku',
 'tu',
 'pi',
 'ro',
 'pa',
 'do',
 'ti',
 'bi',
 'da',
 'ku',
 'go',
 'la',
 'bu',
 'pa',
 'do',
 'ti',
 'bi',
 'da',
 'ku',
 'tu',
 'pi',
 'ro',
 'go',
 'la',
 'bu',
 'tu',
 'pi',
 'ro',
 'go',
 'la',
 'bu',
 'tu',
 'pi',
 'ro',
 'bi',
 'da',
 'ku',
 'pa',
 'do',
 'ti',
 'tu',
 'pi',
 'ro',
 'bi',
 'da',
 'ku',
 'go',
 'la',
 'bu',
 'bi',
 'da',
 'ku',
 'go',
 'la',
 'bu',
 'bi',
 'da',
 'ku',
 'tu',
 'pi',
 'ro',
 'bi',
 'da',
 'ku',
 'go',
 'la',
 'bu',
 'pa',
 'do',
 'ti',
 'bi',
 'da',
 'ku',
 'go',
 'la',
 'bu',
 'tu',
 'pi',
 'ro',
 'pa',
 'do',
 'ti',
 'go',
 'la',
 'bu',
 'tu',
 'pi',
 'ro',
 'pa',
 'do',
 'ti',
 'tu',
 'pi',
 'ro',
 'bi',
 'da',
 'ku',
 'go',
 'la',
 'bu',
 'tu',
 'pi',
 'ro',
 'pa',
 'do',
 'ti',
 'tu',
 'pi',

### Step 2: Processing the data

Next, we need to process the corpus and compute the probabilities for the syllables and syllable-pairs. We split this functionality into three parts.

The counts of the syllables or syllable-pairs can be stored in dictionaries. Then, we go through the list of syllables to add the syllables and syllable-pairs to the dictionaries, or update their respective counts if they are already in the dictionary. 

In [10]:
def process_corpus(list_of_syllables):
    """ extract count of uni- & bigram occurrences from sequence
    of syllables in list_of_syllables """
    
    unigram_dict = {}
    bigram_dict = {}
    
    for syllable_index in range(len(list_of_syllables)-1):
        unigram = list_of_syllables[syllable_index]
        if unigram in unigram_dict:
            unigram_dict[unigram] += 1
        else:
            unigram_dict[unigram] = 1
            
        bigram = (list_of_syllables[syllable_index], list_of_syllables[syllable_index +1])
        if bigram in bigram_dict:
            bigram_dict[bigram] += 1
        else: 
            bigram_dict[bigram] = 1
    
    unigram = list_of_syllables[-1]
    if unigram in unigram_dict:
        unigram_dict[unigram] += 1
    else:
        unigram_dict[unigram] = 1

    # return the dictionaries with the unigram and bigram counts
    return unigram_dict, bigram_dict

In [11]:
unigram_dict, bigram_dict = process_corpus(corpus)
unigram_dict

{'bi': 44,
 'bu': 45,
 'da': 44,
 'do': 40,
 'go': 45,
 'ku': 44,
 'la': 45,
 'pa': 40,
 'pi': 51,
 'ro': 51,
 'ti': 40,
 'tu': 51}

The probability of a bigram (syll_1, syll_2) can be computed by divided the count of both syllables togethers by the count of the first syllable alone.  So, you compute how often the first syllable is followed by the second, opposed to by any other syllable.

In [12]:
def estimated_bigram_probability(bigram, unigram_dict, bigram_dict):
    """ estimate the probability of bigram (= (syll_1,syll_2)) by:
    (count (syll_1,syll_2)) / (count syll_1)
    """
    count = 0.
    
    if bigram in bigram_dict:
        count = bigram_dict[bigram]
    
    prob = count / unigram_dict[bigram[0]]
    
    #return the estimated bigram probability 
    return prob

The probability of a sequence of syllables can be computed by going through the sequence and multiplying all the estimated bigram probabilities of the sequence.

In [13]:
def estimated_sequence_probability(list_of_syllables, unigram_dict, bigram_dict):
    """ estimate probability of sequence of syllables,
    represented as a list """
    
    # set probability to 1 initially
    prob = 1.

    # loop over sequence indices
    for syll_idx in range(len(list_of_syllables) - 1):
        # form bigram from subsequent syllables
        bigram = (list_of_syllables[syll_idx], list_of_syllables[syll_idx + 1])
        
        # multiply previous probability with probability of this bigram
        prob= prob * estimated_bigram_probability(bigram, unigram_dict, bigram_dict)

    # return the estimated probability of the entire sequence
    return prob


### Step 3: Testing the model

To test the model we need the experimental test-phase stimuli from the Saffran-study. These are given below. 

We want to compare the average probability for word to the average probability for non-words. 

In [17]:
def test_model(unigram_dict, bigram_dict):
    """ test the model on saffran's words and non-words
    """
    
    # the words and non-words from saffran
    words = [['tu','pi','ro'],
             ['go','la','bu'],
             ['bi','da','ku'],
             ['pa','do','ti']]
    non_words = [['da','pi','ku'],
                 ['ti','la','do']]

    # calculate the sum of the probabilities of the words
    sum_words = 0
    for word in words:
        sum_words += estimated_sequence_probability(word, unigram_dict, bigram_dict)

    # divide by the number of words to get the average
    average_word = sum_words / len(words)

    # idem for the non-words
    sum_non_words = 0
    for non_word in non_words:
        sum_non_words += estimated_sequence_probability(non_word, unigram_dict, bigram_dict)
    # divide by the number of words to get the average        
    average_non_word = sum_non_words / len(non_words)

    print('Average probability for words:', average_word)
    print('Average probability for non-words:', average_non_word)


IndentationError: unexpected indent (<ipython-input-17-60c4703161f3>, line 56)

Now you can complete a whole run through the model. Which commands do you need to type in below in what order?

In [16]:
#Let's see whether everything works. Don't worry if you get an error message, that is normal even for experienced programmers.

test_model(unigram_dict, bigram_dict)

Average probability for words: 1.0
Average probability for non-words: 0.0


In [18]:

corpusfile_italian="Pelucci3B.cor"
corpus_italian = read_corpus(corpusfile_italian)

# Now, as before, we get unigrams and bigrams, use the print() command to inspect them.
unigram_dict_italian, bigram_dict_italian = process_corpus(corpus_italian)
print(unigram_dict_italian)
print(bigram_dict_italian)

# Now we just need to adapt the test_model routine. We change the name to not overwrite the previous function
def test_model_italian(unigram_dict, bigram_dict):
    """ test the model on saffran's words and non-words
    """
    
    # the words and non-words from Pelucci et al., experiment 3
    words = [['me', 'lo'],
             ['fu', 'ga']]
    non_words = [['bi', 'ci'],
                 ['ca', 'sa']]

    # calculate the sum of the probabilities of the words
    sum_words = 0
    for word in words:
        sum_words += estimated_sequence_probability(word, unigram_dict, bigram_dict)

    # divide by the number of words to get the average
    average_word = sum_words / len(words)

    # idem for the non-words
    sum_non_words = 0
    for non_word in non_words:
        sum_non_words += estimated_sequence_probability(non_word, unigram_dict, bigram_dict)
    # divide by the number of words to get the average        
    average_non_word = sum_non_words / len(non_words)

    print('Average probability for words:', average_word)
    print('Average probability for non-words:', average_non_word)


# Now test this:

test_model_italian(unigram_dict_italian, bigram_dict_italian)


{'non': 1, 'e': 6, 'da': 6, 'me': 17, 'scen': 1, 'de': 3, 're': 4, 'dal': 3, 'lo': 6, 'in': 5, 'u': 3, 'na': 6, 'fu': 18, 'ti': 3, 'le': 4, 'ga': 6, 'a': 4, 'pi': 3, 'Tor': 1, 'no': 5, 'ca': 6, 'sa': 6, 'la': 16, 'ta': 12, 'con': 1, 'bi': 6, 'ci': 8, 'di': 7, 'ma': 2, 'tu': 1, 'il': 7, 'ver': 4, 'se': 2, 'ro': 3, 'por': 2, 'te': 2, 'pres': 1, 'so': 7, 'mes': 1, 'vi': 4, 'zio': 1, 'lu': 1, 'i': 3, 'gi': 3, 'do': 2, 'col': 2, 'ten': 2, 'ri': 3, 'o': 7, 'del': 6, 'an': 2, 'co': 5, 'ros': 1, 'si': 3, 'va': 2, 'nu': 2, 'lin': 1, 'ge': 1, 'che': 3, 'to': 5, 'scor': 1, 'sog': 1, 'li': 2, 'ber': 1, 'rat': 1, 'quan': 1, 'tem': 1, 'pes': 1, 'mi': 2, 'sot': 1, 'al': 1, 'om': 1, 'bro': 1, 'su': 1, 'ra': 1, 'sem': 1, 'bra': 1, 'ce': 1, 'stel': 1, 'fer': 1, 'sul': 1, 'zi': 1}
{('non', 'e'): 1, ('e', 'da'): 1, ('da', 'me'): 1, ('me', 'scen'): 1, ('scen', 'de'): 1, ('de', 're'): 1, ('re', 'dal'): 1, ('dal', 'me'): 1, ('me', 'lo'): 6, ('lo', 'in'): 1, ('in', 'u'): 1, ('u', 'na'): 3, ('na', 'fu'): 2, ('