## Python refresher workshop 1

During this workshop (spanning four weeks), we will build a few different syllabification systems, i.e. models which can split words into sequences of syllables:

```
aɪ s l ə n d ə  ->  aɪ s . l ə n . d ə
```

In this first meeting, we will focus on data processing and evaluation. We'll also implement a  baseline syllabification algorithm. A baseline model is essentially a simple model that acts as a reference in a machine learning project. Our baseline will simply chunk the input into segments of two characters (with an optional single character segment at the end):

```
aɪ s l ə n d ə  ->  aɪ s . l ə . n d . ə
t͡ʃ aɪ n ə t aʊ n z   t͡ʃ aɪ . n ə . t aʊ . n z
```

This is obviously not a great result. For example, `n d` is a pretty weird syllable.  However, this will give us a baseline performance that we can compare to when we start developing more complex syllabification systems. Any reasonable syllabifier needs to be able to beat this simple baseline.

The idea is to work on the exercises in pairs or small teams (though you can also work individually if you prefer). For people with no python experience, it's a good idea to team up with someone who is a bit more experienced.  

The following python lectures on Canvas can be useful:

* [String handling](https://canvas.ubc.ca/courses/65386/files/9375671?module_item_id=2255551)
* [Loops](https://canvas.ubc.ca/courses/65386/files/9375677?module_item_id=2255559)
* [Reading from files](https://canvas.ubc.ca/courses/65386/files/9375679?module_item_id=2255563)

These materials should take you pretty far, but googling can also be useful. You are also very welcome to ask each other, Miikka and Roger for help. 

There are hints in the notebook which you can view by clicking on the hint. It's probably a good idea to try first and look at the hint if you need additional help.

If you feel that you have no idea how to get started, let Miikka and Roger know. We can discuss the exercises together.  

### 0. Preparation

If you're using [Google Colab](https://colab.research.google.com/?utm_source=scs-index), first upload this notebook (under the "Upload" tab). Once you're in the Colab, navigate to the "Files" panel to the left, create a new folder called "data" (by right-clicking and selecting "New folder"), and then upload the train, dev, and test files into the newly-created data folder (by right-clicking on the "data" folder and hitting "Upload"). 

### 0. Preparation

If you're using [Google Colab](https://colab.research.google.com/?utm_source=scs-index), first upload this notebook (under the "Upload" tab). Once you're in the Colab, navigate to the "Files" panel to the left, create a new folder called "data" (by right-clicking and selecting "New folder"), and then upload the train, dev, and test files into the newly-created data folder (by right-clicking on the "data" folder and hitting "Upload"). 

### 1. Reading the data

We are given data files in the following format:

```
'd          d                    d
Bulls       b ʊ l z              b ʊ l z
Chinatowns  t͡ʃ aɪ n ə t aʊ n z   t͡ʃ aɪ . n ə . t aʊ n z
I'll        aɪ l                 aɪ l
Icelander   aɪ s l ə n d ə       aɪ s . l ə n . d ə
```

This is a [TSV](https://www.loc.gov/preservation/digital/formats/fdd/fdd000533.shtml) file which contains three tabulator-separated (```'\t'```) columns:

1. Orthographic word
2. IPA transcription
3. Syllabified IPA transcription

All IPA symbols are separated by single spaces and syllable boundaries are marked by a `.`

We recommend first writing a function `read_line()`, which takes a line (i.e. string consisting of three tab-separated fields) as input, e.g.:

```
"Chinatowns  t͡ʃ aɪ n ə t aʊ n z   t͡ʃ aɪ . n ə . t aʊ n z"
```

and converts it into a Python dictionary having the following format:

```
{"orth": "Chinatowns",
 "ipa": ["t͡ʃ", "aɪ", "n", "ə", "t", "aʊ", "n", "z"],
 "syll": [(["t͡ʃ", "aɪ"], 0, 2), (["n", "ə"], 2, 4), (["t", "aʊ", "n", "z"], 4, 8)]}
```

Apart from `syll`, the fields are pretty self-explanatory. In the `syll` field, we've got a list of tuples representing each syllable, its start index and end index (which is 1 + the index of its final character). E.g. the syllable `"n", "ə"`, in the example above, starts at index 2 and ends at index 4:

```
IPA:   "t͡ʃ", "aɪ", "n", "ə", "t", "aʊ", "n", "z"
index:  0     1     2    3    4    5     6    7
```

After implementing the function `read_line()`, you can implement a function `read_data()`, which reads a file into a list of dictionaries. 

Use `read_data()` to read the training, development and test data and store the result in variables `train_data`, `dev_data` and `test_data`. 

<details>
    <summary style="font-weight: bold;">Click here to see the first hint</summary>
    To create the "syll" list, you need to loop through the syllabified IPA transcription and track the index of each IPA symbol (There is a function that allows you to get index while looping). Note that because of the presence of the syllable boundary ".", the raw index of an IPA symbol needs to be modified to get the correct index.
</details>

<details>
    <summary style="font-weight: bold;">Click here to see the second hint</summary>
    You'll need to initialize two variables before looping through the transcription. One variable is used to track the start index for each variable, while the other variable is used to track how many "." you have encountered. The latter variable can then be used to derive the correct start and end indices.
</details>

<details>
    <summary style="font-weight: bold;">Click here to see the third hint</summary>
    Here is the skeleton for one possible way to solve the problem:
    <code>
    for index, symbol in IPAs:
        case 1: if symbol is "." (i.e., you reach the end of a syllable)
            add the current syllable to syll list
            update the two variables that track start index and # of "." seen
        case 2: if we reach the end of IPAs (last syllable)
            add the last syllable to syll list
        case 3: "else" condition
            add the symbol to the current syllable
    </code>
</details>

In [None]:
# your code here
def read_line(line):
    orth, ipa, ipa_syl = line.strip().split(sep='\t')
    ipa = ipa.split(sep=' ')
    
    ipa_syl = ipa_syl.split(sep=' ')
    syll = []
    offset = 0  # offset to correct indices due to '.'
    start = 0  # start index
    ipas = []  # string representation of syllable
    
    for i, symbol in enumerate(ipa_syl):
        if symbol == '.':
            syll.append((ipas, start, i + offset))
            start = i + offset
            offset -= 1  # offset = offset - 1
            ipas = []
        elif i == len(ipa_syl) - 1:  # reach the end
            ipas.append(symbol)
            syll.append((ipas, start, i + offset + 1))
        else:  # see a normal IPA
            ipas.append(symbol)
    
    return {'orth': orth, 'ipa': ipa, 'syll': syll}


def read_data(path):
    words = []
    with open(path, mode='r') as f:
        for line in f:
            word_dict = read_line(line)
            words.append(word_dict)
    return words

train = read_data('./data/train.tsv')
test = read_data('./data/test.tsv')
dev = read_data('./data/dev.tsv')

### 2. Baseline

Today we will implement a very trivial baseline syllabifier function `baseline()`. It contains no phonological insight. Instead, it simply chops the input word into "syllables" of length 2. E.g. given the input:

```
t͡ʃ aɪ n ə t aʊ n z
```

the baseline function would syllabify:

```
t͡ʃ aɪ . n ə . t aʊ . n z
```

**Note:** If the input contains an odd number of IPA symbols, then the final symbol should constitute a singleton syllable. E.g, `aɪ s l ə n d ə -> aɪ s . l ə . n d . ə`. 

Given the input string:

```
["t͡ʃ", "aɪ", "n", "ə", "t", "aʊ", "n", "z"]
```

`baseline()` should return:

```
[(["t͡ʃ", "aɪ"], 0, 2), (["n", "ə"], 2, 4), (["t", "aʊ"], 4, 6), (["n", "z"], 6, 8)]
```

<details>
    <summary style="font-weight: bold;">Click here to see the first hint</summary>
    You can use the index value to see if you need to insert a syllable boundary ".".
</details>

<details>
    <summary style="font-weight: bold;">Click here to see the second hint</summary>
    More specifically, whether to insert a boundary is determined by whether the current index is odd or even. You can use the remainder of a 2-division to see if the index is odd or even. The modulo operator <code>%</code> will be handy here. Note that you do NOT need to insert a syllable boundary "." after the last syllable.
</details>

In [None]:
# your code here
def get_syll_indices(syllabified_ipa_list):
    syll = []
    offset = 0  # offset to correct indices due to '.'
    start = 0  # start index
    ipas = []  # string representation of syllable
    
    for i, symbol in enumerate(syllabified_ipa_list):
        if symbol == '.':
            syll.append((ipas, start, i + offset))
            start = i + offset
            offset -= 1  # offset = offset - 1
            ipas = []
        elif i == len(syllabified_ipa_list) - 1:  # reach the end
            ipas.append(symbol)
            syll.append((ipas, start, i + offset + 1))
        else:  # see a normal IPA
            ipas.append(symbol)
    return syll

def baseline(ipa_list):
    syllabified = []
    # [t͡ʃ, aɪ, n, ə, t, aʊ, n, z] -> [t͡ʃ, aɪ, ., n, ə, ., t, aʊ, ., n, z]
    for i, symbol in enumerate(ipa_list):
        if i % 2 == 1 and i != len(ipa_list) - 1:  # add '.' if the index is odd and not the last one
            syllabified.append(symbol)
            syllabified.append('.')
        else:
            syllabified.append(symbol)

    # [t͡ʃ, aɪ, ., n, ə, ., t, aʊ, ., n, z] -> pass this through part of the code we write in Q1
    return get_syll_indices(syllabified)

### 3. Evaluation

We will evaluate the performance of the baseline system using [F1-score](https://deepai.org/machine-learning-glossary-and-terms/f-score).

![](https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg)

E.g. given gold standard syllabified strings:

```
[(["t͡ʃ", "aɪ"], 0, 2), (["n", "ə"], 2, 4), (["t", "aʊ", "n", "z"], 4, 8)]
[(["aɪ", "s"], 0, 2), (["l", "ə", "n"], 2, 5), (["d", "ə"], 5, 7), (["s"], 7, 8)]
```

and a baseline system output:

```
[(["t͡ʃ", "aɪ"], 0, 2), (["n", "ə"], 2, 4), (["t", "aʊ"], 4, 6), (["n", "z"], 6, 8)]
[(["aɪ", "s"], 0, 2), (["l", "ə"], 2, 4), (["n", "d"], 4, 6), (["ə", "s"], 7, 8)]
```

we have 3 true positives: 

```
["t͡ʃ", "aɪ"], ["n", "ə"], ["aɪ", "s"] 
```
and 5 false positives:

```
["t", "aʊ"], ["n", "z"], ["l", "ə"], ["n", "d"], ["ə", "s"]
``` 

and 4 false negatives: 

```
["t", "aʊ", "n", "z"], ["l", "ə", "n"], ["d", "ə"], ["s"]
``` 

This results in precision: 

$$P = \frac{\text{true pos}}{(\text{true pos} + \text{false pos})} = 3/8$$ 

and recall: 

$$R = \frac{\text{true pos}}{(\text{true pos} + \text{false neg})} = 3/7$$ 

giving F1-score: 

$$F_1 = 2 * P * R / (P + R) = 0.4 $$

You should implement a function `evaluate()` which takes two lists as input: (1) a list of system output syllabified strings, and (2) a list of gold standard syllabified strings. The function then computes the F1-score and returns it. 

**Note:** You should sum up the true positive, false positive and false negative scores over the entire dataset before computing F1-score.

<details>
    <summary style="font-weight: bold;">Click here to see the first hint</summary>
    Note that you need to compare the corresponding gold standard and system output. That is, you shouldn't collect the syllables from all gold standards into a giant gold set, collect the syllables from all system outputs into a giant system set, and calculate precision and recall use these two sets.
</details>

<details>
    <summary style="font-weight: bold;">Click here to see the second hint</summary>
    The set operations like intersection and difference will be useful here to calculate true positive, false positive, and false negative.
</details>

In [None]:
# your code here
def evaluate(gold, system):
    tp = 0  # true positive
    fp = 0  # false positive
    fn = 0  # false negative

    for g, s in zip(gold, system):
        g = [(tuple(syll[0]), syll[1], syll[2]) for syll in g]
        s = [(tuple(syll[0]), syll[1], syll[2]) for syll in s]

        tp += len(set(g) & set(s))
        fp += len(set(s) - set(g))
        fn += len(set(g) - set(s))

    pre = tp / (tp + fp)  # precision
    rec = tp / (tp + fn)  # recall
    return 2 * pre * rec / (pre + rec)

## Python refresher workshop 2

This week, we'll start by implementing a slightly less trivial syllabification algorithm than `baseline()`. The function `cv_syll()` assigns syllable boundaries after every vowel. This will allow it to syllabify many English words correctly, e.g. *Phoenicia* `f ɪ n ɪ ʃ ə` as `f ɪ . n ɪ . ʃ ə`. 

There are, however, a few problems with this approach. For example, we run the risk of creating syllables without nuclei at the end of words endining in a consonant. E.g. `n z` in `t͡ʃ aɪ . n ə . t aʊ . n z`. Therefore, as a special case, the function will assign all symbols following the last vowel in the word into the coda of the last syllable. Given this modification, the function will return the correct syllabification `t͡ʃ aɪ . n ə . t aʊ n z`.  

### 1. Evaluating the baseline

Start by running the `baseline()` syllabificiation algorithm on the development set and use `evaluate()` to figure out the F1-score of the baseline syllabification method. The F1-score should be around 26%. 

<details>
    <summary style="font-weight: bold;">Click here to see the first hint</summary>
    Note that you need to convert your dev list to the correct format before passing it to evaluate(). For instance, change
    
    [
        {'orth': 'abrade',
         'ipa': ['ə', 'b', 'ɹ', 'eɪ', 'd'],
         'syll': [(['ə'], 0, 1), (['b', 'ɹ', 'eɪ', 'd'], 1, 5)]},
        {'orth': 'abraded',
         'ipa': ['ə', 'b', 'ɹ', 'eɪ', 'd', 'ɪ', 'd'],
         'syll': [(['ə'], 0, 1), (['b', 'ɹ', 'eɪ'], 1, 4), (['d', 'ɪ', 'd'], 4, 7)]},
        {'orth': 'abrasion',
         'ipa': ['ə', 'b', 'ɹ', 'eɪ', 'ʒ', 'n̩'],
         'syll': [(['ə'], 0, 1), (['b', 'ɹ', 'eɪ'], 1, 4), (['ʒ', 'n̩'], 4, 6)]}
    ]
    
to
    
    [
        [(['ə'], 0, 1), (['b', 'ɹ', 'eɪ', 'd'], 1, 5)],
        [(['ə'], 0, 1), (['b', 'ɹ', 'eɪ'], 1, 4), (['d', 'ɪ', 'd'], 4, 7)],
        [(['ə'], 0, 1), (['b', 'ɹ', 'eɪ'], 1, 4), (['ʒ', 'n̩'], 4, 6)]
    ]
    
</details>

In [None]:
# your code here
dev_syll = [d['syll'] for d in dev]

baseline_syll = []
with open('./data/dev.tsv', mode='r') as f:
    for line in f:
        _, ipa, _ = line.strip().split(sep='\t')
        word_dict = baseline(ipa.split(sep=' '))
        baseline_syll.append(word_dict)

evaluate(dev_syll, baseline_syll)

### 2. Consonant and vowel sets

You should then form two sets `CONS` and `VOWEL`, which contain all consonants and vowels in the training set (if you can't upload the training set to Colab, it's also fine to use the development set). You need to:

1. Form a set `IPAS` of all IPA symbols that occur in the train/development set and print the set.
1. Then, manually create two sets `CONS` and `VOWEL` based on the inventory in `IPAS`. 

**Note!** There are a few syllabic consonants in the data set. These are marked by a small diacritic (`n̩ ŋ̩ m̩ l̩`). You need to think about whether to include these in your consonant or vowel set (or possibly both). Examine the development data, to figure out what makes sense.

In [None]:
# your code here
IPAS = set()

with open('data/train.tsv') as f:
    for line in f:
        orth, trans, syll = line.strip().split(sep='\t')
        IPAS.update(trans.split(sep=' '))

with open('data/dev.tsv') as f:
    for line in f:
        orth, trans, syll = line.strip().split(sep='\t')
        IPAS.update(trans.split(sep=' '))

CONS = {'f', 'ʃ', 'j', 'l', 'ɹ', 'h', 'b', 'v', 'w', 'd',
        'ð', 'k', 'x', 's', 'ʒ', 'θ', 't͡ʃ', 't', 'n', 'p',
        'd͡ʒ', 'ŋ', 'm', 'ɡ', 'z'}
VOWEL = {'ʌ', 'm̩', 'n̩', 'ɒ̃ː', 'l̩', 'ɪə', 'ŋ̩', 'ɒ', 'ə', 'aʊ',
         'ɪ', 'iː', 'æ̃ː', 'ɑː', 'æ', 'uː', 'ɛ', 'ɜː', 'ʊ', 'əʊ',
         'ɔː', 'aɪ', 'ɔɪ', 'eɪ', 'ʊə', 'ɑ̃ː', 'ɛə'}

### 3. `cv_syll()`

Now we can start creating the `cv_syll()` function. The function takes a list of IPA symbols as input. For example,

```
["t͡ʃ", "aɪ", "n", "ə", "t", "aʊ", "n", "z"]
```

It returns the syllabified string in the same format as `baseline()`:

```
[(["t͡ʃ", "aɪ"], 0, 2), (["n", "ə"], 2, 4), (["t", "aʊ", "n", "s"], 4, 8)]
```

where each element is a 3-tuple containing a string representation of a syllable like `["t͡ʃ", "aɪ"]` and its start and end index in the IPA string.

You should implement `cv_syll()` by looping through the input string. While looping, we need to keep track of two variables: 

* `index` the current index in the IPA string
* `start` the start index of the syllable which we are currently creating.

The index `start` will be initialized to 0 at the start of the process. Whenever you are done with a syllable, you will need to update the value of `start`.

Below you see an example of how `start` and `index` change when processing `["t͡ʃ", "aɪ", "n", "ə"]`:

```
   index = 0, start = 0 # Start of the process 
t͡ʃ index = 0, start = 0 
aɪ index = 1, start = 2 # Create the syllable (["t͡ʃ", "aɪ"], 0, 2) and update start
n  index = 2, start = 2
ə  index = 3, start = 4 # Create the syllable (["n", "ə"], 2, 4) and update start
                        # End of input. We return [(["t͡ʃ", "aɪ"], 0, 2), 
                        #                          (["n", "ə"], 2, 4)].
```

Whenever you encounter a vowel (i.e. a symbol which is in your `VOWEL` set), you should create a new syllable boundary. The only exception is when this is the last vowel in the word. We recommend that you first implement a simple version of `cv_syll()`, which creates a boundary after every vowel. When you are sure that the simple version works, you can improve it by handling the word-final syllable correctly.    

When you're done with `cv_syll()`, please run it on the development set and evaluate against the gold standard annotations. You should get F1-score around 59%.

**Note!** Make sure that you return all the syllables under all conditions. You should make sure that the final syllable is returned both when the final vowel is followed by some consonants and when it is the last character of the word.

<details>
    <summary style="font-weight: bold;">Click here to see the first hint</summary>
    You can <code>index</code> for free by using the <code>enumerate()</code> function:
    
    for index, symbol in enumerate(ipas):
        ...
    
</details>

<details>
    <summary style="font-weight: bold;">Click here to see the second hint</summary>
    You also need to deal with cases where there is just one consonant. For example,
    
    ['d']
    
    or
    
    ['z']    
</details>

In [None]:
# your code here

# sylls = []    <-- list to store all syllables
# current_syll = []
# start = 0

#        ɑː t ɪ f ɪ ʃ ə l ɪ
# index: 0  1 2 3 4 5 6 7 8

# ɑː
# start: 0
# index: 0
# current_syll: [ɑː]
# sylls: [([ɑː], 0, 1)] = [([ɑː], start, index+1)]
# curret_syll = []
# start = index + 1 = 1

# t:
# start: 1
# index: 1
# current_syll: [t]

# ɪ
# start: 1
# index: 2
# current_syll: [t ɪ]
# sylls: [([ɑː], 0, 1), ([t ɪ], 1, 3)] # [([t ɪ], start, index+1)]
# curret_syll = []
# start = index + 1 = 3

# ...

# ɪ
# start: 7
# index: 9
# current_syll: [l ɪ]
# add to sylls
# curret_syll = []
# start = index + 1 = 3



def cv_syll(ipas):
    sylls = []
    current_syll = []

    start = 0
    for index, symbol in enumerate(ipas):
        current_syll.append(symbol)
        
        if symbol in VOWEL:
            # the end of the current syllable, append it!
            sylls.append((current_syll, start, index + 1))

            start = index + 1
            current_syll = []

        if index == len(ipas) - 1:  # end of ipa list

            # handle cases like [z] or [d]
            if sylls == []:
                sylls.append(([], start, index + 1))

            # ə ɹ ɛ s t
            # sylls = [[ə], [ɹ, ɛ]]
            # current_syll = [[s, t]]
            # take out the last syllable [ɹ, ɛ], combine it with current_syll

            sylls[-1] = (sylls[-1][0] + current_syll,
                         sylls[-1][1],  # original start
                         index + 1)

    return sylls


dev_syll = [d['syll'] for d in dev]

baseline_syll = []
with open('./data/dev.tsv', mode='r') as f:
    for line in f:
        _, ipa, _ = line.strip().split(sep='\t')
        word_dict = cv_syll(ipa.split(sep=' '))
        baseline_syll.append(word_dict)

evaluate(dev_syll, baseline_syll)

In [None]:
assert cv_syll(['ɑː', 't', 'ɪ', 'f', 'ɪ', 'ʃ', 'ə', 'l', 'ɪ']) == [(['ɑː'], 0, 1), (['t', 'ɪ'], 1, 3), (['f', 'ɪ'], 3, 5), (['ʃ', 'ə'], 5, 7), (['l', 'ɪ'], 7, 9)]
assert cv_syll(['t͡ʃ', 'aɪ', 'n', 'ə', 't', 'aʊ', 'n', 'z']) == [(['t͡ʃ', 'aɪ'], 0, 2), (['n', 'ə'], 2, 4), (['t', 'aʊ', 'n', 'z'], 4, 8)]
assert cv_syll(['z']) == [(['z'], 0, 1)]
print('Success!')

### 4. Improving the syllabifier

The `cv_syll()` function already handles most English syllable boundaries correctly, however, there are a few points of improvement. E.g. the function isn't horribly good at handling consonant clusters in words like `æ d v ə n t` which is incorrectly syllabified as `æ . d v ə n t` instead of the correct `æ d . v ə n t`. 

Please investigate the performance of `cv_syll()` on the development data. Compare against gold standard syllabifications, finding errors and try to figure out why the syllabification fails. Then write an improved version `clever_syll()` which fixes some of these errors. 

We suggest solving one problem at a time and iteratively improving `clever_syll`. Throughout the engineering process, it's a good idea to make sure that your modifications result in improved performance by using the `evaluate()` function.

You should able get F1-score > 60%, possibly closer to 70%.

In [None]:
# your code here

## Python refresher workshop 3

For this week, we'll be tracking some statistics about the segmental environments where syllable boundaries (or lack thereof) occur. These numbers will be useful for the tasks we're going to perform next week.

### 1. Extracting unigram segmental environments

The first type of statistics we'll extract is the previous and subsequent segments of a syllable boundary or a "non-boundary". For instance, given the syllabified word `ə . d æ p . t ə`, there are two (non-edge) syllable boundaries in the environments of `ə_d` and `p_t`, and there are three non-boundaries in the environments of `d_æ`, `æ_p`, and `t_ə`. Your task here is to build two counter dictionaries to track the frequency of different segmental environemtns associated with the boundaries and non-boundaries from the words in the training set. Your dictionary counters should have the following structure:

```
uni_bndry_cntr = {
    ('ə', '_', 'd'): 507,  # the number represents the # of times this environment occurs with a syllable boundary
    ('p', '_', 't'): 340,
    ...
}

uni_non_bndry_cntr = {
    ('d', '_', 'æ'): 213,  # the number represents the # of times this environment occurs without a syllable boundary
    ('æ', '_', 'p'): 380,
    ('t', '_', 'ə'): 4160,
    ...
}
```

<details>
    <summary style="font-weight: bold;">Click here to see the first hint</summary>
    Using a pre-packaged <code>Counter</code> object can save you some time. Observe the following example:
    
    from collections import Counter
    
    
    example_list = ['a', 'a', 'b', 'b', 'b', 'c']
    example_counter = Counter()
    
    for l in example_list:
        example_counter[l] += 1
    
Your <code>example_counter</code> will then look like this:
    
    {'a': 2, 'b': 3, 'c': 1}
    
</details>

<details>
    <summary style="font-weight: bold;">Click here to see the second hint</summary>
    You'll notice that you need to find a way to identify "non-boundaries". One way to do this is to explicitly insert them into your transcription first, so when you loop through the transcription, you know whether you run into a boundary or non-boundary. For example, given <code>ə . d æ p . t ə</code>, you can first turn it into <code>ə . d | æ | p . t | ə</code>, so when you loop through it, you know you encounter a syllable boundary is it's a <code>.</code> and a non-boundary if it's a <code>|</code>.
    
</details>

In [None]:
# your code here

In [None]:
assert uni_bndry_cntr[('ə', '_', 'd')] == 507
assert uni_non_bndry_cntr[('ə', '_', 'd')] == 891

print('Success!')

### 2. Extracting bigram segmental environments

This task is similar to what you just did, but we expand our window to include the previous 2 and the following 2 segments of a syllable boundary/non-boundary. You'll notice that you need to pad each word at the begining and end before you extract environemts. For instance, with the word `ə . d æ p . t ə`, you need to first pad the word to `# ə . d æ p . t ə #`. Then for boundaries, you'll have `#ə_dæ` and `æp_tə`, and for non-boundaries, you'll have `əd_æp`, `dæ_pt`, and `pt_ə#`. Again, create two counters to track the number of different environments associated with both boundaries types from the words in the training set:

```
bi_bndry_cntr = {
    ('#', 'ə', '_', 'd', 'æ'): 6,
    ('æ', 'p', '_', 't', 'ə'): 8,
    ...
}

bi_non_bndry_cntr = {
    ('ə', 'd', '_', 'æ', 'p'): 6,
    ('d', 'æ', '_', 'p', 't'): 8,
    ('p', 't', '_', 'ə', '#'): 11,
    ...
}
```

In [None]:
# your code here

In [None]:
assert bi_bndry_cntr[('æ', 'p', '_', 't', 'ə')] == 8
assert bi_non_bndry_cntr[('ə', 'd', '_', 'æ', 'p')] == 6

print('Success!')

### 3. Extracting unigram and bigram CV environments

This task involves abstraction over the first and second task. That is, instead of having an environment for each unique segmental combination, your environment simply tells you whether the preceding and following segments are a consonant `C` or a vowel `V`. For instance, given the syllabified word `ə . d æ p . t ə`, there are two syllable boundaries in the unigram environments of `V_C` (from `ə_d`) and `C_C` (from `p_t`), and there are three non-boundaries in the unigram environments of `C_V` (from `d_æ` and `t_ə`) and `V_C` (from `æ_p`). You should perform the same transformation for different bigram environments by, for example, turning `pt_ə#` into `CC_V#`.

Create four dictionary counters---`uni_bndry_cv_cntr`, `uni_non_bndry_cv_cntr`, `bi_bndry_cv_cntr`, and `bi_non_bndry_cv_cntr`---that store the frequency information of different environments. For example, your `bi_non_bndry_cv_cntr` should look like:

```
bi_non_bndry_cv_cntr = {
    ('#', 'V', '_', 'C', '#'): 113,
    ('#', 'C', '_', 'V', 'C'): 47872,
    ('C', 'V', '_', 'C', '#'): 39376,
    ('#', 'C', '_', 'C', 'V'): 13721,
    ...
}
```


In [None]:
# your code here

In [None]:
assert uni_bndry_cv_cntr[('C', '_', 'C')] == 35728
assert uni_non_bndry_cv_cntr[('C', '_', 'C')] == 57891    
assert bi_non_bndry_cv_cntr[('V', 'C', '_', 'V', 'C')] == 61265
assert bi_bndry_cv_cntr[('V', 'C', '_', 'C', 'V')] == 27337

print('Success!')

### 4. Turning frequencies into probabilities

Your last task for this week is to turn counts/frequencies in all of your dictionary counters into probabilities. Consider the following toy example:

```
example_cntr = {
    'a': 2,
    'b': 5,
    'c': 8,
    'd': 5
}
```

We can then turn the count of each key-value pair into probability by dividing the count by the sum of all values. So, for example, `a` now has a probability of $2 / (2 + 5 + 8 + 5) = 0.1$. The final dictionary of probabilities is then:

```
example_prob = {
    'a': 0.1,
    'b': 0.25,
    'c': 0.4,
    'd': 0.25
}
```

In [None]:
# your code here
# you can call your freq dict uni_bndry_prob, uni_non_bndry_prob, etc.

In [None]:
assert 0.003 < uni_bndry_prob[('ə', '_', 'd')] < 0.004
assert 0.002 < uni_non_bndry_prob[('ə', '_', 'd')] < 0.003

print('Success!')