---

When using this notebook in **Playground mode**:
- You can **run all cells** and see the results.
- **Any changes you make will not be saved** after you close the notebook.

If you’d like to make changes and save your work, please create a copy of this notebook in your own Google Drive:
1. Go to **File > Save a copy in Drive**.
2. This will create an editable version in your Drive, where you can modify and save your changes.

If you ever want to reset the notebook to its original state, use this [link](https://colab.research.google.com/drive/1qdCe_t7aMTWtKmp5Ff3xWmmNpBcIlIVo#scrollTo=WXT8ZUvEKS8K&forceEdit=true&sandboxMode=true) to reopen it in Playground mode.



---

# Challenge: Building Your First Computational Model That Processes Transitional Probabilities Between Syllables

In this challenge, you will **build a simple computational model** that analyses transitional probabilities between syllables in a continuous stream of speech. This model will help simulate how infants might use a statistical learning mechanism to segment words from fluent speech, replicating the findings of Saffran et al. ([1996](https://doi.org/10.1126/science.274.5294.1926)).

Saffran et al.'s study was groundbreaking and had a significant impact on the fields of statistical learning and cognitive science (e.g., Frost et al., [2019](https://doi.org/10.1037/bul0000210)). It provided compelling evidence that infants could use statistical cues to segment words from continuous speech. This finding challenged existing theories that emphasised innate linguistic knowledge, highlighting instead the role of experience-based learning mechanisms in language acquisition.

Following this study, there was a surge of research investigating how statistical learning contributes to a wide range of cognitive functions, such as vision, audition, speech perception, reading, and motor planning, to name but a few. The study catalysed discussions on whether statistical learning is a domain-general mechanism and how it interacts with other cognitive processes. Overall, Saffran et al.'s work laid the foundation for a rich and expanding field that seeks to understand how humans (and even non-human species) extract regularities from their environment to facilitate learning and cognition.

<br>

---

## Key Concepts

- **Statistical Learning**: The ability to perceive and learn patterns in the environment that are either spatial or temporal in nature. (Frost et al., [2019](https://doi.org/10.1037/bul0000210)).

- **Syllable Transitional Probability (TP)**: The probability of one syllable following another in a speech stream.

- **Computational Model**: A set of mathematical and computational tools used to simulate a real-world system or process. In psychological science, computational models help formalise theories about mental processes by implementing them in code (Guest & Martin, [2021](https://doi.org/10.1177/1745691620970585)). Building a computational model involves:
  1. **Theory Building**: Developing a conceptual understanding of the phenomenon.
  2. **Specification**: Formalising the theory with mathematical equations or algorithms.
  3. **Implementation**: Writing code to execute the model.
  4. **Testing and Analysis**: Running the model and comparing predictions with real-world data.

<br>

---

## Background

### The Experiment by Saffran et al. ([1996](https://doi.org/10.1126/science.274.5294.1926))

Saffran and colleagues proposed that infants track **transitional probabilities (TPs)** between syllables to identify word boundaries in continuous speech.

#### Experiment Overview

- **Stimuli**: Four three-syllable nonsense words:
  - "tupiro"
  - "golabu"
  - "bidaku"
  - "padoti"

- **Procedure**:
  - **Familiarisation**: Infants were exposed to a continuous 2-minute stream of the nonsense words. Following a technique developed by Jusczyk and Aslin ([1995](https://doi.org/10.1006/cogp.1995.1010)), the stream contained 180 randomly selected words (540 syllables) without pauses between them, ensuring that the same word never occurred twice in a row.
  
  - **Testing**: 24 Infants' were tested on their ability to discriminate between words and “part-words”. The latter were sequences formed from the final syllable of one word and the first two syllables of another: "rogola," "bubida," "kupado," and "titupi". The infants’ attention was directed to a speaker using a flashing red light. Words and part-words from the familiarisation sequence were repeatedly broadcast from the speaker, and the infants’ gaze duration at the speaker was recorded.

   <img src="https://drive.google.com/uc?export=view&id=17cY_RBfF8AbM2eafigxyLs5gVAHNk0ie" width="450px" border="2px">
   
   *AI-generated example image*


#### Findings

- Infants looked at the speaker significantly longer when they heard part-words compared with words, indicating they could distinguish between words and part-words based on TPs.
- This behaviour suggests that infants possess a statistical learning mechanism that is sensitive to syllable TPs.

<br>

---

#### A Quick Look at Saffran et al.'s Results

**Required Python Knowledge:**

- **Basic Python Syntax:**
  - Understanding code comments, variable assignments, lists, and function calls.

- **Importing Libraries:**
  - Familiarity with importing libraries and using aliases.

- **Matplotlib Basics:**
  - Creating bar charts with Matplotlib.
  - Adjusting figure size, adding labels/titles, and configuring error bars.

In [None]:
import matplotlib.pyplot as plt

# Data for the experiment
labels = ['Words', 'Part-words']
mean_times = [7.97, 8.85]  # Mean fixation times (in seconds)
se = [0.41 * 1.96, 0.45 * 1.96] # 95% confidence interval for each category

# Plotting the bar chart
plt.figure(figsize=(4, 3)) # Set figure size
plt.bar(labels, mean_times, yerr=se, capsize=5) # Create bar chart with error bars
plt.ylabel('Mean Fixation Time (s)') # Y-axis label
plt.title("Saffran et al.'s (1996) results")  # Title of the plot
plt.show()


## Computational Modelling Steps

In this challenge, you will simulate the statistical learning mechanism proposed by Saffran et al.

### Overview of Steps:

1. **Data Generation**: Create 24 random continuous streams of the nonsense words (the same number as the infants tested in Saffran et al.'s study), following the stimuli construction rules from the original study.
*Note*: In the original study, all infants heard the same continuous stream. Here, we generate a different random stream for each simulated infant, which is a nice way to rule out the possibility that performance is affected by the specific order of syllables in the stream.

2. **Model Specification**: Define a computational model that can calculate transitional probabilities between syllables in a continuous stream.

3. **Model Implementation**: Run 24 computational models, one for each of the continuous streams.

4. **Analysis**:
   - Demonstrate that TPs within words are higher than TPs between words within each continuous stream.
   - Expose each model to the test items (words and part-words) and compute a score that could serve as a proxy for "fixation time".
   - Compare the models' "fixation" times to those of infants from Saffran et al.'s study.

<br>

---

### Section 1: Data Generation

**Objective**: Generate 24 random continuous streams of nonsense words, following the stimuli construction rules from Saffran et al.'s study.

**Required Python Knowledge**:

- Randomisation and Control of Random Seeds
- String and List Manipulation
- Conditional Statements and Loops
- Function Definition and Parameter Use
- List Comprehensions
- Console Output for Debugging (e.g., printing intermediate results to verify that resulting objects meet expected specifications)

In [None]:
# Create a list named "words" which contains the nonsense words
# Pro Tips:
# - The variable should be a list
# - Each element should be a string

# INSERT CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

words = ["tupiro", "golabu", "bidaku", "padoti"] # define the list

print("Word list:", words) # print the list to console to check the result
```
</details>

In [None]:
# Great! Now, using the "words" list, we need to generate our experimental stream
# comprising 180 words sampled at random, with no word appearing twice in a row.
# Let's break down this problem a bit first to make it digestible, shall we?

# Sample the first word from the list and name it "first_word"
# This will be the starting word in our sequence
# Pro Tips:
# - Use a sampling method from the "random" library
# - Set a seed for reproducibility
# - The resulting word should be a string

# INSERT CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

import random  # Import the random module for random sampling functions

random.seed(42)  # Set the seed to ensure reproducibility

first_word = random.choice(words)  # Sample the first word
print("First word:", first_word)  # Print the first word to check the result

second_word = random.choice(words)  # Sample an initial second word

# Resample until the second word is different from the first
while second_word == first_word:
  second_word = random.choice(words)

print("Second word:", second_word)  # Print the second word to check it is different from the first
```
</details>

In [None]:
# Nice! Now we can go a step further and repeat the above process until we have a final
# experimental stream of 180 words, with no word occurring twice in a row.

# From the "words" list, sample 180 words with replacement (also ensuring no word occurs twice in a row),
# and name the resulting object "experimental_stream"
# Pro Tips:
# - Initialise an empty list called "experimental_stream" to store the sampled words at each loop iteration
# - Set up a for loop that repeats 180 times
# - Use a sampling method with replacement that avoids consecutive word duplicates
# - Set a seed for reproducibility
# - The resulting sample should be a list called "experimental_stream" with 180 string elements

# INSERT CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

experimental_stream = []  # Initialise an empty list to store the sampled words

random.seed(42)  # Set the seed to ensure reproducibility

# Sample 180 words, ensuring no two consecutive words are the same
for _ in range(180):
    next_word = random.choice(words)  # Sample a word from the list
    # Ensure the new word is not the same as the last word in the list
    while experimental_stream and next_word == experimental_stream[-1]:
        next_word = random.choice(words)  # Resample if the word is the same as the last one
    experimental_stream.append(next_word)  # Add the word to the stream

print("Experimental stream:", experimental_stream)  # Print the entire list to the console
print("Number of words in experimental stream:", len(experimental_stream))  # Also verifying that it contains 180 elements
```
</details>

In [None]:
# Amazing! You have now created a list of words that can be presented
# continuously without pauses, forming an experimental stream that follows Saffran et al.'s method.

# Now, the final step is to generate 24 different experimental streams
# (for our 24 models simulating the 24 infants).

# Pro Tips:
# - Use the code in the previous section to define a general function called "generate_stream", that can create an individual stream with 180 words
# - Use a list comprehension to create 24 unique streams, each generated independently
# - Remember to set a seed for reproducibility
# - Each stream will be an element in a list called "all_streams"
# - Thus, the resulting "all_streams" object should be a list of lists

# INSERT CODE BELOW


<details>
  <summary>Reveal the solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

# Define a function to generate a single experimental stream of 180 words
def generate_stream(word_list, length):
  """
  Generates a list of words with a specified number of words, ensuring no consecutive duplicates.
  
  Parameters:
  - word_list: List of words to sample from
  - length: Number of words in the generated stream
  
  Returns:
  - A list containing the generated stream of words
  """
  stream = []  # Initialise an empty list to store the sampled words
  for _ in range(length):
    next_word = random.choice(word_list)  # Sample a word from the list
    # Ensure the new word is not the same as the last word in the list
    while stream and next_word == stream[-1]:
      next_word = random.choice(word_list)  # Resample if the word is the same as the last one
    stream.append(next_word)  # Add the word to the stream
  return stream

random.seed(42)  # Set the seed to ensure reproducibility

# Use a list comprehension to generate 24 different experimental streams
all_streams = [generate_stream(words, 180) for _ in range(24)]

# Print the number of streams and the length of each to verify correctness
print("Number of experimental streams:", len(all_streams))
print("Number of words in each experimental stream:", [len(stream) for stream in all_streams])

# Print one of the streams to see its elements
print("8th experimental stream:", all_streams[7])
```
</details>

In [None]:
# Final Section Note
# In the original study by Saffran et al., each word appeared exactly 45 times in the continuous stream.
# As a homework task, try to implement this constraint: generate a stream where words are selected at random,
# no word is repeated consecutively, and each word occurs exactly 45 times.

# In our case, we can reasonably omit this constraint, because across our 24 different random streams,
# each word should be repeated 45 times on average.

# Here's some code to verify this:

from collections import Counter
import numpy as np

# Compute word counts per stream
stream_word_counts = [Counter(stream) for stream in all_streams]

# For each word, gather its count across all 24 streams
word_frequencies = {word: [] for word in words}
for stream_count in stream_word_counts:
    for word in words:
        word_frequencies[word].append(stream_count[word])

# Print average and standard deviation per word
print("Average and standard deviation of word frequencies across the 24 streams:\n")
for word in words:
    counts = word_frequencies[word]
    mean = np.mean(counts)
    std = np.std(counts)
    print(f"{word}: M = {mean:.2f}, SD = {std:.2f}")

### Section 2: Model Specification

**Objective**: Define a computational model that calculates transitional probabilities between two co-occurring events (e.g., a pair of adjacent syllables).

**Required Knowledge**:

- Basic Concepts in Probability Theory (e.g., conditional probability and joint probability)
- Basic Concepts in Statistical Analysis (e.g., frequency counts, statistical relationships between events)
- Basic Understanding of Mathematical Notation and Equations

**Define the Core Function that Calculates a Transitional Probability**:

The transitional probability $P(B \mid A)$ represents the probability of event $B$ occurring immediately after event $A$. It is calculated using the following formula:

$$
P(B \mid A) = \frac{C(A, B)}{C(A)}
$$

Where:
- $C(A, B)$ is the count of times event $B$ follows event $A$.
- $C(A)$ is the total count of event $A$ occurring.

<br>

---

### Section 3: Model Implementation

**Objective**: Implement the transitional probability model to calculate the probability of one syllable following another within each continuous stream.

**Required Python Knowledge**:

- Function Definition and Parameter Use
- Regular Expressions (`re` library) for Pattern Matching
- `Counter` from the `collections` module for Frequency Counting
- Using `zip` for Overlapping Pairs in Lists
- Using `product` from the `itertools` library for Generating Combinations
- List and Dictionary Manipulation
- Loops and List Comprehensions
- Console Output for Debugging (e.g., printing intermediate results to check frequency counts)

**Tasks**:

1. **Based on the model specification, construct a Python function that calculates the transitional probability between two adjacent syllables.**

2. **Run the model for each of the 24 experimental streams.**

<br>

---

In [None]:
# Well done for getting this far! Now, let’s build a function to compute transitional probabilities
# between two adjacent events, like two adjacent syllables.
#
# Pro tips:
# - Define a function named "transitional_probability" that takes in two arguments:
#   (1) count_AB: the number of times event B follows event A
#   (2) count_A: the total number of occurrences of event A
# - Within the function, ensure the denominator is not zero to avoid division errors.
# - Return the calculated probability value as a float.
#
# INSERT CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

def transitional_probability(count_AB, count_A):
    """
    Calculates the transitional probability of event B following event A.
    
    Parameters:
    - count_AB (int): The count of occurrences where event B follows event A.
    - count_A (int): The total count of occurrences of event A.
    
    Returns:
    - float: The transitional probability P(B | A).
    """
    if count_A == 0:
        return 0  # Avoid division by zero if event A never occurs
    return count_AB / count_A
```
</details>

In [None]:
# Great, we have a basic transitional probability function.
# Consider it the core of our transitional probability model.

# The basic function will need to work with two types of frequencies:
# - frequencies AB (frequency of syllable B following syllable A)
# - and frequencies A (frequency of syllable A)

# Therefore, it will be easier to calculate all these frequencies in advance,
# so they are ready for the model to access when needed.

# For now, let's focus on the denominator of the transitional probability formula, as it’s easier to digest.
# We'll take one of our 24 streams (e.g., stream 1) and compute the frequency
# of each possible syllable in the stream.

# We already know that the possible syllables are:
syllables = ["tu", "pi", "ro", "go", "la", "bu", "bi", "da", "ku", "pa", "do", "ti"]

# To check whether a syllable occurs in a word, we need to split the word
# into its individual syllables, based on our list of possible syllables.

# For example, let’s start by trying to split the word "bidaku" into its component syllables.
word = "bidaku"

# Pro tips:
# - Step 1: Create a regular expression pattern to match any syllable in the "syllables" list.
#   * Join the syllables with the '|' symbol to form a pattern that matches any one of these syllables.
# - Step 2: Compile the regex pattern for efficient matching.
# - Step 3: Use `re.findall()` with the compiled pattern to find all syllables in the word.
#   * This returns a list of syllables from the word in sequence, based on the matches.
# - Step 4: Verify that all parts of the word are valid syllables.
#   * Reconstruct the word by joining the extracted syllables together.
#   * If this reconstructed word differs from the original, raise a `ValueError` to indicate that
#     unmatched parts exist in the word.
# - Make sure to print the syllables found to confirm that the word is split correctly!

# INSERT CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

import re  # Import the regular expressions module

# Step 1: Create a regex pattern from the syllables list and compile it for efficient matching
# Join all syllables with '|' to match any one of them in the word
syllable_pattern = '|'.join(syllables)
syllable_regex = re.compile(syllable_pattern)

# Step 2: Use the regex to find all matching syllables in sequence within the word
split_syllables = syllable_regex.findall(word)
print(f"Syllables in '{word}':", split_syllables)

# Step 3: Verify that all parts of the word are valid syllables by reconstructing the word
# If the reconstructed word differs from the original, raise an error
reconstructed_word = ''.join(split_syllables)
if reconstructed_word != word:
  raise ValueError(f"The word '{word}' contains unmatched parts that do not correspond to valid syllables.")
```
</details>

In [None]:
# Now that you’ve figured out how to identify syllables in a word,
# it’s easy to find all syllables in the experimental stream.
# Simply treat the experimental stream as one very loooooong word.

# Pro Tips:
# - Take the first experimental stream, all_streams[0], and join all words into a single large string with no spaces, called "collapsed_stream".
# - Apply the same code as above to find all syllables in this continuous string.
# - The result should be a list of syllables called "stream_syllables".

# - Magic Tip: Once, Richard Stallman appeared to me in a dream.
#   Amidst lines of code and the hum of Linux coming to life, he leaned over and whispered:
#   “When code repeats and grows a tad wild, wrap it in a function, and keep it styled!”
# - Follow Stallman's advice and create a function called "extract_syllables"
#   that can extract syllables from any given text using a provided syllables list.

# INSERT CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

import re  # Import the regular expressions module

# Define a function to extract syllables from any given text using a provided syllables list
def extract_syllables(text, syllables_list):
    """
    Extracts syllables from the provided text based on a given list of syllables.
    
    Parameters:
    - text (str): A string from which syllables are to be extracted.
    - syllables_list (list): A list of valid syllables to match.
    
    Returns:
    - list: A list of extracted syllables.
    
    Raises:
    - ValueError: If there are unmatched parts in the text.
    """
    # Step 1: Create and compile the regex pattern from the syllables list
    syllable_pattern = '|'.join(syllables_list)
    syllable_regex = re.compile(syllable_pattern)
    
    # Step 2: Use the regex to find all matching syllables in sequence
    split_syllables = syllable_regex.findall(text)
    
    # Step 3: Verify that all parts of the text are valid syllables
    reconstructed_text = ''.join(split_syllables)
    if reconstructed_text != text:
        raise ValueError(f"The text '{text}' contains unmatched parts that do not correspond to valid syllables.")
    
    return split_syllables

collapsed_stream = ''.join(all_streams[0])  # Join the words in the first stream

# Apply the function to the continuous stream
stream_syllables = extract_syllables(collapsed_stream, syllables)
print(f"Syllables in the first experimental stream: {stream_syllables}")
print(f"Number of syllables in the stream: {len(stream_syllables)}")  # Verify that the syllables total 540 as per Saffran et al.'s design
```
</details>

In [None]:
# Now we have the object "stream_syllables", which contains all syllables in an experimental stream.
# Next, we can count how many times each unique syllable appears in the stream.
# This will give us all the frequencies we can use for the denominator of the transitional probability formula!

# Pro tip:
# - Use the "Counter" function from the "collections" package to compute the frequency of each unique syllable in "stream_syllables"
# - The result should return a dictionary where each syllable is a key, and its frequency is the corresponding value.

# INSERT CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

from collections import Counter  # Import Counter for easy frequency counting

# Count the frequency of each unique syllable in "stream_syllables"
syllable_frequencies = Counter(stream_syllables)

# Output the syllable frequencies
print("Syllable frequencies in the stream:")
for syllable, frequency in syllable_frequencies.items():
  print(f"{syllable}: {frequency}")

# Optionally, verify the total count matches the length of stream_syllables
print(f"Total number of syllables counted: {sum(syllable_frequencies.values())}")
```
</details>

In [None]:
# Great, you have extracted the frequencies needed to compute the denominator of the transitional probability.
# In the two code chunks above, we defined the function extract_syllables() that extracts the syllables from a stream,
# and then we counted the frequencies of the unique syllables in a separate step.
# However, we can do everything together by creating a function called "get_denominator_frequencies" that takes
# an experimental stream and a list of possible syllables, then finds the syllables in the stream and counts their frequencies.

# Pro tip:
# - You can extend the extract_syllables function by adding a step to count syllable frequencies with Counter,
#   creating a new function, "get_denominator_frequencies", to handle both tasks in one go.

# INSERT CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

import re
from collections import Counter

# Define a function that collapses the stream, extracts syllables, and counts their frequencies
def get_denominator_frequencies(stream_words, syllables_list):
  """
  Collapses the stream of words into a continuous string, extracts syllables based on a given list,
  and counts their frequencies.

  Parameters:
  - stream_words (list): A list of words forming the experimental stream.
  - syllables_list (list): A list of valid syllables to match.

  Returns:
  - dict: A dictionary where each key is a syllable, and the value is its frequency.

  Raises:
  - ValueError: If there are unmatched parts in the collapsed stream.
  """
  # Step 1: Collapse the stream into a single continuous string
  collapsed_text = ''.join(stream_words)

  # Step 2: Create and compile the regex pattern from the syllables list
  syllable_pattern = '|'.join(syllables_list)
  syllable_regex = re.compile(syllable_pattern)

  # Step 3: Use the regex to find all matching syllables in sequence
  split_syllables = syllable_regex.findall(collapsed_text)

  # Step 4: Verify that all parts of the text are valid syllables
  reconstructed_text = ''.join(split_syllables)
  if reconstructed_text != collapsed_text:
    raise ValueError(f"The text '{collapsed_text}' contains unmatched parts that do not correspond to valid syllables.")

  # Step 5: Count the frequency of each syllable
  syllable_frequencies = Counter(split_syllables)

  return syllable_frequencies

# Apply the new function to the first stream
stream_syllable_frequencies = get_denominator_frequencies(all_streams[0], syllables)

# Output the syllable frequencies
print("Syllable frequencies in the stream:")
for syllable, frequency in stream_syllable_frequencies.items():
  print(f"{syllable}: {frequency}")
```
</details>

In [None]:
# Denominator is done! Give yourself a pat on the back.

# Now let's reflect on the numerator for a moment: "given a syllable pair AB, the numerator is the count of times syllable B follows syllable A."
# What does that mean? Well, it simply means we need to count the frequency of AB occurring! Simple as that.

# I know what you're thinking now
# "Well, what if instead of a list of possible syllables, I compute a list of possible syllable pairs (like AB) and pass it to the
# get_denominator_frequencies() function?"
# Maybe :D Let's try it

# First, we need to find a way to compute all possible syllable pairs.

# Pro tips:
# - Think of each syllable pair as a two-syllable "unit" (like "AB") where A and B can each be any syllable from the syllables list.
# - Use the "syllables" object and compute all possible two-syllable combinations. With 12 syllables, each one can be followed by any other,
#   resulting in 12 x 12 = 144 combinations.
# - The result should be a list of 144 strings, named "syllable_pairs".

# INSERT CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

from itertools import product

# Generate all possible pairs of syllables
syllable_pairs = [''.join(pair) for pair in product(syllables, repeat=2)]

# Print the generated pairs (optional)
print("Possible syllable pairs:", syllable_pairs)
```
</details>

In [None]:
# Now run the function using syllable_pairs as the reference list
# WARNING: The function will throw an error :D Head to the next colab cell to find out why!
get_denominator_frequencies(["tupiro"], syllable_pairs)

In [None]:
# We get a ValueError. Strange, the function at some point stops because it cannot find a valid syllable pair matching part of "tupiro".

# When operating with syllables, for the word "tupiro" we need to find "tu", "pi", and "ro".
# Instead, when operating with syllable pairs, these are the syllable pairs we need to find: "tupi", "piro".

# the way our get_denominator_frequencies() function works now is that it tries to divide "tupiro" into
# "tupi" and "ro" (so it fails because "ro" is not a valid syllable pair!).
# Namely, it doesn't handle partially overlapping syllable pairs.

# We need to create a modified version of the function (we are going to call it "get_numerator_frequencies") that considers
# consecutive syllable pairs, where the final syllable of one pair always overlaps with the first syllable of the next adjacent one.

# Let's try to find the consecutive overlapping pairs of syllables in "tupiro" (i.e., "tupi" and "piro")
word = "tupiro"

# Pro tip:
# - Consider each consecutive pair of syllables as a bigram (e.g., ("tu", "pi") and ("pi", "ro")).
# - Use the zip function to create overlapping pairs from the syllables list.
# - Store these in a list of strings called "syllable_pairs".

# INSERT CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

# Define the list of syllables for the word "tupiro"
tupiro_syllables = extract_syllables(word, syllables)

# Create overlapping sy§llable pairs using zip
syllable_pairs = [''.join(pair) for pair in zip(tupiro_syllables, tupiro_syllables[1:])]

# Print the overlapping syllable pairs
print("Overlapping syllable pairs in 'tupiro':", syllable_pairs)
```
</details>

In [None]:
# Well done, you are becoming quite proficient with list comprehensions. They are very important in Python for creating compact, readable code.

# Now that you've solved how to compute the overlapping syllable pairs, create the get_numerator_frequencies() function.
# This function needs to extract consecutive overlapping syllable pairs from the experimental stream and count their frequencies.

# Pro tip:
# - To create overlapping pairs from a list of syllables, use `zip` with a list comprehension as you did above.
#   This allows you to join each syllable with the next, ensuring that the last syllable of one pair overlaps
#   with the first syllable of the next pair.
# - Use `Counter` as you did in the "get_denominator_frequencies" function. It is an efficient way to count
#   syllable pair occurrences once you’ve generated the list of pairs.

# INSERT CODE BELOW



<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

from collections import Counter

def get_numerator_frequencies(stream_words, syllables_list):
  """
  Extracts consecutive overlapping syllable pairs from the stream of words and counts their frequencies.
  
  Parameters:
  - stream_words (list): A list of words forming the experimental stream.
  - syllables_list (list): A list of valid syllables to match.
  
  Returns:
  - dict: A dictionary where each key is a syllable pair (e.g., "tupi"), and the value is its frequency.
  
  Raises:
  - ValueError: If there are unmatched parts in the syllable extraction process.
  """
  # Step 1: Collapse the stream into a single continuous string
  collapsed_text = ''.join(stream_words)
  
  # Step 2: Extract syllables using the existing extract_syllables function
  stream_syllables = extract_syllables(collapsed_text, syllables_list)
  
  # Step 3: Create overlapping syllable pairs (bigrams)
  syllable_pairs = [''.join(pair) for pair in zip(stream_syllables, stream_syllables[1:])]
  
  # Step 4: Count the frequency of each syllable pair
  syllable_pair_frequencies = Counter(syllable_pairs)
  
  return syllable_pair_frequencies

# Apply the new function to the first stream
stream_syllable_pair_frequencies = get_numerator_frequencies(all_streams[0], syllables)

# Output the syllable pair frequencies
print("Syllable pair frequencies in the stream:")
for pair, frequency in stream_syllable_pair_frequencies.items():
  print(f"{pair}: {frequency}")
```
</details>

In [None]:
# Almost at the end of this section, well done! We now have everything we need to
# compute the transitional probability of a syllable pair.

# For example, let's try to compute the transitional probability of "tupi" (first syllable of "tupiro") in experimental stream 1.
# First, save the frequencies.
syllable_frequencies = get_denominator_frequencies(all_streams[0], syllables)
syllable_pair_frequencies = get_numerator_frequencies(all_streams[0], syllables)

# Now compute the syllable transitional probability
tupi_first_syllable = extract_syllables("tupi", syllables)[0]
tupi_tp = transitional_probability(syllable_pair_frequencies["tupi"], syllable_frequencies[tupi_first_syllable])

print("TP of 'tupi' in experimental stream 1:", tupi_tp)

In [None]:
# The final step is yours to tackle—no tips this time. Well, perhaps just one: the "enumerate" function might
# come in handy here to loop through each experimental stream.
# Your task is to compute the transitional probability (TP) for all consecutive syllable pairs in each experimental stream.
# You now have all the tools you need to complete this!

# INSERT CODE BELOW



<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

# Initialise a dictionary to store the TP results for each stream
all_tps = []

# Loop through each experimental stream
for stream_index, stream in enumerate(all_streams):
    # Get the frequencies of individual syllables and syllable pairs for the current stream
    syllable_frequencies = get_denominator_frequencies(stream, syllables)
    syllable_pair_frequencies = get_numerator_frequencies(stream, syllables)

    # Calculate the TP for each syllable pair in the current stream
    stream_tp = {}
    for pair in syllable_pair_frequencies:
        first_syllable = extract_syllables(pair, syllables)[0]
        tp = transitional_probability(syllable_pair_frequencies[pair], syllable_frequencies[first_syllable])
        stream_tp[pair] = tp

    # Append the TP results for this stream to the list
    all_tps.append(stream_tp)

# Output the TPs for each stream
for stream_index, stream_tp in enumerate(all_tps):
    print(f"Transitional probabilities for experimental stream {stream_index + 1}:")
    for pair, tp in stream_tp.items():
        print(f"  TP of '{pair}': {tp}")
```
</details>

### Section 4: Analysis

**Objectives**:
- Demonstrate that TPs within words are higher than TPs between words within each continuous stream.
- Expose each model to the test items (words and part-words) and compute a score that could serve as a proxy for "fixation time".
- Compare the models' fixation times to those of infants from Saffran et al.'s study.

**Required Python Knowledge**:

- List and Dictionary Comprehension (to group and restructure results)
- Aggregation with defaultdict (to merge values across streams)
- Numerical Operations with numpy (mean, standard deviation, square root)
- Confidence Intervals using Standard Error and Normal Approximation
- Conditional Logic (e.g., handling division by zero)
- Min–Max Normalisation (rescaling values to match a target range)
- Bar Plotting with matplotlib.pyplot, including:
    - Error bars (yerr and capsize)
    - Custom figure size, labels, and legends


<br>

---

In [None]:
# Our first task in this section is to show that transitional probabilities (TPs) within words are higher than TPs between words.

# Syllables within a word always occur together. For example, in the word "tupiro", "pi" always follows "tu".
# In contrast, syllables at word boundaries do not consistently co-occur.
# For instance, "ro" (from "tupiro") might be followed by "go" (from "golabu"),
# but could just as easily be followed by "bi" (from "bidaku") or "pa" (from "padoti").
# In other words, it’s harder to predict what comes next across word boundaries.

# To demonstrate this, we’ll calculate:
# 1. The average TP for syllable pairs *within* words
# 2. The average TP for syllable pairs *between* words (i.e., at word boundaries)

# We already have the TPs for all syllable pairs in each stream (stored in the all_tps object).
# Let's start by finding out what the average TP is for each syllable pair across the 24 streams.

# Pro tip:
# - Use a defaultdict to gather all TP values for each syllable pair across the 24 streams.
# - Once collected, use numpy’s mean function to compute the average TP for each pair.
# - Store the result in a dictionary named avg_tps, which maps each syllable pair to its mean TP value.

# INSERT CODE BELOW



<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

import numpy as np
from collections import defaultdict

# Initialise a defaultdict
all_tps_across_streams = defaultdict(list)

# Collect TPs for each syllable pair, from all 24 dictionaries
for stream_tp in all_tps:
    for pair, tp in stream_tp.items():
        all_tps_across_streams[pair].append(tp)

# For each syllable pair, compute the average TP across streams
avg_tps = {pair: np.mean(tps) for pair, tps in all_tps_across_streams.items()}
avg_tps
```
</details>

In [None]:
# Brilliant — we now have the average transitional probabilities (TPs) across streams for each syllable pair.
# The next step is to identify which syllable pairs occur *within* words and which occur *between* words.
# Once we've grouped them accordingly, we can compute the average TP for each group and compare.

# Pro tip:
# - Use the `extract_syllables()` function to break each word into its syllables.
# - Then use `zip` to generate overlapping syllable pairs within each word.
# - Add all within-word syllable pairs to a set, then compare with the keys in avg_tps.
# - Use list comprehensions or dictionary comprehensions to separate the within-word and between-word TPs.
# - Use numpy to compute the mean and standard deviation (SD) of each group.

# INSERT CODE BELOW



<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

words = ["tupiro", "golabu", "bidaku", "padoti"]
syllables = ["tu", "pi", "ro", "go", "la", "bu", "bi", "da", "ku", "pa", "do", "ti"]

# Generate all within-word syllable pairs
within_word_pairs = set()
for word in words:
    syll_list = extract_syllables(word, syllables)
    within_pairs = [''.join(pair) for pair in zip(syll_list, syll_list[1:])]
    within_word_pairs.update(within_pairs)

# Now use the above to split the TPs into two groups
within_word_tps = {pair: tp for pair, tp in avg_tps.items() if pair in within_word_pairs}
between_word_tps = {pair: tp for pair, tp in avg_tps.items() if pair not in within_word_pairs}

# Compute average and standard deviation for each group
import numpy as np

within_values = list(within_word_tps.values())
between_values = list(between_word_tps.values())

within_mean = np.mean(within_values)
within_sd = np.std(within_values)
between_mean = np.mean(between_values)
between_sd = np.std(between_values)

# Print results in a clean format
print("Average Transitional Probability (TP)")
print("-------------------------------------")
print(f"Within-word  : Mean = {within_mean:.3f}, SD = {within_sd:.3f}")
print(f"Between-word : Mean = {between_mean:.3f}, SD = {between_sd:.3f}")
```
</details>

In [None]:
# Well done — the first task in this section is completed. We've shown that transitional probabilities (TPs) between words are relatively low.

# One way to think about this is that, once we’ve heard a word, it’s difficult to predict what the next syllable will be
# (i.e., the first syllable of the upcoming word). However, once we hear the first syllable of a new word,
# the rest of the syllables in that word become easier to predict — at least for a statistical model.

# Saffran’s study showed that infants were also sensitive to these probability differences and used such knowledge at test.
# Let’s now examine whether our transitional probability model shows similar results when exposed to the test items.
# If our models behave like the infants in the original study, they should assign higher
# transitional probabilities to the words than to the part-words.

# Our next task is to expose each of our 24 transitional probability models to the test items used in Saffran et al.’s study.
# Go back to the Introduction if you need a reminder of what the "part-words" are.

# First, define the words and part-words
words = ["tupiro", "golabu", "bidaku", "padoti"]
partwords = ["rogola", "bubida", "kupado", "titupi"]

# We'll calculate the average syllable pair TP for each test item (i.e., each word or part-word),
# and then compute the mean TP for the four words and four part-words in each model.

# Pro tip:
# - Use the `extract_syllables()` function to break down each test item into its syllables.
# - Use `zip` to construct syllable pairs from these syllables.
# - For each syllable pair, retrieve its TP from the model's TP dictionary (`stream_tp`).
# - If a pair is missing from the dictionary, use 0 as a default (this is a reasonable fallback for unseen transitions).
# - Repeat this for all 24 models, and store the results in separate lists for words and part-words.

# INSERT CODE BELOW



<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

import numpy as np
import matplotlib.pyplot as plt

# Define function to compute average syllable pair TP for an item, given a stream tp dictionary
def calculate_avg_tp(word, stream_tp):
    syllables_in_word = extract_syllables(word, syllables)
    pairs = [''.join(pair) for pair in zip(syllables_in_word, syllables_in_word[1:])]
    avg_tp = np.mean([stream_tp.get(pair, 0) for pair in pairs])  # Use 0 if pair is missing
    return avg_tp

# Store average TPs across models
word_tps = []
partword_tps = []

# Compute mean TPs across all experimental streams
for stream_tp in all_tps:
    word_avg_tps = [calculate_avg_tp(word, stream_tp) for word in words]
    partword_avg_tps = [calculate_avg_tp(partword, stream_tp) for partword in partwords]

    word_tps.append(np.mean(word_avg_tps))
    partword_tps.append(np.mean(partword_avg_tps))

# Print summary statistics
print("Model response summary:")
print("------------------------")
print(f"Words     : Mean TP = {np.mean(word_tps):.3f}, SD = {np.std(word_tps):.3f}")
print(f"Part-words: Mean TP = {np.mean(partword_tps):.3f}, SD = {np.std(partword_tps):.3f}")
```
</details>

In [None]:
# Nice! Our models seem to perform like the infants in the original study — they assign higher transitional probabilities (TPs)
# to the words than to the part-words. This mirrors the infants’ longer fixation times for part-words, which suggests that our
# simple TP model might indeed capture an underlying mechanism that supports statistical learning in early language acquisition.

# This way of presenting the model performance is often used in computational modelling studies (e.g., French et al., 2011, with the TRACX model)
# to examine the idea that TPs may explain observed human behaviour.

# Now, let’s take it a step further.

# Can we directly compare our model’s output to the fixation times measured in infants?
# Well, not yet — because our model’s scores (i.e., raw transitional probabilities) are not on the same scale as seconds of looking time.

# One way to make them comparable is to rescale the model’s TP scores so that they fall into a similar numerical range as the
# fixation times reported by Saffran et al. This allows us to visually assess whether the relative difference between conditions (words vs. part-words)
# in the model mirrors the relative difference seen in the behavioural data.

# But how should we do that?

# Step 1: First, recall that lower TPs correspond to greater prediction uncertainty / lower familiarity. In the context of infant studies, greater uncertainty
# might lead to longer looking times (a novelty preference effect). To make our model reflect this intuition, we invert the TP scores:
#     high TP → low fixation
#     low TP → high fixation

# Use a small epsilon close to 0 to avoid divisions by zero
epsilon = 1e-8
fixation_word_tps = [1/(tp + epsilon) for tp in word_tps]
fixation_partword_tps = [1/(tp + epsilon) for tp in partword_tps]

# Step 2: Rescale the individual inverted scores to the empirical fixation time range
# Saffran et al. (1996) fixation time data
saffran_means = [7.97, 8.85]  # Mean fixation times (in seconds)
saffran_ci = [0.41 * 1.96, 0.45 * 1.96]     # 95% CIs

# Define the min and max of the empirical range
min_fixation = min(saffran_means) - min(saffran_ci)
max_fixation = max(saffran_means) + max(saffran_ci)

# Get the min and max of the model’s inverted values
all_fixation_model_scores = fixation_word_tps + fixation_partword_tps
min_model = min(all_fixation_model_scores)
max_model = max(all_fixation_model_scores)

# Apply min–max scaling to the individual values
scaled_fixation_word_tps = [min_fixation + (x - min_model) * (max_fixation - min_fixation) / (max_model - min_model) for x in fixation_word_tps]
scaled_fixation_partword_tps = [min_fixation + (x - min_model) * (max_fixation - min_fixation) / (max_model - min_model) for x in fixation_partword_tps]

# Step 3: Compute mean and 95% confidence intervals of the rescaled values
def compute_mean_and_ci(data, confidence=0.95):
    """
    Returns the mean and half-width of the confidence interval for a list of values.
    """
    mean = np.mean(data)
    std_err = np.std(data, ddof=1) / np.sqrt(len(data))
    margin = std_err * 1.96  # For 95% confidence assuming normality
    return mean, margin

scaled_mean_word, scaled_ci_word = compute_mean_and_ci(scaled_fixation_word_tps)
scaled_mean_partword, scaled_ci_partword = compute_mean_and_ci(scaled_fixation_partword_tps)

scaled_model_means = [scaled_mean_word, scaled_mean_partword]
scaled_model_ci = [scaled_ci_word, scaled_ci_partword]

# Plotting the results
labels = ['Words', 'Part-words']
width = 0.35
x = np.arange(len(labels))

fig, ax = plt.subplots(figsize=(8, 5))
bars1 = ax.bar(x - width/2, saffran_means, width, yerr=saffran_ci, capsize=5, label="Saffran et al. (1996)", alpha=0.7)
bars2 = ax.bar(x + width/2, scaled_model_means, width, yerr=scaled_model_ci, capsize=5, label="Scaled Model TP Scores", alpha=0.7)

# Add labels and legend
ax.set_ylabel('Mean Fixation Time (s) / Scaled Model Scores')
ax.set_title("Comparison of Saffran et al.'s Results with Scaled Model TP Scores")
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

plt.show()

🎉🎉🎉

Congratulations on completing the first part of the course!
You’ve successfully built a computational model that simulates how transitional probabilities might support word segmentation in infancy, and you’ve compared your model’s performance to real experimental data from a classic study in cognitive science. This is an impressive achievement — you’re now thinking like a computational modeller!

⸻

🧠 Final Point for Reflection

Let’s take a closer look at the visual comparison we generated between the model and the infants’ performance. Why are the confidence intervals for the model so narrow? Why might infants' scores be more variable than a model?

⸻

This is a perfect point to pause and celebrate what you’ve achieved. 🥳

In the next part of the course, you’ll have the opportunity to run simulations on conversational corpus data, and use a computational model implementing a chunking-based learning mechanism.

See you soon!


### 📝 Bonus Track

*Great work! now enjoy your musical reward!* 🎶

<audio controls>
  <source src="https://drive.google.com/uc?export=download&id=17kEucr1p_VaI5jLGPidrDtIldxw95rCn" type="audio/mpeg">
  Your browser does not support the audio element.
</audio>

 *by ChatGPT + SunoAI*