---

When using this notebook in **Playground mode**:
- You can **run all cells** and see the results.
- **Any changes you make will not be saved** after you close the notebook.

If you’d like to make changes and save your work, please create a copy of this notebook in your own Google Drive:
1. Go to **File > Save a copy in Drive**.
2. This will create an editable version in your Drive, where you can modify and save your changes.

If you ever want to reset the notebook to its original state, use this [link](https://colab.research.google.com/drive/1HPSSzWEaAmf9P-5UyoTcI1QbKgf2xvdJ#scrollTo=EmVl9xi2bRKC&forceEdit=true&sandboxMode=true) to reopen it in Playground mode.



---

# Challenge: Using a Chunking-Based Model to Simulate Early Vocabulary Growth  

In this challenge, you will **use a computational model of “chunking”** – the idea that events which frequently co-occur in the environment gradually become represented as unified units (chunks), leading to more fluent processing over time. We implement this idea by simulating the theoretical proposal that children do not learn each word from scratch, but acquire vocabulary by re-using an ever-growing inventory of sub-lexical and lexical chunks they have already mastered (Jones & Rowland, [2017](https://doi.org/10.1016/j.cogpsych.2017.07.002)).  

For example, having already learned the sub-lexical sequence /mˈɐ/ in /mˈɐmi/ (mummy) may facilitate the later learning of words such as /mˈɐŋki/ (monkey), /mˈɐd/ (mud), /mˈɐndˌeɪ/ (Monday), or /mˈɐtʃ/ (much), due to familiarity with at least part of their phonological forms.  

<br>

> **The Importance of Chunking in Cognitive Science**  
> * Infants, children, and adults spontaneously group items that frequently co-occur in the input (e.g., Jones et al., [2020](https://doi.org/10.1016/j.cognition.2020.104200); Miller, [1956](https://doi.org/10.1037/h0043158); Slone & Johnson, [2018](https://doi.org/10.1016/j.cognition.2018.05.016)).  
> * Computational models of chunking have shown that a small set of psychologically plausible learning mechanisms can account for core findings in domains such as expert memory, problem solving, and verbal learning (e.g., Gobet et al., [2015](https://doi.org/10.3389/fpsyg.2015.01785)) – and, crucially for us, in **child language acquisition** (e.g., Cabiddu et al., [2023](https://doi.org/10.1111/lang.12559); Jessop et al., [2025](https://dx.doi.org/10.1037/rev0000564); Jones et al., [2021](https://doi.org/10.1016/j.jml.2021.104232)).

<br>

## What you’ll do in this notebook
1. **Familiarise yourself** with a simple chunking learner that processes child-directed utterances one at a time, storing new chunks formed from adjacent units that have just been encoded.  
2. **Compare the model’s learning to real child learning.** Given a parent–child pair, you will evaluate how well the model vocabulary trained on the parent’s speech reflects the child’s productive vocabulary.  
3. **Manipulate the input** to examine how the *quantity* and *diversity* of parental speech affect vocabulary growth. This will involve a pure simulation experiment designed to provide proof-of-principle evidence and test hypotheses about the role of different input characteristics in early word learning!

<br>

---

---


## Background  

Understanding how infants and toddlers acquire their vocabularies is a central puzzle in language acquisition. **Chunking-based learning** offers a parsimonious answer: children store and re-use recurring fragments of phonological material (chunks), gradually assembling them into the larger units we recognise as words.  

### Why chunking matters for vocabulary growth  
* Chunking provides a **mechanistic link** between perception and learning.  
* An expanding inventory of chunks acts as **linguistic scaffolding**: newly heard words often contain familiar sub-chunks, allowing the child to encode – and learn – those words more efficiently.    

### Quantity vs diversity of the input  
Researchers typically distinguish two key properties of children’s language experience:

-  **Quantity** ➜ Total word **tokens** (all child-directed input words, repetitions included) ➜ “How much speech does the child hear?”
-  **Quality / Diversity** ➜ Number of unique word **types** ➜ “How many *different* words does the child hear?”

In naturalistic data these measures are **highly correlated** – talkative caregivers also tend to use a wider range of words – making it difficult to tease their effects apart.

### Insights from Jones & Rowland (2017)  
Jones and Rowland's study resolved this confound with simulations that **independently manipulated** quantity and lexical diversity. Four headline findings emerged:

1. **Early advantage for quantity:** when the model’s lexicon is tiny, simply hearing more tokens speeds initial learning.  
2. **Later advantage for diversity:** once a basic chunk repertoire is in place, exposure to *diverse* vocabulary yields steeper, more sustained vocabulary growth.  
3. **Broader cognitive pay-offs:** models trained on diverse input outperform quantity-matched peers on non-word repetition, sentence recall, and the ability to learn completely novel words.  
4. **Converging child data:** in English-speaking two- to three-year-olds, caregiver lexical diversity – not quantity – was the stronger predictor of later productive vocabulary.

### Computational model used in the simulations  

**CLASSIC** – *Chunking Lexical And Sub-lexical Sequence Incremental Computation*.  

* **Input format:** word-delimited phonemic transcriptions of child-directed utterances.  
* **Learning mechanism:** whenever two adjacent chunks are encoded, the model stores their concatenation as a new chunk, gradually building a hierarchy from phonemes → sub-lexical sequences → words → multi-word units.  
* **Processing constraint:** on each utterance the model can access, on average, 4–5 chunks, with a recency bias favouring utterance-final material – mimicking children’s limited online processing capacity.  
* **Outcome measures:** number and size of learned chunks, rate of learning novel words.

### Child-directed speech corpus used  

**Manchester corpus** (Theakston et al., [2001](https://talkbank.org/childes/access/Eng-UK/Manchester.html)):  

* **Participants:** 12 English-learning children (initial age 1;10–2;0) and their primary caregivers.  
* **Sampling schedule:** 34 one-hour recordings per dyad, roughly every three weeks over a twelve-month period.  
* **Scale:** ≈ 410 000 caregiver word tokens, ≈ 12 000 word types, enabling fine-grained manipulation of quantity and diversity in the simulations.

### What this means for our challenge  

In this second part of the course you will implement a simplified chunking learner inspired by CLASSIC, feed it child-directed utterances (we'll use just one cargiver-child dyad for this course), and explore:

* How the model’s vocabulary growth compares with the actual child vocabulary drawn from the corpus.  
* How growth curves shift under **input manipulated in quantity or diversity**.  
* Whether the same trade-off found by Jones & Rowland – early benefits of sheer exposure versus later benefits of lexical richness – emerges in your runs.  
* Practical implications: could boosting the *variety* of words in a child’s daily input accelerate vocabulary learning more than simply talking more?

By the end you should have an intuition for **how linguistic input structure shapes learning**, and hands-on experience manipulating corpora to isolate these effects!

---

---


#### A Quick Look at Jones & Rowland's Findings

The image below shows the cumulative number of unique words learned by the **CLASSIC** computational model after being trained on three types of maternal input from the *Manchester Corpus*:

- **Mother**: the original unaltered input  
- **Diversity**: input modified to include a wider range of unique words  
- **Quantity**: input with the same set of words as *Maternal*, but repeated more often  

As shown in the graph, the model trained on the *Quantity* input has an early advantage. However, the model exposed to the *Diversity* input catches up and ultimately acquires a larger vocabulary over time.

The authors suggest that highly repetitive input is helpful in the early stages of word learning, as it helps build a foundational set of sublexical chunks. Once these basic building blocks are in place, more diverse input allows the learner to rapidly expand their vocabulary.

<img src="https://drive.google.com/uc?export=view&id=1u0kXuNeLekO6p3NBjzQgxmEVjIEZR-mX" width="450px" border="2px">

*Figure from Jones & Rowland (2017)*

<br>

---

---


## Section 1: A simplified version of CLASSIC

In this challenge, we provide you with a basic chunking model inspired by CLASSIC. This part of the course focuses more on experimenting with the model’s input rather than building its architecture.

### Overview of Mini-CLASSIC:

**Unconstrained Chunking Mechanism**:  
The model uses the same unconstrained chunking mechanism as the original CLASSIC. It takes an utterance transcribed into phonemes—such as (d uː) (j uː) (w ɒ n ə) (m æ ʃ) (s ɐ m) (p ə t ɑː t əʊ z) / (do) (you) (wanna) (mash) (some) (potatoes)—and processes it from left to right by encoding it using available chunks from its lexicon.  

If this is the first utterance the model sees, its lexicon only contains the basic phonemes of the language. It then forms new biphone chunks from adjacent phonemes:  
d uː – uː j – j uː – uː w – w ɒ – ɒ n – n ə – ə m – m æ – æ ʃ – ʃ s – s ɐ – ɐ m – m p – p ə – ə t – t ɑː – ɑː t – t əʊ – əʊ z.

If you look closely, the model has already reached a word-level representation for the words *do* and *you*. At this point, we say the model has learned a vocabulary of two distinct words!

These newly learned chunks can then be reused to encode future utterances using longer chunks instead of individual phonemes. The use of longer chunks during encoding is a proxy for processing efficiency—the fewer the chunks used to encode a word or utterance, the easier it is to process and learn.

For example, if the next utterance is (j uː) (g ɒ t) (s ɐ m) (p ə t ɑː t əʊ z) / (you) (got) (some) (potatoes), the model may encode it using the previously learned biphone chunks:  
(juː) (g) (ɒ) (t) (sɐ) (mp) (ət) (ɑːt) (əʊz).

After encoding, the model forms new chunks from adjacent elements:  
j uː g – g ɒ – ɒ t – t sɐ – sɐ mp – mp ət – ət ɑːt – ɑːt əʊz.

These chunks are now longer—some include 3 or 4 phonemes! Notice the chunk *sɐmp*, which crosses a word boundary and contains the word *sɐm* (some). At this point, we add *some* to the lexicon, meaning the model now has a vocabulary of three words.

**Learning Rate**:  
The model includes a learning rate parameter ranging from 0 to 1. It controls the likelihood of forming a new chunk from two adjacent elements. A learning rate of 1 means all possible adjacent sequences are chunked, while a rate of 0 means none are.

**Recency Bias**:  
We’ve also added a recency bias parameter, which makes the model more likely to learn chunks from the end of an utterance. This bias is implemented in Jones and Rowland’s simulations, under the assumption that children often pay more attention to the ends of utterances. A recency bias of 0 means no bias, while higher values increasingly favour the final chunks. At the extreme, the model may only learn the last chunk of each utterance.

### Play with Mini-CLASSIC:

In the following code cell, you’ll find a function that defines the entire Mini-CLASSIC model. Run the code cell to load the model. Next, try running the model on three consecutive utterances to see what chunks and words it learns. You can also experiment with the learning rate and recency bias parameters to observe how they affect the model’s behaviour.

<br>

In [None]:
# Import some libraries
import os
import numpy as np
import re
import time
import pandas as pd

def run_chunking_model(input_utterances, learning_rate=1.0, final_bias_strength = 0,set_seed=None):
    """
    Given a list of input utterances,
    this function runs an incremental chunking learning model inspired by CLASSIC (Jones & Rowland, 2017).

    The model:
    - Parses each utterance into basic units (e.g., phonemes).
    - Encodes incrementally by matching known chunks in the current lexicon.
    - Learns new chunks by combining adjacent encoded elements and adds them to the lexicon.
    - Tracks, at each step, which new chunks and input words were learned.
    - Applies a learning probability (learning_rate) and optionally biases learning toward utterance-final chunks.

    Parameters:
    ----------
    input_utterances : list of str
        Input utterances formatted as strings like "(h e l l o) (d a d d y)".

    learning_rate : float, optional (default=1.0)
        Probability of learning a new chunk when encountered. 1.0 means always learn; 0.0 means never.

    final_bias_strength : float, optional (default=0)
        Strength of positional bias toward learning later chunks in the utterance.
        - 0 disables position-based bias (uniform probability).
        - Values >1.0 increase preference for utterance-final chunks.
        - Values <1.0 reduce the bias effect.
        - 1.0 applies a linear bias (proportional to position).


    set_seed : int or None, optional (default=None)
        Random seed for reproducibility.

    Returns:
    -------
    df : pandas.DataFrame
        A DataFrame with the following columns:
        - 'cds_utterance_id': utterance ID number
        - 'utterance_phonemes': the original input utterance
        - 'encoded_input': encoded version of the utterance using previously learned chunks
        - 'new_chunks': list of new chunks learned at this utterance
        - 'new_words': list of new words learned after processing the current utterance
    """
    # ── parameter validation ───────────────────────────────
    if not (0.0 <= learning_rate <= 1.0):
        raise ValueError("learning_rate must be between 0 and 1 (inclusive).")
    if final_bias_strength < 0:
        raise ValueError("final_bias_strength must be zero or positive.")

    # set seed
    if set_seed is not None:
        np.random.seed(set_seed)

    # progress-bar setup
    total_utts = len(input_utterances)
    checkpoints = {
        int(total_utts * 0.20): "20 %",
        int(total_utts * 0.40): "40 %",
        int(total_utts * 0.60): "60 %",
        int(total_utts * 0.80): "80 %",
    }
    start_time = time.time()

    utterances_orig = input_utterances

    utterances = [
        " ".join(re.findall(r"\((.*?)\)", utt))
        for utt in utterances_orig
    ]

    # Check for empty utterances and raise an error if any are found
    empty_utterance_count = sum(utterance.strip() == "" for utterance in utterances)
    if empty_utterance_count > 0:
        raise ValueError(f"Input contains {empty_utterance_count} empty utterance(s). Please remove or fix before running the model.")

    # create list of unsegmented sentences for final printing
    utterances_base = [re.sub("_", "", utterance) for utterance in utterances]

    # create a list of unique units (phonemic or syllabic) to be converted into unique characters
    # (i.e., characters whose Unicode code point is the integer i [see built-in function chr()])
    unique_chars = list(set(list(np.concatenate([utterances[i].split() for i in range(len(utterances))]))))
    unique_chars += "|" # add separator character for final segmentation printing

    # generate unique characters to convert each different phonemic/syllabic unit
    chars = [c for c in map(chr, range(0, len(unique_chars))) if len(c) == 1]

    # create look-up dictionary to prepare input for algorithm
    chars_in = dict(zip(unique_chars, chars))

    # create look-up dictionary to convert algorithm output back to original phonemic/syllabic transcription
    chars_out = dict(zip(chars, unique_chars))

    # convert input utterances into strings of unique character units
    utt_conversion = [utterance.split() for utterance in utterances]
    utterances = ["".join([chars_in[unit] for unit in utterance]) for utterance in utt_conversion]

    # define boundary separator
    separator = chars_in["|"]

    # initialise variables

    lexicon_nopause = [chars_in[unique_char] for unique_char in unique_chars if unique_char != "|"]
    lexicon_nopause = dict(zip(lexicon_nopause, lexicon_nopause))

    final_segmentation = []
    lexicon_by_utterance = []

    regex_separator_1 = "".join(["^[", separator, "]"])
    regex_separator_2 = "".join(["[", separator, "]{2,}"])

    start_time = time.time()
    for i in range(len(utterances)): # for each utterance
        utt = utterances[i]
        final_seg = []

        while len(utt) > 0:
            # all adjacent combinations ordered by decreasing length
            combinations = reversed([utt[0: z] for z in range(0 + 1, len(utt) + 1)])
            chunk_found = ""

            while chunk_found == "":
                combination = next(combinations) # for each combination
                # note, a chunk will always be found as a phoneme/syllable
                # this happens because the lexicon has the phonemes/syllables basic units initialised

                try:
                    # look for a combination in the lexicon
                    chunk_found = lexicon_nopause[combination]
                except KeyError:
                    # otherwise pass
                    pass

            # store chunk found
            final_seg += [chunk_found]

            # delete part of utterance that was matched by a chunk
            utt = utt[len(chunk_found):]

        chunks = []
        # create new chunks from recognised segments
        if len(final_seg) > 1:
            for j in range(len(final_seg) - 1):
                chunks += ["".join([final_seg[j], final_seg[j + 1]])]
        else:
            chunks += [final_seg[0]]

        new_chunks = []

        for j, chunk in enumerate(chunks): # add chunks to the lexicon
            if chunk not in lexicon_nopause:
                position_bias = ((j + 1) / len(chunks)) ** final_bias_strength # positional bias
                if np.random.rand() < (learning_rate * position_bias): # learning rate
                    lexicon_nopause[chunk] = chunk
                    new_chunks.append(chunk)

        # store only newly learned chunks for this utterance
        lexicon_by_utterance.append(new_chunks)

        # join all the recognised segmentes into a single string
        final_seg_joined = re.sub(regex_separator_1,
                                "",
                                re.sub(regex_separator_2,
                                        separator,
                                        separator.join(final_seg)))

        # add utterance segmentation to list of segmented utterances
        final_segmentation += [final_seg_joined]

        # ───── print progress every 20 % ─────
        if i in checkpoints:
            elapsed = time.time() - start_time
            print(f"[{checkpoints[i]}]   processed {i+1:,} / {total_utts:,} utterances "
                  f"({elapsed:.1f} s elapsed)")

    # raw loop finished – print an early 100 % notice
    core_done_time = time.time() - start_time
    print(
        f"[100 %] raw segmentation complete in {core_done_time:.1f} s – "
        "assembling results (converting chunks → phonemes, "
        "building parent lexicon, computing new-words …)"
    )

    # format final segmentation
    final_segmentation = ["".join([chars_out[unit] for unit in u]) for u in final_segmentation]
    final_segmentation = [" ".join(f"({seg})" for seg in s.split("|")) for s in final_segmentation]

    # convert each lexicon entry from character format back to the original phonemic/syllabic units
    lexicon_by_utterance_out = []

    for lexicon in lexicon_by_utterance:
        converted_lexicon = []
        for chunk in lexicon:
            # convert each character in the chunk to the original unit
            units = [chars_out[c] for c in chunk]
            converted_lexicon.append(" ".join(units))
        lexicon_by_utterance_out.append(converted_lexicon)

    # Create the DataFrame with proper decoded chunks
    df = pd.DataFrame({
        "cds_utterance_id": range(1, len(utterances_orig) + 1),
        "utterance_phonemes": utterances_orig,
        "encoded_input": final_segmentation,
        "new_chunks": lexicon_by_utterance_out
    })

    # Build cumulative parent lexicon for each utterance
    parent_lexicon_progressive = []
    words_seen_so_far = set()

    pattern = re.compile(r"\((.*?)\)")
    parent_lexicon_progressive = []
    words_seen_so_far = set()

    for utt in utterances_orig:
        words_seen_so_far.update(pattern.findall(utt))
        parent_lexicon_progressive.append(words_seen_so_far.copy())

    # Add a column with the learned words (i.e. parent words that appear as substrings of any learned chunk)
    # Use the cumulative parent lexicon
    # Precompute for speed
    chunks_col = df["new_chunks"].tolist()
    words_col = parent_lexicon_progressive

    # Fast match using set logic and substring lookup
    def find_learned_words(chunks, allowed_words):
        learned = set()
        for word in allowed_words:
            for chunk in chunks:
                if word in chunk:
                    learned.add(word)
                    break  # no need to check other chunks
        return sorted(learned)

    learned_words_so_far = set()
    new_learned_words_all = []

    for chunks, allowed_words in zip(chunks_col, words_col):
        new_learned = []
        for word in allowed_words:
            if word in learned_words_so_far:
                continue
            if any(word in chunk for chunk in chunks):
                new_learned.append(word)
                learned_words_so_far.add(word)
        new_learned_words_all.append(sorted(new_learned))

    df["new_words"] = new_learned_words_all

    total_time = time.time() - start_time
    print(f"[100 %] finished processing {total_utts:,} utterances "
          f"in {total_time:.1f} s ({total_time/60:.1f} min)")


    return df

In [None]:
# Now, play with Mini-CLASSIC
# Inspect the chunks learned, the words learned, and modify its parameters.

# Notice how the output of the model is in a table format. How neat is that!
# The table is specifically a Pandas DataFrame, which is a powerful and widely-used data structure in Python for
# organising and analysing tabular data. It allows you to easily view, filter, group, and summarise the model's output.

# Pro Tip:
# to fully visualise the table, once you've run the cell, click on the blue table icon that appears to the left of the column names in the output.
# This will open an interactive preview of the DataFrame.

utterances_example = [ # For the sake of the example, let's use orthographic utterances, as they are easier to read than phonemic transcriptions.
    "(w h e r e s) (d a d d y)",
    "(w h e r e s) (d a d d y)",
    "(w h e r e s) (d a d d y)"
]

model_df_example = run_chunking_model(utterances_example, learning_rate=1, final_bias_strength=0, set_seed=1234)
model_df_example

<br>

---

### Section 2: Mother and Child Data

**Objective**
Get hands-on with one mother–child dyad from the Manchester Corpus.

**Required Python Knowledge**

* Fetching remote files (`requests`, `pd.read_excel`)
* Regex
* Core Pandas operations: `apply`, `explode`, `groupby`, `qcut`, basic DataFrame maths
* List/set comprehensions
* Simple plotting with Matplotlib (`plot`, axes labels, ticks, legend, `tight_layout`)



In [None]:
# Brilliant # 👶 Let's take a first look at the input to our Mini-CLASSIC model!

# In their 2017 study, Jones and Rowland used real conversations between mothers and children from
# the Manchester Corpus. To keep things simple (and fast), we’ll work with just one of these conversations (target child = Becky).

# 🧾 Your task:
# We'll begin by downloading a small extract from the Manchester Corpus. This will help you understand
# the structure of the data we’ll be working with later.

url = "https://raw.githubusercontent.com/francescocabiddu/MEDALWorkshopChildComputational/refs/heads/main/Eng-UK_conversation-flow_219_Manchester_15410_Becky_Example.txt"

# 🔗 Dataset URL:
# The file contains lines with three elements:
# - a speaker role (either Child_Directed_Speech or Target_Child),
# - an orthographic transcription like (give) (dolly) (some),
# - and a phonemic transcription like (ɡ ɪ v) (d ɒ l i) (s ɐ m)

# 👇 Pro tips:
# - Use the `requests` library to download the text file from the URL.
# - Load the text into a variable.
# - Split the content into individual lines — each line is one utterance.
# - Then, print the lines.

# ✏️ INSERT YOUR CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

import requests
import pandas as pd
import re

# Send a GET request to the URL to retrieve the content of the text file
response = requests.get(url)

# Extract the content of the response as plain text
utterances_text = response.text

# Split lines into a list
utterances_lines = utterances_text.strip().split("\n")

# Print utterances
utterances_lines
```
</details>

In [None]:
# 🧒 So cool! Now that you’ve explored how conversation data is structured,
# let’s work with two separate files: one for the child’s speech and one for the mother's speech.

# 🧾 Your task:
# We’ll now load two Excel files:
# 1️⃣ One contains phonemic utterances produced by the child (Becky).
# 2️⃣ The other contains phonemic utterances from the mother (Child-Directed Speech).

# 🧠 Why only phonemes?
# For our simulations, we only need the phonemic representations of the utterances — so we’ll keep things simple and focus on those.

# 📁 File format:
# Unlike the previous `.txt` file, these are `.xlsx` Excel files. This is a good chance to practise importing from a different format!

# 👇 Pro tips:
# - Use the `pd.read_excel()` function to load the Excel files directly from the provided URLs.
# - Store the child data in a DataFrame called `child_df`
# - Store the child-directed speech in a DataFrame called `cds_df`
# - Preview the first few rows of each dataset using `.head()`

# ✏️ INSERT YOUR CODE BELOW

becky_url = "https://github.com/francescocabiddu/MEDALWorkshopChildComputational/raw/refs/heads/main/becky_speech.xlsx"
becky_cds_url = "https://github.com/francescocabiddu/MEDALWorkshopChildComputational/raw/refs/heads/main/becky_cds.xlsx"



<details>
  <summary>Reveal a solution</summary>
  
```python
# Read Excel files directly from URL
child_df = pd.read_excel(becky_url)
cds_df = pd.read_excel(becky_cds_url)

# Preview the data
child_df.head()
```
</details>

In [None]:
# 📊 Now let’s compute some basic summary statistics to better understand our conversation samples.

# We’ll compare the speech of the child and the mother across a few key measures.

# 🧾 Your task:
# Create a small table that includes the following statistics for both the child and the mother:

# 1️⃣ Total number of utterances
# 2️⃣ Mean length of utterance (in number of words)
# 3️⃣ Total number of word tokens (all words produced, including repetitions)
# 4️⃣ Total number of word types (unique words only)

# 👇 Pro tips:
# - Each row in `child_df` and `cds_df` contains a phonemic utterance in the format: (w ʌ n ə) (t ɛ d i)
# - Use regex to extract the bracketed word units.
# - Count the number of utterances using `len()`
# - Count word tokens by flattening the extracted words and taking the total
# - Count word types using a `set()`

# 🛠️ Output:
# Create a small summary DataFrame or dictionary to neatly display the stats for child and mother.

# ✏️ INSERT YOUR CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

import re
import pandas as pd

# Function to extract words from a phonemic utterance like: (w ʌ n ə) (t ɛ d i)
def extract_words(utterance_str):
    return re.findall(r"\((.*?)\)", utterance_str)

# Helper function to compute summary stats for a dataframe
def compute_summary(df):
    # Check for missing utterances and raise an error if any are found
    if df['utterance_phonemes'].isna().any():
        raise ValueError("Missing utterances found in 'utterance_phonemes' column!")

    # Apply extraction directly to the column
    word_lists = df['utterance_phonemes'].apply(extract_words)

    total_utterances = len(word_lists)
    mean_length = word_lists.apply(len).mean()
    all_tokens = sum(word_lists, [])  # flatten list
    num_tokens = len(all_tokens)
    num_types = len(set(all_tokens))

    return {
        "total_utterances": total_utterances,
        "mean_utterance_length": round(mean_length, 2),
        "word_tokens": num_tokens,
        "word_types": num_types
    }

# Compute stats
child_stats = compute_summary(child_df)
mother_stats = compute_summary(cds_df)

# Combine into DataFrame for display
summary_df = pd.DataFrame([child_stats, mother_stats], index=["Child", "Mother"])
summary_df
```
</details>

In [None]:
# 🧠 Excellent! As we saw, the mother speaks more often than the child, uses longer utterances,
# and draws from a wider vocabulary — exactly what one might expect in child-directed speech.

# 📈 In Jones & Rowland’s simulations, the key focus was on vocabulary growth — specifically,
# how *new word types* are learned over time.

# 🧾 Your task:
# Let’s now write a function that tracks *new word types* as they appear in the dataset.
# For each utterance, the function should identify which words are being used for the *first time* in the conversation.

# 🛠️ What counts as a "word"?
# For our purposes, each bracketed unit in the phonemic transcription (e.g., (b æ n ə n ə))
# is treated as one "word".

# 👇 Step-by-step instructions:
# - Define a function `add_new_words_column()` that:
#     • Takes a DataFrame with a column of utterance phonemes
#     • Extracts the bracketed word units
#     • Tracks which words have already been seen
#     • Adds a new column showing which words are *new* in each row
# - Apply the function to both `child_df` and `cds_df`
# - Preview the result to check that it’s working

# ✏️ INSERT YOUR CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

import pandas as pd
import re


def add_new_words_column(df, phoneme_col='utterance_phonemes', new_col='new_words'):
    import re
    if df[phoneme_col].isna().any():
        raise ValueError(f"Missing values in '{phoneme_col}'")

    def extract_words(s):
        return re.findall(r"\((.*?)\)", s)

    seen_global = set()
    new_lists   = []

    for utter in df[phoneme_col]:
        current     = extract_words(utter)

        current_unique = list(dict.fromkeys(current))

        new_in_row = [w for w in current_unique if w not in seen_global]
        seen_global.update(current_unique)
        new_lists.append(new_in_row)

    df[new_col] = new_lists
    return df

# Apply to both child and mother speech
child_df = add_new_words_column(child_df)
cds_df = add_new_words_column(cds_df)

# Preview the result
child_df[['utterance_phonemes', 'new_words']].iloc[0:10]
```
</details>

In [None]:
# Great, let's double check that the new column only contains unique words not produced in previous utterances
def check_global_uniqueness(df, col_name):
    """
    Checks if all entries in a column of lists are globally unique (i.e., no duplicates across rows).

    Parameters:
    - df: pandas DataFrame
    - col_name: name of the column containing lists of items (e.g., new_words or new_chunks)

    Returns:
    - None, but prints a summary and whether duplicates were found
    """
    # Flatten all lists into a single list
    all_items = sum(df[col_name], [])

    # Compute total and unique counts
    total_count = len(all_items)
    unique_count = len(set(all_items))

    # Print result
    if total_count == unique_count:
        print(f"✅ No duplicates in column '{col_name}' — all entries are globally unique.")
    else:
        print(f"❌ Duplicates found in column '{col_name}'.")
        print(f"Total entries: {total_count}")
        print(f"Unique entries: {unique_count}")
        print(f"Duplicates: {total_count - unique_count}")


print("Check on child_df...")
check_global_uniqueness(child_df, "new_words")
print("Check on cds_df...")
check_global_uniqueness(cds_df, "new_words")

In [None]:
# 🗂️ Almost there! To mirror Jones & Rowland’s vocabulary-growth plot, we now need a sense of *time*.

# 📈 In their simulations, the “quantity” effect emerged very early.
# To capture such early differences, we’ll divide each conversation into a *large* number of
# equally sized stages — say, 100.  That way, we can inspect vocabulary growth from the
# very first few utterances right through to the end.

# 🧾 Pro tips:
# 1️⃣ Add an "utterance_id" column to each DataFrame that simply numbers the utterances in order (starting at 1).
# 2️⃣ Use `pd.qcut()` to split those IDs into 100 equal-sized *stages*.
#    • Label the stages 1-to-100 (tip: `labels=False` gives 0-based labels; add 1 afterward).
# 3️⃣ Store the stage labels in a new column called "stage".
# 4️⃣ Preview the updated child DataFrame to check your work.

# ✏️ INSERT YOUR CODE BELOW



<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

child_df["utterance_id"] = range(1, len(child_df) + 1)
child_df["stage"] = pd.qcut(child_df["utterance_id"], q=100, labels=False) + 1

cds_df["utterance_id"] = range(1, len(cds_df) + 1)
cds_df["stage"] = pd.qcut(cds_df["utterance_id"], q=100, labels=False) + 1

child_df
```
</details>

In [None]:
# 📊 Time to watch production vocabulary grow!

# We’ve marked which word types are *new* in each utterance and binned the conversation
# into 100 stages.  Now we want to count how the *cumulative* number of unique word
# types increases across those stages, just as Jones & Rowland did.

# 🧾 Your task:
# 1️⃣ Write a function `compute_cumulative_new_words(df, stage_col='stage', words_col='new_words')`
#    that returns a list of length 100, where each element is the total number of
#    *unique* word types encountered up to (and including) that stage.
#    • Hint: `df.explode(words_col)` will expand the list column into one word per row.
#    • Drop rows where the exploded value is NaN (utterances with zero new words).
#    • Iterate through stages 1…100, keep a running `set()` of words you have “seen”.
# 2️⃣ Use the function to compute two lists:
#       • `stage_counts_cds`   – mother’s cumulative words produced
#       • `stage_counts_child` – child’s cumulative words produced

# 👇 Skeleton to get you started:
#
# def compute_cumulative_new_words(df, stage_col='stage', words_col='new_learned_words'):
#     # YOUR CODE HERE
#     return cumulative_counts
#
# stage_counts_cds   = compute_cumulative_new_words(cds_df)
# stage_counts_child = compute_cumulative_new_words(child_df)

# ✏️ INSERT YOUR CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

def compute_cumulative_new_words(df, stage_col='stage', words_col='new_words'):
    # Explode and drop rows with no new words
    df_exploded = df.explode(words_col).dropna(subset=[words_col])
    
    cumulative_counts = []
    seen_words = set()
    
    for stage in sorted(df[stage_col].unique()):
        current_words = set(df_exploded[df_exploded[stage_col] == stage][words_col])
        seen_words.update(current_words)
        cumulative_counts.append(len(seen_words))
    
    return cumulative_counts


import pandas as pd
import matplotlib.pyplot as plt

# Compute cumulative counts
stage_counts_cds = compute_cumulative_new_words(cds_df)
stage_counts_child = compute_cumulative_new_words(child_df)

# Preview child counts for the first 20 stages
stage_counts_child[0:20]
```
</details>

In [None]:
# Let me do some basic sanity checks on the cumulative counts
def _quick_check(df, counts):
    assert len(counts) == df['stage'].nunique() == 100, "Should have 100 stage counts"
    assert all(x <= y for x, y in zip(counts, counts[1:])), "Counts must be non-decreasing"
    final_unique = len(set().union(*df['new_words']))
    assert counts[-1] == final_unique, "Final count should equal total unique word types"

_quick_check(cds_df, stage_counts_cds)
_quick_check(child_df, stage_counts_child)
print("✅  Cumulative counts look consistent!")

In [None]:
# 📈 With `stage_counts_cds` (mother) and `stage_counts_child` (child) ready,
# it’s time to plot the cumulative vocabulary growth for both speakers.

# 🧾 Your task:
# 1️⃣ Use Matplotlib to plot two lines:
#     • Mother (CDS)
#     • Child
# 2️⃣ The x-axis should be the stage number (1 → 100).
# 3️⃣ Label the axes clearly:
#     • x-label: 'Stage'
#     • y-label: e.g., 'Cumulative Count of Unique Words Produced'
# 4️⃣ Add a legend identifying the two curves.
# 5️⃣ For readability, set the x-tick labels every 10 stages (0, 10, 20, …, 100).
# 6️⃣ Include a grid and call `plt.tight_layout()` before displaying with `plt.show()`.

# ✏️ INSERT YOUR CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

import pandas as pd
import matplotlib.pyplot as plt

# Plot
plt.plot(range(1, len(stage_counts_cds) + 1), stage_counts_cds, marker='o', label='CDS')
plt.plot(range(1, len(stage_counts_child) + 1), stage_counts_child, marker='s', label='Child')

plt.xlabel('Stage')
plt.ylabel('Cumulative Count of Unique Words Produced')
plt.xticks(range(0, len(stage_counts_cds) + 1, 10))
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
```
</details>

<br>

---

### Section 3: Run Mini-CLASSIC

**Objective**
Train the Mini-CLASSIC learner on Becky’s maternal input, inspect what it learns, and tune its hyper-parameters.

**Required Python Knowledge**

* Calling functions with keyword arguments and reproducible seeds
* Pandas column wrangling (`assign`, `range`, `qcut`)
* Basic Matplotlib line plotting (multiple series, legends, axis labels, grids)
* (optional) RMSE calculation with NumPy


In [None]:
# 🤖 Time to let Mini-CLASSIC learn!

# So far, we’ve plotted *produced* word types for mother and child.
# Next, we’ll see how our computational model learns from the mother’s input.

# ⏳ Heads-up: Depending on connection speed, running the model may take a couple of minutes.

# 🧾 Your task:
# 1️⃣ Call `run_chunking_model()` on the mother’s phonemic utterances:
#        • input          → `cds_df["utterance_phonemes"]`
#        • learning_rate  → 1
#        • final_bias_strength → 0
#        • set_seed       → 1234  (makes results reproducible)
# 2️⃣ Store the result in a DataFrame named `model_df`.
# 3️⃣ Display `model_df` to inspect what the function returns

# ✏️ INSERT YOUR CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
model_df = run_chunking_model(cds_df["utterance_phonemes"], learning_rate=1, final_bias_strength=0, set_seed=1234)
model_df
```
</details>

In [None]:
# Great, now let's add the stage column to model_df
# and plot the model acquired vocabulary alongside mother and child

# Add Stage
model_df["utterance_id"] = range(1, len(model_df) + 1)
model_df["stage"] = pd.qcut(model_df["utterance_id"], q=100, labels=False) + 1

# Compute cumulative counts for the model
stage_counts_model = compute_cumulative_new_words(model_df)

# Plot
plt.plot(range(1, len(stage_counts_cds) + 1), stage_counts_cds, marker='o', label='CDS')
plt.plot(range(1, len(stage_counts_child) + 1), stage_counts_child, marker='s', label='Child')
plt.plot(range(1, len(stage_counts_model) + 1), stage_counts_model, marker='s', label='Model')

plt.xlabel('Stage')
plt.ylabel('Cumulative Count of Unique Words Produced')
plt.xticks(range(0, len(stage_counts_cds) + 1, 10))
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# 🛠️ Model tuning time!

# With `learning_rate = 1` and `final_bias_strength = 0`, the model is acquiring more
# word types than the child.  Before we explore quantity and diversity manipulations, I think it would
# be a good idea to have a *baseline* model that captures the child’s vocabulary growth as closely as possible.

# 🧾 Your task:
# 1️⃣ Go back two cells, and experiment with several values for `learning_rate` (e.g., 0.2, 0.5, 0.6, 0.7, 0.8)
#    and `final_bias_strength` (e.g., 0.5, 1.0, 1.5, 2).
# 2️⃣ For each parameter pair, replot the model curve together with the mother and child curve
# 3️⃣ Visually identify the parameter combination that brings the model curve closest to the child curve.
# 4️⃣ When you’re satisfied, keep the *best-fitting* parameter set for later analyses
#     and store the corresponding model output in `model_df`.

# 👉 Hints:
# - Use a small grid search (a handful of combos is fine for now).
# - You might judge “fit” by eye, or compute the Root Mean Squared Error between curves if you really fancy.

# ✏️ CONTINUE WITH NEXT CELL WHEN SATIFIED WITH MODEL FIT TO CHILD DATA

<details>
  <summary>Reveal a solution</summary>
  
```python
# This is a reasonably good fit to child data
model_df = run_chunking_model(cds_df["utterance_phonemes"], learning_rate=0.6, final_bias_strength=1, set_seed=1234)
model_df
```
</details>

<br>

---

### Section 4: Quantity and Diversity Input

**Objective**
Import two experimentally manipulated versions of the mother’s speech—a *quantity* boost and a *diversity* boost—and create a concise quantitative profile for each.

**Required Python knowledge**

* Reading remote Excel files with **`pd.read_excel()`** and adding basic identifier columns (`range`, `.assign`)
* Binning rows with **`pd.qcut()`** and using **`groupby()`** to aggregate stage-wise counts
* Extracting words via regular expressions, plus list handling and set logic
* Writing small helper functions
* Joining, renaming and generally tidying DataFrames


In [None]:
# 🌱 New input conditions: QUANTITY vs DIVERSITY

# We’ve tuned a baseline Mini-CLASSIC model on the original maternal input.
# Now we’ll import two *manipulated* versions of that input:
#
# 1️⃣ Quantity boost  – same vocabulary, but many utterances repeated → more *tokens*.
# 2️⃣ Diversity boost – every other utterance swapped for one of the same length
#                      from another mother but containing *new* words → more *types*.
#
# 🔗 Data sources:
quantity_url  = "https://github.com/francescocabiddu/MEDALWorkshopChildComputational/raw/refs/heads/main/becky_cds_quantity.xlsx"
diversity_url = "https://github.com/francescocabiddu/MEDALWorkshopChildComputational/raw/refs/heads/main/becky_cds_diversity.xlsx"

# 🧾 Your task:
# 1️⃣ Read each Excel file into its own DataFrame:
#       • call them `cds_quantity_df` and `cds_diversity_df`
# 2️⃣ Preview each DataFrame to make sure everything looks right.
#
# 📝 No “Pro Tips” this time — just reuse the code you wrote earlier!

# ✏️ INSERT YOUR CODE BELOW



<details>
  <summary>Reveal a solution</summary>
  
```python
# Read Excel files directly from URL
cds_quantity_df = pd.read_excel(quantity_url)
cds_diversity_df = pd.read_excel(diversity_url)

# Display one of the DFs
cds_quantity_df
```
</details>

In [None]:
# 📊 Summarise the three maternal input conditions

# We now have three DataFrames:
#   1️⃣ cds_df              – original maternal speech (baseline)
#   2️⃣ cds_quantity_df     – quantity-boost version
#   3️⃣ cds_diversity_df    – diversity-boost version
#
# 🧾 Your task:
# Compute (for each DataFrame) the following four statistics:
#   • total_utterances
#   • mean_utterance_length          (average words per utterance)
#   • word_tokens                    (all words, including repetitions)
#   • word_types                     (unique words only)
#
# ➡ Feel free to reuse whatever approach you used earlier for the
#    standard input — whether that was with helper functions or
#    inline code.  The goal is simply to produce the same numbers
#    for all three conditions.
#
# 1️⃣ Calculate the four stats for each DataFrame.
# 2️⃣ Combine the results into a single summary table (e.g., a
#    small pandas DataFrame) with rows labelled:
#         "Mother (standard)", "Mother – Quantity", "Mother – Diversity"
# 3️⃣ Display the table so you can compare how the manipulations
#    have changed tokens and types.
#
# ✏️ INSERT YOUR CODE BELOW



<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

import re
import pandas as pd

# Function to extract words from a phonemic utterance like: (w ʌ n ə) (t ɛ d i)
def extract_words(utterance_str):
    return re.findall(r"\((.*?)\)", utterance_str)

# Helper function to compute summary stats for a dataframe
def compute_summary(df):
    # Check for missing utterances and raise an error if any are found
    if df['utterance_phonemes'].isna().any():
        raise ValueError("Missing utterances found in 'utterance_phonemes' column!")

    # Apply extraction directly to the column
    word_lists = df['utterance_phonemes'].apply(extract_words)

    total_utterances = len(word_lists)
    mean_length = word_lists.apply(len).mean()
    all_tokens = sum(word_lists, [])  # flatten list
    num_tokens = len(all_tokens)
    num_types = len(set(all_tokens))

    return {
        "total_utterances": total_utterances,
        "mean_utterance_length": round(mean_length, 2),
        "word_tokens": num_tokens,
        "word_types": num_types
    }

# Compute stats
cds_stats = compute_summary(cds_df)
cds_quantity_stats = compute_summary(cds_quantity_df)
cds_diversity_stats = compute_summary(cds_diversity_df)

# Combine into DataFrame for display
summary_df = pd.DataFrame([cds_stats,
                           cds_quantity_stats,
                           cds_diversity_stats],
                          index=["Mother", "Mother - Quantity",
                                 "Mother - Diversity"])
summary_df
```
</details>

In [None]:
# 📊 Stage-by-stage vocabulary profile

# So far, you’ve compared the three maternal input conditions (standard, quantity-boost,
# diversity-boost) with *global* statistics.  Now we want to zoom in and see how
# word tokens and word types accumulate at each of our 100 stages.
#
# 🧾 Your task:
# 1️⃣ **Add a “new_words” column** to both `cds_quantity_df` and `cds_diversity_df`
#    (use the same method you used earlier for the baseline `cds_df`).
#
# 2️⃣ Write a helper function called `summarise_new_types_and_tokens` that,
#    for a given DataFrame, returns a summary table with *one row per stage* and two columns:
#        • word_types        – number of *new* word types produced in that stage
#        • word_tokens       – total word tokens in that stage (incl. repetitions)
#
# 3️⃣ Apply the helper to all three DataFrames to obtain three per-stage tables.
#
# 4️⃣ Merge the results into a single table with columns:
#        stage,
#        cds_types,      cds_quantity_types,      cds_diversity_types,
#        cds_tokens,     cds_quantity_tokens,     cds_diversity_tokens
#
# 5️⃣ Display the table for the **first 10 stages** only (use `.head(10)`).
#
# 🪄 Pro Tips:
#   👉 Using .groupby():
#   When you want to compute something separately for each stage (like word counts),
#   you can use `.groupby('stage')` to group the DataFrame by the stage number.

#   Example:
#   for stage, group in df.groupby('stage'):
#       # 'stage' is the current stage number
#       # 'group' is a smaller DataFrame with only the rows for that stage

# - To count word types: for each stage, combine all the lists in the 'new_words' column.
# - To count tokens: extract all words from the 'utterance_phonemes' column and count how many there are.
# - You can rename columns using `.rename(columns={...})`.
# - To merge the tables: use `.join()` to combine them by their shared 'stage' index.
# - To reorder columns: make a list with the order you want and use `df = df[that_list]`.
#
#
# ✏️ INSERT YOUR CODE BELOW



<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

import pandas as pd
import re

# ---------- helpers ----------
def extract_words(utterance_str):
    """Return list of bracketed words from a phonemic utterance."""
    return re.findall(r"\((.*?)\)", utterance_str)

def summarise_new_words_and_tokens(df, stage_col='stage',
                                   new_words_col='new_words',
                                   phoneme_col='utterance_phonemes'):
    """
    Return a per-stage DataFrame with counts of unique new word types and total tokens.
    """
    unique_types = {}
    token_counts = {}
    
    for stage, grp in df.groupby(stage_col):
        # types introduced *in this stage*
        stage_new_types = set(sum(grp[new_words_col], []))
        unique_types[stage] = len(stage_new_types)
        
        # all tokens (including repetitions)
        token_list = sum(grp[phoneme_col].apply(extract_words), [])
        token_counts[stage] = len(token_list)
    
    summary = pd.DataFrame({
        'word_types': pd.Series(unique_types),
        'word_tokens': pd.Series(token_counts)
    }).sort_index()
    summary.index.name = 'stage'
    return summary

# ---------- add new_words column ----------
cds_quantity_df = add_new_words_column(cds_quantity_df)
cds_diversity_df= add_new_words_column(cds_diversity_df)

# ---------- add stage column ----------
cds_quantity_df["utterance_id"] = range(1, len(cds_quantity_df) + 1)
cds_quantity_df["stage"] = pd.qcut(cds_quantity_df["utterance_id"], q=100, labels=False) + 1
cds_diversity_df["utterance_id"] = range(1, len(cds_diversity_df) + 1)
cds_diversity_df["stage"] = pd.qcut(cds_diversity_df["utterance_id"], q=100, labels=False) + 1

# ---------- per-stage summaries ----------
std_stage  = summarise_new_words_and_tokens(cds_df)
qty_stage  = summarise_new_words_and_tokens(cds_quantity_df)
div_stage  = summarise_new_words_and_tokens(cds_diversity_df)

# rename columns
std_stage  = std_stage.rename(columns={'word_types':'cds_types',
                                       'word_tokens':'cds_tokens'})
qty_stage  = qty_stage.rename(columns={'word_types':'cds_quantity_types',
                                       'word_tokens':'cds_quantity_tokens'})
div_stage  = div_stage.rename(columns={'word_types':'cds_diversity_types',
                                       'word_tokens':'cds_diversity_tokens'})

# ---------- merge all three ----------
summary_by_stage = (
    std_stage
    .join(qty_stage, how='outer')
    .join(div_stage, how='outer')
    .reset_index()
)

# Re-order columns
summary_by_stage = summary_by_stage[[
    "stage",
    "cds_types", "cds_quantity_types", "cds_diversity_types",
    "cds_tokens", "cds_quantity_tokens", "cds_diversity_tokens"
]]

# Show the first 10 stages
summary_by_stage.head(10)
```
</details>

<br>

---

### Section 5: Testing the Effect of Quantity and Diversity

**Objective**
Run Mini-CLASSIC on the quantity-boost and diversity-boost corpora, plot their cumulative vocabulary curves alongside the baseline model.

**Required Python knowledge**

* Re-using previously defined functions and fixed model parameters
* Calling custom functions on new datasets and storing their outputs
* Basic list slicing (`counts[:20]`) and range manipulation
* Multi-series plotting with Matplotlib: markers, line styles, legends and axis formatting
* Light DataFrame housekeeping (adding stage labels, passing data into helper functions)


In [None]:
# 🚀 Putting Mini-CLASSIC to the ultimate test!

# You now know exactly how the three maternal inputs differ (standard, quantity-boost, diversity-boost).
# Let’s see how those differences shape what the Mini-CLASSIC model learns.

# 🧾 Your task:
# 1️⃣ Run `run_chunking_model()` on the **quantity** input (`cds_quantity_df["utterance_phonemes"]`)
#    and on the **diversity** input (`cds_diversity_df["utterance_phonemes"]`) using **the same
#    parameters** you identified for the best-fitting baseline model (learning_rate, final_bias_strength, seed).
#
# 2️⃣ For each model run:
#    • Compute the cumulative vocabulary curve (unique word types learned)
#      – use the same helper code you wrote for `model_df` earlier.
#
# 3️⃣ Plot three curves on the same graph:
#    • Model trained on real CDS  (baseline model curve)
#    • Model trained on quantity-boost input
#    • Model trained on diversity-boost input
#
#    Label axes clearly (Stage vs Cumulative Unique Words), add a legend,
#    and keep x-ticks every 10 stages.
#
#
# 🪄 Pro Tips:
# • You already have code to generate `stage_counts_*` from a model DataFrame – reuse it!
# • To vary marker shapes: marker='o', marker='s', marker='^', marker='D'.
# • Keep the colour palette simple – Matplotlib will choose distinct colours by default.
# • If the curves overlap too much, try using dashed or dotted line styles
#   (`linestyle='--'`, `'-.`', `':'`) in addition to different markers.
#
# ✏️ INSERT YOUR CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

# Run models
print("Running Model - Diversity...")
model_diversity_df = run_chunking_model(cds_diversity_df["utterance_phonemes"], learning_rate=0.6, final_bias_strength=1, set_seed=1234)
print("Running Model - Quantity...")
model_quantity_df = run_chunking_model(cds_quantity_df["utterance_phonemes"], learning_rate=0.6, final_bias_strength=1, set_seed=1234)


# Add Stage
model_diversity_df["utterance_id"] = range(1, len(model_diversity_df) + 1)
model_diversity_df["stage"] = pd.qcut(model_diversity_df["utterance_id"], q=100, labels=False) + 1
model_quantity_df["utterance_id"] = range(1, len(model_quantity_df) + 1)
model_quantity_df["stage"] = pd.qcut(model_quantity_df["utterance_id"], q=100, labels=False) + 1

# Compute cumulative counts for the model
stage_counts_model_diversity = compute_cumulative_new_words(model_diversity_df)
stage_counts_model_quantity = compute_cumulative_new_words(model_quantity_df)

# Plot
plt.plot(range(1, len(stage_counts_model) + 1), stage_counts_model, marker='s', label='Model')
plt.plot(range(1, len(stage_counts_model_diversity) + 1), stage_counts_model_diversity, marker='s', label='Model - Diversity')
plt.plot(range(1, len(stage_counts_model_quantity) + 1), stage_counts_model_quantity, marker='s', label='Model - Quantity')

plt.xlabel('Stage')
plt.ylabel('Cumulative Count of Unique Words Produced')
plt.xticks(range(0, len(stage_counts_model) + 1, 10))
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
```
</details>

In [None]:
# 🔍 Zooming in: early learning differences

# The full cumulative plot showed that diversity drives stronger vocabulary growth overall,
# while quantity provides a more modest boost – exactly in line with the findings of
# Jones & Rowland (2017).  Well done on replicating that effect! 🎉
#
# There’s a subtler pattern, however:
# 📌 In the *very first* stages, extra quantity may give the model a short-term edge
#     over diversity – but it’s hard to see on the full-scale graph.
#
# To make this early effect clearer, we’ll zoom in on just the first 10 stages.

# 🧾 Your task:
# 1️⃣ Re-use the three cumulative curves you already computed:
#       • `stage_counts_model`            – baseline (real CDS)
#       • `stage_counts_model_quantity`   – quantity boost
#       • `stage_counts_model_diversity`  – diversity boost
#
# 2️⃣ Slice each list to keep only the first 10 stages, e.g.  `counts[:10]`.
#
# 3️⃣ Plot the three curves again on a single graph:
#
# 🪄 Pro tips:
# • Adjust the x-range to `range(1, 11)`.
# • Copy/paste your earlier plotting code and tweak the list slices and labels.
#
# ✏️ INSERT YOUR CODE BELOW


<details>
  <summary>Reveal a solution</summary>
  
```python
# Note: This is one possible solution, but other valid approaches may exist!

# Compute cumulative counts for the model
stage_counts_model_10 = stage_counts_model[:10]
stage_counts_model_diversity_10 = stage_counts_model_diversity[:10]
stage_counts_model_quantity_10 = stage_counts_model_quantity[:10]

# Plot
plt.plot(range(1, len(stage_counts_model_10) + 1), stage_counts_model_10, marker='s', label='Model')
plt.plot(range(1, len(stage_counts_model_diversity_10) + 1), stage_counts_model_diversity_10, marker='s', label='Model - Diversity')
plt.plot(range(1, len(stage_counts_model_quantity_10) + 1), stage_counts_model_quantity_10, marker='s', label='Model - Quantity')

plt.xlabel('Stage')
plt.ylabel('Cumulative Count of Unique Words Produced')
plt.xticks(range(0, len(stage_counts_model_10) + 1, 1))
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
```
</details>

---
---

<br>

### 🚀 What We’ve Accomplished

You worked with Becky’s mother–child corpus, tagging every word at its first appearance, slicing the data into equal “learning” stages and plotting cumulative vocabulary growth for mother, child and three Mini-CLASSIC models. After tuning a baseline model on real child-directed speech, you trained it on two altered inputs—one that simply repeated existing utterances (quantity) and one that injected new ones that expand the vocabulary used (diversity). The curves reproduced Jones and Rowland’s key result: extra quantity gives a brief initial boost, but greater lexical diversity soon overtakes it and drives superior long-term vocabulary growth.

---

### 📚 Why These Findings Matter

1. **Quantity jump-starts learning, but diversity wins the long game**
   *Jones and Rowland (2017) reported precisely this with the full CLASSIC model; your Mini-CLASSIC run hits the same crossover.*

   * **Earliest stages (tiny lexicon):** more *tokens* mean more opportunities to reinforce the basic bi- and triphones from which words are built. Children (and chunking models) thrive on repetition when everything is still novel.
   * **Soon afterwards:** once a decent sub-lexical scaffold exists, hearing *different* words is far more informative than hearing the same few again. Diversity fuels the combinatorial explosion that lets a modest chunk inventory cover thousands of forms.

2. **A mechanistic link to processing-speed cascades in children**
   Weisleder & Fernald ([2013](https://doi.org/10.1177/0956797613488145)) showed that richer caregiver talk predicts faster word recognition later on.

   * **Why:** Diverse input → broader chunk inventory → novel words can be encoded in *fewer* chunks → faster processing → more cognitive resources to map sound to meaning.
   * Quantity alone deepens representations of the *same* lexemes; it cannot create the adaptable chunks that accelerate later learning.

3. **Methodological pay-off**
   Manipulating real corpora while holding other statistics constant is notoriously difficult; simulation lets us pull single causal levers cleanly. Mini-CLASSIC provides a sandbox for bold “what-if” tests, delivering proof-of-principle evidence that can guide empirical work in language acquisition!

---

### 🔮 Points for Reflection

* How would you **go about** repeating the analysis across *all* parent–child dyads in the Manchester corpus?
* We fixed one learning-rate / bias combination. Are the results **independent** of that choice? How could you extend the analysis to explore a parameter grid (see Jones & Rowland for their approach)?
* In this course, we’ve looked at the words learned by each model, but we haven’t yet examined the actual chunks that were learned. Exploring these chunks could help us better understand the quantity and diversity effects. For instance, an early advantage from quantity should reflect the fact that the model is exposed to more repetitive utterances, providing more opportunities to build word representations from sublexical chunks more quickly. If that’s the case, then we should also observe an early advantage in the learning of sublexical chunks. Think about how you could extract sublexical chunks from the Mini-CLASSIC model’s output (a sublexical chunk is any chunk that forms part of a word).

---

### 📝 Bonus Track

*Well done for completing this MEDAL workshop! Great work – now enjoy your musical reward!* 🎶

<audio controls>
  <source src="https://drive.google.com/uc?export=download&id=1uAyaNS1NfIqAC5Wc9xtesmUT3UctrpQU" type="audio/mpeg">
  Your browser does not support the audio element.
</audio>

 *by ChatGPT + SunoAI*

