**Author:** Girwan Dhakal  <br>
**Affiliation:** The University of Alabama, Department of Computer Science <br>
**Date:** 2025-09-23

<hr>

This Python notebook prepares data into the CSV format required by the CRF model accompanying Nikolaus et al.(2022)

In [1]:
!pip install childespy

Collecting childespy
  Downloading childespy-1.0.1-py3-none-any.whl.metadata (1.5 kB)
Downloading childespy-1.0.1-py3-none-any.whl (17 kB)
Installing collected packages: childespy
Successfully installed childespy-1.0.1


Import required packages

In [2]:
import childespy as cpy
import pandas as pd
import json
import os

  there is no package called ‘childesr’

(as ‘lib’ is unspecified)



Installing childesr...







	‘/tmp/Rtmp4R6Ssz/downloaded_packages’



To apply to any childes dataset, replace the value in corpus_name to the dataset you are working on

In [3]:
def prepare_childes_csv(corpus_name):
    """
    Retrieves and processes speech data from the CHILDES corpus, then saves it as a CSV.

    Args:
        corpus_name (str): The name of the CHILDES corpus (e.g., 'Bliss').

    Returns:
        None. A CSV file named '<corpus_name>_data.csv' is saved to disk.
    """
    output_filename = corpus_name + "_data.csv"
    headers = ['utterance_id', 'transcript_file', 'child_id', 'age_months', 'tokens', 'pos', 'speaker_code']

    # Get utterance data
    utterances_df = cpy.get_utterances(corpus=[corpus_name])
    if utterances_df.empty:
        print(f"No utterances found for corpus '{corpus_name}'")
        return
    utterances_df.rename(columns={'id': 'utterance_id'}, inplace=True)

    # Get transcript metadata
    transcripts_df = cpy.get_transcripts(corpus=corpus_name)
    if transcripts_df.empty:
        print(f"No transcripts found for corpus '{corpus_name}'. Cannot determine transcript file names.")
        return

    # Merge utterances with transcript metadata
    merged_utterances_df = pd.merge(utterances_df, transcripts_df, on='transcript_id')

    # Get tokens and POS tags
    tokens_df = cpy.get_tokens(corpus=corpus_name, token="%")
    if tokens_df.empty:
        print(f"No tokens found for corpus '{corpus_name}'. This might indicate an issue or empty corpus.")
        return

    # Group tokens and POS by utterance
    aggregated_tokens = tokens_df.groupby('utterance_id').agg({
        'gloss': lambda x: list(x),
        'part_of_speech': lambda x: list(x)
    }).reset_index()

    # Merge tokens into the main DataFrame
    final_merged_df = pd.merge(merged_utterances_df, aggregated_tokens, on='utterance_id', how='inner')

    # Rename key columns for consistency
    rename = {
        'part_of_speech_y': 'pos',
        'transcript_id': 'transcript_file',
        'target_child_age_x': 'age_months',
        'gloss_y': 'tokens',
        'target_child_id_x': 'child_id'
    }
    final_merged_df.rename(columns=rename, inplace=True)

    # Keep only needed columns
    final_merged_df = final_merged_df[headers]
    final_df = final_merged_df.copy()

    # Convert lists to JSON strings for CSV compatibility
    final_df['tokens'] = final_df['tokens'].apply(json.dumps)
    final_df['pos'] = final_df['pos'].apply(json.dumps)

    # Save to CSV
    print(f"Saving data to {output_filename}...")
    final_df.to_csv(output_filename, index=False)
    print(f"Data successfully saved to {output_filename}")



Run the method on the selected datasets

In [5]:
#You can modify the list according to project requirement
corpora = [
    "Bliss",
    "EisenbergGuo",
    "ENNI",
    "Hargrove",
    "Rescorla",
    "UCSD",
    "Conti1",
    "Conti2",
    "Conti3",
    "EllisWeismer"
]


for corpus in corpora:
  prepare_childes_csv(corpus)








Saving data to Bliss_data.csv...
Data successfully saved to Bliss_data.csv









Saving data to EisenbergGuo_data.csv...
Data successfully saved to EisenbergGuo_data.csv









Saving data to ENNI_data.csv...
Data successfully saved to ENNI_data.csv









Saving data to Hargrove_data.csv...
Data successfully saved to Hargrove_data.csv









Saving data to Rescorla_data.csv...
Data successfully saved to Rescorla_data.csv









Saving data to UCSD_data.csv...
Data successfully saved to UCSD_data.csv









Saving data to Conti1_data.csv...
Data successfully saved to Conti1_data.csv









Saving data to Conti2_data.csv...
Data successfully saved to Conti2_data.csv









Saving data to Conti3_data.csv...
Data successfully saved to Conti3_data.csv









Saving data to EllisWeismer_data.csv...
Data successfully saved to EllisWeismer_data.csv
