# Prepare BabyLM Evaluation Pipeline Data

Script to convert BabyLM evaluation data to phonemes. First, download the [evaluation data](https://github.com/codebyzeb/evaluation-pipeline-2024?tab=readme-ov-file) used in the [BabyLM evaluation pipeline](https://github.com/codebyzeb/evaluation-pipeline-2024?tab=readme-ov-file) then run this notebook. After converting to phonemes, the data was copied into the [forked version](https://github.com/codebyzeb/evaluation-pipeline-2024?tab=readme-ov-file) of the pipeline used in the TransformerSegmentation project. 

In [7]:
import json
import os
import sys

os.environ['PHONEMIZER_ESPEAK_LIBRARY'] = '/opt/local/lib/libespeak-ng.dylib'
sys.path.append('../../')
from src.phonemize import phonemize_utterances

INPUT_DIR = "evaluation_data"
OUTPUT_DIR = "evaluation_data_phonemized"

In [11]:
keys = ['sentence_good', 'sentence_bad', 'sentence', 'question', 'passage', 'premise', 'hypothesis', 'sentence1', 'sentence2', 'paragraph', 'answer', 'question1', 'question2', 'text', 'span1_text', 'span2_text']
folders = ['blimp_filtered', 'glue_filtered', 'supplement_filtered']
#folders = ['supplement_filtered']

for folder in folders:

    print(f"\n----------\n----------\nPhonemizing {folder}\n----------\n----------\n")

    files = []
    for root, _, filenames in os.walk(f'{INPUT_DIR}/{folder}'):
        for filename in filenames:
            if filename.endswith('.jsonl'):
                files.append(os.path.join(root, filename))

    for file in files:
        print(f"----------------\nPhonemizing {file}")

        with open(file, 'r') as f:
            data = f.readlines()
            data = [json.loads(line) for line in data]

        data_keys = []
        for line in data:
            data_keys += line.keys()
        data_keys = list(set(data_keys))

        for key in keys:
            if key in data_keys:
                sentences = [line[key] for line in data]
                phonemized = phonemize_utterances(sentences, keep_word_boundaries=False)
                if len(phonemized) != len(sentences):
                    print(f"Failed to phonemize {len(sentences) - len(phonemized)} sentences ({(len(sentences) - len(phonemized)) / len(sentences) * 100:.2f}%) out of {len(sentences)} total sentences")
                    continue
                i = 0
                for line in data:
                    if key in line:
                        line[key] = phonemized[i]
                        i += 1

        # Save the phonemized data
        filename = file.split('/')[-1]
        os.makedirs(f'{OUTPUT_DIR}/{folder}', exist_ok=True)
        with open(f'{OUTPUT_DIR}/{folder}/{filename}', 'w', encoding='utf-8') as f:
            for line in data:
                f.write(json.dumps(line, ensure_ascii=False) + '\n')
        
    print("Done phonemizing")


----------
----------
Phonemizing blimp_filtered
----------
----------

----------------
Phonemizing evaluation_data/blimp_filtered/ellipsis_n_bar_2.jsonl
Phonemizing using language "EnglishNA"...
Using espeak backend with language code "en-us"...
Phonemizing using language "EnglishNA"...
Using espeak backend with language code "en-us"...
----------------
Phonemizing evaluation_data/blimp_filtered/principle_A_case_2.jsonl
Phonemizing using language "EnglishNA"...
Using espeak backend with language code "en-us"...
Phonemizing using language "EnglishNA"...
Using espeak backend with language code "en-us"...
----------------
Phonemizing evaluation_data/blimp_filtered/existential_there_quantifiers_1.jsonl
Phonemizing using language "EnglishNA"...
Using espeak backend with language code "en-us"...
Phonemizing using language "EnglishNA"...
Using espeak backend with language code "en-us"...
----------------
Phonemizing evaluation_data/blimp_filtered/causative.jsonl
Phonemizing using language 