# Parsing Script for CMU Spanish Dictionary

This script contains functions to:
- Split the Spanish CMU dictionary dataset into chunks of adequate size for upload to the [Latin American Spanish](https://www.bcbl.eu/databases/espal/index.php) database
- Obtain all unique phonemes present in the Spanish dictionary

In [1]:
import os
import re 
import pandas as pd
import numpy as np

In [2]:
dictfile = "cmudict_es.txt"

### Word Chunks

Splits the dictionary into text files containing <= 5000 words (separated by a newline) with pronunciations omitted. Used for input to the Latin American Spanish database.

In [3]:
words = []

# collect words and their pronunciations
with open(dictfile, encoding='utf-8') as file:
    for line in file.readlines():
        parts = line.strip().split()
        word = parts[0]
        pronunciation = ''.join(parts[1:])
        words.append((word, pronunciation))

output_dir = 'wordonly'
os.makedirs(output_dir, exist_ok=True)

# chunks of 5,000
max_lines_per_file = 5000
num_files = (len(words) + max_lines_per_file - 1) // max_lines_per_file

for i in range(num_files):
    start_index = i * max_lines_per_file
    end_index = min(start_index + max_lines_per_file, len(words))
    
    output_file = os.path.join(output_dir, f"cmudict_es_wordonly_part{i+1}.txt")
    
    with open(output_file, 'w', encoding='utf-8') as file:
        for word, _ in words[start_index:end_index]:
            file.write(word + "\n")

### Data Parsing

Transforms the dataset into a pandas DataFrame with words as rows and their corresponding pronunciations as columns. Each column contains a different phoneme present in the word. If a word does not have as many phonemes as there are columns, the remaining phonemes are represented as NaNs.

In [4]:
data = []
unique_phonemes = set()
max_phonemes = 0

with open(dictfile, encoding='utf-8') as file:
    for line in file:
        # Split by whitespace, assuming first element is the word and the rest are phonemes
        parts = line.split()
        word = parts[0]
        phonemes = parts[1:]
        
        # Add phonemes to the set of unique phonemes
        unique_phonemes.update(phonemes)
        
        # Track the maximum number of phonemes for column count
        max_phonemes = max(max_phonemes, len(phonemes))
        data.append([word] + phonemes)

# Generate column names for phonemes (p1, p2, ..., pn)
columns = [f'p{i}' for i in range(1, max_phonemes + 1)]
df = pd.DataFrame([phonemes for _, *phonemes in data], index=[word for word, *_ in data], columns=columns)

In [5]:
df.head()

Unnamed: 0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,...,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23
a,a,,,,,,,,,,...,,,,,,,,,,
aaron,a,a,r,o,n,,,,,,...,,,,,,,,,,
ab,a,B,,,,,,,,,...,,,,,,,,,,
abajo,a,B,a,x,o,,,,,,...,,,,,,,,,,
abandona,a,B,a,n,d,o,n,a,,,...,,,,,,,,,,


The phonemes used in this Spanish CMU dictionary do not match the representation used in the English CMU dictionary (see `transcriptions/english_transcriptions.csv`). As such, we need to determine how phonemes in the Spanish CMU dictionary are to be translated, and to determine if phonemes exist in the Spanish language that do not exist in English.

To do this, we generate a set of words present in the Spanish dictionary that encompass all phonemes. We can then examine this list, comparing the sounds to the sounds outlined in the English transcription file, and translate accordingly.

In [6]:
covered_phonemes = set()
selected_words = set()

for word, row in df.iterrows():
    word_phonemes = set(row.dropna())  # Get phonemes for the current word
    uncovered_phonemes = word_phonemes - covered_phonemes  # Only consider uncovered phonemes
    
    if uncovered_phonemes:  # Only add word if it introduces new phonemes
        selected_words.add(word)
        covered_phonemes.update(uncovered_phonemes)
    
    # Stop once all phonemes are covered
    if covered_phonemes == unique_phonemes:
        break

In [7]:
selected_words

{'a',
 'aaron',
 'ab',
 'abajo',
 'abandona',
 'abandonada',
 'abandonadas',
 'abandonarla',
 'abandone',
 'abanico',
 'abaratamiento',
 'abarrotando',
 'abastecer',
 'abdallah',
 'abdul',
 'abkhazia',
 'abogada',
 'abruptos',
 'absuelto',
 'acaudalados',
 'acecho',
 'aceite',
 'acompaña',
 'adolf',
 'agrupábanse',
 'hielo'}