Creating the PCU-alignment-lexicon consists of three steps: extracting all words needed in the lexicon; using the G2P tool to get phonetic transcriptions of all words; and aligning the phonetic transcriptions with the words. For the first and last step, code is provided in this notebook. For the second step, the code from a Github repository can be used.

In [1]:
import pandas as pd
import os
import helper_scripts.graph_phon_alignment as gpa
import csv

## Step 1: Extracting all words

In [None]:
target_directory = '/vol/tensusers5/wharmsen/spelling-data/aseda/2-targetdata/'

In [None]:
"""
Extracts all target words from all files found in the given directory. 
The files are expected to be .csv files and to contain a 'target' column.
"""
def extract_all_target_words(target_directory):

    all_target_words = set()

    for filename in os.listdir(target_directory):
        location = target_directory + filename

        dataframe = pd.read_csv(location)
        all_target_words.update(dataframe['target'])

    return list(all_target_words)


def write_words_wordlist(words):

    words = sorted(words, key=lambda v: (v.upper(), v[0].islower()))
    with open('all_target_words_list', "w", encoding='utf-8') as f:
        for word in words:
            f.write(word+"\n")

In [None]:
target_words = extract_all_target_words(target_directory)
write_words_wordlist(target_words)

## Step 2: Using a G2P tool

For this step, any desired G2P tool can be used. In case of using the CLST webservice, the Lexiconator can be used. This is a Python binding of the webservice and can be found [here](https://github.com/cristiantg/lexiconator). If the step above is finished, only the uber_script of the lexiconator is needed. If wished, the pre-processing step of the lexiconator can also be used instead of the above code.

## Step 3: Aligning the phonemes and graphemes

In [2]:
g2p_output_path = '/vol/tensusers5/wharmsen/spelling-data/aseda/resources/lexicons/all-target-words-list/results-final/lexicon.txt'
lexicon_output_directory ="/vol/tensusers5/wharmsen/spelling-data/aseda/resources"

In [3]:
def extract_grapheme_phoneme_from_g2p_output(g2p_output_path):
    with open(g2p_output_path) as file:
        text = file.read()
        
    lines = text.split('\n')
    lines = [x for x in lines if x!='']

    #Save lexicon in grapheme and phoneme list
    graphemes = []
    graphemes_aligned = []
    phonemes = []
    phonemes_aligned = []

    for line in lines:
        line = line.split("\t")

        grapheme = line[0]
        phoneme = line[1]
        grapheme_align, phoneme_align = gpa.align_word_and_phon_trans(grapheme, phoneme)

        graphemes.append(grapheme)
        phonemes.append(phoneme)
        graphemes_aligned.append(grapheme_align)
        phonemes_aligned.append(phoneme_align)

    return graphemes, phonemes, graphemes_aligned, phonemes_aligned

def create_alignment_lexicon(graphemes, phonemes, graph_align_list, phon_align_list):
    matrix = []
    for idx in range(len(phonemes)):
        row = [graphemes[idx],phonemes[idx],graph_align_list[idx], phon_align_list[idx]]
        matrix.append(row)

    lexicon_df = pd.DataFrame(matrix, columns = ["graphemes", "phonemes", "graphemes_align", "phonemes_align"])
    lexicon_df = lexicon_df.set_index("graphemes")
    lexicon_df = lexicon_df.dropna(subset=['phonemes'])

    return lexicon_df

In [4]:
graphemes, phonemes, graphemes_aligned, phonemes_aligned = extract_grapheme_phoneme_from_g2p_output(g2p_output_path)

# Create lexicon with grapheme-phoneme alignment
lexiconDF = create_alignment_lexicon(graphemes, phonemes, graphemes_aligned, phonemes_aligned)
lexicon_output_file = lexicon_output_directory + "/graph_phon_alignment_lexicon.csv"
lexiconDF.to_csv(lexicon_output_file, quoting = csv.QUOTE_NONNUMERIC, quotechar='"')
print("Alignment Lexicon created: ", lexicon_output_file)

Alignment Lexicon created:  /vol/tensusers5/wharmsen/spelling-data/aseda/resources/graph_phon_alignment_lexicon.csv
