In [1]:
#Prints **all** console output, not just last item in cell 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

**Notebook author:** emeinhardt@ucsd.edu

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview</a></span><ul class="toc-item"><li><span><a href="#Motivation" data-toc-modified-id="Motivation-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Motivation</a></span></li><li><span><a href="#Usage" data-toc-modified-id="Usage-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Usage</a></span><ul class="toc-item"><li><span><a href="#Papermill---command-line" data-toc-modified-id="Papermill---command-line-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Papermill - command line</a></span></li><li><span><a href="#Old-School" data-toc-modified-id="Old-School-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Old School</a></span></li></ul></li></ul></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Parameters" data-toc-modified-id="Parameters-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Parameters</a></span></li><li><span><a href="#Import-Files-to-Project" data-toc-modified-id="Import-Files-to-Project-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Import Files to Project</a></span><ul class="toc-item"><li><span><a href="#Projection-mapping" data-toc-modified-id="Projection-mapping-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Projection mapping</a></span></li><li><span><a href="#Case-A:-gating-data" data-toc-modified-id="Case-A:-gating-data-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Case A: gating data</a></span></li><li><span><a href="#Case-B:-transcription-lexicon" data-toc-modified-id="Case-B:-transcription-lexicon-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Case B: transcription lexicon</a></span></li></ul></li><li><span><a href="#Apply-Projection" data-toc-modified-id="Apply-Projection-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Apply Projection</a></span><ul class="toc-item"><li><span><a href="#Case-A:-gating-data" data-toc-modified-id="Case-A:-gating-data-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Case A: gating data</a></span></li><li><span><a href="#Case-B:-transcription-lexicon" data-toc-modified-id="Case-B:-transcription-lexicon-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Case B: transcription lexicon</a></span><ul class="toc-item"><li><span><a href="#Pre-conditions" data-toc-modified-id="Pre-conditions-5.2.1"><span class="toc-item-num">5.2.1&nbsp;&nbsp;</span>Pre-conditions</a></span></li><li><span><a href="#Processing" data-toc-modified-id="Processing-5.2.2"><span class="toc-item-num">5.2.2&nbsp;&nbsp;</span>Processing</a></span></li><li><span><a href="#Post-conditions" data-toc-modified-id="Post-conditions-5.2.3"><span class="toc-item-num">5.2.3&nbsp;&nbsp;</span>Post-conditions</a></span></li></ul></li></ul></li><li><span><a href="#Export" data-toc-modified-id="Export-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Export</a></span><ul class="toc-item"><li><span><a href="#Case-A:-gating-data" data-toc-modified-id="Case-A:-gating-data-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Case A: gating data</a></span></li><li><span><a href="#Case-B:-transcription-lexicon" data-toc-modified-id="Case-B:-transcription-lexicon-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Case B: transcription lexicon</a></span></li></ul></li></ul></div>

# Overview

## Motivation

Let 
 - $g$ be a gating data file
 - $l$ be a transcribed lexicon relation file

The notebook `Gating Data - Transcription Lexicon Alignment Maker.ipynb` will import both, calculate the segmental inventories $\Sigma_g$, $\Sigma_l$ used in each, and 
 - calculate the portion $\overline{g}$ of $g$'s inventory $\Sigma_g$ unique to $g$ relative to $l$
 - calculate the portion $\overline{l}$ of $l$'s inventory $\Sigma_l$ unique to $l$ relative to $g$

After running that notebook, *you* (or someone else) then (in $\S$5-6) defined dictionaries (eventually exported) for mapping 
 - $\overline{g} \rightarrow \Sigma_g \cap \Sigma_l$
 - $\overline{l} \rightarrow \Sigma_g \cap \Sigma_l$

*This* notebook takes three arguments:
 - $p$: *one* of those projection mappings as an argument.
 - $g$ or $l$: (as another argument) the appropriate file (**g**ating data or transcribed **l**exicon) to apply the projection mapping to.
 - $o$: a filename or filepath for the projected version of the second argument, to be computed by this notebook.
 
It then applies the projection mapping to the input file and produces the output file.

## Usage

### Papermill - command line

This notebook is intended to be used with the [`papermill`](https://papermill.readthedocs.io/en/latest/) package.

**Example:**

Let 
 - `p` = `./LTR_Buckeye_aligned_w_GD_AmE_destressed/alignment_of_LTR_Buckeye_w_AmE-diphones-IPA-annotated-columns.json`
 - `l` = `./LTR_Buckeye/LTR_Buckeye.tsv`
then
```
papermill "Align transcriptions.ipynb" "GD_AmE-diphones - LTR_Buckeye alignment application to AmE-diphones.ipynb" -p l "./LTR_Buckeye_aligned_w_GD_AmE_destressed/alignment_of_LTR_Buckeye_w_AmE-diphones-IPA-annotated-columns.json" -p o "./LTR_Buckeye_aligned_w_GD_AmE_destressed/LTR_Buckeye_aligned_w_GD_AmE-diphones.tsv"
```
will 
 - create a new notebook `GD_AmE-diphones - LTR_Buckeye alignment application to AmE-diphones.ipynb` that records data processing (but not, if it runs successfully, requiring any action or intervention from you) 

...and output `./LTR_Buckeye_aligned_w_GD_AmE_destressed/LTR_Buckeye_aligned_w_GD_AmE-diphones.tsv`, a projected version of 
 - `./LTR_Buckeye_aligned_w_GD_AmE_destressed/alignment_of_LTR_Buckeye_w_AmE-diphones-IPA-annotated-columns.json`
based on the projection defined in 
 - `./LTR_Buckeye_aligned_w_GD_AmE_destressed/alignment_of_LTR_Buckeye_w_AmE-diphones-IPA-annotated-columns.json` 

### Old School

If you don't have or want to use this notebook as intended, edit the filenames/paths in the cell below with the top comment #"PARAMETERS CELL".

# Imports

In [35]:
from string_utils import *
from boilerplate import *

In [3]:
from os import getcwd, chdir, listdir, path, mkdir, makedirs

import json
import csv

In [4]:
getcwd()

'/mnt/cube/home/AD/emeinhar/wr'

In [5]:
%ls

'1 initial directory setup.txt'
'2a alignment_paths_and_cmds.sh'
'Align transcriptions.ipynb'
 boilerplate.py
'Gating Data - Transcription Lexicon Alignment Maker.ipynb'
 [0m[01;34mGD_AmE[0m/
 [01;34mGD_AmE_destressed_aligned_w_LTR_Buckeye[0m/
 [01;34mGD_AmE_destressed_aligned_w_LTR_CMU_destressed[0m/
 [01;34mGD_AmE_destressed_aligned_w_LTR_newdic_destressed[0m/
'GD_AmE-diphones - LTR_Buckeye.tsv alignment definition.ipynb'
'GD_AmE-diphones - LTR_CMU_destressed.tsv alignment definition.ipynb'
'GD_AmE-diphones - LTR_newdic_destressed.tsv alignment definition.ipynb'
 [01;34mLTR_Buckeye[0m/
 [01;34mLTR_Buckeye_aligned_w_GD_AmE_destressed[0m/
 [01;34mLTR_CMU_destressed[0m/
 [01;34mLTR_CMU_destressed_aligned_w_GD_AmE_destressed[0m/
 [01;34mLTR_CMU_stressed[0m/
 [01;34mLTR_newdic_destressed[0m/
 [01;34mLTR_newdic_destressed_aligned_w_GD_AmE_destressed[0m/
 [01;34mold[0m/
'Processing Driver Notebook.ipynb'
 [01;34m__pycache__[0m/
 string_utils.

# Parameters

In [6]:
# PARAMETERS CELL
#
# This is the Paremeters Cell that Papermill looks at and modifies
# 
# go to View->Cell Toolbar->Tags to see what's going on

p = ""
# p = "./GD_AmE_destressed_aligned_w_LTR_Buckeye/alignment_of_AmE-diphones-IPA-annotated-columns_w_LTR_Buckeye.json"
# p = "./LTR_Buckeye_aligned_w_GD_AmE_destressed/alignment_of_LTR_Buckeye_w_AmE-diphones-IPA-annotated-columns.json"

g = ""
# g = "./GD_AmE/AmE-diphones-IPA-annotated-columns.csv"

l = ""
# l = "./LTR_Buckeye/LTR_Buckeye.tsv"

o = ""
# o = "./GD_AmE_destressed_aligned_w_LTR_Buckeye/AmE-diphones_aligned_w_LTR_Buckeye.tsv"
# o = "./LTR_Buckeye_aligned_w_GD_AmE_destressed/LTR_Buckeye_aligned_w_GD_AmE-diphones.tsv"

In [7]:
projection_dp = path.dirname(p)
projection_dp
projection_fn = path.basename(p)
projection_fn

'./LTR_Buckeye_aligned_w_GD_AmE_destressed'

'alignment_of_LTR_Buckeye_w_AmE-diphones-IPA-annotated-columns.json'

In [8]:
working_on_gating_data = True if g != "" else False
working_on_gating_data

False

In [9]:
if working_on_gating_data:
    data_dp = path.dirname(g)
    data_fn = path.basename(g)
else:
    data_dp = path.dirname(l)
    data_fn = path.basename(l)
    
data_dp
data_fn

'./LTR_Buckeye'

'LTR_Buckeye.tsv'

In [10]:
if not working_on_gating_data:
    assert l != "", "Either l or g must not be the empty string."

In [11]:
assert o != "", "Must supply a filename or filepath for the projected version of the input, got o={0} instead".format(o)

In [12]:
output_dirpath = path.dirname(o)
output_fn = path.basename(o)
if not path.exists(output_dirpath):
    makedirs(output_dirpath)

# Import Files to Project

## Projection mapping

In [13]:
with open(p, encoding='utf-8') as data_file:
    projection_mapping = json.loads(data_file.read())
len(projection_mapping)
projection_mapping

2

{'m̩': 'm', 'n̩': 'n'}

In [14]:
if projection_mapping == dict():
    print("Projection mapping is empty: no segments will be changed in creating {0}.".format(o))

## Case A: gating data

In [15]:
def getDiphoneGatingTrials(filename, print_fields = True):
    '''
    Opens filename in the current working directory and returns the trials as a 
    list of dictionaries, plus the fieldnames in the order present in the file.
    '''
    diphone_fields = []
    diphoneTrials = []
    diphoneDataInFilename = filename
    with open(diphoneDataInFilename, newline='') as csvfile:
        my_reader = csv.DictReader(csvfile, delimiter='\t')
        diphone_fields = my_reader.fieldnames
        if print_fields:
            print("fieldnames: {0}".format(diphone_fields))
        for row in my_reader:
            #print(row)
            diphoneTrials.append(row)
    return {'trials': diphoneTrials, 'fields':diphone_fields}

def writeProcessedDataToCSV(theTrials, theFieldnames, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, delimiter='\t',fieldnames=theFieldnames)
        writer.writeheader()
        writer.writerows(theTrials)

In [16]:
def getDestressedDiphone(row):
    return row['diphoneInSeg']

def getStressedDiphone(row):
    return row['diphoneInWStress']

In [17]:
sound_fields = ['Prec_context', 'CorrAns1', 'CorrAns2', 'Resp1', 'Resp2',
                'diphoneInSeg', 'diphoneInWStress', 'diphoneOutSeg',
                'prefixSeg', 'prefixWStress',
                'suffixSeg', 'suffixWStress',
                'stimulusSeg', 'stimulusWProsody']
diphone_fields = ['CorrAns1', 'CorrAns2', 'Resp1', 'Resp2',
                  'diphoneInSeg', 'diphoneInWStress', 'diphoneOutSeg']
#                 'stimulusSeg', 'stimulusWProsody']

def getSoundFields(row):
    return project_dict(row, sound_fields)

def getDiphoneFields(row, include_full_stimulus_column = False):
    if not include_full_stimulus_column:
        return project_dict(row, diphone_fields)
    return project_dict(row, diphone_fields + ['stimulusSeg', 'stimulusWProsody'])

core_sound_fields = ['Prec_context', 'CorrAns1', 'CorrAns2', 'Resp1', 'Resp2']

def getSounds(row):
    return set(project_dict(row, core_sound_fields).values())

In [18]:
def getStimSeg1(row, which_stress):
    seg = row['CorrAns1']
    if which_stress == 'destressed':
        return seg
    elif which_stress == 'stressed':
        s = row['seg1_stress']
        if s == '2' or s == 2:
            return seg
        else:
            return seg + str(s)
    else:
        assert which_stress in ['destressed', 'stressed'], '{0} is an invalid choice about stress representations'.format(which_stress)

def getStimSeg2(row, which_stress):
    seg = row['CorrAns2']
    if which_stress == 'destressed':
        return seg
    elif which_stress == 'stressed':
        s = row['seg2_stress']
        if s == '2' or s == 2:
            return seg
        else:
            return seg + str(s)
    else:
        assert which_stress in ['destressed', 'stressed'], '{0} is an invalid choice about stress representations'.format(which_stress)
        
def removeConsStress(stringRep):
    return ''.join([c for c in stringRep if c != "2"])

def removeStress(stringRep):
    return ''.join([c for c in stringRep if c != "0" and c != "1" and c != "2"])

def replaceSyllableBoundaries(stringRep):
    return stringRep.replace('-','.')

def justSegments(stringRep):
    return replaceSyllableBoundaries(removeStress(stringRep))

def getDiphonesInAsStr(row, which_stress):
    if which_stress == 'destressed':
        return row['diphoneInSeg']
    elif which_stress == 'stressed': 
        #we remove consonant stress annotations because there are none in IPhOD (and probably none in Hammond's newdic, either)
        assert removeStress(row['diphoneInWStress']) == row['diphoneInSeg'], '{0} and {1} have segmental mismatch'.format(row['diphoneIn'], row['diphoneInWStress'])
        return removeConsStress(row['diphoneInWStress'])
    else:
        assert which_stress in ['destressed', 'stressed'], '{0} is an invalid choice about stress representations'.format(which_stress)
        
def getDiphonesOutAsStr(row):
    return row['diphoneOutSeg']

In [19]:
def mergeXintoY(sound_x,sound_y,the_dict, exact_match = True):
    '''
    Replace every instance of sound X with one of sound Y 
    in all sound fields of the_dict.
    
    If exact_match is True, then a field's value must be exactly
    and entirely equal to sound_X; otherwise, this function will
    substitute any instance (substring) of sound_X in the sound
    fields of the_dict.
    '''
    for key in the_dict.keys():
        if exact_match:
            if sound_x == the_dict[key] and key in sound_fields:
#                 if key != 'Prec_context':
#                     print("{2}:{0}⟶{1}.".format(the_dict[key], sound_y, key))
                the_dict.update({key: sound_y})
        else: #use carefully...
            if sound_x in the_dict[key] and key in sound_fields:
                old_str = the_dict[key]
                new_str = old_str.replace(sound_x, sound_y)
#                 if key != 'Prec_context':
#                     print("{2}:{0}⟶{1}.".format(old_str, new_str, key))
                the_dict.update({key: new_str})
    return the_dict

In [20]:
if working_on_gating_data:
    # diphoneDataInFilename = "diphones-IPA-annotated-columns.csv"
    diphoneDataInFP = g

    file_data = getDiphoneGatingTrials(diphoneDataInFP)
    rows_in = file_data['trials']
    the_fields = file_data['fields']
    rows_in[:5]

## Case B: transcription lexicon

In [21]:
lexicon_rows_in = []
if not working_on_gating_data:
    transcribed_lexicon_fp = l
    with open(transcribed_lexicon_fp) as csvfile:
        my_reader = csv.DictReader(csvfile, delimiter='\t', quoting=csv.QUOTE_NONE, quotechar='@')
        for row in my_reader:
            #print(row)
            lexicon_rows_in.append(row)
    lexicon_rows_in[:5]

[OrderedDict([('Orthographic_Wordform', 'i'), ('Transcription', 'aɪ')]),
 OrderedDict([('Orthographic_Wordform', 'uh'), ('Transcription', 'ʌ')]),
 OrderedDict([('Orthographic_Wordform', 'grew'), ('Transcription', 'g.ɹ.u')]),
 OrderedDict([('Orthographic_Wordform', 'up'), ('Transcription', 'ʌ.p')]),
 OrderedDict([('Orthographic_Wordform', 'in'), ('Transcription', 'ɪ.n')])]

# Apply Projection

In [22]:
from funcy import compose

## Case A: gating data

In [23]:
if working_on_gating_data:
    if len(projection_mapping) == 0:
        print('No segments to project.')
    else:
        for seg_to_project in projection_mapping:
            print('Seg to project: {0}'.format(seg_to_project))
            print('\t# rows w/ seg: {0}'.format(len([r for r in rows_in 
                                                     if seg_to_project in getSoundFields(r).values()])))

In [24]:
if working_on_gating_data:
    if len(projection_mapping) == 0:
        print('No segments to project.')
    else:
        row_alteration_functions = []
        for seg_to_project in projection_mapping:
            project_to = projection_mapping[seg_to_project]
            row_alteration_functions.append(lambda row: mergeXintoY(seg_to_project, project_to, row, exact_match = True))
        row_alteration_function = compose(*tuple(row_alteration_functions))

In [25]:
if working_on_gating_data:
    if len(projection_mapping) == 0:
        print('No segments to project.')
        projected_rows = rows_in
    else:
        projected_rows = list(map(row_alteration_function,
                                  rows_in))
        len(rows_in)
        len(projected_rows)

In [26]:
if working_on_gating_data:
    if len(projection_mapping) == 0:
        print('No segments to project.')
    else:
        for seg_to_project in projection_mapping:
            print('Seg to project: {0}'.format(seg_to_project))
            badRows = [r for r in projected_rows if seg_to_project in getSoundFields(r).values()]
            print('\t# rows w/ seg: {0}'.format(len(badRows)))
            assert len(badRows) == 0, '{0} still found after it should have been removed'.format(seg_to_project)

In [27]:
if working_on_gating_data and projection_mapping == dict():
    assert rows_in == projected_rows

## Case B: transcription lexicon

### Pre-conditions

In [28]:
if len(projection_mapping) == 0:
    print('No segments to remap.')
else:
    total_num_rows = len(lexicon_rows_in)
    rows_with = {}
    for segment_to_remap in projection_mapping:
        rows_with[segment_to_remap] = list(filter(lambda row: segment_to_remap in ds2t(row['Transcription']),
                                           lexicon_rows_in))
        print('{0}/{1} total rows have segment {2}'.format(len(rows_with[segment_to_remap]), total_num_rows, segment_to_remap))
        

785/216062 total rows have segment m̩
989/216062 total rows have segment n̩


### Processing

In [29]:
# def subInDS(dottedString, to_replace, replacement):
#     '''
#     Replace each instance of symbol 'to_replace' 
#     with 'replacement' symbol in 'dottedString'.
#     '''
#     old_symbol_tuple = dottedStringToTuple( dottedString )

#     replacer = lambda symb: symb if symb != to_replace else replacement
#     new_symbol_tuple = tuple( map(replacer, old_symbol_tuple) )

#     dottedSymbols = tupleToDottedString( new_symbol_tuple ) 
#     return dottedSymbols

In [36]:
def replaceXwithYinRow(x, y, row):
    new_row = modify_dict(row, 'Transcription', subInDS( row['Transcription'], x, y))
    return new_row

def makeReplacer(x,y):
    def replaceXwithYin(row):
        new_row = modify_dict(row, 'Transcription', subInDS( row['Transcription'], x, y))
        return new_row
    return replaceXwithYin

In [37]:
lexicon_projection_functions = []
if not working_on_gating_data:
    for segment_to_remap in projection_mapping:
        lexicon_projection_functions.append(makeReplacer(segment_to_remap, projection_mapping[segment_to_remap]))
    lexicon_projection_function = compose(*tuple(lexicon_projection_functions))
#         projected_lexicon_rows = list(map(lambda row: replaceXwithYinRow(x, y, row),
#                                           lexicon_rows_in))
    projected_lexicon_rows = list(map(lexicon_projection_function,
                                      lexicon_rows_in))
    len(lexicon_rows_in)
    len(projected_lexicon_rows)
    
    if len(projection_mapping) == 0:
        assert lexicon_rows_in == projected_lexicon_rows

216062

216062

### Post-conditions

In [39]:
if len(projection_mapping) == 0:
    print('No segments to remap.')
else:
    total_num_rows = len(projected_lexicon_rows)
    rows_with = {}
    for segment_to_remap in projection_mapping:
        rows_with[segment_to_remap] = list(filter(lambda row: segment_to_remap in ds2t(row['Transcription']),
                                           projected_lexicon_rows))
        print('{0}/{1} total rows have segment {2}'.format(len(rows_with[segment_to_remap]), total_num_rows, segment_to_remap))
        if len(rows_with[segment_to_remap]) > 0:
            print('{0} still contained in processed lexicon!'.format(segment_to_remap))
        if len(rows_with[segment_to_remap]) == 0:
            print('{0} removed from processed lexicon.'.format(segment_to_remap))
        

0/216062 total rows have segment m̩
m̩ removed from processed lexicon.
0/216062 total rows have segment n̩
n̩ removed from processed lexicon.


# Export

## Case A: gating data

In [40]:
o

'./LTR_Buckeye_aligned_w_GD_AmE_destressed/LTR_Buckeye_aligned_w_GD_AmE-diphones.tsv'

In [41]:
listdir(output_dirpath)

['alignment_of_LTR_Buckeye_w_AmE-diphones-IPA-annotated-columns.json']

In [44]:
if working_on_gating_data:
    print('Output dir before writing to file:')
    print(listdir(output_dirpath))
    print(' ')
    
    writeProcessedDataToCSV(projected_rows, the_fields, o)
    
    print('Output dir *after* writing to file:')
    print(listdir(output_dirpath))
    print(' ')
    print('Wrote o={0} to file.'.format(o))

## Case B: transcription lexicon

In [45]:
o

'./LTR_Buckeye_aligned_w_GD_AmE_destressed/LTR_Buckeye_aligned_w_GD_AmE-diphones.tsv'

In [46]:
listdir(output_dirpath)

['alignment_of_LTR_Buckeye_w_AmE-diphones-IPA-annotated-columns.json']

In [47]:
if not working_on_gating_data:
    print('Output dir before writing to file:')
    print(listdir(output_dirpath))
    print(' ')
    
    with open(o, 'w', newline='\n') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=['Orthographic_Wordform', 'Transcription'], delimiter='\t', quoting=csv.QUOTE_NONE, quotechar='@')

        writer.writeheader()
        writer.writerows(projected_lexicon_rows)
    
    print('Output dir *after* writing to file:')
    print(listdir(output_dirpath))
    print(' ')

Output dir before writing to file:
['alignment_of_LTR_Buckeye_w_AmE-diphones-IPA-annotated-columns.json']
 
Output dir *after* writing to file:
['alignment_of_LTR_Buckeye_w_AmE-diphones-IPA-annotated-columns.json', 'LTR_Buckeye_aligned_w_GD_AmE-diphones.tsv']
 
