In [1]:
#Prints **all** console output, not just last item in cell 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#About-this-notebook" data-toc-modified-id="About-this-notebook-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>About this notebook</a></span></li><li><span><a href="#Background" data-toc-modified-id="Background-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Background</a></span></li><li><span><a href="#Requirements" data-toc-modified-id="Requirements-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Requirements</a></span></li><li><span><a href="#Import-data" data-toc-modified-id="Import-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Import data</a></span><ul class="toc-item"><li><span><a href="#Diphone-gating-data" data-toc-modified-id="Diphone-gating-data-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Diphone gating data</a></span></li><li><span><a href="#Hammond's-newdic" data-toc-modified-id="Hammond's-newdic-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Hammond's newdic</a></span></li><li><span><a href="#IPhOD" data-toc-modified-id="IPhOD-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>IPhOD</a></span></li></ul></li><li><span><a href="#Inventory-exploration" data-toc-modified-id="Inventory-exploration-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Inventory exploration</a></span><ul class="toc-item"><li><span><a href="#Gating-data" data-toc-modified-id="Gating-data-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Gating data</a></span></li><li><span><a href="#Hammond's-newdict" data-toc-modified-id="Hammond's-newdict-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Hammond's newdict</a></span><ul class="toc-item"><li><span><a href="#Comparison-of-inventories-between-Hammond's-newdic-and-the-diphone-gating-data" data-toc-modified-id="Comparison-of-inventories-between-Hammond's-newdic-and-the-diphone-gating-data-5.2.1"><span class="toc-item-num">5.2.1&nbsp;&nbsp;</span>Comparison of inventories between Hammond's newdic and the diphone gating data</a></span></li></ul></li><li><span><a href="#Lexicon---IPhOD" data-toc-modified-id="Lexicon---IPhOD-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Lexicon - IPhOD</a></span><ul class="toc-item"><li><span><a href="#Comparison-of-IPhOD-inventory-and-diphone-gating-data" data-toc-modified-id="Comparison-of-IPhOD-inventory-and-diphone-gating-data-5.3.1"><span class="toc-item-num">5.3.1&nbsp;&nbsp;</span>Comparison of IPhOD inventory and diphone gating data</a></span></li></ul></li></ul></li><li><span><a href="#Processing-pipeline-sketch" data-toc-modified-id="Processing-pipeline-sketch-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Processing pipeline sketch</a></span><ul class="toc-item"><li><span><a href="#Global-todo-tally:" data-toc-modified-id="Global-todo-tally:-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Global todo tally:</a></span><ul class="toc-item"><li><span><a href="#New-capabilities" data-toc-modified-id="New-capabilities-6.1.1"><span class="toc-item-num">6.1.1&nbsp;&nbsp;</span>New capabilities</a></span></li><li><span><a href="#Open-questions" data-toc-modified-id="Open-questions-6.1.2"><span class="toc-item-num">6.1.2&nbsp;&nbsp;</span>Open questions</a></span></li></ul></li></ul></li></ul></div>

# About this notebook

**Notebook author:** emeinhardt@ucsd.edu

**Notebook goal:** Examine the inventories of diphone gating and lexicon representations and cite the relevant portions of each paper necessary to determine the gist of what needs to be done to align gating data with representations of the lexicon for use in a word recognition model.

# Background

**Motivation:** The goal is to construct a word recognition model that can map from a segmental transcription of the incrementally produced prefix of a speaker's intended wordform to a listener's beliefs about what the speaker's actual intended wordform is. In this model task, the speaker chooses a wordform according to its prior probability.

**Data Sources:**
This requires a lexicon of transcribed wordforms and the ability to assign a prior probability to each wordform, as well as a model of listening noise. The diphone gating data of Warner et al. (2014) provides the noise model for (Southwestern) American English. 

Its inventory is based on the transcriptions from a dictionary of transcriptions of unclear origins provided by Mike Hammond ('newdic.txt'); as far as I can tell, its transcriptions are from the same source as the NetTalk Corpus and what I think is the Hoosier Mental Lexicon (Hammond's dictionary also shares word frequency estiamtes with what I think is the HML) -- some digitally transcribed English dictionary from the 60s or 70s. There are about 20k entries in Hammond's newdic.

The Irvine Phonotactic Online Dictionary ('IPhOD') contains those transcriptions from the CMU pronouncing dictionary (many of whose transcriptions were also taken from some dictionary and whose other transcription sources are unvetted and undocumented) that could be aligned with orthographic word frequency estimates from the SUBTLEX_US database. There are 40-50k entries in IPhOD.

**Concrete Problem:** The inventory of segment symbols used in the model of the lexicon needs to be aligned with the inventory of the noise model. This notebook investigates what the inventories of each dataset are.

**Notational convention:** I will often use '.'s to indicate boundaries between segments (not e.g. syllable structure), because the alternative that ought to permit normal notation (using unicode tie bars to represent diphthongs and affricates) looks terrible on my machine (and everyone else's too, by default) and the trouble it takes to get them to render properly isn't worth it.

In [2]:
#Use e.g. 
#  http://westonruter.github.io/ipa-chart/keyboard/
#  https://linguistlist.org/unicode/ipa.html
#  http://www.internationalphoneticalphabet.org/ipa-charts/ipa-symbols-with-unicode-decimal-and-hex-codes/
# to construct a two-segment pair to put a tie bar over.
print('A diphthong with tie bar: ɔi͡')
print('An affricate with tie bar: tʃ ͡')
' tʃ ͡'

A diphthong with tie bar: ɔi͡
An affricate with tie bar: tʃ ͡


' tʃ ͡'

# Requirements

At a high level, I am running Python 3.6.5, Jupyter Notebook 5.5.0, and otherwise defaults associated with Anaconda 5.2. More specifically, this notebook assumes the current working directory contains
 - a copy of the dataset and annotations available from https://linguistics.arizona.edu/dpl/english_diphones (NOTE: Natasha Warner has just - as of early 2018 - changed the website, and links to the data are currently broken)
 - a copy of Hammond's mysterious 'newdic' transcribed lexicon of English http://dingo.sbs.arizona.edu/~hammond/lsasummer11/newdic
 - a copy of the data associated with IPhOD (available from http://www.iphod.com/)

In [3]:
%pwd

'/home/AD/emeinhar/c2-jn'

In [4]:
%ls diphones-raw-data.csv

diphones-raw-data.csv


In [5]:
%ls IPhOD*

IPhOD2_Words.txt  [0m[01;31mIPhODv2.0_REALS.zip[0m


In [6]:
%ls Hammond*

"Hammond's mysterious newdic.txt"


# Import data

In [7]:
import csv

## Diphone gating data

**What do the data describe?** Each row in the diphones data file corresponds to a trial in a gating task: experimenters made audio recordings of the production of all (or nearly all) phonotactically licit diphones in (south)western American English embedded in nonsense contexts of at most a few speech sounds to either side -- e.g. one full stimulus nonsense word produced could be [ɑ.p.tʃ.ə], in which the vowels at either end are 'nonsense' context and [p.tʃ] is the diphone. For each such sound segment sequence, one recording was chosen. Six gating task stimuli were then created, each covering an increasing portion of the original recording (up to one of six 'gates')
 - one from the start of the recording to 1/3 of the way through the production of the first segment of the diphone
 - one from the start of the recording to 2/3 of the way through the production of the first segment of the diphone
 - one from the start of the recording to 3/3 of the way through the production of the first segment of the diphone
 - one from the start of the recording to 1/3 of the way through the production of the second segment of the diphone
 - one from the start of the recording to 2/3 of the way through the production of the second segment of the diphone
 - one from the start of the recording to 3/3 of the way through the production of the second segment of the diphone
In a given trial, experiment participants listen to one of the six possible gatings of some sound segment sequence containing a diphone and then reported what they thought the full diphone was that was being produced in the recording in question. 

Each row indicates various properties of the stimulus and the participant's response.

For more details, see Warner et al. (2014), the earlier paper on the Dutch diphone gating experiment (cited in Warner et al., 2014), and the documentation accompanying the dataset.

In [8]:
diphoneDataInFilename = "diphones-raw-data.csv"

In [9]:
def getDiphoneGatingTrials(filename, print_fields = True):
    '''
    Opens filename in the current working directory and returns the trials as a 
    list of dictionaries, plus the fieldnames in the order present in the file.
    '''
    diphone_fields = []
    diphoneTrials = []
    diphoneDataInFilename = filename
    with open(diphoneDataInFilename, newline='') as csvfile:
        my_reader = csv.DictReader(csvfile, delimiter='\t')
        diphone_fields = my_reader.fieldnames
        if print_fields:
            print("fieldnames: {0}".format(diphone_fields))
        for row in my_reader:
            #print(row)
            diphoneTrials.append(row)
    return {'trials': diphoneTrials, 'fields':diphone_fields}


In [10]:
diphoneTrials = getDiphoneGatingTrials(diphoneDataInFilename)['trials']
diphoneTrials[0]

fieldnames: ['Subject', 'Diph_num', 'Diph_name', 'Sylltype', 'SoundFile', 'Prec_context', 'gate', 'four_gate', 'seg1_stress', 'seg2_stress', 'CorrAns1', 'CorrAns2', 'Resp1', 'Resp2', 'Seg1Accur', 'Seg2Accur', 'Prec_context_binary', 'wrong_preccontext', 'replacedSeg1Data', 'replacedSeg2Data']


OrderedDict([('Subject', '1'),
             ('Diph_num', '1'),
             ('Diph_name', 'CC'),
             ('Sylltype', 'CC'),
             ('SoundFile', '0001_CC_CC-g1-beeped.wav'),
             ('Prec_context', 'ah'),
             ('gate', '1'),
             ('four_gate', '0'),
             ('seg1_stress', '2'),
             ('seg2_stress', '2'),
             ('CorrAns1', 'ch'),
             ('CorrAns2', 'ch'),
             ('Resp1', 'ch'),
             ('Resp2', 't'),
             ('Seg1Accur', '1'),
             ('Seg2Accur', '0'),
             ('Prec_context_binary', '1'),
             ('wrong_preccontext', ' '),
             ('replacedSeg1Data', ' '),
             ('replacedSeg2Data', ' ')])

## Hammond's newdic

**About the data**: Recall that the inventory of the gating experiment is based on a dictionary of transcriptions of unclear origins provided by Mike Hammond ('newdic.txt'); as far as I can tell, its transcriptions are from the same source as the NetTalk Corpus and what I think is the Hoosier Mental Lexicon (Hammond's dictionary also shares word frequency estiamtes with what I think is the HML) -- some digitally transcribed English dictionary from the 60s or 70s. There are about 20k entries in Hammond's newdic. I don't currently understand exactly how to interpret the columns I've labeled 'stressInfoA' and 'stressInfoB', but that's also not currently important.

In [11]:
hammond_fn = "Hammond's mysterious newdic.txt"

In [12]:
%ls Hammond*

"Hammond's mysterious newdic.txt"


In [13]:
newdic_raw = []
fieldnames = ['Transcription', 'stressInfoA', 'stressInfoB', 'Orthography', 'Frequency', 'PoSs']
with open(hammond_fn) as csvfile:
    my_reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=fieldnames)
    for row in my_reader:
        #print(row)
        newdic_raw.append(row)

newdic_raw[0]
len(newdic_raw[0].keys())

OrderedDict([('Transcription', 'x'),
             ('stressInfoA', '_'),
             ('stressInfoB', 'S1'),
             ('Orthography', 'a'),
             ('Frequency', '23178'),
             ('PoSs', '(N IA VB PP)')])

6

## IPhOD

**About the data**: The Irvine Phonotactic Online Dictionary ('IPhOD') contains those transcriptions from the CMU pronouncing dictionary (many of whose transcriptions were also taken from some dictionary and whose other transcription sources are unvetted and undocumented) that could be aligned with orthographic word frequency estimates from the SUBTLEX_US database. (There are also a variety of columns indicating things like phonotactic probability neighborhood density useful for psycholinguistic experiments.)

There are 40-50k entries in IPhOD; note that each one corresponds to a pronunciation, but the same orthographic word may have multiple transcriptions (rows) associated with it -- homophones that are also homographs and pronunciation variants of the same 'word' are indistinguishable using only the resources of the dataset alone.

See the IPhOD website (link near the top) for more explanation of the cryptic column abbreviations.

In [14]:
%ls IPhOD*

IPhOD2_Words.txt  [0m[01;31mIPhODv2.0_REALS.zip[0m


In [15]:
#%cd IPhODv2.0_REALS

In [16]:
IPhOD_raw_filename = 'IPhOD2_Words.txt'

lexicon_raw = []
with open(IPhOD_raw_filename) as csvfile:
    my_reader = csv.DictReader(csvfile, delimiter='\t')
    for row in my_reader:
        #print(row)
        lexicon_raw.append(row)

lexicon_raw[0]
len(lexicon_raw[0].keys())

OrderedDict([('Indx', '1'),
             ('Word', 'a'),
             ('UnTrn', 'AH'),
             ('StTrn', 'AH0'),
             ('NSyll', '1'),
             ('NPhon', '1'),
             ('unsDENS', '26'),
             ('unsFDEN', '150377.45'),
             ('unsLDEN', '91.03'),
             ('unsCDEN', '136251'),
             ('strDENS', '21'),
             ('strFDEN', '145035.32'),
             ('strLDEN', '77.56'),
             ('strCDEN', '114167'),
             ('unsBPAV', '0'),
             ('unsFBPAV', '0'),
             ('unsLBPAV', '0'),
             ('unsCBPAV', '0'),
             ('strBPAV', '0'),
             ('strFBPAV', '0'),
             ('strLBPAV', '0'),
             ('strCBPAV', '0'),
             ('unsTPAV', '0'),
             ('unsFTPAV', '0'),
             ('unsLTPAV', '0'),
             ('unsCTPAV', '0'),
             ('strTPAV', '0'),
             ('strFTPAV', '0'),
             ('strLTPAV', '0'),
             ('strCTPAV', '0'),
             ('unsPOSPAV', '0.044

49

In [17]:
lexicon_raw[1]

OrderedDict([('Indx', '2'),
             ('Word', 'a'),
             ('UnTrn', 'EY'),
             ('StTrn', 'EY1'),
             ('NSyll', '1'),
             ('NPhon', '1'),
             ('unsDENS', '44'),
             ('unsFDEN', '89743.84'),
             ('unsLDEN', '116.4'),
             ('unsCDEN', '145321'),
             ('strDENS', '44'),
             ('strFDEN', '89743.84'),
             ('strLDEN', '116.4'),
             ('strCDEN', '145321'),
             ('unsBPAV', '0'),
             ('unsFBPAV', '0'),
             ('unsLBPAV', '0'),
             ('unsCBPAV', '0'),
             ('strBPAV', '0'),
             ('strFBPAV', '0'),
             ('strLBPAV', '0'),
             ('strCBPAV', '0'),
             ('unsTPAV', '0'),
             ('unsFTPAV', '0'),
             ('unsLTPAV', '0'),
             ('unsCTPAV', '0'),
             ('strTPAV', '0'),
             ('strFTPAV', '0'),
             ('strLTPAV', '0'),
             ('strCTPAV', '0'),
             ('unsPOSPAV', '0.00333

(The empty fieldname in both lexicon entries/rows above is a result of IPhOD having trailing tabs on every line - it can be ignored.)

# Inventory exploration

## Gating data

In [18]:
sound_fields = ['Prec_context', 'CorrAns1', 'CorrAns2', 'Resp1', 'Resp2']

In [19]:
def project_dict(the_dict, keys_to_keep):
    new_dict = {key:the_dict[key] for key in the_dict.keys() if key in keys_to_keep}
    return new_dict
project_dict({'Name':'Joe','ID':123,'Job':'clerk'},['Job','ID'])

{'ID': 123, 'Job': 'clerk'}

In [20]:
soundsInDiphoneMatrix = set()
def getSounds(row):
    return project_dict(row, sound_fields).values()
soundsByRow = [getSounds(row) for row in diphoneTrials]
soundsByRow[:10]

[dict_values(['ah', 'ch', 'ch', 'ch', 't']),
 dict_values(['ah', 'ch', 'ch', 'uh', 't']),
 dict_values(['ah', 'ch', 'ch', 't', 't']),
 dict_values(['ah', 'ch', 'ch', 't', 't']),
 dict_values(['ah', 'ch', 'ch', 'k', 't']),
 dict_values(['ah', 'ch', 'ch', 't', 'uu']),
 dict_values(['ah', 'ch', 'ch', 't', 'uh']),
 dict_values(['ah', 'ch', 'ch', 't', 'k']),
 dict_values(['ah', 'ch', 'ch', 't', 'ch']),
 dict_values(['ah', 'ch', 'ch', 'ch', 'h'])]

In [21]:
for eachSoundSet in soundsByRow:
    soundsInDiphoneMatrix.update(eachSoundSet)
print(soundsInDiphoneMatrix)
print(len(soundsInDiphoneMatrix))

{'y', 'l', 'th', 'er', ' ', 'ah', 'r', 'ng', 'd', 'oy', 'X', 'z', 'p', 'uh', 'oh', 'eh', 'm', 'f', 'sh', 'ahb', 'F', 'ee', 'x', 'w', 'oo', 'j', 't', 'v', 'ay', 'n', 'b', 'dh', 'zh', 'ae', 's', 'aye', 'L', 'ih', 'ch', 'uu', 'ow', 'g', 'h', 'k'}
44


Based on the links below, the text of Warner et al. (2014) and earlier papers on the Dutch diphone gating data, plus the raw data set documentation, I've identified the relation between the inventory of the (English) diphone gating data experiment and IPA. Of note:
 - the speaker who produced the audio recordings for the experiment speaks a dialect of English that has the caught-cot merger.
 - Warner et al. have coded the low-back vowel resulting from the merger with 'a', the IPA symbol for the low *front* vowel that begins two of the diphthongs of American English; in IPA/in dealing with the merger and other data, this should be rendered with the glyph 'ɑ' rather than that of 'a' 
 - the only consonant to have a distinguished syllabic version is [l] 

In [22]:
#Relevant:
# https://www.wikiwand.com/en/IPA_Extensions
# http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm
# http://westonruter.github.io/ipa-chart/keyboard/
# https://docs.python.org/3/howto/unicode.html
# http://stackoverflow.com/questions/33583485/ipa-to-arpabet-python
# http://unicode-table.com/en/#control-character

#Demonstration of unicode and Python playing nice
print("\u0278") 
print("ɸ")
print("ɸ" == "\u0278")

ɸ
ɸ
True


In [23]:
# arpabet = ['AO','AA','IY','UW','EH','IH','UH','AH','AX','AE','EY','AY','OW','AW','OY','ER','P','B','T','D','K','G','CH','JH','F','V','TH','DH','S','Z','SH','ZH','HH','M','EM','N','EN','NG','ENG','L','EL','R','DX','NX','Y','W','Q']

# Commented out lines indicate symbols in the diphone gating data (and their corresponding IPA symbols) that are not in the inventory used by Warner et al.  
diphoneUnicodeIPArelation = set([ \
# ('', 'ɔ'), #Speaker for Warner et al is from Arizona -- ɔ is not in her inventory
# ('ah', 'a'), #contrary to the raw dataset explanation, this is the low *back* vowel resulting from the cot-caught merger, not the low *front* vowel (IPA 'a') that some diphthongs start with...
('ah', 'ɑ'), 
('ee', 'i'),
('oo', 'u'),
('eh', 'ɛ'),
('ih', 'ɪ'),
('uu', 'ʊ'),
('uh', 'ʌ'),
('x', 'ə'),
('ae', 'æ'),
('ay', 'eɪ'),
('aye', 'aɪ'),
('oh', 'oʊ'),
('ow', 'aʊ'),
('oy', 'ɔɪ'),
('er', 'ɚ'),     # two symbols mapping to the same sound - unstressed r-colored schwa
('X', 'ɚ'),      # two symbols mapping to the same sound - stressed r-colored schwa
('p', 'p'),
('b', 'b'),
('t', 't'),
('d', 'd'),
('k', 'k'),
('g', 'g'),
('ch', 'tʃ'),
('j', 'dʒ'),
('f', 'f'),
('v', 'v'),
('th', 'θ'),
('dh', 'ð'),
('s', 's'),
('z', 'z'),
('sh', 'ʃ'),
('zh', 'ʒ'),
('h', 'h'),
('m', 'm'),
('n', 'n'),
('ng', 'ŋ'),
('l', 'l'),
('L', 'l̩'),
('r', 'r'),
('F', 'ɾ'),
('y', 'j'),
('w', 'w'),
])
print(diphoneUnicodeIPArelation) 
print(len(diphoneUnicodeIPArelation))
def diphoneToUnicodeIPA(diphoneSymbol):
    mapping = dict(diphoneUnicodeIPArelation)
    #print(mapping)
    return mapping[diphoneSymbol]
print(diphoneToUnicodeIPA('th'))
def invertMapping(mydict): 
    return dict([[val, key] for key,val in mydict.items()])
def unicodeIPAToDiphone(unicodeIPAsymbol):
    return invertMapping( dict(diphoneUnicodeIPArelation) )[unicodeIPAsymbol]
print(unicodeIPAToDiphone('θ'))

{('x', 'ə'), ('d', 'd'), ('z', 'z'), ('g', 'g'), ('X', 'ɚ'), ('eh', 'ɛ'), ('p', 'p'), ('t', 't'), ('ay', 'eɪ'), ('oo', 'u'), ('uu', 'ʊ'), ('h', 'h'), ('y', 'j'), ('f', 'f'), ('uh', 'ʌ'), ('ow', 'aʊ'), ('ng', 'ŋ'), ('ae', 'æ'), ('dh', 'ð'), ('ch', 'tʃ'), ('k', 'k'), ('L', 'l̩'), ('w', 'w'), ('m', 'm'), ('th', 'θ'), ('s', 's'), ('l', 'l'), ('r', 'r'), ('ee', 'i'), ('er', 'ɚ'), ('sh', 'ʃ'), ('oy', 'ɔɪ'), ('ih', 'ɪ'), ('j', 'dʒ'), ('ah', 'ɑ'), ('v', 'v'), ('n', 'n'), ('aye', 'aɪ'), ('oh', 'oʊ'), ('zh', 'ʒ'), ('F', 'ɾ'), ('b', 'b')}
42
θ
th


If the assert statement below succeeds, it means every transcribed sound observed in the diphone data at hand is, in fact, documented in Table 1 (pp. 2-3) of http://www.u.arizona.edu/~nwarner/Raw_data_explanation.pdf and therefore in the relation defined above.

In [24]:
assert(all([diphoneSound in list(map(lambda pair: pair[0],diphoneUnicodeIPArelation)) for diphoneSound in soundsInDiphoneMatrix]))

AssertionError: 

Hmm...what transcribed sounds (as I'm capturing them) aren't in the mapping to IPA I've constructed?

In [25]:
[diphoneSound for diphoneSound in soundsInDiphoneMatrix if diphoneSound in list(map(lambda pair: pair[0],diphoneUnicodeIPArelation))]

['y',
 'l',
 'th',
 'er',
 'ah',
 'r',
 'ng',
 'd',
 'oy',
 'X',
 'z',
 'p',
 'uh',
 'oh',
 'eh',
 'm',
 'f',
 'sh',
 'F',
 'ee',
 'x',
 'w',
 'oo',
 'j',
 't',
 'v',
 'ay',
 'n',
 'b',
 'dh',
 'zh',
 'ae',
 's',
 'aye',
 'L',
 'ih',
 'ch',
 'uu',
 'ow',
 'g',
 'h',
 'k']

In [26]:
[diphoneSound for diphoneSound in soundsInDiphoneMatrix if diphoneSound not in list(map(lambda pair: pair[0],diphoneUnicodeIPArelation))]

[' ', 'ahb']

In [27]:
diphoneToUnicodeIPA('ah')
diphoneToUnicodeIPA('b')

'ɑ'

'b'

'ahb' is a proceeding context (viz. 'ɑb'); ' ' is also a preceding context in some cases. I'll extend the mapping/functions I wrote to handle these two cases:

In [28]:
# arpabet = ['AO','AA','IY','UW','EH','IH','UH','AH','AX','AE','EY','AY','OW','AW','OY','ER','P','B','T','D','K','G','CH','JH','F','V','TH','DH','S','Z','SH','ZH','HH','M','EM','N','EN','NG','ENG','L','EL','R','DX','NX','Y','W','Q']

# Commented out lines indicate symbols in the diphone gating data (and their corresponding IPA symbols) that are not in the inventory used by Warner et al.  
diphoneUnicodeIPArelation = set([ \
# ('AO', 'ɔ'), #Speaker for Warner et al is from Arizona -- ɔ is not in her inventory
(' ', ' '),
# ('ahb', 'ab'),
('ahb', 'ɑb'),
# ('ah', 'a'), #contrary to the raw dataset explanation, this is the low *back* vowel resulting from the cot-caught merger, not the low *front* vowel (IPA 'a') that some diphthongs start with...
('ah', 'ɑ'),
('ee', 'i'),
('oo', 'u'),
('eh', 'ɛ'),
('ih', 'ɪ'),
('uu', 'ʊ'),
('uh', 'ʌ'),
('x', 'ə'),
('ae', 'æ'),
('ay', 'eɪ'),
('aye', 'aɪ'),
('oh', 'oʊ'),
('ow', 'aʊ'),
('oy', 'ɔɪ'),
('er', 'ɚ'),     # two symbols mapping to the same sound - unstressed r-colored schwa
('X', 'ɚ'),      # two symbols mapping to the same sound - stressed r-colored schwa
('p', 'p'),
('b', 'b'),
('t', 't'),
('d', 'd'),
('k', 'k'),
('g', 'g'),
('ch', 'tʃ'),
('j', 'dʒ'),
('f', 'f'),
('v', 'v'),
('th', 'θ'),
('dh', 'ð'),
('s', 's'),
('z', 'z'),
('sh', 'ʃ'),
('zh', 'ʒ'),
('h', 'h'),
('m', 'm'),
# ('EM', 'm̩'),   #
('n', 'n'),
# ('EN', 'n̩'),   #
('ng', 'ŋ'),
# ('ENG', 'ŋ̩'),  #
('l', 'l'),
('L', 'l̩'),
('r', 'r'),
('F', 'ɾ'),
# ('NX', 'ɾ̃'),   #
('y', 'j'),
('w', 'w'),
# ('Q', 'ʔ')     #
])
print(diphoneUnicodeIPArelation) 
print(len(diphoneUnicodeIPArelation))
def diphoneToUnicodeIPA(diphoneSymbol):
    mapping = dict(diphoneUnicodeIPArelation)
    #print(mapping)
    return mapping[diphoneSymbol]
print(diphoneToUnicodeIPA('th'))
def invertMapping(mydict): 
    return dict([[val, key] for key,val in mydict.items()])
def unicodeIPAToDiphone(unicodeIPAsymbol):
    return invertMapping( dict(diphoneUnicodeIPArelation) )[unicodeIPAsymbol]
print(unicodeIPAToDiphone('θ'))

{('x', 'ə'), ('d', 'd'), ('z', 'z'), ('g', 'g'), ('X', 'ɚ'), ('eh', 'ɛ'), ('p', 'p'), ('t', 't'), ('ay', 'eɪ'), ('ahb', 'ɑb'), ('oo', 'u'), ('uu', 'ʊ'), ('h', 'h'), ('y', 'j'), ('f', 'f'), ('uh', 'ʌ'), ('ow', 'aʊ'), ('ng', 'ŋ'), ('ae', 'æ'), ('dh', 'ð'), ('ch', 'tʃ'), ('k', 'k'), ('L', 'l̩'), ('w', 'w'), ('m', 'm'), ('th', 'θ'), ('s', 's'), ('l', 'l'), ('r', 'r'), ('ee', 'i'), ('er', 'ɚ'), ('sh', 'ʃ'), ('oy', 'ɔɪ'), ('ih', 'ɪ'), ('j', 'dʒ'), ('ah', 'ɑ'), (' ', ' '), ('v', 'v'), ('n', 'n'), ('aye', 'aɪ'), ('oh', 'oʊ'), ('zh', 'ʒ'), ('F', 'ɾ'), ('b', 'b')}
44
θ
th


Let's try the assertion again:

In [29]:
assert(all([diphoneSound in list(map(lambda pair: pair[0],diphoneUnicodeIPArelation)) for diphoneSound in soundsInDiphoneMatrix]))

Huzzah. Now we can convert characters to UTF-8 IPA symbols for comparison with lexicon representations.

In [30]:
soundsInDiphoneMatrix_IPA = set(map(diphoneToUnicodeIPA, soundsInDiphoneMatrix))
print(soundsInDiphoneMatrix_IPA)
print(len(soundsInDiphoneMatrix_IPA))

{'ʌ', 'ɔɪ', 'ɛ', 'l', 'aʊ', 'g', ' ', 'aɪ', 'ɑ', 'r', 'tʃ', 'ɚ', 'ŋ', 'd', 'ɪ', 'z', 'p', 'ʃ', 'ʊ', 'l̩', 'm', 'f', 'ð', 'eɪ', 'ʒ', 'i', 'ə', 'w', 'j', 't', 'v', 'ɾ', 'oʊ', 'n', 'b', 'dʒ', 's', 'ɑb', 'u', 'θ', 'æ', 'h', 'k'}
43


## Hammond's newdict

In [31]:
newdic_raw[0]

OrderedDict([('Transcription', 'x'),
             ('stressInfoA', '_'),
             ('stressInfoB', 'S1'),
             ('Orthography', 'a'),
             ('Frequency', '23178'),
             ('PoSs', '(N IA VB PP)')])

In [32]:
def getTrn(entry):
    return entry['Transcription']

def stringToTuple(s):
    return tuple(list(s))

def getSounds(entry):
    trn = getTrn(entry)
    trnTple = stringToTuple(trn)
    return trnTple

soundsByWord = list(map(getSounds, newdic_raw))
len(soundsByWord)

19528

In [33]:
soundsetsByWord = list(map(set, soundsByWord))
len(soundsetsByWord)

19528

In [34]:
from functools import reduce
soundsInNewDic = reduce(set.union, soundsetsByWord)
len(soundsInNewDic)
print(soundsInNewDic)

45

{'y', 'J', 'W', 'Y', 'l', '^', 'g', 'k', 'r', '|', 'd', 'X', 'z', 'p', 'T', 'E', 'm', 'f', 'R', 'x', 'i', '@', 'w', 'G', 'C', 't', 'v', 'U', 'n', 'S', 'b', 'a', 'O', 's', 'L', 'M', 'D', 'u', 'N', 'Z', 'o', 'e', 'I', 'h', 'c'}


In [35]:
orthWordList = [entry['Orthography'] for entry in newdic_raw]
print(len(orthWordList))
orthWords = set(orthWordList)
print(len(orthWords))

19528
19528


In [36]:
from collections import Counter
orthWordCounter = Counter(orthWords)
uniqueOrthWords = [orthWord for orthWord in orthWords if orthWordCounter[orthWord] == 1]
duplicatedOrthWords = [orthWord for orthWord in orthWords if orthWordCounter[orthWord] > 1]
len(uniqueOrthWords)
len(duplicatedOrthWords)

19528

0

Every orthographic word is associated with exactly one phonological description.

In [37]:
orthToEntries = {orthword:[entry for entry in newdic_raw if entry['Orthography'] == orthword] for orthword in orthWords}

In [38]:
assert(all(map(lambda orthword: len(orthToEntries[orthword]) == 1, orthToEntries )))

In [39]:
orthToEntry = {orthword:orthToEntries[orthword][0] for orthword in orthWords}

In [40]:
orthToTrn = {orthword:orthToEntry[orthword]['Transcription'] for orthword in orthWords}

Below is the relation between symbols in Hammond's newdic and the diphone gating dataset, based on comparing IPA transcriptions in IPhOD with those in Hammond's newdic. Of note:
 - the inventory of Hammond's newdic distinguishes stressed and unstressed versions of some sounds with distinct symbols (/ə/, /ɚ/, /ɪ/)
 - the inventory of Hammond's newdic distinguishes syllabic and nonsyllabic versions of some consonants with distinct symbols (/n/, /m/, /l/)
 - Hammond's newdic preserves the cot-caught distinction

In [41]:
hammond_diphone_inventory_relation = [
 ('h','h'),
 ('S','sh'),
 ('p','p'),
 ('x','x'), #schwa
 ('m','m'),
 ('y','y'),
 ('v','v'),
 ('^','uh'),
 ('o','oh'),
 ('u','oo'),
 ('I','ih'),
 ('G','ng'),
 ('N','n'), #syllabic n isn't represented in the diphone inventory
 ('|','ih'), #as near as I can tell, this is unstressed/reduced 'ɪ'
 ('Z','zh'),
 ('L','L'), #syllabic l
 ('M', 'm'), #syllabic m isn't represented in the diphone inventory
 ('i','ee'),
 ('r','r'),
 ('g','g'),
 ('O','oy'),
 ('T','th'),
 ('n','n'),
 ('J','j'),
 ('d','d'),
 ('k','k'),
 ('W','ow'),
 ('f','f'),
 ('D','dh'),
 ('U','uu'),
 ('z','z'),
 ('Y','aye'),
 ('b','b'),
 ('X','er'), #r-colored schwa - unstressed
 ('a','a'),
 ('s','s'),
 ('e','ay'),
 ('C','ch'),
 ('t','t'),
 ('R','er'), #r-colored schwa - stressed
 ('E','eh'),
 ('w','w'),
 ('l','l'),
 ('@','ae'),
 ('c','ah') #ɔ as in kɔt ('caught') isn't actually in the diphone gating data segment inventory
]
def hammondToDiphone(hammond_symb):
    relevant_relations = [rel for rel in hammond_diphone_inventory_relation if rel[0] == hammond_symb]
    diph_symbs = set([rel[1] for rel in relevant_relations])
    return diph_symbs
def diphoneToHammond(diph_symb):
    relevant_relations = [rel for rel in hammond_diphone_inventory_relation if rel[1] == diph_symb]
    hammond_symbs = set([rel[0] for rel in relevant_relations])
    return hammond_symbs

Below is the relation between symbols in Hammond's newdic and the IPA, based on comparing IPA transcriptions in IPhOD with those in Hammond's newdic. Of note:
 - the inventory of Hammond's newdic distinguishes stressed and unstressed versions of some sounds with distinct symbols (/ə/, /ɚ/, /ɪ/)
 - the inventory of Hammond's newdic distinguishes syllabic and nonsyllabic versions of some consonants with distinct symbols (/n/, /m/, /l/)
 - Hammond's newdic preserves the cot-caught distinction

In [42]:
hammond_IPA_relation = [
 ('h', 'h'),
 ('S', 'ʃ'),
 ('p', 'p'),
 ('x', 'ə'),
 ('m', 'm'),
 ('y', 'j'),
 ('v', 'v'),
 ('^', 'ʌ'),
 ('o', 'oʊ'),
 ('u', 'u'),
 ('I', 'ɪ'),
 ('G', 'ŋ'),
 ('N', 'ṇ'), #sllabic n is NOT in diphone gating inventory
 ('|', 'ɪ'), #as near as I can tell, this is unstressed/reduced 'ɪ'
 ('Z', 'ʒ'),
 ('L', 'l̩'),
 ('M', 'ṃ'), #syllabic m is NOT in diphone gating inventory
 ('i', 'i'),
 ('r', 'r'),
 ('g', 'g'),
 ('O', 'ɔɪ'),
 ('T', 'θ'),
 ('n', 'n'),
 ('J', 'dʒ'),
 ('d', 'd'),
 ('k', 'k'),
 ('W', 'aʊ'),
 ('f', 'f'),
 ('D', 'ð'),
 ('U', 'ʊ'),
 ('z', 'z'),
 ('Y', 'aɪ'),
 ('b', 'b'),
 ('X', 'ɚ'), #r-colored schwa - stressed
 ('a', 'ɑ'),
 ('s', 's'),
 ('e', 'eɪ'),
 ('C', 'tʃ'),
 ('t', 't'),
 ('R', 'ɚ'), #r-colored schwa - UNstressed
 ('E', 'ɛ'),
 ('w', 'w'),
 ('l', 'l'),
 ('@', 'æ'),
 ('c', 'ɔ')] #ɔ is NOT in diphone gating data inventory
def hammondToUnicodeIPA(diphoneSymbol):
    mapping = dict(hammond_IPA_relation)
    #print(mapping)
    return mapping[diphoneSymbol]
print(hammondToUnicodeIPA('T'))
def invertMapping(mydict): 
    return dict([[val, key] for key,val in mydict.items()])
def unicodeIPAToHammond(unicodeIPAsymbol):
    return invertMapping( dict(hammond_IPA_relation) )[unicodeIPAsymbol]
print(unicodeIPAToHammond('θ'))

θ
T


In [43]:
soundsInHammondsNewDic_IPA = set(map(hammondToUnicodeIPA, soundsInNewDic))
print(len(soundsInHammondsNewDic_IPA))
print(soundsInHammondsNewDic_IPA)

43
{'ʌ', 'ɔɪ', 'ɛ', 'l', 'aʊ', 'ɔ', 'g', 'aɪ', 'ɑ', 'r', 'tʃ', 'ɚ', 'd', 'ŋ', 'ɪ', 'z', 'p', 'ʃ', 'ʊ', 'l̩', 'm', 'f', 'ð', 'eɪ', 'ʒ', 'i', 'ə', 'w', 'j', 'ṃ', 't', 'v', 'ṇ', 'oʊ', 'n', 'b', 'dʒ', 's', 'u', 'θ', 'æ', 'h', 'k'}


### Comparison of inventories between Hammond's newdic and the diphone gating data

In [44]:
soundsMissingFromDiphoneData = soundsInHammondsNewDic_IPA - soundsInDiphoneMatrix_IPA
soundsMissingFromDiphoneData

{'ɔ', 'ṃ', 'ṇ'}

In [45]:
soundsMissingFromHammond = soundsInDiphoneMatrix_IPA - soundsInHammondsNewDic_IPA
soundsMissingFromHammond

{' ', 'ɑb', 'ɾ'}

Use of Hammond's newdic for transcriptions/the lexicon alongside the diphone gating data for channel/noise distributions minimally requires:
 1. merging syllabic m into m in Hammond's newdic.
 2. merging syllabic n into n in Hammond's newdic.
 3. merging 'ɔ' into 'ɑ' in Hammond's newdic.
 4. merging 'ɾ' into t (or d) in the diphone gating data.

Since both other syllabic consonants in Hammond's newdic are unrepresentable, I'm going to merge syllabic l into l in both Hammond's newdic and the gating data for parsimony, yielding the following set of alignment steps:
 1. merging syllabic m into m in Hammond's newdic.
 2. merging syllabic n into n in Hammond's newdic.
 3. merging 'ɔ' into 'ɑ' in Hammond's newdic.
 4. merging 'ɾ' into t (or d) in the diphone gating data.
 5. merging syllabic l into l in Hammond's newdic.
 6. merging syllabic l into l in the diphone gating data.

TWO NOTES ON PROSODY:
 1. The inventory of newdic incorporates stress information that is suppressed when rendered only in terms of IPA segments.
 2. The other prosodic annotations of Hammond's newdic/various other versions of the same data source (Hoosier mental lexicon, NETTALK) contains two pieces of information about each transcription
     - one is the sequence of stresses associated with each vowel (\_ = unstressed, ' = moderate stress, \` = high stress).
     - I don't understand what the second one is yet.

## Lexicon - IPhOD

In [46]:
#All we want are arpabet symbols.
desired_fieldnames = [
#                       'Word',  #orthographic representation
                      'UnTrn', #unstressed CMU pronouncing dictionary transcription
#                       'StTrn', #stressed CMU pronouncing dictionary transcription
]

In [47]:
IPhOD_slim = [project_dict(row, desired_fieldnames) for row in lexicon_raw]
len(IPhOD_slim)
IPhOD_slim[0]
IPhOD_slim[2304]

54030

{'UnTrn': 'AH'}

{'UnTrn': 'AH.P.R.AA.K.S.AH.M.AH.T'}

In [48]:
IPhOD_slim[0]['UnTrn'].split('.')
IPhOD_slim[2304]['UnTrn'].split('.')

['AH']

['AH', 'P', 'R', 'AA', 'K', 'S', 'AH', 'M', 'AH', 'T']

In [49]:
def getArpabetSymbols(row):
    return row['UnTrn'].split('.')

arpabetSymbolsByWord = list(map(lambda r: set(getArpabetSymbols(r)), IPhOD_slim))
arpabetSymbolsByWord[:10]

[{'AH'},
 {'EY'},
 {'AH', 'EY', 'IH', 'L', 'P', 'R', 'T'},
 {'AA', 'D', 'K', 'R', 'V'},
 {'AH', 'EH', 'N', 'R'},
 {'AE', 'B'},
 {'B', 'EY', 'IY'},
 {'AE', 'AH', 'B', 'K'},
 {'AE', 'AH', 'B', 'K', 'S'},
 {'AE', 'AH', 'B', 'IY', 'L', 'N', 'OW'}]

In [50]:
arpabetSymbolsInIPhOD = set()
for eachSymbolSet in arpabetSymbolsByWord:
    arpabetSymbolsInIPhOD.update(eachSymbolSet)
print(arpabetSymbolsInIPhOD)
print(len(arpabetSymbolsInIPhOD))

{'CH', 'EY', 'W', 'Y', 'V', 'IY', 'JH', 'IH', 'SH', 'AY', 'T', 'HH', 'P', 'R', 'F', 'K', 'B', 'G', 'OY', 'AH', 'ER', 'AO', 'UW', 'S', 'DH', 'AE', 'AA', 'UH', 'L', 'D', 'M', 'AW', 'N', 'OW', 'Z', 'NG', 'ZH', 'EH', 'TH'}
39


In [51]:
arpabet = set(['AO','AA','IY','UW','EH','IH','UH','AH','AX','AE','EY','AY','OW','AW','OY','ER','P','B','T','D','K','G','CH','JH','F','V','TH','DH','S','Z','SH','ZH','HH','M','EM','N','EN','NG','ENG','L','EL','R','DX','NX','Y','W','Q'])
print(arpabet)
len(arpabet)

{'CH', 'EM', 'EY', 'W', 'Y', 'NX', 'V', 'IY', 'JH', 'IH', 'SH', 'AY', 'T', 'HH', 'P', 'R', 'F', 'ENG', 'AX', 'B', 'K', 'G', 'OY', 'AH', 'ER', 'AO', 'UW', 'S', 'DH', 'AE', 'AA', 'UH', 'EN', 'M', 'D', 'L', 'AW', 'EL', 'OW', 'Z', 'N', 'NG', 'ZH', 'EH', 'TH', 'DX', 'Q'}


47

In [52]:
arpabet = set(arpabet)
missingFromIphod = arpabet - arpabetSymbolsInIPhOD
missingFromIphod

{'AX', 'DX', 'EL', 'EM', 'EN', 'ENG', 'NX', 'Q'}

In [53]:
missingFromArpabet = arpabetSymbolsInIPhOD - arpabet
missingFromArpabet

set()

Ok. So IPhOD doesn't use all of the sounds in the arpabet, but it also doesn't use any extra sounds that are NOT in the arpabet. 

Should I be worried about the sounds NOT in IPhOD?

 - #AX is a schwa - this is potentially a problem.
 - #DX is a tap - this is unfortunate but not a showstopper; all taps are dental stops in the CMU transcription.
 - #EL is syllabic dark l.
 - #EM is syllabic m.
 - #EN is syllabic n.
 - #ENG is syllabic engma.
 - #NX is a nasalized tap.
 - #Q is a glottal stop.

In [54]:
print("\u0278")
print("ɸ")
print("ɸ" == "\u0278")

ɸ
ɸ
True


In [55]:
arpabetUnicodeIPArelation = set([ \
('AO', 'ɔ'),
('AA', 'ɑ'),
('IY', 'i'),
('UW', 'u'),
('EH', 'ɛ'),
('IH', 'ɪ'),
('UH', 'ʊ'),
('AH', 'ʌ'),
('AX', 'ə'),
('AE', 'æ'),
('EY', 'eɪ'),
('AY', 'aɪ'),
('OW', 'oʊ'),
('AW', 'aʊ'),
('OY', 'ɔɪ'),
('ER', 'ɚ'),
('P', 'p'),
('B', 'b'),
('T', 't'),
('D', 'd'),
('K', 'k'),
('G', 'g'),
('CH', 'tʃ'),
('JH', 'dʒ'),
('F', 'f'),
('V', 'v'),
('TH', 'θ'),
('DH', 'ð'),
('S', 's'),
('Z', 'z'),
('SH', 'ʃ'),
('ZH', 'ʒ'),
('HH', 'h'),
('M', 'm'),
('EM', 'm̩'),
('N', 'n'),
('EN', 'n̩'),
('NG', 'ŋ'),
('ENG', 'ŋ̩'),
('L', 'l'),
('EL', 'l̩'),
('R', 'r'),
('DX', 'ɾ'),
('NX', 'ɾ̃'),
('Y', 'j'),
('W', 'w'),
('Q', 'ʔ')
])
arpabetUnicodeIPArelation

{('AA', 'ɑ'),
 ('AE', 'æ'),
 ('AH', 'ʌ'),
 ('AO', 'ɔ'),
 ('AW', 'aʊ'),
 ('AX', 'ə'),
 ('AY', 'aɪ'),
 ('B', 'b'),
 ('CH', 'tʃ'),
 ('D', 'd'),
 ('DH', 'ð'),
 ('DX', 'ɾ'),
 ('EH', 'ɛ'),
 ('EL', 'l̩'),
 ('EM', 'm̩'),
 ('EN', 'n̩'),
 ('ENG', 'ŋ̩'),
 ('ER', 'ɚ'),
 ('EY', 'eɪ'),
 ('F', 'f'),
 ('G', 'g'),
 ('HH', 'h'),
 ('IH', 'ɪ'),
 ('IY', 'i'),
 ('JH', 'dʒ'),
 ('K', 'k'),
 ('L', 'l'),
 ('M', 'm'),
 ('N', 'n'),
 ('NG', 'ŋ'),
 ('NX', 'ɾ̃'),
 ('OW', 'oʊ'),
 ('OY', 'ɔɪ'),
 ('P', 'p'),
 ('Q', 'ʔ'),
 ('R', 'r'),
 ('S', 's'),
 ('SH', 'ʃ'),
 ('T', 't'),
 ('TH', 'θ'),
 ('UH', 'ʊ'),
 ('UW', 'u'),
 ('V', 'v'),
 ('W', 'w'),
 ('Y', 'j'),
 ('Z', 'z'),
 ('ZH', 'ʒ')}

In [56]:
def arpabetToUnicodeIPA(arpabetSymbol):
    mapping = dict(arpabetUnicodeIPArelation)
    #print(mapping)
    return mapping[arpabetSymbol]
arpabetToUnicodeIPA('TH')

def invertMapping(mydict): 
    return dict([[val, key] for key,val in mydict.items()])
def unicodeIPAToArpabet(unicodeIPAsymbol):
    return invertMapping( dict(arpabetUnicodeIPArelation) )[unicodeIPAsymbol]
unicodeIPAToArpabet('θ')

'θ'

'TH'

In [57]:
arpabetSymbolsInIPhOD_IPA = set(map(arpabetToUnicodeIPA, arpabetSymbolsInIPhOD))
len(arpabetSymbolsInIPhOD_IPA)

39

### Comparison of IPhOD inventory and diphone gating data

In [58]:
iphod_inventory = arpabetSymbolsInIPhOD_IPA
gating_inventory = soundsInDiphoneMatrix_IPA

In [59]:
in_iphod_NOT_in_gating = iphod_inventory - gating_inventory
in_iphod_NOT_in_gating

{'ɔ'}

'ɔ' should indeed be missing from the gating data, as noted previously.

In [60]:
in_gating_NOT_in_iphod = gating_inventory - iphod_inventory
in_gating_NOT_in_iphod

{' ', 'l̩', 'ɑb', 'ə', 'ɾ'}

 - ' ' and 'ab' are hacks I introduced (and documented) above; they are not concerning.
 - syllabic l, schwa, and taps are indeed documented as missing from CMU pronouncing dictionary representations.

Use of IPhOD for transcriptions/the lexicon alongside the diphone gating data for channel/noise distributions requires:
 1. merging 'ɔ' into 'ɑ' in IPhOD.
 2. mapping at least unstressed 'ʌ' to 'ə' in IPhOD.
 3. merging syllabic l into l the diphone gating data.
 4. merging 'ɾ' into t (or d) in the diphone gating data.

# Processing pipeline sketch

1. **IPA**: To be more accessibly inspectable and easier to work with and analyze, all transcriptions in IPhOD, Hammond's newdic, and the gating data should be transformed to IPA symbols.

2. **Alignment**: For either lexicon (IPhOD vs. Hammond's newdic) to be useful with the coarticulation/noise data from Warner et al. (2014), they need to utilize a common inventory of segments.
  - Use of IPhOD for transcriptions/the lexicon alongside the diphone gating data for channel/noise distributions requires:
    1. merging 'ɔ' into 'ɑ' in IPhOD.
    2. mapping at least unstressed 'ʌ' to 'ə' in IPhOD.
    3. merging syllabic l into l the diphone gating data.
    4. merging 'ɾ' into t (or d) in the diphone gating data.
 -  Use of Hammond's newdic for transcriptions/the lexicon alongside the diphone gating data for channel/noise distributions requires:
    1. merging syllabic m into m in Hammond's newdic.
    2. merging syllabic n into n in Hammond's newdic.
    3. merging 'ɔ' into 'ɑ' in Hammond's newdic.
    4. merging 'ɾ' into t (or d) in the diphone gating data.
    5. merging syllabic l into l in Hammond's newdic.
    6. merging syllabic l into l in the diphone gating data. 

3. **Transformation into model inputs**: The word recognition model needs 
 - a prior distribution over phonological wordforms, defined from (the aligned version of) the chosen lexicon.
 - a coarticulation/noise model, defined from (the aligned version of) the gating data, plus smoothing.

These two model inputs are designed to be used with minimal modification by an implemented word recognition model, documented in other notebooks.

For each of the three stages of processing, there are two notebooks ('1a', '2a', '3a', '1b', '2b', '3b'), one for processing the gating data, and one for processing the lexicon.

## Global todo tally:

### New capabilities
 
 1. **Support generating random lexicons with less structure than a given real one** - Notebooks 3b
  - E.g. it would be nice to have a prior distribution over wordforms using the same inventory as the aligned version of IPhOD or the aligned version of Hammond's newdic and with e.g. same uniphone distribution, but with no phonotactics, or e.g. with similar phonotactics but with different levels of 'clumpiness' among (e.g. morphologically) related forms relative to the actual lexicon.
 
 3. **Support prosodically-annotated segmental representations** - Notebooks 3a (+ 1b, 2b, 3b + Model Notebooks)
  1. First goal: Modify Notebook 3a to define/export channel distributions that include vowel stress information instead of just segmental identities.
     - May help identify and explain anomalies in the confusion data / repercussions of alignment choices.
     - *Low priority*, unless it becomes clear (as seems to have been the case in the past in some cases) that ignoring vowel stress is the cause of strange results in the context of pursuing other questions.
  2. Second goal: Be able to use channel distributions that incorporate stress information in the word recognition model.
     - May help increase model performance, naturalness of its behavior; 
     - Requires non-trivial modification of scripts processing the lexicon:
        1. Reprocessing Hammond's newdic in Processing Notebook 1a so that stress values are interleaved (as in IPhOD and the gating data); because the inventory of Hammond's newdic incorporates stress information into the original inventory, this may involve changing the mapping into IPA+stress (for vowels/maybe syllabic consonants, anyway).
           - This "maybe" should involve checking that the stress level suggested by the symbol and the stress annotation given are always compatible, and resolving discrepancies as needed.
        2. Producing lexicon distributions in Processing Notebook 3b that include stress information.
        3. Modifying the word recognition model to deal with how the channel output distribution’s sample space is strictly coarser grained than that of the input. (Gating trial stimuli have stress annotations, but gating trial responses do *not*.)
     - *Lowest priority* currently. Requires a lot of work; unless e.g. enough channel distributions with vs. without stress are dramatically different, it's not clear what would make the work worth it.

### Open questions

1. **Handling the lack of schwas in IPhOD** - Notebooks 2a, 2b
 - All vowels in IPhOD are annotated with one of a few levels of stress. There are no transcriptions containing 'ə'. Deleting trial data with stimuli or responses containing 'ə' is not a good idea. To avoid distorting the channel distribution, it seems reasonable to instead change the lexicon and convert unstressed ('stress level 0') 'ʌ's (42% of all unstressed vowel tokens in IPhOD) to 'ə's; it might not be unreasonable to merge unstressed 'i' (12.5% of all unstressed vowel tokens in IPhOD) or 'ɪ' (24.5% of all unstressed vowel tokens in IPhOD) into 'ə' as well. 
 
   - At stake here is the accuracy and usefulness of the word recognition model -- the question is whether 12.5+24.5 = 37% more of all unstressed vowel tokens in IPhOD should be moved from two segments into a third ('ə'), making schwa almost twice as common as it is with the current merger of unstressed 'ʌ' into 'ə'. Given the empirical ubiquity of vowel reduction in English and the practical benefit of capturing some aspects of richer (and more complicated) prosodic annotation with very little work, this might be a simple and justifiable way to avoid outliers or anomalously extreme predictions (most likely finding words harder to recognize than they actually are.
   
   - *Easy to implement.* Ask Marc about whether this is a no brainer / reasonable before comparing word recognition performance (e.g. an estimate of average surprisal during word recognition over the whole lexicon) more rigorously/exhaustively.

2. **Good things to know about the gating data** - Notebook 2a
   1. **Features and Errors**: Import a feature matrix from Phonological Corpus Tools (or my version of Riggle's feature chart); use it to get a finer grained description about errors in terms of features. This may be useful for identifying, understanding, and deciding what to do with outliers when e.g. considering the problem immediately below.
       - This would be tremendously useful for investigating all kinds of questions about the gating data, the performance of the word recognition model, making interpretable graphs (for trivially organizing symbols by feature/natural class), and relating the confusability data to hypothetical phonological constraints (perceptibility constraints "projected from the 'P-map'").
       - *High priority.*
   2. **What is the impact of inventory alignments (and ignoring vowel stress) on the channel distribution?**: When segment types get remarkably easier or harder to identify after aligning the gating data with a lexicon in some way, we may have caused a net-undesirable change with respect to the goal of modeling word recognition (e.g. maybe certain words now are judged substantially harder to accurately recognize than they actually are), or at least one that we should be able to identify and investigate. It would be nice to have some relatively general measures to compare channel distributions before vs. after at least some alignments - e.g. I could, for each segment, take the KL divergence between a segment's uniphone channel distribution when unaligned vs. when aligned for IPhOD or when aligned for IPhOD vs. aligned for Hammond's newdic; ranking segments by that difference and looking at the distribution of differences could point out which segments are most affected by the alignment changes and prompt further investigation. (Recall that the difference between the unaligned trials and the newdic-aligned trials is that taps in the trial data have been rewritten as [t], and that the difference between the newdic-aligned trials and the IPhOD-aligned trials is that syllabic l in the trial data has been rewritten as [l]; note also that no outputs of Notebook 3a take stress or prosody into account, currently.) 
   
      - For example: taps are currently merged into [t]s, but -- if I recall correctly -- taps are actually relatively easily confusable with both [t]s and [d]s, unlike either [t]s or [d]s, meaning the merger of taps into [t] in trial data undesirably makes [t]s seem more confusable with [d]s than they actually are. Maybe instead of merging all taps into [t]s, I should exclude all trials with taps in the stimuli and convert taps in the responses of remaining trials evenly but randomly into [t]s and [d]s. 
      - *Moderate priority.* Could plausibly become higher priority if I encounter issues or anomalies involving taps, low back vowels, l, or unstressed/reduced vowels, or if I have more information (as a result of e.g. integrating features).

3. **Choice of lexicon & exact problem formulation** - Notebooks 2a, 2b
 - Q: Is there any reason to prefer Hammond's newdic over IPhOD or vice versa for any relevant scientific question or practical issue?
    - A: IPhOD has ≈2-3x the number of phonological wordforms as Hammond's newdic. Consequences:
    - *Higher expected surprisal*, *more informative top-down expectations*. The prior with IPhOD (relative to the prior over Hammond's newdic/the portion of it alignable with SUBTLEX word frequencies) probably contains a larger tail of infrequent words that will increase expected surprisal. Acoustic noise should have less effect on average using the IPhOD prior; in general, there may also plausibly be *some* relatively infrequent words where top-down expectations and a larger, denser lexicon make them *more likely* to be confused relative to a prior based on Hammond's newdic.
    - IPhOD has multiple phonological variants for the same orthographic word, unlike Hammond's newdic.
      1. *Neighborhood density and confusability.* Between simply having more words and specifically allowing multiple phonological variants per orthographic word, I expect the average phonological neighborhood to be denser than in Hammond's newdic, increasing the average confusability of each wordform.
      2. *Phonological wordform vs. orthographic wordform vs. psycholinguistic lemma.* Without any kind of model of the relative frequency of different variants of an orthographic word (as one might get from the data of a transcribed corpus of natural speech - e.g. the Buckeye corpus), I split the IPhOD-provided SUBTLEX_US frequency of an orthographic wordform equally among its phonological variants. If the modeled problem is taken to be about the speaker choosing an intended phonological wordform (rather than an orthographic one or a 'lemma' - plausibly operationalizable here as a part-of-speech and a subset of all the phonological variants associated with an orthographic word), then the prior probability of the speaker's specific intended phonological variant will be low (perhaps seemingly artificially so) and so will the listener's expected degree of belief in the speaker's actual intended variant. This could be addressed by slightly altering the exact formulation of the modeled problem.