**Notebook author:** emeinhardt@ucsd.edu

In [1]:
#Prints **all** console output, not just last item in cell 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview-and-requirements" data-toc-modified-id="Overview-and-requirements-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview and requirements</a></span><ul class="toc-item"><li><span><a href="#Context" data-toc-modified-id="Context-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Context</a></span></li><li><span><a href="#About-the-transcriptions" data-toc-modified-id="About-the-transcriptions-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>About the transcriptions</a></span></li><li><span><a href="#This-notebook" data-toc-modified-id="This-notebook-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>This notebook</a></span></li></ul></li><li><span><a href="#Import-data" data-toc-modified-id="Import-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import data</a></span><ul class="toc-item"><li><span><a href="#Hammond's-newdic" data-toc-modified-id="Hammond's-newdic-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Hammond's newdic</a></span></li></ul></li><li><span><a href="#Identify-the-inventory-of-symbols" data-toc-modified-id="Identify-the-inventory-of-symbols-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Identify the inventory of symbols</a></span></li><li><span><a href="#Convert-phonological-representations-to-IPA" data-toc-modified-id="Convert-phonological-representations-to-IPA-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Convert phonological representations to IPA</a></span><ul class="toc-item"><li><span><a href="#Group-symbols-to-be-mapped" data-toc-modified-id="Group-symbols-to-be-mapped-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Group symbols to be mapped</a></span><ul class="toc-item"><li><span><a href="#Fricatives-and-Affricates" data-toc-modified-id="Fricatives-and-Affricates-4.1.1"><span class="toc-item-num">4.1.1&nbsp;&nbsp;</span>Fricatives and Affricates</a></span></li><li><span><a href="#Oral-stops" data-toc-modified-id="Oral-stops-4.1.2"><span class="toc-item-num">4.1.2&nbsp;&nbsp;</span>Oral stops</a></span></li><li><span><a href="#Nasal-stops" data-toc-modified-id="Nasal-stops-4.1.3"><span class="toc-item-num">4.1.3&nbsp;&nbsp;</span>Nasal stops</a></span></li><li><span><a href="#Liquids-and-Glides" data-toc-modified-id="Liquids-and-Glides-4.1.4"><span class="toc-item-num">4.1.4&nbsp;&nbsp;</span>Liquids and Glides</a></span></li><li><span><a href="#Low-vowels" data-toc-modified-id="Low-vowels-4.1.5"><span class="toc-item-num">4.1.5&nbsp;&nbsp;</span>Low vowels</a></span></li><li><span><a href="#Mid-vowels" data-toc-modified-id="Mid-vowels-4.1.6"><span class="toc-item-num">4.1.6&nbsp;&nbsp;</span>Mid vowels</a></span></li><li><span><a href="#High-vowels" data-toc-modified-id="High-vowels-4.1.7"><span class="toc-item-num">4.1.7&nbsp;&nbsp;</span>High vowels</a></span></li><li><span><a href="#Diphthongs" data-toc-modified-id="Diphthongs-4.1.8"><span class="toc-item-num">4.1.8&nbsp;&nbsp;</span>Diphthongs</a></span></li><li><span><a href="#Altogether" data-toc-modified-id="Altogether-4.1.9"><span class="toc-item-num">4.1.9&nbsp;&nbsp;</span>Altogether</a></span></li></ul></li><li><span><a href="#Define-newdic-inventory-->-IPA-mapping" data-toc-modified-id="Define-newdic-inventory-->-IPA-mapping-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Define newdic inventory -&gt; IPA mapping</a></span></li><li><span><a href="#Transform-transcriptions-to-IPA..." data-toc-modified-id="Transform-transcriptions-to-IPA...-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Transform transcriptions to IPA...</a></span></li><li><span><a href="#Transform-entries-to-have-IPA-transcriptions..." data-toc-modified-id="Transform-entries-to-have-IPA-transcriptions...-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Transform entries to have IPA transcriptions...</a></span></li><li><span><a href="#Write-to-file-/-read-back-in" data-toc-modified-id="Write-to-file-/-read-back-in-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Write to file / read back in</a></span></li></ul></li></ul></div>

# Overview and requirements

## Context

Prof. Mike Hammond has a transcribed and annotated dictionary of American English located at 
 - http://dingo.sbs.arizona.edu/~hammond/lsasummer11/newdic

There's no documentation I'm aware of as to where any of this information came from. It was used to help create the inventory of American English diphones for Warner et al. 2014 ("Tracking the perception of the sounds of English").

If you google some of the transcriptions, it appears the transcriptions are the same as those in the 'NETtalk corpus' of Sejnowski & Rosenberg 1987 ("Parallel networks that learn to pronounce English text"). The dataset can be found at UCI's Machine Learning Repository here
 - https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Nettalk+Corpus)
 
Both `newdic.txt` and `nettalk.data` are included in this repository. The two files are not identical and contain slightly different annotations about each entry; newdic appears to contain about 500 fewer entries.

## About the transcriptions

The text of Sejnowski & Rosenberg indicates that the transcriptions came from an unspecified version of *Meriam Webster's Pocket Dictionary*. (In other words, we have no systematic information or documentation about what the inventory is or what the assumptions of the transcribers were.)

Each line of both dictionaries associates an orthographic wordform with a transcription, information about stress and syllabic structure, and other information. In the case of `nettalk.data`, this is an integer indicating whether the word is foreign or irregular. In the case of `newdic.txt`, this is a column explicitly indicating which syllable (counting from the left) has primary stress, what might be frequency count information (from an unknown dataset), and a list of part-of-speech tags that the orthographic wordform can have (based on an unknown dataset or annotation procedure).

There are about 20k entries in each file and each associates an orthographic wordform with exactly one transcription.

## This notebook

This notebook converts the transcriptions in `newdic.txt` to Unicode IPA symbols and documents my choices / my process of identifying correspondences.

I am using Python 3. (Note that Python /3/ makes Unicode much more straightforward and painless to deal with than Python 2.) There are no other salient dependencies.

# Import data

In [2]:
%pwd

'/mnt/cube/home/AD/emeinhar/newdic-nettalk'

In [3]:
import csv

## Hammond's newdic

In [4]:
%ls newdic*

newdic.txt


In [5]:
newdic_fn = "newdic.txt"

In [6]:
%cat -n newdic.txt | head -60

     1	x	_	S1	a	23178	(N IA VB PP)
     2	ardvark	'`	S2	aardvark	0	(N)
     3	xb@k	_'	S2	aback	2	(AV)
     4	@bxkxs	'__	S3	abacus	0	(N)
     5	xb@ft	_'	S2	abaft	0	(AV PP)
     6	@bxloni	`_'_	S4	abalone	0	(N)
     7	xb@nd|n	_'_	S3	abandon	17	(VT N)
     8	xbes	_'	S2	abase	0	(VT)
     9	xb@S	_'	S2	abash	0	(VT)
    10	xbet	_'	S2	abate	0	(VT VI)
    11	@bxti	'_`	S3	abatis	0	(N)
    12	@bxtwar	'_`	S3	abattoir	0	(N)
    13	@bxsi	'__	S3	abbacy	0	(N)
    14	@be	_'	S2	abbe	3	(N)
    15	@bxs	'_	S2	abbess	0	(N)
    16	@bi	'_	S2	abbey	7	(N)
    17	@bxt	'_	S2	abbot	2	(N)
    18	xbriviet	_'_`	S4	abbreviate	0	(VT)
    19	xbrivieS|n	_`_'_	S5	abbreviation	1	(N)
    20	@bdIket	'_`	S3	abdicate	0	(VT VI)
    21	@bdIkeS|n	`_'_	S4	abdication	0	(N)
    22	@bdxmxn	'__	S3	abdomen	6	(N)
    23	@bdamxnL	_'__	S4	abdominal	4	(AJ)
    24	@bd^kt	_'	S2	abduct	0	(VT)
    25	xbim	_'	S2	abeam	0	(AV)
    26	ebisidErixn	`_`'__	S6	abecedarian	0	(N AJ)
    27	xbEd	_'	S2	ab

In [7]:
newdic_raw = []
fieldnames = ['Transcription', 'stressInfoA', 'stressInfoB', 'Orthography', 'Frequency', 'PoSs']
with open(newdic_fn) as csvfile:
    my_reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=fieldnames)
    for row in my_reader:
        #print(row)
        newdic_raw.append(row)

newdic_raw[0]
len(newdic_raw[0].keys())

OrderedDict([('Transcription', 'x'),
             ('stressInfoA', '_'),
             ('stressInfoB', 'S1'),
             ('Orthography', 'a'),
             ('Frequency', '23178'),
             ('PoSs', '(N IA VB PP)')])

6

# Identify the inventory of symbols

In [8]:
from functools import reduce

In [9]:
def union(Ss):
    return reduce(set.union, Ss)

In [10]:
transcriptions = [row['Transcription'] for row in newdic_raw]
transcriptions[:10]

['x',
 'ardvark',
 'xb@k',
 '@bxkxs',
 'xb@ft',
 '@bxloni',
 'xb@nd|n',
 'xbes',
 'xb@S',
 'xbet']

In [11]:
getSymbols = lambda transcription: set(transcription)
symbols = union(map(getSymbols, transcriptions))
len(symbols)
symbols

45

{'@',
 'C',
 'D',
 'E',
 'G',
 'I',
 'J',
 'L',
 'M',
 'N',
 'O',
 'R',
 'S',
 'T',
 'U',
 'W',
 'X',
 'Y',
 'Z',
 '^',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '|'}

# Convert phonological representations to IPA

In [12]:
newdic_raw[2]

OrderedDict([('Transcription', 'xb@k'),
             ('stressInfoA', "_'"),
             ('stressInfoB', 'S2'),
             ('Orthography', 'aback'),
             ('Frequency', '2'),
             ('PoSs', '(AV)')])

We want to convert the transcription field into a string of IPA symbols, with each segment separated by a '.'

## Group symbols to be mapped

This subsection incrementally shows/documents the mapping into IPA symbols with examples.

In [13]:
def wordsWithSymbol(s):
    return [row for row in newdic_raw if s in row['Transcription']]

In [14]:
# preliminary sorting of consonants by manner
fricativesAffricates = ('f','v','T','D','s','z','S','Z','C','J','h') #fricatives and affricates organized by place
oralStops = ('p','b','t','d','k','g') #oral stops by place
nasalStops = ('m','n','G','M','N') #non-syllabic nasal stops by place + syllabic nasals by place
liquidsGlides = ('w','y','r','l','L') #liquids and glides in no particular order
consonants = set(fricativesAffricates) | set(oralStops) | set(nasalStops) | set(liquidsGlides)

#preliminary sorting of monopthongs by height + dipthongs
lowVowels = ('@','c','a')
midVowels = ('x','^','X','R','E')
highVowels = ('i','I','|','U','u')
diphthongs = ('W','o','e','O','Y')
vowels = set(lowVowels) | set(midVowels) | set(highVowels) | set(diphthongs)

len(symbols)
len(consonants | vowels)
symbols - (consonants | vowels)
(consonants | vowels) - symbols

45

45

set()

set()

In [15]:
from random import choice, choices

def nExampleWordsWith(s, n=10):
    return choices(wordsWithSymbol(s), k=n)

In [16]:
exampleMap = {s:nExampleWordsWith(s) for s in symbols}

In [17]:
mapping = dict()

In [18]:
def addMapping(newdicSymbol, IPAsymbol):
    mapping.update({newdicSymbol: IPAsymbol})

### Fricatives and Affricates

In [19]:
fricativesAffricates

('f', 'v', 'T', 'D', 's', 'z', 'S', 'Z', 'C', 'J', 'h')

In [20]:
addMapping('f','f')
addMapping('v','v')
addMapping('s','s')
addMapping('z','z')
addMapping('h','h')

In [21]:
exampleMap['T'][:2] # θ

[OrderedDict([('Transcription', 'txlEpxTi'),
              ('stressInfoA', "_'__"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'telepathy'),
              ('Frequency', '3'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'TrIft'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'thrift'),
              ('Frequency', '5'),
              ('PoSs', '(N)')])]

In [22]:
addMapping('T', 'θ')

In [23]:
exampleMap['D'][:2] # ð

[OrderedDict([('Transcription', 'DEr'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'their'),
              ('Frequency', '2668'),
              ('PoSs', '(AJ)')]),
 OrderedDict([('Transcription', 'loDli'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'loathly'),
              ('Frequency', '0'),
              ('PoSs', '(AJ AV)')])]

In [24]:
addMapping('D', 'ð')

In [25]:
exampleMap['S'][:2] # ʃ

[OrderedDict([('Transcription', 'xJudIkeS|n'),
              ('stressInfoA', "_`_'_"),
              ('stressInfoB', 'S5'),
              ('Orthography', 'adjudication'),
              ('Frequency', '4'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'wcSX'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'washer'),
              ('Frequency', '2'),
              ('PoSs', '(N)')])]

In [26]:
addMapping('S', 'ʃ')

In [27]:
exampleMap['Z'][:2] # ʒ

[OrderedDict([('Transcription', 'ImRZ|n'),
              ('stressInfoA', "_'_"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'immersion'),
              ('Frequency', '2'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'nEglxZe'),
              ('stressInfoA', "`_'"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'neglige'),
              ('Frequency', '0'),
              ('PoSs', '(N)')])]

In [28]:
addMapping('Z', 'ʒ')

In [29]:
exampleMap['C'][:2] # tʃ

[OrderedDict([('Transcription', 'kWntXsIgnxCUr'),
              ('stressInfoA', "`_'_`"),
              ('stressInfoB', 'S5'),
              ('Orthography', 'countersignature'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', '@nCovi'),
              ('stressInfoA', "'__"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'anchovy'),
              ('Frequency', '1'),
              ('PoSs', '(N)')])]

In [30]:
addMapping('C', 'tʃ')

In [31]:
exampleMap['J'][:2] # dʒ

[OrderedDict([('Transcription', 'J^GkCX'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'juncture'),
              ('Frequency', '4'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'taksIkxlaJIk'),
              ('stressInfoA', "`__'_"),
              ('stressInfoB', 'S5'),
              ('Orthography', 'toxicologic'),
              ('Frequency', '0'),
              ('PoSs', '(AJ)')])]

In [32]:
addMapping('J', 'dʒ')

### Oral stops

In [33]:
oralStops

('p', 'b', 't', 'd', 'k', 'g')

In [34]:
addMapping('p','p')
addMapping('b','b')
addMapping('t','t')
addMapping('d','d')
addMapping('k','k')
addMapping('g','g')

### Nasal stops

In [35]:
nasalStops

('m', 'n', 'G', 'M', 'N')

In [36]:
addMapping('m','m')
addMapping('n','n')

In [37]:
exampleMap['G'][:2] # ŋ

[OrderedDict([('Transcription', 'bundcglIG'),
              ('stressInfoA', "'`_"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'boondoggling'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'fIlIG'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'filling'),
              ('Frequency', '37'),
              ('PoSs', '(N)')])]

In [38]:
addMapping('G', 'ŋ')

In [39]:
exampleMap['M'][:2] # ṃ

[OrderedDict([('Transcription', 'sedIzM'),
              ('stressInfoA', "'`_"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'sadism'),
              ('Frequency', '3'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'pEsxmIzM'),
              ('stressInfoA', "'_`_"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'pessimism'),
              ('Frequency', '6'),
              ('PoSs', '(N)')])]

In [40]:
addMapping('M','ṃ')

In [41]:
exampleMap['N'][:2] # ṇ

[OrderedDict([('Transcription', 'risNsi'),
              ('stressInfoA', "'__"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'recency'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'mRsNEri'),
              ('stressInfoA', "'_`_"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'mercenary'),
              ('Frequency', '1'),
              ('PoSs', '(N AJ)')])]

In [42]:
addMapping('N','ṇ')

### Liquids and Glides

In [43]:
liquidsGlides

('w', 'y', 'r', 'l', 'L')

In [44]:
addMapping('w','w')
addMapping('y','j')
addMapping('r','ɹ')
addMapping('l','l')

In [45]:
exampleMap['L'][:2] # l̩

[OrderedDict([('Transcription', 'YdEntIkL'),
              ('stressInfoA', "_'__"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'identical'),
              ('Frequency', '31'),
              ('PoSs', '(AJ)')]),
 OrderedDict([('Transcription', 'vokL'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'vocal'),
              ('Frequency', '14'),
              ('PoSs', '(AJ N)')])]

In [46]:
addMapping('L', 'l̩')

### Low vowels

In [47]:
lowVowels

('@', 'c', 'a')

In [48]:
exampleMap['@'][:2] # æ

[OrderedDict([('Transcription', 'hEks@mxtX'),
              ('stressInfoA', "_'__"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'hexameter'),
              ('Frequency', '3'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'm@l|t'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'mallet'),
              ('Frequency', '0'),
              ('PoSs', '(N)')])]

In [49]:
addMapping('@', 'æ')

In [50]:
exampleMap['c'][:2] # ɔ

[OrderedDict([('Transcription', 'skrcl'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'scrawl'),
              ('Frequency', '0'),
              ('PoSs', '(VT VI)')]),
 OrderedDict([('Transcription', 'EkscrsYz'),
              ('stressInfoA', "'``"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'exorcise'),
              ('Frequency', '1'),
              ('PoSs', '(VT)')])]

In [51]:
[ex for ex in wordsWithSymbol('c') if ex['Orthography'] == 'caught' or ex['Orthography'] == 'bought']

[OrderedDict([('Transcription', 'bct'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'bought'),
              ('Frequency', '57'),
              ('PoSs', '(AJ)')]),
 OrderedDict([('Transcription', 'kct'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'caught'),
              ('Frequency', '98'),
              ('PoSs', '(null)')])]

In [52]:
addMapping('c', 'ɔ')

In [53]:
exampleMap['a'][:2] # ɑ

[OrderedDict([('Transcription', 'pcliwag'),
              ('stressInfoA', "'_`"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'pollywog'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'kxlYdoskapIk'),
              ('stressInfoA', "_`_'_"),
              ('stressInfoB', 'S5'),
              ('Orthography', 'kaleidoscopic'),
              ('Frequency', '0'),
              ('PoSs', '(AJ)')])]

In [54]:
[ex for ex in wordsWithSymbol('a') if ex['Orthography'] == 'cot' or ex['Orthography'] == 'hot']

[OrderedDict([('Transcription', 'kat'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'cot'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'hat'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'hot'),
              ('Frequency', '137'),
              ('PoSs', '(AJ AV)')])]

In [55]:
addMapping('a', 'ɑ')

### Mid vowels

In [56]:
midVowels

('x', '^', 'X', 'R', 'E')

In [57]:
exampleMap['x'][:5] # ə

[OrderedDict([('Transcription', 'lxpEl'),
              ('stressInfoA', "_'"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'lapel'),
              ('Frequency', '1'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'k@ndxlabrx'),
              ('stressInfoA', "`_'_"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'candelabra'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'flcrxd'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'florid'),
              ('Frequency', '2'),
              ('PoSs', '(AJ)')]),
 OrderedDict([('Transcription', 'mEtxlRJ|st'),
              ('stressInfoA', "'_`_"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'metallurgist'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'brIgxntin'),
              ('stressInfoA', "'_

In [58]:
addMapping('x', 'ə')

In [59]:
exampleMap['^'][:5] # ʌ

[OrderedDict([('Transcription', 's^frIJ|st'),
              ('stressInfoA', "'__"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'suffragist'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'Ink^lkeS|n'),
              ('stressInfoA', "``'_"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'inculcation'),
              ('Frequency', '2'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'b^gxbu'),
              ('stressInfoA', "'_`"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'bugaboo'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'kr^mpL'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'crumple'),
              ('Frequency', '0'),
              ('PoSs', '(VT VI N)')]),
 OrderedDict([('Transcription', 'fr^ktxfY'),
              ('stressInfoA

In [60]:
addMapping('^', 'ʌ')

In [61]:
exampleMap['X'][:2] # ɚ

[OrderedDict([('Transcription', 'hYdramxtX'),
              ('stressInfoA', "_'__"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'hydrometer'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'wIdoX'),
              ('stressInfoA', "'__"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'widower'),
              ('Frequency', '1'),
              ('PoSs', '(N)')])]

In [62]:
addMapping('X', 'ɚ')

In [63]:
exampleMap['R'][:2] # ɚ

[OrderedDict([('Transcription', 'mesenyRz'),
              ('stressInfoA', "``'"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'messeigneurs'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'xbzRvxtori'),
              ('stressInfoA', "_'_`_"),
              ('stressInfoB', 'S5'),
              ('Orthography', 'observatory'),
              ('Frequency', '3'),
              ('PoSs', '(N)')])]

In [64]:
addMapping('R', 'ɚ')

In [65]:
exampleMap['E'][:2] # ɛ

[OrderedDict([('Transcription', 'prEtsL'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'pretzel'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'bElt'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'belt'),
              ('Frequency', '31'),
              ('PoSs', '(AJ VT VI N)')])]

In [66]:
addMapping('E', 'ɛ')

### High vowels

In [67]:
highVowels

('i', 'I', '|', 'U', 'u')

In [68]:
exampleMap['i'][:2] # i

[OrderedDict([('Transcription', 'kamxdi'),
              ('stressInfoA', "'__"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'comedy'),
              ('Frequency', '39'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'wiv'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'weave'),
              ('Frequency', '4'),
              ('PoSs', '(VT N VI)')])]

In [69]:
addMapping('i', 'i')

In [70]:
exampleMap['I'][:2] # ɪ

[OrderedDict([('Transcription', 'hRbIvxrxsli'),
              ('stressInfoA', "`'___"),
              ('stressInfoB', 'S5'),
              ('Orthography', 'herbivorously'),
              ('Frequency', '0'),
              ('PoSs', '(AV)')]),
 OrderedDict([('Transcription', 'IveZ|n'),
              ('stressInfoA', "_'_"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'evasion'),
              ('Frequency', '1'),
              ('PoSs', '(N)')])]

In [71]:
addMapping('I', 'ɪ')

In [72]:
exampleMap['|'][:10] # ɪ

[OrderedDict([('Transcription', '@pt|tud'),
              ('stressInfoA', "'_`"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'aptitude'),
              ('Frequency', '3'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'ElxkyuS|n'),
              ('stressInfoA', "`_'_"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'elocution'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'k^lXeS|n'),
              ('stressInfoA', "`_'_"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'coloration'),
              ('Frequency', '2'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'mEdxkeS|n'),
              ('stressInfoA', "`_'_"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'medication'),
              ('Frequency', '2'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'rot|grxvyUr'),
              ('stressI

In [73]:
addMapping('|', 'ɪ')

In [74]:
exampleMap['U'][:5] # ʊ

[OrderedDict([('Transcription', 'spIrICUxs'),
              ('stressInfoA', "'___"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'spirituous'),
              ('Frequency', '0'),
              ('PoSs', '(AJ)')]),
 OrderedDict([('Transcription', 'myUnIfxsNt'),
              ('stressInfoA', "_'__"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'munificent'),
              ('Frequency', '0'),
              ('PoSs', '(AJ)')]),
 OrderedDict([('Transcription', '^nbUzM'),
              ('stressInfoA', "`'_"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'unbosom'),
              ('Frequency', '0'),
              ('PoSs', '(VT VI)')]),
 OrderedDict([('Transcription', 'wUdmxn'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'woodman'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'pyUrLxnt'),
              ('stressInfoA

In [75]:
addMapping('U', 'ʊ')

In [76]:
exampleMap['u'][:5] # u

[OrderedDict([('Transcription', 'protrud'),
              ('stressInfoA', "_'"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'protrude'),
              ('Frequency', '1'),
              ('PoSs', '(VT VI)')]),
 OrderedDict([('Transcription', 'krud'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'crude'),
              ('Frequency', '15'),
              ('PoSs', '(AJ N)')]),
 OrderedDict([('Transcription', 'Sut'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'chute'),
              ('Frequency', '2'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'spul'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'spool'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'ruZ'),
              ('stressInfoA', "'"),
              ('stressInfo

In [77]:
addMapping('u', 'u')

### Diphthongs

In [78]:
diphthongs

('W', 'o', 'e', 'O', 'Y')

In [79]:
exampleMap['W'][:5] # aʊ

[OrderedDict([('Transcription', 'trWns'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'trounce'),
              ('Frequency', '0'),
              ('PoSs', '(VT)')]),
 OrderedDict([('Transcription', 'Wtdorz'),
              ('stressInfoA', "_'"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'outdoors'),
              ('Frequency', '6'),
              ('PoSs', '(AV N)')]),
 OrderedDict([('Transcription', 'flWndX'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'flounder'),
              ('Frequency', '1'),
              ('PoSs', '(N VI)')]),
 OrderedDict([('Transcription', 'prxnWns'),
              ('stressInfoA', "_'"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'pronounce'),
              ('Frequency', '2'),
              ('PoSs', '(VT VI)')]),
 OrderedDict([('Transcription', 'h@ndWt'),
              ('stressInfoA', "'`"),
  

In [80]:
addMapping('W', 'aʊ')

In [81]:
exampleMap['o'][:10] # oʊ

[OrderedDict([('Transcription', 'xlEktrod'),
              ('stressInfoA', "_'`"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'electrode'),
              ('Frequency', '5'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'kop'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'cope'),
              ('Frequency', '21'),
              ('PoSs', '(N VI VT)')]),
 OrderedDict([('Transcription', 'sIloEt'),
              ('stressInfoA', "`_'"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'silhouette'),
              ('Frequency', '4'),
              ('PoSs', '(N VT)')]),
 OrderedDict([('Transcription', 'gord'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'gourd'),
              ('Frequency', '3'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'dIf@mxtori'),
              ('stressInfoA', "_'_`_"),
  

In [82]:
addMapping('o', 'oʊ')

In [83]:
exampleMap['e'][:5] # eɪ

[OrderedDict([('Transcription', '^nCen'),
              ('stressInfoA', "`'"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'unchain'),
              ('Frequency', '0'),
              ('PoSs', '(VT)')]),
 OrderedDict([('Transcription', 'ret'),
              ('stressInfoA', "'"),
              ('stressInfoB', 'S1'),
              ('Orthography', 'rate'),
              ('Frequency', '222'),
              ('PoSs', '(N VT VI)')]),
 OrderedDict([('Transcription', 'CesX'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'chaser'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'medxmwxzEl'),
              ('stressInfoA', "`__'"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'mesdemoiselles'),
              ('Frequency', '0'),
              ('PoSs', '(null)')]),
 OrderedDict([('Transcription', 'Ink^lket'),
              ('stressInfoA', "_'`"),

In [84]:
addMapping('e', 'eɪ')

In [85]:
exampleMap['O'][:5] # ɔɪ

[OrderedDict([('Transcription', 'EnvO'),
              ('stressInfoA', "'`"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'envoy'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'sIkwOx'),
              ('stressInfoA', "_'_"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'sequoia'),
              ('Frequency', '2'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'OstX'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'oyster'),
              ('Frequency', '6'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'zoOd'),
              ('stressInfoA', "'`"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'zooid'),
              ('Frequency', '0'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'p@rxtYfOd'),
              ('stressInfoA', "`_'`"),
              ('stres

In [86]:
addMapping('O', 'ɔɪ')

In [87]:
exampleMap['Y'][:5] # aɪ

[OrderedDict([('Transcription', 'rEk|nsYl'),
              ('stressInfoA', "'_`"),
              ('stressInfoB', 'S3'),
              ('Orthography', 'reconcile'),
              ('Frequency', '4'),
              ('PoSs', '(VT)')]),
 OrderedDict([('Transcription', 'sprIGtYm'),
              ('stressInfoA', "'`"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'springtime'),
              ('Frequency', '4'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'JRmxsYdL'),
              ('stressInfoA', "`_'_"),
              ('stressInfoB', 'S4'),
              ('Orthography', 'germicidal'),
              ('Frequency', '0'),
              ('PoSs', '(AJ)')]),
 OrderedDict([('Transcription', 'dYri'),
              ('stressInfoA', "'_"),
              ('stressInfoB', 'S2'),
              ('Orthography', 'diary'),
              ('Frequency', '4'),
              ('PoSs', '(N)')]),
 OrderedDict([('Transcription', 'bYagrxfi'),
              ('stressInfoA', "_'__"

In [88]:
addMapping('Y', 'aɪ')

### Altogether

In [89]:
mapping

{'f': 'f',
 'v': 'v',
 's': 's',
 'z': 'z',
 'h': 'h',
 'T': 'θ',
 'D': 'ð',
 'S': 'ʃ',
 'Z': 'ʒ',
 'C': 'tʃ',
 'J': 'dʒ',
 'p': 'p',
 'b': 'b',
 't': 't',
 'd': 'd',
 'k': 'k',
 'g': 'g',
 'm': 'm',
 'n': 'n',
 'G': 'ŋ',
 'M': 'ṃ',
 'N': 'ṇ',
 'w': 'w',
 'y': 'j',
 'r': 'ɹ',
 'l': 'l',
 'L': 'l̩',
 '@': 'æ',
 'c': 'ɔ',
 'a': 'ɑ',
 'x': 'ə',
 '^': 'ʌ',
 'X': 'ɚ',
 'R': 'ɚ',
 'E': 'ɛ',
 'i': 'i',
 'I': 'ɪ',
 '|': 'ɪ',
 'U': 'ʊ',
 'u': 'u',
 'W': 'aʊ',
 'o': 'oʊ',
 'e': 'eɪ',
 'O': 'ɔɪ',
 'Y': 'aɪ'}

## Define newdic inventory -> IPA mapping

In [90]:
newdic_IPA_relation = [
 ('h', 'h'),
 ('S', 'ʃ'),
 ('p', 'p'),
 ('x', 'ə'),
 ('m', 'm'),
 ('y', 'j'),
 ('v', 'v'),
 ('^', 'ʌ'),
 ('o', 'oʊ'),
 ('u', 'u'),
 ('I', 'ɪ'),
 ('G', 'ŋ'),
 ('N', 'ṇ'),
 ('|', 'ɪ'), #as near as I can tell, this is unstressed/reduced 'ɪ'
 ('Z', 'ʒ'),
 ('L', 'l̩'),
 ('M', 'ṃ'),
 ('i', 'i'),
 ('r', 'ɹ'),
 ('g', 'g'),
 ('O', 'ɔɪ'),
 ('T', 'θ'),
 ('n', 'n'),
 ('J', 'dʒ'),
 ('d', 'd'),
 ('k', 'k'),
 ('W', 'aʊ'),
 ('f', 'f'),
 ('D', 'ð'),
 ('U', 'ʊ'),
 ('z', 'z'),
 ('Y', 'aɪ'),
 ('b', 'b'),
 ('X', 'ɚ'), #r-colored schwa - stressed
 ('a', 'ɑ'),
 ('s', 's'),
 ('e', 'eɪ'),
 ('C', 'tʃ'),
 ('t', 't'),
 ('R', 'ɚ'), #r-colored schwa - UNstressed
 ('E', 'ɛ'),
 ('w', 'w'),
 ('l', 'l'),
 ('@', 'æ'),
 ('c', 'ɔ')]

def newdicToUnicodeIPA(newdic_symbol):
    mapping = dict(newdic_IPA_relation)
    #print(mapping)
    return mapping[newdic_symbol]

print(newdicToUnicodeIPA('T'))

# def invertMapping(mydict): 
#     return dict([[val, key] for key,val in mydict.items()])

# def unicodeIPAToNewdic(unicodeIPAsymbol):
#     return invertMapping( dict(newdic_IPA_relation) )[unicodeIPAsymbol]

# print(unicodeIPAToNewdic('θ'))

θ


In [91]:
for s in mapping:
    assert mapping[s] == newdicToUnicodeIPA(s), "mapping[{0}] ≠ newdicToUnicode[{0}]: '{1}' vs. '{2}'".format(s, mapping[s], newdicToUnicodeIPA(s))

## Transform transcriptions to IPA...

In [92]:
t = newdic_raw[2]['Transcription']
t
tuple(t)
tuple(map(newdicToUnicodeIPA, tuple(t)))
'.'.join(tuple(map(newdicToUnicodeIPA, tuple(t))))

'xb@k'

('x', 'b', '@', 'k')

('ə', 'b', 'æ', 'k')

'ə.b.æ.k'

In [93]:
dottedStringToTuple = lambda ds: tuple(ds.split('.'))
tupleToDottedString = lambda t: '.'.join(t)

In [94]:
tupleToDottedString( tuple(map(newdicToUnicodeIPA, 
                               tuple(t))) )

'ə.b.æ.k'

In [95]:
def newdicTranscriptionToIPA(newdic_trn):
    nt_tup = tuple(newdic_trn)
    ipa_tup = tuple(map(newdicToUnicodeIPA, nt_tup))
    return tupleToDottedString(ipa_tup)
newdicTranscriptionToIPA(t)

'ə.b.æ.k'

## Transform entries to have IPA transcriptions...

In [96]:
def edit_dict(the_dict, the_key, the_new_value):
    '''
    Composable (because it returns a value) but stateful(= in-place) dictionary update.
    '''
    the_dict.update({the_key: the_new_value})
    return the_dict

def modify_dict(the_dict, the_key, the_new_value):
    '''
    Composable and (naively-implemented) non-mutating dictionary update.
    '''
    new_dict = {k:the_dict[k] for k in the_dict}
    new_dict.update({the_key: the_new_value})
    return new_dict

In [97]:
def IPAify_newdic_entry(entry):
    new_entry = modify_dict(entry, 'Transcription', newdicTranscriptionToIPA(entry['Transcription']))
    return new_entry

In [98]:
newdic_raw[2]

OrderedDict([('Transcription', 'xb@k'),
             ('stressInfoA', "_'"),
             ('stressInfoB', 'S2'),
             ('Orthography', 'aback'),
             ('Frequency', '2'),
             ('PoSs', '(AV)')])

In [99]:
IPAify_newdic_entry(newdic_raw[2])
newdic_raw[2]

{'Transcription': 'ə.b.æ.k',
 'stressInfoA': "_'",
 'stressInfoB': 'S2',
 'Orthography': 'aback',
 'Frequency': '2',
 'PoSs': '(AV)'}

OrderedDict([('Transcription', 'xb@k'),
             ('stressInfoA', "_'"),
             ('stressInfoB', 'S2'),
             ('Orthography', 'aback'),
             ('Frequency', '2'),
             ('PoSs', '(AV)')])

In [100]:
newdic_IPA = list(map(IPAify_newdic_entry, newdic_raw))
newdic_IPA[2]

{'Transcription': 'ə.b.æ.k',
 'stressInfoA': "_'",
 'stressInfoB': 'S2',
 'Orthography': 'aback',
 'Frequency': '2',
 'PoSs': '(AV)'}

## Write to file / read back in

In [101]:
%pwd

'/mnt/cube/home/AD/emeinhar/newdic-nettalk'

In [102]:
%ls *newdic*

'Converting Hammond'\''s newdic (NETtalk) to Unicode IPA symbols.ipynb'
 newdic.txt


In [103]:
import csv

In [104]:
newdic_IPA[0]

{'Transcription': 'ə',
 'stressInfoA': '_',
 'stressInfoB': 'S1',
 'Orthography': 'a',
 'Frequency': '23178',
 'PoSs': '(N IA VB PP)'}

In [105]:
theFieldnames = ['Transcription', 'stressInfoA', 'stressInfoB', 'Orthography', 'Frequency', 'PoSs']
len(theFieldnames)
theFieldnames

6

['Transcription',
 'stressInfoA',
 'stressInfoB',
 'Orthography',
 'Frequency',
 'PoSs']

In [106]:
newdic_IPA_stem = 'newdic_IPA'

In [107]:
with open(newdic_IPA_stem + '.tsv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, delimiter='\t', fieldnames=theFieldnames)
    writer.writeheader()
    writer.writerows(newdic_IPA)

In [108]:
%pwd

'/mnt/cube/home/AD/emeinhar/newdic-nettalk'

In [109]:
%ls newdic*

newdic_IPA.tsv  newdic.txt


In [110]:
%cat -n newdic_IPA.tsv | head -60

     1	Transcription	stressInfoA	stressInfoB	Orthography	Frequency	PoSs
     2	ə	_	S1	a	23178	(N IA VB PP)
     3	ɑ.ɹ.d.v.ɑ.ɹ.k	'`	S2	aardvark	0	(N)
     4	ə.b.æ.k	_'	S2	aback	2	(AV)
     5	æ.b.ə.k.ə.s	'__	S3	abacus	0	(N)
     6	ə.b.æ.f.t	_'	S2	abaft	0	(AV PP)
     7	æ.b.ə.l.oʊ.n.i	`_'_	S4	abalone	0	(N)
     8	ə.b.æ.n.d.ɪ.n	_'_	S3	abandon	17	(VT N)
     9	ə.b.eɪ.s	_'	S2	abase	0	(VT)
    10	ə.b.æ.ʃ	_'	S2	abash	0	(VT)
    11	ə.b.eɪ.t	_'	S2	abate	0	(VT VI)
    12	æ.b.ə.t.i	'_`	S3	abatis	0	(N)
    13	æ.b.ə.t.w.ɑ.ɹ	'_`	S3	abattoir	0	(N)
    14	æ.b.ə.s.i	'__	S3	abbacy	0	(N)
    15	æ.b.eɪ	_'	S2	abbe	3	(N)
    16	æ.b.ə.s	'_	S2	abbess	0	(N)
    17	æ.b.i	'_	S2	abbey	7	(N)
    18	æ.b.ə.t	'_	S2	abbot	2	(N)
    19	ə.b.ɹ.i.v.i.eɪ.t	_'_`	S4	abbreviate	0	(VT)
    20	ə.b.ɹ.i.v.i.eɪ.ʃ.ɪ.n	_`_'_	S5	abbreviation	1	(N)
    21	æ.b.d.ɪ.k.eɪ.t	'_`	S3	abdicate	0	(VT VI)
    22	æ.b.d.ɪ.k.eɪ.ʃ.ɪ.n	`_'_	S4	abdication	0	(N)
    23	æ.b.d.ə.m.ə.n	'__	S3	abdomen	6	(N)
   