One use case for regular expressions in the context of computational linguistics is querying a lexicon. To look at this use case, we will use the lexicon of word forms found in the [CMU Pronouncing Dictionary](https://web.archive.org/web/20210216211828/http://www.speech.cs.cmu.edu/cgi-bin/cmudict).

In [2]:
!wget https://raw.githubusercontent.com/Alexir/CMUdict/master/cmudict-0.7b

--2024-01-31 12:13:47--  https://raw.githubusercontent.com/Alexir/CMUdict/master/cmudict-0.7b
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3865710 (3.7M) [text/plain]
Saving to: ‘cmudict-0.7b.1’


2024-01-31 12:13:48 (14.0 MB/s) - ‘cmudict-0.7b.1’ saved [3865710/3865710]



In [3]:
!head -n 200 cmudict-0.7b

;;; # CMUdict  --  Major Version: 0.07
;;; 
;;; # $HeadURL: https://svn.code.sf.net/p/cmusphinx/code/branches/cmudict/cmudict-0.7b $
;;; # $Date:: 2015-07-15 12:34:30 -0400 (Wed, 15 Jul 2015)      $:
;;; # $Id:: cmudict-0.7b 13083 2015-07-15 16:34:30Z air         $:
;;; # $Rev:: 13083                                              $: 
;;; # $Author:: air                                             $:
;;;
;;; #
;;; # Copyright (C) 1993-2015 Carnegie Mellon University. All rights reserved.
;;; #
;;; # Redistribution and use in source and binary forms, with or without
;;; # modification, are permitted provided that the following conditions
;;; # are met:
;;; #
;;; # 1. Redistributions of source code must retain the above copyright
;;; #    notice, this list of conditions and the following disclaimer.
;;; #    The contents of this file are deemed to be source code.
;;; #
;;; # 2. Redistributions in binary form must reproduce the above copyright
;;; #    notice, this list of conditions and th

In [4]:
with open('cmudict-0.7b', encoding = "ISO-8859-1") as f:
    cmudict = [l.split() for l in f if l[:3] != ";;;"]
    cmudict = {w[0].lower(): w[1:] for w in cmudict}
    
cmudict["abstraction"]

['AE0', 'B', 'S', 'T', 'R', 'AE1', 'K', 'SH', 'AH0', 'N']

This dictionary uses what's known as the [ARPABET](https://en.wikipedia.org/wiki/ARPABET) and represents stress using numeric indicators: 0 for no stress, 1 for primary stress, and 2 for secondary stress. The ARPABET maps to more recognizable IPA representations in the following way.

In [5]:
arpabet_to_phoneme = {'AA': 'ɑ', 
                      'AE': 'æ', 
                      'AH': 'ə', 
                      'AO': 'ɔ', 
                      'AW': 'aʊ', 
                      'AY': 'aɪ', 
                      'B': 'b', 
                      'CH': 'tʃ', 
                      'D': 'd', 
                      'DH': 'ð', 
                      'EH': 'ɛ',
                      'ER': 'ɝ', 
                      'EY': 'eɪ', 
                      'F': 'f', 
                      'G': 'g', 
                      'HH': 'h', 
                      'IH': 'ɪ', 
                      'IY': 'i', 
                      'JH': 'dʒ', 
                      'K': 'k', 
                      'L': 'l', 
                      'M': 'm', 
                      'N': 'n',
                      'NG': 'ŋ', 
                      'OW': 'oʊ', 
                      'OY': 'ɔɪ', 
                      'P': 'p', 
                      'R': 'ɹ', 
                      'S': 's', 
                      'SH': 'ʃ', 
                      'T': 't', 
                      'TH': 'θ', 
                      'UH': 'ʊ', 
                      'UW': 'u', 
                      'V': 'v',
                      'W': 'w', 
                      'Y': 'j', 
                      'Z': 'z', 
                      'ZH': 'ʒ'}

We'll use this mapping to construct the IPA representation from the ARPABET representation.

In [6]:
import re

entries: dict[str, list[str]] = {
    w: [arpabet_to_phoneme[''.join(re.findall('[A-z]', phoneme))]
        for phoneme in transcription]
        for w, transcription in cmudict.items()
        if len({c for c in w})>1
        if len(transcription)>1 
        if not re.findall('[^A-z]', w[0])
}

For instance, the IPA representation for the word *abstraction* can be accessed in the following way.

In [7]:
entries["abstraction"]

['æ', 'b', 's', 't', 'ɹ', 'æ', 'k', 'ʃ', 'ʌ', 'n']

In [9]:
#| code-fold: true
#| code-summary: Dump IPA representation to file

with open("cmudict-ipa", "w") as f:
    for w, ipa in entries.items():
        ipa = " ".join(ipa)
        f.write(f"{w},{ipa}\n")