# Some applications

Here are a few of the experimental applications of phonetic similarity vectors included in the paper, including vector arithmetic, analogies, and sound symbolism tinting.

In [7]:
from annoy import AnnoyIndex
import numpy as np

I use `spacy` for word probabilities. A "reverse lookup" based on phonetic similarity yields *all* of a particular transcription's homophones. Sometimes the first homophone returned is a weird word that no one ever uses. In order to maintain some semblance of readability, I sort all of the returned words sharing a pronunciation by their unigram probability and use the most probable word.

In [8]:
import spacy

In [4]:
nlp = spacy.load('en_core_web_md')

In this cell, I create the Annoy index with pre-calculated vectors. The `words` contains all of the words, and the `lookup` dictionary maps a word to its index. I use these to look up words by their index in the Annoy index, and to find the Annoy index for a given word.

In [10]:
t = AnnoyIndex(50, metric="euclidean")
words = list()
lookup = dict()
for i, line in enumerate(open("cmudict-0.7b-simvecs", encoding="latin1")):
    word, vals_raw = line.split("  ")
    word = word.lower().strip("(012)")
    vals = np.array([float(x) for x in vals_raw.split(" ")])
    t.add_item(i, vals)
    words.append(word.lower())
    lookup[word.lower()] = i
t.build(100)

True

The `nnslookup` function takes an AnnoyIndex, a spaCy instance, a list of words, and a vector, and returns the most similar words by sound. It collates all homophones into groups and includes only the most common word for each group.

In [17]:
from collections import Counter
def nnslookup(t, nlp, words, vec, n=10):
    res = t.get_nns_by_vector(vec, n)
    batches = []
    current_batch = []
    last_vec = None
    for item in res:
        if last_vec is None or t.get_item_vector(item) == last_vec:
            current_batch.append(item)
            last_vec = t.get_item_vector(item)
        else:
            batches.append(current_batch[:])
            current_batch = []
            last_vec = None
    if len(current_batch) > 0:
        batches.append(current_batch[:])
    output = []
    for batch in batches:
        output.append(sorted(batch, key=lambda x: nlp.vocab[words[x]].prob, reverse=True)[0])
    return output
[words[i] for i in nnslookup(t, nlp, words, t.get_item_vector(lookup["roads"]))]

['roads', 'loads']

## Some arithmetic experiments

### Averaging the sound of a sentence

Find the word closest to the average of a sentence's phonetic similarity vectors:

In [74]:
sentences = [
    "I am sitting in a room different from the one you are in now",
    "Double double toil and trouble fire burn and cauldron bubble",
    "Peter Piper picked a peck of pickled peppers",
    "Four score and seven years ago our fathers brought forth on this continent a new nation"
]
for s in sentences:
    vecs = np.array([t.get_item_vector(lookup[w.lower()]) for w in s.split()])
    mean = vecs.mean(axis=0)
    print(s, "\n\t→", ', '.join([words[i] for i in nnslookup(t, nlp, words, mean, 10)]))

I am sitting in a room different from the one you are in now 
	→ anacin, ison, anna, enlightening, enigma
Double double toil and trouble fire burn and cauldron bubble 
	→ untenable, dependable, unbuildable, detachable, tradable
Peter Piper picked a peck of pickled peppers 
	→ pipette, epileptic, pipetec, decapitated, epic
Four score and seven years ago our fathers brought forth on this continent a new nation 
	→ stelljes, uncoordinated, straightedge, coordinated, oriordan


### Gradual progress

The `progress` function takes a list of vectors as a "source" (`src_vecs`) and gradually multiplies them by fractions of the vectors in a "destination" (`op_vecs`) over `n` steps, finding a gradual phonetic transition between the two lists of vectors.

In [29]:
def progress(src_vecs, op_vecs, n=10):
    for i in range(n+1):
        delta = i * (1.0/n)
        val = (src_vecs * (1.0-delta)) + (op_vecs * delta)
        yield val

In the cell below, we gradually move from "I am sitting in a room different from the one you are in now" to a vector consisting entirely of the word "buzzing":

In [32]:
s = "I am sitting in a room different from the one you are in now"
vecs = np.array([t.get_item_vector(lookup[w.lower()]) for w in s.split()])
print(s)
op_vecs = np.array([t.get_item_vector(lookup["humming"])]*len(vecs))
for res in progress(vecs, op_vecs, n=25):
    print(" ".join([words[nnslookup(t, nlp, words, i)[0]] for i in res]))

I am sitting in a room different from the one you are in now
i am sitting in a room different from the one you are in now
i am sitting in a room different from the one you are in now
i am sitting in a room different from the one you are in now
i am sitting in a room different from the one hue are in now
i am sitting in a room different schrum the one hue are in now
i am sitting in a room different schrum the one hue are in now
i am sitting in ah room different schrum the one hue are in now
ah am sitting in ah room different schrum the one hue are in now
ah am sitting in ah room different schrum the one hue are in now
ah mam sitting in ah room different schrum the one hue are in now
ah mam sitting inning ah rooming remittance schrum the one hue are inning romanow
imai mam chickening inning imai rooming different schrum the one hue emminger inning romanow
imai mam chickening inning imai rooming cinnaminson schrum the hyun ya emminger inning romanow
imai mam committing inning imai rooming

I do the same thing in this next cell, moving from the slogan of my department (NYU ITP) to the word "error":

In [35]:
s = "Perhaps the best way to describe us is as a Center for the Recently Possible"
vecs = np.array([t.get_item_vector(lookup[w.lower()]) for w in s.split()])
print(s)
op_vecs = np.array([t.get_item_vector(lookup["error"])]*len(vecs))
for res in progress(vecs, op_vecs, n=25):
    print(" ".join([words[nnslookup(t, nlp, words, i)[0]] for i in res]))

Perhaps the best way to describe us is as a Center for the Recently Possible
perhaps the best way to describe us is as a center for the recently possible
perhaps the best way to describe us is as a center for the recently possible
perhaps the best way to describe us is as a center for the recently possible
perhaps the best way to describe us is as a center for the recently possible
perhaps the best whey to describe us is as a center for the recently possible
perhaps the best whey to describe us is as ah center schreur the recently possible
perhaps the best whey to describe us is as ah center schreur the recently possible
perhaps the best whey to describe us is as ah center schreur the recently possible
perhaps the best whey to describe us is as ah center schreur the recently possible
perhaps the best erway to describe usair is as ah center schreur the erisa possible
perhaps thy bestseller erway to ameridata usair is as ah center schreur thy erisa irrepressible
inheritor thy bestseller 

In the cell below, I find the latent letters in the alphabet by finding words "between" the way the letters are pronounced: 

In [33]:
import textwrap
alpha = "abcdefghijklmnopqrstuvwxyz"
last = ""
output = []
for a, b in zip(alpha[:-1], alpha[1:]):
    if a != last:
        output.append(a)
    last = a
    for res in progress(np.array([t.get_item_vector(lookup[a])]), np.array([t.get_item_vector(lookup[b])]), n=30):
        res = [words[t.get_nns_by_vector(i, n=1)[0]] for i in res][0]
        if res != last:
            output.append(res)
        last = res
    if b != last:
        output.append(b)
    last = b
print(textwrap.fill(", ".join(output), 65))

a, eh, aah, b, beatie, beachy, c, teachey, deedee, g, d, e, eh,
f, sffed, divi, vecci, g, dacey, tizzy, edgy, jaycee, h, aah, ai,
i, ai, aah, ajay, j, che, 'kay, k, 'kay, cail, estai, ehle, l,
ehle, mtel, m, airmen, en, n, en, aw, au, o, au, aw, ooh, p,
kyowa, cue, q, cue, ru, ahr, r, ahr, 's, s, 's, essie, itchy,
attie, t, ja, yoy, hew, ewe, u, ewe, hew, view, venue, v, buddie,
dubay, w, extendable, aix, x, aix, equitex, ex-wife, why, iwai,
why, wai, y, wai, why, zewe, xie, z


### Sound analogies

These analogies fill in this blank "word A sounds like word B in the same way that word C sounds like word D." The `analogy()` function implements this with simple vector arithmetic (taking the difference of A and B and adding it to C to get D).

In [39]:
def analogy(t, w1, w2, w3):
    vec = (np.array(t.get_item_vector(w2)) - np.array(t.get_item_vector(w1))) + np.array(t.get_item_vector(w3))
    #return t.get_nns_by_vector(vec, 10)
    return nnslookup(t, nlp, words, vec, 10)

In [44]:
good_groups = [
    ["decide", "decision", "explode"], # explosion
    ["final", "finalize", "modern"],
    ["glory", "glorify", "liquid"],
    ["bite", "bitten", "shake"], # shaken
    ["leaf", "leaves", "calf"], # calves
    ["foot", "feet", "tooth"], # teeth
    ["automaton", "automata", "criterion"], # criteria
    ["four", "fourteen", "nine"], # nineteen
    ["light", "slide", "lack"], # slag
    ["whisky", "whimsy", "frisky"], # flimsy
    ["could", "stood", "calling"], # stalling
]
for w1, w2, w3 in good_groups:
    # uncomment for latex table formatting
    #print("%s & %s & %s & %s \\\\" % (w1, w2, w3, words[analogy(t, lookup[w1], lookup[w2], lookup[w3])[0]]))
    print("%s : %s :: %s : %s" % (w1, w2, w3, words[analogy(t, lookup[w1], lookup[w2], lookup[w3])[0]]))

decide : decision :: explode : explosion
final : finalize :: modern : modernize
glory : glorify :: liquid : liquefied
bite : bitten :: shake : shaken
leaf : leaves :: calf : calves
foot : feet :: tooth : keach
automaton : automata :: criterion : criteria
four : fourteen :: nine : nineteen
light : slide :: lack : slag
whisky : whimsy :: frisky : flimsy
could : stood :: calling : stalling


### Adding and subtracting sound

The following functions add the vectors of two words together, or subtract one vector from another, and get the words phonetically closest to the result.

In [58]:
def vecsum(t, w1, w2, mult=1.0):
    vec = (np.array(t.get_item_vector(w1)) + np.array(t.get_item_vector(w2))*mult)
    return nnslookup(t, nlp, words, vec)
def vecsub(t, w1, w2):
    vec = (np.array(t.get_item_vector(w1)) - np.array(t.get_item_vector(w2)))
    return nnslookup(t, nlp, words, vec)

Addition (e.g., "fizz" + "theology" = "physiology")

In [47]:
[words[i] for i in vecsum(t, lookup["ate"], lookup["teen"])]

['eighteen', 'ate', 'it']

In [48]:
[words[i] for i in vecsum(t, lookup["miss"], lookup["sieve"])]

['missive', 'miss', 'missus', 'fis']

In [49]:
[words[i] for i in vecsum(t, lookup["sub"], lookup["marine"])]

['submarine', 'summarize', 'marine', 'submarines']

In [50]:
[words[i] for i in vecsum(t, lookup["fizz"], lookup["theology"])]

['physiology', 'fizzle', 'sieve', 'sivy', 'feese']

In [59]:
[words[i] for i in vecsum(t, lookup["snack"], lookup["king"])]

['snacking', 'qing', 'snaking', 'smacking', 'flanking']

Subtraction (e.g., "curiously" - "lee" = "curious")

In [52]:
[words[i] for i in vecsub(t, lookup["submarine"], lookup["sub"])]

['amerindian', "burundi's", 'uninspired', 'collaborating', 'emerine']

In [53]:
[words[i] for i in vecsub(t, lookup["wordsworth"], lookup["word"])]

['dislodged', 'disregards', 'elsworth', 'discharged', 'aylesworth']

In [54]:
[words[i] for i in vecsub(t, lookup["lavender"], lookup["under"])]

['javelin', 'raveling', 'televise', 'ravel', 'travelodge']

In [55]:
[words[i] for i in vecsub(t, lookup["curiously"], lookup["lee"])]

['curious', 'kurian', 'curiosities', 'curatorial', 'murias']

In [56]:
[words[i] for i in vecsub(t, lookup["ingredients"], lookup["reed"])]

['insignia', 'industrielle', 'argentinians', 'archaeologists', 'intracranial']

In [57]:
[words[i] for i in vecsub(t, lookup["disassociate"], lookup["diss"])]

['authoritarianism',
 'associating',
 'professorial',
 'associate',
 'disassociated']

### Sound symbolism tinting

Research in psychology and psycholinguistics says that "kiki" is a "sharp" word and "bouba" is a "round" word. Let's re-compose a text while trying to make it "sharp" (by adding the vector for `kiki` to each word) and "round" (by adding the vector for `babu` to each word). (Using `babu` instead of `bouba` because `bouba` isn't in CMUdict.)

Here's Frost's famous "The Road Not Taken":

In [64]:
text = """\
two roads diverged in a yellow wood
and sorry i could not travel both
and be one traveler long i stood
and looked down one as far as i could
to where it bent in the undergrowth
 
then took the other as just as fair
and having perhaps the better claim
because it was grassy and wanted wear
though as for that the passing there
had worn them really about the same
 
and both that morning equally lay
in leaves no step had trodden black
oh i kept the first for another day
yet knowing how way leads on to way
i doubted if i should ever come back
 
i shall be telling this with a sigh
somewhere ages and ages hence
two roads diverged in a wood and i
i took the one less traveled by
and that has made all the difference"""

Rewritten as a "sharp" poem:

In [66]:
for line in text.split("\n"):
    print(' '.join([words[vecsum(t, lookup[word], lookup["kiki"], 0.8)[0]] for word in line.split()]).capitalize())

Kooky roads diverged in eh yellow woodke
And sarti i gokey pecan keevil booth
And pee one traveler long i stookey
And loci down one as far as i gokey
Tuckey waikiki eke beak in thi undergrowth

Then kupek thi other as cheeky as fichera
And having perhaps thi becky keim
Picky eke was keesee and waikiki wacky
Though as for peak thi peeking geeky
Hakki worn them keeley kabuki thi safekeeping

And booth peak morning kiki lay
In teves know techie hakki teagarden blackie
Oh i khaki thi thirsty for another ghee
Fekete keown how way tiegs on tuckey way
I tiki if i shooed keever come bacchi

I kishi pee leckey keith withey eh psyche
Squeaky keizai and keizai hence
Kooky roads diverged in eh woodke and i
I kupek thi one leckey keevil be
And peak has pigue all thi defrance


Now as a "round" poem:

In [68]:
for line in text.split("\n"):
    print(' '.join([words[vecsum(t, lookup[word], lookup["babu"], 0.8)[0]] for word in line.split()]).capitalize())

Chubu roads barboursville in a yellow would
And bari i koba knob travel both
And bobby one bosler daum i stobaugh
And jukebox bowed one as fahd as i koba
To bowell it bondt in the bogard

Babu bocook the bother as babu as fair
And having perhaps the bobbette claim
Boggess it zabawa barresi and wambaugh bowell
Though as for bogacz the babu babu
Hob worn them bodley abboud the same

And both bogacz booming equally lay
In bob's know bobbette hob bodden blob
Oh i bobcat the first for bother doi
Bobbette knowing baja way bob's on to way
I bowed if i should ever kebab bob

I shall bobby babu boggess with a seibu
Babu mugabe's and mugabe's hence
Chubu roads barboursville in a would and i
I bocook the one babu traveled ba
And bogacz has mugabe all the ballance


And hey, just for fun, let's add the word "road" to the poem:

In [72]:
for line in text.split("\n"):
    print(' '.join([words[vecsum(t, lookup[word], lookup["road"], 0.95)[0]] for word in line.split()]).capitalize())

To roads road in a yellow would
And sorrows ah could not devilwood boge
And be rowen traveler long ah road
And looked lowdown rowen a.s far a.s ah could
To haywood it bent in the undergrowth

Then rook the other a.s road a.s fair
And having perhaps the eurodollar crowed
Road it was grode and road haywood
Zoh a.s for rhodus the road narrowed
Hid worn them reload o'dowd the same

And boge rhodus morning equally lay
In roads know road hid roden brode
Oh ah roadcap the first for another doi
Road rowing doha way roads on to way
Ah road if ah should ever come roadcap

Ah shall be rodale road with a sigh
Road loges and loges hence
To roads road in a would and ah
Ah rook the rowen road traveled lodi
And rhodus has made all the durrance
