# Similar sounding words

This is a list of similar sounding words that I have collected from various sources on the web and added to as I find new pairs.

Unlike most homophone, homograph, and homonym resources this list is not targeting ESL or educational use. Instead it is designed for finding common errors in speech recognition texts. Specifically I use it with [Caster](https://caster.readthedocs.io/en/latest/) for voice programming.

I currently have five different sources. I've downloaded their contents as text files, or in one case HTML and parsed appropriately. I have also linked to the original location of these files both inside the files and in the headings between Jupyter cells below.

Unfortunately I wasn't thinking about reproducibility when I started this project, so most of the text files have had a bit of light preprocessing in a text editor. Given that I don't expect these source lists to change in the future, I don't think it will be a problem.

In [19]:
from bs4 import BeautifulSoup # pip install beautifulsoup4
from disjoint_set import DisjointSet # pip install disjoint-set
import re
from pprint import pformat

# [7esl.html](https://7esl.com/homophones/)

In [2]:
contents = open("7esl.html", encoding="utf8").read()
parser = BeautifulSoup(contents, 'html.parser')

In [3]:
similar_7esl = []
for element in parser.find_all("p"):
    candidate = element.find("strong")
    if candidate:
        partitions = candidate.text.lower().split(" —– ")
        if len(partitions) > 1:
            words = []
            for p in partitions:
                words.extend(s.strip().replace('’', "''") for s in p.split("/"))
            
            similar_7esl.append(words)
similar_7esl

ste', 'waist'],
 ['way', 'weigh'],
 ['weak', 'week'],
 ['weather', 'whether'],
 ['where', 'wear'],
 ['which', 'witch'],
 ["who''s", 'whose'],
 ['won', 'one'],
 ['would', 'wood'],
 ["you''re", 'your']]

# [ku.txt](https://web.ku.edu/~edit/wordsall.html)

In [4]:
contents = open("ku.txt").read().lower().splitlines()[1:]
similar_ku = [s.split(';') for s in contents]
similar_ku

,
 ['wane', 'wain'],
 ['want', 'wont'],
 ['wean', 'ween'],
 ['wear', 'ware'],
 ['weather', 'wether', 'whether'],
 ['whither', 'wither'],
 ['worst', 'wurst'],
 ['yew', 'ewe', 'you'],
 ['yoke', 'yolk']]

# [singularis.txt](http://www.singularis.ltd.uk/bifroest/misc/homophones-list.html)

In [5]:
contents = open("singularis.txt").read().lower().splitlines()[1:]
similar_singularis = [s.split(', ') for s in contents]
similar_singularis

'],
 ['whirled', 'world'],
 ['whit', 'wit'],
 ['white', 'wight'],
 ["who's", 'whose'],
 ['woe', 'whoa'],
 ['wood', 'would'],
 ['yaw', 'yore', 'your', "you're"],
 ['yoke', 'yolk'],
 ["you'll", 'yule']]

# [teachingtreasures.txt](https://www.teachingtreasures.com.au/teaching-tools/Basic-worksheets/worksheets-english/upper/homophones-list.htm)

In [6]:
contents = open("teachingtreasures.txt").read().lower().splitlines()[1:]
similar_teachingtreasures = [s.split(' ') for s in contents  if s]
similar_teachingtreasures

['wrap', 'rap'],
 ['wrapped', 'rapped'],
 ['wreak', 'reek'],
 ['wrest', 'rest'],
 ['wretch', 'retch'],
 ['wring', 'ring'],
 ['write', 'right'],
 ['wrote', 'rote'],
 ['wrung', 'rung'],
 ['wry', 'rye']]

# [thoughtco](https://www.thoughtco.com/homonyms-homophones-and-homographs-a-b-1692660)

In [7]:
contents = open("thoughtco.txt").read().lower().splitlines()[1:]
similar_thoughtco = [s.split(' ') for s in contents  if s]
similar_thoughtco

],
 ['war', 'wore'],
 ['warn', 'worn'],
 ['way', 'weigh'],
 ['we', 'wee'],
 ['weak', 'week'],
 ['wear', 'where'],
 ['weather', 'whether'],
 ['which', 'witch'],
 ['wood', 'would'],
 ['your', "you're"]]

# My personal list of words not found above

These were identified through trial and error (actually, just error) during dictation. Pull Requests welcome. These words tend to be commonly confused, but are not generally recognized as homophones.

In [8]:
contents = open("dusty.txt").read().lower().splitlines()[1:]
similar_dusty = [s.split(' ') for s in contents  if s]
similar_dusty

[]

# Join it all together
We want a list of all possible sets of words. This list of lists will surely contain duplicates (in fact, mostly duplicates).

I have done a visual sanity check in all the outputs above, but I'll do another below.

In [9]:
similar_words = []
similar_words.extend(similar_7esl)
similar_words.extend(similar_ku)
similar_words.extend(similar_singularis)
similar_words.extend(similar_teachingtreasures)
similar_words.extend(similar_thoughtco)
similar_words.extend(similar_dusty)

In [10]:
regex = re.compile("^[a-z'-]+$")
for similar in similar_words:
    if len(set(similar)) < 2:
        print(similar)
    for word in similar:
        if not regex.match(word):
            print(word)

# Dedup

Removing duplicates is not trivial, since the different sets of words may include multiple variations (for example, one set has *your* and *you're* and another includes *yore*). It would be easy enough to just do a double loop, but disjoint sets are my favourite datastructure, and I've never actually had an opportunity to use them in production code before. Read up on the union-find algorithm if you're unfamiliar with it, it's pretty cool.

In [11]:
word_set = DisjointSet()
for word_list in similar_words:
    for word in word_list[1:]:
        word_set.union(word_list[0], word)
    
wordsets = sorted(sorted(s) for s in word_set.itersets())
wordsets

 ['whither', 'wither'],
 ["who''s", "who's", 'whose'],
 ['whoa', 'woe'],
 ['wood', 'would'],
 ['worst', 'wurst'],
 ['yaw', 'yore', "you''re", "you're", 'your'],
 ['yoke', 'yolk'],
 ["you'll", 'yule']]

In [12]:
len(wordsets)

662

# Redupe

The final output is a dictionary of words mapping to all the words similar to that word, not including that word.

In [13]:
index = {}
for similar in wordsets:
    for word in similar:
        local = similar.copy()
        local.remove(word)
        index[word] = local
        
index


 'pries': ['prize'],
 'prize': ['pries'],
 'primer': ['primmer'],
 'primmer': ['primer'],
 'prince': ['prints'],
 'prints': ['prince'],
 'principal': ['principle'],
 'principle': ['principal'],
 ...}

In [14]:
len(index)

1440

In [20]:
with open("../similar_sounding_words.py", "w") as file:
    file.write("index = " + pformat(index))