# Spell Corrector

A spell corrector is a widely used application that we see almost everywhere on the internet. Spell correction is an important part of lexical processing. In many applications, spell correction forms an initial preprocessing layer. For example, if we are making a chatbot to book flights, and we get the user request 'Book a flight from Mumbai to Bangalor', you want to gracefully handle that spelling error and return relevant results.

Here we will be working on Norvigâ€™s spell corrector.

In [1]:
# import required libraries
import re
from collections import Counter

In [2]:
# function to tokenize the words
def words(document):
    return re.findall(r'\w+', document.lower())

In [3]:
# create a frequency table of all the words of the document
all_words = Counter(words(open('big.txt').read()))

In [4]:
type(all_words)

collections.Counter

In [5]:
print(len(open('big.txt').read()))
print(len(all_words))

6488665
32198


In [6]:
# check the frequency of a random word, say 'bought'
all_words['bought']

55

In [7]:
# look at top 10 frequent words in the document
all_words.most_common(10)

[('the', 79809),
 ('of', 40024),
 ('and', 38312),
 ('to', 28765),
 ('in', 22023),
 ('a', 21124),
 ('that', 12512),
 ('he', 12401),
 ('was', 11410),
 ('it', 10681)]

In [8]:
# Create all edits that are one edit away from word passed
def edits_one(word):
    alphabets    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])                   for i in range(len(word) + 1)]
    deletes    = [left + right[1:]                       for left, right in splits if right]
    inserts    = [left + c + right                       for left, right in splits for c in alphabets]
    replaces   = [left + c + right[1:]                   for left, right in splits if right for c in alphabets]
    transposes = [left + right[1] + right[0] + right[2:] for left, right in splits if len(right)>1]
    return set(deletes + inserts + replaces + transposes)

In [9]:
# Create all edits that are two edits away from word passed
def edits_two(word):
    return (e2 for e1 in edits_one(word) for e2 in edits_one(e1))

In [10]:
# The subset of `words` that appear in the `all_words`
def known(words):
    return set(word for word in words if word in all_words)

In [11]:
# Generate possible spelling corrections for word
def possible_corrections(word):
    return (known([word]) or known(edits_one(word)) or known(edits_two(word)) or [word])

In [12]:
# Probability of `word`: Number of appearances of 'word' / total number of tokens
def prob(word, N=sum(all_words.values())): 
    return all_words[word] / N

In [13]:
# Let us test with sample text

print(len(set(edits_one("monney"))))
print(edits_one("monney"))

336
{'monjney', 'mxonney', 'monndy', 'monncy', 'monnely', 'mponney', 'aonney', 'monmney', 'mohney', 'mdonney', 'mobnney', 'mkonney', 'motnney', 'monhey', 'monbey', 'hmonney', 'lonney', 'monxey', 'monneys', 'mosney', 'mnney', 'mwonney', 'moanney', 'monuney', 'monqey', 'mdnney', 'umonney', 'monneny', 'mtonney', 'monnhey', 'mojnney', 'monneuy', 'monnye', 'vmonney', 'donney', 'mondey', 'tmonney', 'monnsy', 'monvey', 'monpey', 'mobney', 'monyney', 'molnney', 'mooney', 'monnen', 'monnewy', 'amonney', 'uonney', 'monnep', 'monwey', 'moenney', 'monrney', 'moncey', 'myonney', 'monnezy', 'monnvy', 'mtnney', 'movnney', 'maonney', 'monnegy', 'monnwy', 'imonney', 'monrey', 'monnery', 'mhnney', 'monzney', 'cmonney', 'monnsey', 'monneyw', 'bmonney', 'monner', 'monnky', 'monnbey', 'monnpey', 'monnaey', 'monneyy', 'moneey', 'monneyd', 'monneey', 'xmonney', 'mouney', 'monneyx', 'monpney', 'momnney', 'monniey', 'monnecy', 'monaey', 'oonney', 'moonney', 'menney', 'monnex', 'mopney', 'monneiy', 'monwney', '

In [14]:
print(len(set(edits_two("monney"))))
print(edits_two("monney"))

51013
<generator object edits_two.<locals>.<genexpr> at 0x10fd54ac0>


In [15]:
# Let's look at possible corrections of a word
print(possible_corrections("monney"))

{'monkey', 'money'}


In [16]:
# Let's look at probability of a word
print(prob("money"))
print(prob("monkey"))

0.0002922233626303688
5.378344097491451e-06


In [17]:
# Let us build the spell corrector by printing the most probable spelling correction for `word` out of all the `possible_corrections`
def spell_check(word):
    correct_word = max(possible_corrections(word), key=prob)
    if correct_word != word:
        return "Did you mean " + correct_word + "?"
    else:
        return "Correct spelling."

In [18]:
# test spell check
print(spell_check("monney"))

Did you mean money?


In [19]:
# test spell check
print(spell_check("aquire"))

Did you mean acquire?
