Dataset:

https://github.com/amankharwal/Website-data/blob/master/book.txt

### Import Libraries and dataset

In [1]:
import pandas as pd
import numpy as np
import textdistance
import re
from collections import Counter

words = []
with open('book.txt', 'r') as f:
    file_name_data = f.read()
    file_name_data=file_name_data.lower()
    words = re.findall('\w+',file_name_data)
# This is our vocabulary
V = set(words)
print(f"The first ten words in the text are: \n{words[0:10]}")
print(f"There are {len(V)} unique words in the vocabulary.")

The first ten words in the text are: 
['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or', 'the', 'whale']
There are 17647 unique words in the vocabulary.


In the above code, we made a list of words, and now we need to build the frequency of those words, which can be easily done by using the counter function in Python:

In [2]:
word_freq_dict = {}
word_freq_dict = Counter(words)
print(word_freq_dict.most_common()[0:10])

[('the', 14703), ('of', 6742), ('and', 6517), ('a', 4799), ('to', 4707), ('in', 4238), ('that', 3081), ('it', 2534), ('his', 2530), ('i', 2120)]


### Relative Frequency of words

Now we want to get the probability of occurrence of each word, this equals the relative frequencies of the words:


In [3]:
probs = {}
Total = sum(word_freq_dict.values())
for k in word_freq_dict.keys():
    probs[k] = word_freq_dict[k]/Total

### Finding Similar Words

In [4]:
def my_autocorrect(input_word):
    input_word = input_word.lower()
    if input_word in V:
        return("Your word seems to be correct")
    else:
        similarities = [1 - (textdistance.Jaccard(qval=2).distance(v,input_word)) for v in word_freq_dict.keys()]
        df = pd.DataFrame.from_dict(probs, orient='index').reset_index()
        df = df.rename(columns={'index':'Word', 0:'Prob'})
        df['Similarity'] = similarities
        output = df.sort_values(['Similarity', 'Prob'], ascending=False).head()
        return(output)

Now, let’s find the similar words by using our autocorrect function:

In [5]:
my_autocorrect('nevertekess')

Unnamed: 0,Word,Prob,Similarity
2571,nevertheless,0.000225,0.5
1105,never,0.000925,0.4
14746,overtakes,4e-06,0.384615
11797,bitterness,4e-06,0.357143
14418,tenderness,4e-06,0.357143


Source:

https://thecleverprogrammer.com/2020/10/04/autocorrect-with-python/