Have we ever thought about how the autocorrect features works in the keyboard of a smartphone? 

* Almost every smartphone brand irrespective of its price provides an autocorrect feature in their keyboards today. So let’s understand how the autocorrect features works.

![image.png](attachment:image.png)

With the context of machine learning, autocorrect is based on natural language processing. As the name suggests it is programmed to correct spellings and errors while typing. So how it works?

Let’s say we typed a word in our keyboard. If that word exist in the vocabulary of our smartphone then it will assume that we have written the right word. Now it doesn’t matter whether we write a name, a noun or any word on the planet.

If the word exists in the history of the smartphone, it will generalize the word as a correct word. What if the word doesn’t exist? If the word that we typed is a non-existing word in the history of our smartphone then the autocorrect is programmed to find the most similar words in the history of our smartphone.

### Build an Autocorrect with Python

Like our smartphone uses history to match the type words whether it’s correct or not. So here we also need to use some words to put the functionality in our autocorrect. We will use the text from a book. 

For this task, we need some libraries. The libraries that we are going to use are very general as a machine learning practitioner. So we must be having all the libraries installed in our system already except one. We need to install a library known as `textdistance`, which can be easily installed by using the `pip` command; `pip install textdistance`.

In [2]:
# !pip install textdistance

Collecting textdistance
  Downloading textdistance-4.2.1-py3-none-any.whl (28 kB)
Installing collected packages: textdistance
Successfully installed textdistance-4.2.1


In [3]:
import pandas as pd
import numpy as np
import textdistance
import re
from collections import Counter

In [6]:
words = []

with open('book.txt', 'r', encoding = "utf-8") as f:
    file_name_data = f.read()
    file_name_data=file_name_data.lower()
    words = re.findall('\w+',file_name_data)


In [11]:
# This is our vocabulary
V = set(words)

print(f"The first ten words in the text are: \n\n{words[0:10]}")
print()
print(f"There are {len(V)} unique words in the vocabulary.")

The first ten words in the text are: 

['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or', 'the', 'whale']

There are 17647 unique words in the vocabulary.


In the code above, we made a list of words, and now we need to build the frequency of those words, which can be easily done by using the counter function in Python:

In [12]:
word_freq_dict = {}  
word_freq_dict = Counter(words)

print(word_freq_dict.most_common()[0:10])

[('the', 14703), ('of', 6742), ('and', 6517), ('a', 4799), ('to', 4707), ('in', 4238), ('that', 3081), ('it', 2534), ('his', 2530), ('i', 2120)]


### Relative Frequency of words

Now we want to get the probability of occurrence of each word, this equals the relative frequencies of the words:

In [15]:
probs = {}     
Total = sum(word_freq_dict.values())    

for k in word_freq_dict.keys():
    probs[k] = word_freq_dict[k]/Total

### Finding Similar Words

Now we will sort similar words according to the Jaccard distance by calculating the 2 grams Q of the words. Next, we will return the 5 most similar words ordered by similarity and probability:

In [16]:
def my_autocorrect(input_word):
    input_word = input_word.lower()
    if input_word in V:
        return('Your word seems to be correct')
    else:
        similarities = [1-(textdistance.Jaccard(qval=2).distance(v,input_word)) for v in word_freq_dict.keys()]
        df = pd.DataFrame.from_dict(probs, orient='index').reset_index()
        df = df.rename(columns={'index':'Word', 0:'Prob'})
        df['Similarity'] = similarities
        output = df.sort_values(['Similarity', 'Prob'], ascending=False).head()
        return(output)

Now, let’s find the similar words by using our autocorrect function:

In [17]:
my_autocorrect('neverteless')

Unnamed: 0,Word,Prob,Similarity
2571,nevertheless,0.000225,0.75
13657,boneless,1.3e-05,0.416667
12684,elevates,4e-06,0.416667
1105,never,0.000925,0.4
7136,level,0.000108,0.4


As we took words from a book the same way there are some words already present in the vocabulary of the smartphone and some words it records while the user starts using the keyboard.