We consider the problem of determining the most probable correction of a word not found in a dictionary. Thus, the problem is to find the correction c, among all possible candidate corrections, which maximizes the probability that c is the desired correction, given the original word w: argmax c∈ candidates P(c|w). By Bayes' theorem, this is equivalent to:
argmax c∈ candidates P(c) P(w|c) / P(w)
Since P(w) is the same for each possible candidate c, we can eliminate it, which gives: argmax c ∈ candidates P(c) P(w|c)
The elements of this equation are: Language model: P(c) The probability that c appears as a word in an English text.
Error model: P(w|c) The probability that w was typed in a text when the author meant to write c. For example, P(teh|the) is relatively high, but P(theeexyz|the) would be very low.

### Importation

In [1]:
import re
from collections import Counter
import numpy as np
import pandas as pd


## 1 - Data Preprocessing 

###  process_data(corpus_file)

1- Implement the process_data(corpus_file) function which allows you to :

        -read the given corpus in the form of a text file, 
        -convert the text to lowercase and segment the text and returns the list of words.

In [2]:
def process_data(file_name):
    """
    Input: 
        A file_name which is found in your current directory. You just have to read it in. 
    Output: 
        words: a list containing all the words in the corpus (text file you read) in lower case. 
    """
    words = [] 

    ### START CODE HERE ### 
    
    with open(file_name, 'r', encoding='utf-8') as file:
        text = file.read()
    
    text = text.lower()
    
    words = re.findall(r'\b\w+\b', text)
    
    ### END CODE HERE ###
    
    return words

In [3]:
#process_data("big.txt")

### get_vocabulary(corpus_file)

2- Implement the function get_vocabulary(corpus_file) which :
        
        returns the vocabulary constructed from a corpus passed as an argument to the function.

In [4]:
def get_vocabulary(corpus_file):
    """
    Input:
        corpus_file: path to the corpus file
    Output:
        vocabulary: a set containing all unique words in the corpus
    """
    words = process_data(corpus_file)
    
    vocabulary = set(words)
    
    return vocabulary

## 2- Build the language Model

We can estimate the probability of a word, by counting the number of times that word appears in a large corpus of text and dividing over the corpus size. Write a function that builds the language model by calculating the probability of each word based on the big.txt file provided with this lab and stores the result in an appropriate data structure. Don't forget to preprocess with the process_data function.

### get_count(word_l)

In [5]:
def get_count(word_l):
    '''
    Input:
        word_l: a set of words representing the corpus. 
    Output:
        word_count_dict: The wordcount dictionary where key is the word and value is its frequency.
    '''
    
    word_count_dict = {} 
    for word in word_l:
        if word in word_count_dict:
            word_count_dict[word]+=1
        else:
            word_count_dict[word]=1
            
    return word_count_dict

### get_probs(word_count_dict)

In [6]:
def get_probs(word_count_dict):
    '''
    Input:
        word_count_dict: The wordcount dictionary where key is the word and value is its frequency.
    Output:
        probs: A dictionary where keys are the words and the values are the probability that a word will occur. 
    '''
    probs = {}  
    
    total_count=sum(word_count_dict.values())
    for word , count in word_count_dict.items():
        probs[word]= count/ total_count
    
    return probs

### build_language_model(corpus_file)

In [7]:
def build_language_model(corpus_file):
    """
    Build a language model by calculating the probability of each word based on the corpus file.
    
    Args:
    - corpus_file: Path to the corpus file
    
    Returns:
    - language_model: Dictionary containing word probabilities
    """
    vocabulary=get_vocabulary(corpus_file)
    
    word_count_dict = get_count(vocabulary)
    
    language_model = get_probs(word_count_dict)
    
    return language_model

corpus_file = "big.txt"
language_model = build_language_model(corpus_file)


In [8]:
#language_model

## 3- Edits

Write the following functions:

    - edits1(s): which returns the set of all strings (whether they are words or not) that can be obtained with a single modification (insertion, substitution or deletion) to be performed on the string s.
    - edits2(s): which returns the set of all strings (whether they are words or not) which can be obtained with 2 modifications (insertion, substitution or deletion) to be carried out on the string s.
    - knownWord(words) which filters the words from the words list which are not in the dictionary (it therefore only keeps valid words). We can use the function get_vocabulary(corpus_file)

but before implementing the edits functions , let's implement each edit itself ( delete , switch , insert , replace)

In [9]:
def delete_letter(word, verbose=False):
    '''
    Input:
        word: the string/word for which you will generate all possible words 
                in the vocabulary which have 1 missing character
    Output:
        delete_l: a list of all possible strings obtained by deleting 1 character from word
    '''
    
    delete_l = []
    split_l = []
    
    split_l=[(word[:i],word[i:]) for i in range(len(word)+1)]
    delete_l=[L+R[1:] for L,R in split_l if R]
    

    if verbose: print(f"input word {word}, \nsplit_l = {split_l}, \ndelete_l = {delete_l}")

    return  delete_l

In [10]:
def switch_letter(word, verbose=False):
    '''
    Input:
        word: input string
     Output:
        switches: a list of all possible strings with one adjacent charater switched
    ''' 
    
    switch_l = []
    split_l = []
    
    split_l = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    for L, R in split_l:
        if len(R) >= 2:
            switched_word = L + R[1] + R[0] + R[2:]
            switch_l.append(switched_word)
    
    if verbose:
        print(f"Input word = {word} \nsplit_l = {split_l} \nswitch_l = {switch_l}") 
    
    return switch_l


In [11]:
def replace_letter(word, verbose=False):
    '''
    Input:
        word: the input string/word 
    Output:
        replaces: a list of all possible strings where we replaced one letter from the original word. 
    ''' 
    
    letters = 'abcdefghijklmnopqrstuvwxyz'
    
    replace_l = []
    split_l = []
    
    split_l=[(word[:i],word[i:]) for i in range(len(word)+1)]
    replace_set= set()
    for L, R in split_l:
        for letter in letters:
            if R:
                replace_set.add(L+letter+R[1:])

                
    replace_set.discard(word)    
    replace_l = sorted(list(replace_set))
    
    if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \nreplace_l {replace_l}")   
    
    return replace_l

In [12]:
def insert_letter(word, verbose=False):
    '''
    Input:
        word: the input string/word 
    Output:
        inserts: a set of all possible strings with one new letter inserted at every offset
    ''' 
    letters = 'abcdefghijklmnopqrstuvwxyz'
    insert_l = []
    split_l = []
    
    split_l=[(word[:i],word[i:]) for i in range(len(word)+1)]
    for L, R in split_l:
        for letter in letters:
            
            inserted_word= L + letter +R 
            insert_l.append(inserted_word)
    
    if verbose: print(f"Input word {word} \nsplit_l = {split_l} \ninsert_l = {insert_l}")
    
    return insert_l

### edit_one_letter(word)

In [13]:
def edit_one_letter(word, allow_switches = True):
    """
    Input:
        word: the string/word for which we will generate all possible wordsthat are one edit away.
    Output:
        edit_one_set: a set of words with one possible edit. Please return a set. and not a list.
    """
    
    edit_one_set = set()
    
    edits_one = delete_letter(word) + replace_letter(word) + insert_letter(word)
    if allow_switches:
        edits_one += switch_letter(word)
    edit_one_set.update(edits_one)
    
    return set(edit_one_set)

### edit_two_letters(word)

In [14]:
def edit_two_letters(word, allow_switches=True):
    """
    Generate all possible edits with two modifications for the given word.
    """
    edit_two_set = set()
    
    # Get all edits with one modification
    edits_one = delete_letter(word) + replace_letter(word) + insert_letter(word)
    if allow_switches:
        edits_one += switch_letter(word)
    
    # For each modified word from edits_one, get all edits with one modification again
    for edit in edits_one:
        edits_two = delete_letter(edit) + replace_letter(edit) + insert_letter(edit)
        if allow_switches:
            edits_two += switch_letter(edit)
        edit_two_set.update(edits_two)
    
    return edit_two_set



### knownWord(words, corpus_file)

In [15]:
def knownWord(words, corpus_file):
    """
    Filter words from the list 'words' that are present in the vocabulary obtained from the corpus file.

    Args:
    - words: A list of words to filter.
    - corpus_file: The path to the corpus file used to obtain the vocabulary.

    Returns:
    - A list containing only the words that are present in the vocabulary.
    """
    # Get the vocabulary from the corpus file
    vocabulary = get_vocabulary(corpus_file)
    
    # Filter words to keep only those present in the vocabulary
    known_words = [word for word in words if word in vocabulary]
    
    return known_words


## 4- Candidates

We assume that we do not have data to construct the error model, so we will adopt the following assumptions: all known words with an edit distance of 1 are infinitely more likely than known words with an edit distance d edition of 2, and infinitely less probable than a known word with an edition distance of 0. Thus, to select the most probable candidates we consider their probabilities according to the previously constructed language model and their priorities according to their edit distance from the original word. With this simplification, we do not need to multiply by a factor P(w|c), because each candidate for the chosen priority will have the same probability.
Write the candidate(word) function which returns the first non-empty list of candidates in order of priority:

    The original word, if known; Otherwise
    The list of known words at an edit distance of one, if any; Otherwise
    The list of known words at an edit distance of two, if any; Otherwise
    The original word, even if it is not known.

In [16]:
vocabulary=get_vocabulary(corpus_file)

In [17]:
def candidates(word):
    """
    Generate candidate words for the given word based on the described prioritization.

    Args:
    - word: The original word for which candidates are generated.

    Returns:
    - A list of candidate words prioritized according to the described criteria.
    """
    if word in vocabulary:
        return [word]
    
    candidates_one_edit = edit_one_letter(word)
    known_candidates_one_edit = knownWord(candidates_one_edit, corpus_file)
    known_candidates_one_edit = [candidate for candidate in known_candidates_one_edit if candidate in language_model]
    if known_candidates_one_edit:
        return sorted(known_candidates_one_edit, key=lambda x: language_model.get(x, 0), reverse=True)
    
    candidates_two_edits = edit_two_letters(word)
    known_candidates_two_edits = knownWord(candidates_two_edits, corpus_file)
    known_candidates_two_edits = [candidate for candidate in known_candidates_two_edits if candidate in language_model]
    if known_candidates_two_edits:
        return sorted(known_candidates_two_edits, key=lambda x: language_model.get(x, 0), reverse=True)
    
    return [word]


### Test the candidates

In [18]:
test_word = "hetre"

suggested_candidates = candidates(test_word)

if test_word in vocabulary:
    print("Word '{}' is in the vocabulary.".format(test_word))
else:
    if suggested_candidates:
        print("Suggested candidates for '{}':".format(test_word))
        for candidate in suggested_candidates:
            print(candidate)
    else:
        print("No candidates found for '{}'. It is not in the vocabulary and no candidates with one or two edits.".format(test_word))


Suggested candidates for 'hetre':
etre
here
metre


Using the previous functions, write the function correction(word, k) which returns the k most probable corrections of the word word.

### correction(word, k)

In [19]:
def correction(word, k):
    """
    Return the k most probable corrections of the given word.

    Args:
    - word: The original word for which corrections are sought.
    - k: The number of corrections to return.

    Returns:
    - A list of the k most probable corrections of the given word.
    """
    candidates_list = candidates(word)
    
    sorted_candidates = sorted(candidates_list, key=lambda x: language_model.get(x, 0), reverse=True)
    
    return sorted_candidates[:k]


### Test the correction function 

In [20]:
word = "hert"
k = 3
suggested_corrections = correction(word, k)
print("Corrections for '{}':".format(word))
for correction_word in suggested_corrections:
    print(correction_word)


Corrections for 'hert':
wert
her
herb


In [22]:
string = "sdhe askk mee abot my age adn I told  heer that I amm twenty four"

correct_string = []

for word in string.split():

    if word in vocabulary:
        correct_string.append(word)
        
    else:
        suggestions = correction(word, k=2)
        
        if suggestions:  
            best_correction = suggestions[0]  
        else:
            best_correction = word 
            
        correct_string.append(best_correction)
        
' '.join(correct_string)


'she asks met abort my age ann a told heir that a amy twenty four'