# Spelling Corrector

We consider the problem of determining the most probable correction for a word not found in the dictionary. Thus, the problem is to find the correction `c`, among all possible candidate corrections, that maximizes the probability that `c` is the intended correction, given the original word `w`:

$$ \text{argmax}_{c \in \text{candidates}} P(c|w) $$

By Bayes' theorem, this is equivalent to:

$$ \text{argmax}_{c \in \text{candidates}} \frac{P(c) P(w|c)}{P(w)} $$

Since \( P(w) \) is the same for each possible candidate `c`, we can eliminate it, which gives:

$$ \text{argmax}_{c \in \text{candidates}} P(c) P(w|c) $$

The elements of this equation are:

- **Language Model: \( P(c) \)** - The probability that `c` appears as a word in an English text.
- **Error Model: \( P(w|c) \)** - The probability that `w` was typed when the author intended to write `c`.


Now, in order to solve this problem, we will start by implementing the function `process_data(corpus_file)` which reads the given corpus from a text file, converts the text to lowercase, segments the text, and returns a list of words.

In [1]:
# Importing the nltk library for corpus preprocessing
import nltk
from nltk.tokenize import word_tokenize

# Downloading the pre-trained 'punkt' tokenizer model for word segmentation
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
def process_data(corpus_file):
    """
    Process the input corpus file, segmenting it into words, converting to lowercase,
    and filtering out non-alphanumeric tokens.

    Parameters:
    corpus_file (str): Path to the text file containing the corpus.

    Returns:
    list: A list of cleaned, lowercase words from the corpus.
    """
    # List to store the processed words
    words = []

    # Reading the corpus from the file
    with open(corpus_file, 'r', encoding='utf-8') as file:
        text = file.read()

    # Tokenizing the text using NLTK's word tokenizer
    tokens = word_tokenize(text)

    # Processing each token
    for token in tokens:
        # Filtering out non-alphanumeric tokens
        if token.isalnum():
            # Adding the token in lowercase to the list
            words.append(token.lower())

    return words

Now we will implement the function `get_vocabulary(corpus_file)` that returns the vocabulary built from a corpus passed as an argument to the function.

In [3]:
def get_vocabulary(corpus_file):
    """
    Extracts the vocabulary from a given text corpus.

    Parameters
    ----------
    corpus_file : str
        The path to the text file containing the corpus.

    Returns
    -------
    set
        A set of unique words in lowercase, containing only alphanumeric tokens.
    """
    # Initialize a set to store unique words
    vocabulary = set()

    # Read the corpus
    with open(corpus_file, 'r', encoding='utf-8') as file:
        text = file.read()

    # Tokenize the text using NLTK
    tokens = word_tokenize(text)

    for token in tokens:
        # Remove tokens that are not alphanumeric
        if token.isalnum():
            # Add the token in lowercase to the vocabulary set
            vocabulary.add(token.lower())

    return vocabulary

We can estimate the probability of a word by counting the number of times that word appears in a large text corpus and dividing by the size of the corpus. We will write a function that constructs the language model by calculating the probability of each word based on the provided text file (I will use: The Project Gutenberg EBook of The Adventures of Sherlock Holmes by Arthur Conan Doyle) and store the results in an appropriate data structure.

You can get the file from here: [The Adventures of Sherlock Holmes](https://www.gutenberg.org/ebooks/1661)

In [4]:
def build_language_model(corpus_file):
    """
    Constructs a language model by calculating the probability of each word in the provided corpus.

    The probability of a word is estimated by counting its occurrences in the text and dividing
    by the total number of words in the corpus.

    Parameters
    ----------
    corpus_file : str
        The path to the text file containing the corpus.

    Returns
    -------
    dict
        A dictionary where the keys are words and the values are their respective probabilities.
    """

    # Pre-process the corpus
    words = process_data(corpus_file)

    # Count word frequencies
    word_counts = {}
    total_words = len(words)
    for word in words:
        word_counts[word] = word_counts.get(word, 0) + 1  # Increment the count correctly

    # Calculate probabilities
    language_model = {}
    for word, count in word_counts.items():
        language_model[word] = count / total_words

    return language_model

### Function Definitions

1. **edits1(s)**:
   - This function returns the set of all strings (whether they are words or not) that can be obtained by making a single modification (insertion, substitution, or deletion) to the string `s`.

2. **edits2(s)**:
   - This function returns the set of all strings (whether they are words or not) that can be obtained by making two modifications (insertion, substitution, or deletion) to the string `s`.

3. **knownWord(words)**:
   - This function filters the words in the list `words` that are not in the dictionary, retaining only the valid words. You can use the `get_vocabulary(corpus_file)` function to obtain the list of valid words from the dictionary.

In [5]:
def edits1(s):
    alphabet = 'abcdefghijklmnopqrstuvwxyz'

    # Split the string into all possible pairs of prefixes and suffixes
    splits = [(s[:i], s[i:]) for i in range(len(s) + 1)]

    # Generate words after deleting each letter
    deletes = [a + b[1:] for a, b in splits if b]

    # Generate words after inserting each letter of the alphabet at every position
    inserts = [a + c + b for a, b in splits for c in alphabet]

    # Generate words after replacing each letter with every letter of the alphabet
    replaces = [a + c + b[1:] for a, b in splits if b for c in alphabet]

    # Generate words after transposing two letters in the word
    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) > 1]

    # Return a set of unique words
    return set(deletes + inserts + replaces + transposes)

In [6]:
def edits2(word):
    # Return the set of all words that are two edits away from the given word
    return {e2 for e1 in edits1(word) for e2 in edits1(e1)}

In [7]:
def knownWord(words, corpus_file):
    # Get the set of valid words from the dictionary
    vocabulary = get_vocabulary(corpus_file)
    return set(w for w in words if w in vocabulary)

We assume that we do not have data to construct the error model; therefore, we will adopt the following assumptions: all known words with an edit distance of 1 are infinitely more probable than known words with an edit distance of 2, and infinitely less probable than a known word with an edit distance of 0. Thus, to select the most probable candidates, we consider their probabilities based on the previously constructed language model and their priorities according to their edit distance from the original word. With this simplification, we do not need to multiply by a factor \( P(w|c) \), since each candidate with the chosen priority will have the same probability.

We will create the function `candidates(word, corpus_file)` that returns the first non-empty list of candidates in order of priority:
- The original word, if it is known; otherwise,
- The list of known words at an edit distance of one, if there are any; otherwise,
- The list of known words at an edit distance of two, if there are any; otherwise,
- The original word, even if it is not known.

In [8]:
def candidates(word , corpus_file):
    """
    Generates a list of probable candidates for the given word based on edit distance.

    The function checks if the original word is known, and if not, it looks for words
    with an edit distance of 1 or 2 that are in the vocabulary obtained from the corpus.

    Parameters
    ----------
    word : str
        The word for which to find candidates.
    corpus_file : str
        The path to the text file used as the corpus to build the vocabulary.

    Returns
    -------
    list
        A list of candidate words in order of priority:
        - The original word if it is known.
        - Words known with an edit distance of 1.
        - Words known with an edit distance of 2.
        - The original word, even if it is not known.
    """
    # List to store candidates
    candidate_list = []
    vocabulary = get_vocabulary(corpus_file)

    # Check if the original word is known
    if word in vocabulary:
        candidate_list.append(word)

    # Get known words with an edit distance of 1
    e1_words = edits1(word)
    valid_e1 = knownWord(e1_words, corpus_file)
    if valid_e1:
        candidate_list.extend(valid_e1)

    # Get known words with an edit distance of 2
    e2_words = edits2(word)
    valid_e2 = knownWord(e2_words, corpus_file)
    if valid_e2:
        candidate_list.extend(valid_e2)

    # Add the original word to the candidate list
    candidate_list.append(word)

    # Return the first non-empty list of candidates
    return candidate_list

Using the previous functions, we will write the function `correction(word, k, corpus_file)` which returns the `k` most probable corrections for the word `word`.

In [9]:
def correction(word, k, corpus_file):
    """
    Returns the k most probable corrections for the given word.

    Parameters
    ----------
    word : str
        The word to correct.
    k : int
        The number of top corrections to return.
    corpus_file : str
        The path to the corpus file used for building the language model.

    Returns
    -------
    list of str
        A list of the k most probable corrections for the word.
    """
    # Get candidates for the given word
    candidates_list = candidates(word, corpus_file)

    # Calculate scores for each candidate using the language model
    language_model = build_language_model(corpus_file)
    candidate_scores = {candidate: language_model.get(candidate, 0) for candidate in candidates_list}

    # Sort candidates by descending score
    sorted_candidates = sorted(candidate_scores, key=candidate_scores.get, reverse=True)

    # Return the top k candidates
    return sorted_candidates[:k]

In [10]:
# Example Usage
corrections = correction('heloo', 5, 'corpus.txt')
print(corrections)

['held', 'help', 'below', 'hero', 'helen']
