# Context-sensitive Spelling Correction

The goal of the assignment is to implement context-sensitive spelling correction. The input of the code will be a set of text lines and the output will be the same lines with spelling mistakes fixed.

Submit the solution of the assignment to Moodle as a link to your GitHub repository containing this notebook.

Useful links:
- [Norvig's solution](https://norvig.com/spell-correct.html)
- [Norvig's dataset](https://norvig.com/big.txt)
- [Ngrams data](https://www.ngrams.info/download_coca.asp)

Grading:
- 60 points - Implement spelling correction
- 20 points - Justify your decisions
- 20 points - Evaluate on a test set


## Implement context-sensitive spelling correction

Your task is to implement context-sensitive spelling corrector using N-gram language model. The idea is to compute conditional probabilities of possible correction options. For example, the phrase "dking sport" should be fixed as "doing sport" not "dying sport", while "dking species" -- as "dying species".

The best way to start is to analyze [Norvig's solution](https://norvig.com/spell-correct.html) and [N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

When solving this task, we expect you'll face (and successfully deal with) some problems or make up the ideas of the model improvement. Some of them are: 

- solving a problem of n-grams frequencies storing for a large corpus;
- taking into account keyboard layout and associated misspellings;
- efficiency improvement to make the solution faster;
- ...

Please don't forget to describe such cases, and what you decided to do with them, in the Justification section.

##### IMPORTANT:  
Your project should not be a mere code copy-paste from somewhere. You must provide:
- Your implementation
- Analysis of why the implemented approach is suggested
- Improvements of the original approach that you have chosen to implement

In [10]:
# Your code here
import re
from collections import Counter, defaultdict
import math

class ContextSensitiveSpellCorrector:
    def __init__(self, corpus_path, bigram_path=None):
        # Load unigram corpus and build word probability distribution
        self.unigram_counts = self.load_corpus(corpus_path)
        self.total_words = sum(self.unigram_counts.values())
        self.word_probs = {w: c / self.total_words for w, c in self.unigram_counts.items()}
        
        # Load bigram frequencies if provided; expected format: word1 word2 count
        if bigram_path:
            self.bigram_counts = self.load_bigrams(bigram_path)
        else:
            self.bigram_counts = None

    def load_corpus(self, path):
        """Load the text corpus and count word frequencies."""
        with open(path, 'r') as f:
            text = f.read().lower()
        words = re.findall(r'\w+', text)
        return Counter(words)
    
    def load_bigrams(self, path):
        """Load bigram counts from a file."""
        bigrams = defaultdict(int)
    
        with open(path, 'r', errors='ignore') as f:
            for line in f:
                parts = line.strip().split()
                if len(parts) == 3:
                    bigrams[(parts[1], parts[2])] = int(parts[0])
        return bigrams

    def edits1(self, word):
        """Return the set of words that are one edit away from 'word'."""
        letters    = 'abcdefghijklmnopqrstuvwxyz'
        splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
        deletes    = [L + R[1:]            for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:]  for L, R in splits if len(R) > 1]
        replaces   = [L + c + R[1:]        for L, R in splits if R for c in letters]
        inserts    = [L + c + R            for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)
    
    def known(self, words):
        """Filter the set to only include words present in our corpus."""
        return set(w for w in words if w in self.unigram_counts)

    def candidates(self, word):
        """Generate possible spelling corrections for word."""
        return (self.known([word]) or 
                self.known(self.edits1(word)) or 
                {word})
    
    def unigram_probability(self, word):
        """Return the probability of a word from the corpus."""
        return self.word_probs.get(word, 1e-6)
    
    def bigram_probability(self, prev, word):
        """Compute the conditional probability P(word | prev) using bigram counts.
           Fallback to unigram probability if bigram data is missing."""
        if self.bigram_counts:
            # Sum over all words that follow the previous word
            total_prev = sum(self.bigram_counts[(prev, w)] for w in self.unigram_counts if (prev, w) in self.bigram_counts)
            if total_prev > 0:
                return self.bigram_counts.get((prev, word), 0) / total_prev
        return self.unigram_probability(word)
    
    def correct_sentence(self, sentence):
        """Corrects a sentence using context-sensitive bigram probabilities."""
        tokens = sentence.split()
        corrected = []
        for i, token in enumerate(tokens):
            # Consider lowercase for matching
            token_lower = token.lower()
            if i == 0:
                # For the first word, use the best candidate by unigram probability.
                best = max(self.candidates(token_lower), key=self.unigram_probability)
            else:
                prev_word = corrected[i - 1]
                # For subsequent words, select the candidate maximizing the bigram probability
                best = max(self.candidates(token_lower), key=lambda w: self.bigram_probability(prev_word, w))
            corrected.append(best)
        return " ".join(corrected)


# Paths to your corpus and bigram data
corpus_file = "big.txt"        # Norvig's corpus (downloaded separately)
bigram_file = "bigrams.txt"    # Preprocessed bigram data file

corrector = ContextSensitiveSpellCorrector(corpus_file, bigram_file)

In [12]:

# Test sentences: expected "doing sport" and "dying species"
test_sentences = [
	"dking sport",
	"dking species",
	'speling'
]

for sent in test_sentences:
	corrected = corrector.correct_sentence(sent)
	print(f"Original: {sent}\nCorrected: {corrected}\n")

Original: dking sport
Corrected: king sport

Original: dking species
Corrected: king species

Original: speling
Corrected: spelling



## Justify your decisions

Write down justificaitons for your implementation choices. For example, these choices could be:
- Which ngram dataset to use
- Which weights to assign for edit1, edit2 or absent words probabilities
- Beam search parameters
- etc.

*Your text here...*

## Evaluate on a test set

Your task is to generate a test set and evaluate your work. You may vary the noise probability to generate different datasets with varying compexity (or just take another dataset). Compare your solution to the Norvig's corrector, and report the accuracies.

In [None]:
# Your code here

#### Useful resources (also included in the archive in moodle):

1. [Possible dataset with N-grams](https://www.ngrams.info/download_coca.asp)
2. [Damerau–Levenshtein distance](https://en.wikipedia.org/wiki/Damerau–Levenshtein_distance#:~:text=Informally%2C%20the%20Damerau–Levenshtein%20distance,one%20word%20into%20the%20other.)