# Context-sensitive Spelling Correction

The goal of the assignment is to implement context-sensitive spelling correction. The input of the code will be a set of text lines and the output will be the same lines with spelling mistakes fixed.

Submit the solution of the assignment to Moodle as a link to your GitHub repository containing this notebook.

Useful links:
- [Norvig's solution](https://norvig.com/spell-correct.html)
- [Norvig's dataset](https://norvig.com/big.txt)
- [Ngrams data](https://www.ngrams.info/download_coca.asp)

Grading:
- 60 points - Implement spelling correction
- 20 points - Justify your decisions
- 20 points - Evaluate on a test set


## Implement context-sensitive spelling correction

Your task is to implement context-sensitive spelling corrector using N-gram language model. The idea is to compute conditional probabilities of possible correction options. For example, the phrase "dking sport" should be fixed as "doing sport" not "dying sport", while "dking species" -- as "dying species".

The best way to start is to analyze [Norvig's solution](https://norvig.com/spell-correct.html) and [N-gram Language Models](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

When solving this task, we expect you'll face (and successfully deal with) some problems or make up the ideas of the model improvement. Some of them are: 

- solving a problem of n-grams frequencies storing for a large corpus;
- taking into account keyboard layout and associated misspellings;
- efficiency improvement to make the solution faster;
- ...

Please don't forget to describe such cases, and what you decided to do with them, in the Justification section.

##### IMPORTANT:  
Your project should not be a mere code copy-paste from somewhere. You must provide:
- Your implementation
- Analysis of why the implemented approach is suggested
- Improvements of the original approach that you have chosen to implement

In [118]:
import re
import numpy as np
import json
import random
random.seed(26)

In [2]:
def process_corpus(corpus_filename):
    with open(corpus_filename) as f:
        corpus = f.read()
        # find words
        lowercased_corpus = corpus.lower()
        all_words = re.findall(r'\w+', lowercased_corpus)
        for w in all_words:
            if w == 'I':
                print('Found')
                break
        unique_words = set(all_words)
    return all_words, unique_words

def get_words_frequencies(all_words):
    word_freq_dict = {}
    for word in all_words:
        if word in word_freq_dict:
            word_freq_dict[word] += 1
        else:
            word_freq_dict[word] = 1
    return word_freq_dict

def get_word_prob(word, all_words, word_freq_dict):
    # check that the word exista in the vocabulary
    if word in word_freq_dict:
        return word_freq_dict[word] / len(all_words)
    
    return 0

def add_char(word):
    words_with_char_added = []
    possible_chars = 'qwertyuiopasdfghjklzxcvbnm'
    for i in range(len(word)):
        for char in possible_chars:
            words_with_char_added.append(word[:i] + char + word[i:])
        words_with_char_added.append(word + char)
    return words_with_char_added

def delete_char(word):
    words_with_char_deleted = []
    for i in range(len(word)):
        words_with_char_deleted.append(word[:i] + word[i+1:])
    return words_with_char_deleted

def replace_char(word):
    words_with_char_replaced = []
    possible_chars = 'qwertyuiopasdfghjklzxcvbnm'
    for i in range(len(word)):
        for char in possible_chars:
            new_word = word[:i] + char + word[i+1:]
            words_with_char_replaced.append(new_word)
    return words_with_char_replaced

def swap_chars(word):
    words_with_chars_swapped = []
    for i in range(len(word)-1):
        new_word = word[:i] + word[i+1] + word[i] + word[i+2:]
        words_with_chars_swapped.append(new_word)
    return words_with_chars_swapped

def filter_existent_words(words, vocabulary):
    return [word for word in words if word in vocabulary]

def check_existence(word, vocabulary):
    if word in vocabulary:
        return True
    return False

def generate_candidates_edit_1(word):
    candidates = []
    words_with_char_added = add_char(word)
    #existent_words_with_char_added = filter_existent_words(words_with_char_added, vocabulary)
    words_with_char_deleted = delete_char(word)
    #existent_words_with_char_deleted = filter_existent_words(words_with_char_deleted, vocabulary)
    words_with_char_replaced = replace_char(word)
    #existent_words_with_char_replaced = filter_existent_words(words_with_char_replaced, vocabulary)
    words_with_chars_swapped = swap_chars(word)

    candidates.extend(words_with_char_added)
    candidates.extend(words_with_char_deleted)
    candidates.extend(words_with_char_replaced)
    candidates.extend(words_with_chars_swapped)
    unique_candidate_words = set(candidates)
    if ('corrected' in unique_candidate_words):
        print('Corrected found')
    return unique_candidate_words


In [3]:
all_words, unique_words = process_corpus('big.txt')
print(f"Num of words in corpus: {len(all_words)}")
print(f"Num of unique words in corpus: {len(unique_words)}")

Num of words in corpus: 1115585
Num of unique words in corpus: 32198


In [4]:
word_freq = get_words_frequencies(all_words)
word_freq['cat']

10

In [5]:
def correct_word_simple(word, vocabulary):
    word = word.lower()
    if check_existence(word, vocabulary):
        return word
    unique_candidates_edit_1 = generate_candidates_edit_1(word)
    candidates_edit_2 = []
    for candidate in unique_candidates_edit_1:
        new_cadidates_edit_2 = generate_candidates_edit_1(candidate)
        candidates_edit_2.extend(new_cadidates_edit_2)
    if 'corrected' in candidates_edit_2:
        print('Corrected found')
    unique_candidates_edit_2 = set(candidates_edit_2)

    all_candidates = []
    unique_candidates_edit_1_existent = filter_existent_words(unique_candidates_edit_1, vocabulary)
    unique_candidates_edit_2_existent = filter_existent_words(unique_candidates_edit_2, vocabulary)
    for candidate in unique_candidates_edit_1_existent:
        all_candidates.append((candidate, 1))
    for candidate in unique_candidates_edit_2_existent:
        all_candidates.append((candidate, 2))
    unique_candidates = set(all_candidates)

    # sort unique_candidates by the distance and the probability of the word
    sorted_candidates = sorted(unique_candidates, key=lambda x: (x[1], -get_word_prob(x[0], all_words, word_freq)))
    if len(sorted_candidates) > 0:
        best_candidate = sorted_candidates[0]
    else:
        best_candidate = (word, 0)

    return best_candidate[0]

def correct_text(given_text):
    found_words = re.finditer(r'\b\w+\b', given_text)
    cur_idx = 0
    corrected_text = []
    for cur_word_with_boundaries in found_words:
        word = cur_word_with_boundaries.group()
        start_idx, end_idx = cur_word_with_boundaries.span()
        corrected_word = correct_word_simple(word, unique_words)
        # to save the spaces and punctuation
        corrected_text.append(given_text[cur_idx:start_idx])
        # if the word's characters are all UPPER
        if word.isupper():
            corrected_word = corrected_word.upper()
        # if the first letter is in upper case
        elif word.istitle():
            corrected_word = corrected_word.capitalize()
    
        corrected_text.append(corrected_word)
        cur_idx = end_idx
    corrected_text.append(given_text[cur_idx:])
    corrected_text_result = ''.join(corrected_text)
        
    return corrected_text_result


In [6]:
corrected_text = correct_text("I am a cat7")
print(corrected_text)

I am a cat


In [7]:
correct_word_simple('speling', unique_words)

'spelling'

In [8]:
text = 'It is speling correction task.'
corrected_text = correct_text(text)
print(corrected_text)

It is spelling correction task.


In [9]:
text_example = 'dking sport'
correct_text(text_example)

'king sport'

### Trying bigram

In [128]:
def process_1_word_freq(filename):
    with open(filename) as f:
        word_freq = {}
        for line in f:
            word, freq = line.strip().split()
            word = word.lower()
            word_freq[word] = int(freq)
    return word_freq

def process_2_word_freq(filename):
    with open(filename) as f:
        word_freq = {}
        for line in f:
            word1, word2, freq = line.strip().split()
            word1 = word1.lower()
            word2 = word2.lower()
            bigram = word1 + ' ' + word2
            word_freq[bigram] = int(freq)
    return word_freq

# without Laplase smoothing (I commented it)
# def calculate_bigram_prob(prev_word, cur_word, bigram_freq, single_word_freq):
#     lowered_prev_word = prev_word.lower()
#     lowered_cur_word = cur_word.lower()
#     bigram = lowered_prev_word + ' ' + lowered_cur_word
#     total_single_word_freq = sum(single_word_freq.values())
#     if bigram in bigram_freq:
#         if lowered_prev_word in single_word_freq:
#             return bigram_freq[bigram] / single_word_freq[lowered_prev_word]
#         else:
#             return bigram_freq[bigram] / total_single_word_freq
#     else:
#         if lowered_cur_word in single_word_freq:
#             return single_word_freq[lowered_cur_word] / total_single_word_freq
#         else:
#             return 0
        
# adding Laplase smoothing
def calculate_bigram_prob(prev_word, cur_word, bigram_freq, single_word_freq):
    lowered_prev_word = prev_word.lower()
    lowered_cur_word = cur_word.lower()
    bigram = lowered_prev_word + ' ' + lowered_cur_word
    bigram_count = bigram_freq.get(bigram, 0)
    prev_word_count = single_word_freq.get(lowered_prev_word, 0)
    smoothed_prob = (bigram_count + 1)/ (prev_word_count + len(unique_words))
    return smoothed_prob


        

def calculate_word_sequence_prob(words, bigram_freq, single_word_freq, prev_token = '<S>', edit_distance=1):
    result = 0
    for i in range(len(words)):
        if i==0:
            prob = calculate_bigram_prob(prev_token, words[i], bigram_freq, single_word_freq)
        else:
            prob = calculate_bigram_prob(words[i-1], words[i], bigram_freq, single_word_freq)
        if prob == 0:
            prob = 1e-10
        result+= np.log(prob)
        
        # penalize for number of corrections
        result = result - 0.05*edit_distance
        # print(result)
    return result


In [95]:
single_word_freq = process_1_word_freq('count_1w.txt')
bigram_freq = process_2_word_freq('count_2w.txt')

In [96]:
prob = calculate_bigram_prob('the', 'cat', bigram_freq, single_word_freq)
print(prob)

7.259822215839525e-05


In [97]:
prob = calculate_bigram_prob('t', 'cot', bigram_freq, single_word_freq)
print(prob)

2.5741336593573144e-09


In [98]:
word_seq_prob = calculate_word_sequence_prob(['the', 'cat'], bigram_freq, single_word_freq)
print(word_seq_prob)

-0.6398877749638754


In [99]:
word_seq_prob = calculate_word_sequence_prob(['t', 'cot'], bigram_freq, single_word_freq)
print(word_seq_prob)

-12.593434927385083


In [None]:
# def correct_word_bigram(given_word, given_text, given_word_idx):
#     print(given_text)
#     given_word = given_word.lower()
#     if check_existence(given_word, unique_words):
#         print('Given word is in the vocabulary')
#         return given_word
#     all_candidates_with_edit_dist = []
#     unique_candidates_edit_1 = generate_candidates_edit_1(given_word)
#     print('unique_candidates_edit_1', unique_candidates_edit_1)
    
#     candidates_edit_2 = []
#     for candidate in unique_candidates_edit_1:
#         new_cadidates_edit_2 = generate_candidates_edit_1(candidate)
#         candidates_edit_2.extend(new_cadidates_edit_2)
#     unique_candidates_edit_2 = set(candidates_edit_2)
#     print('unique_candidates_edit_2', len(unique_candidates_edit_2))

#     for candidate in unique_candidates_edit_1:
#         all_candidates_with_edit_dist.append((candidate, 1))
#     for candidate in unique_candidates_edit_2:
#         all_candidates_with_edit_dist.append((candidate, 2))

#     all_unique_candidates_with_edit_dist = set(all_candidates_with_edit_dist)
#     all_unique_candidates_with_edit_dist_existent = filter_existent_words(all_unique_candidates_with_edit_dist, unique_words)
#     print('all_unique_candidates_with_edit_dist_existent', all_unique_candidates_with_edit_dist_existent)
    
#     # find best candidate
#     new_probabilities = []
#     for (candidate, edit_dist) in all_unique_candidates_with_edit_dist_existent:
#         new_word_sequence = list(given_text.copy())
#         new_word_sequence[given_word_idx] = candidate
#         print('new_word_sequence', new_word_sequence)
#         prob = calculate_word_sequence_prob(new_word_sequence, bigram_freq, single_word_freq, edit_distance=edit_dist)
#         new_probabilities.append(prob)
#     if len(all_unique_candidates_with_edit_dist_existent) > 0:
#         best_candidate = list(all_unique_candidates_with_edit_dist_existent)[new_probabilities.index(max(new_probabilities))]
#     else:
#         best_candidate = (given_word, 0)
#     return best_candidate[0][0]

# def correct_text_bigram(given_text):
#     found_words = re.finditer(r'\b\w+\b', given_text)
#     cur_idx = 0
#     corrected_text = []
#     cur_word_idx = 0
#     for cur_word_with_boundaries in found_words:
#         cur_word_idx += 1
#         word = cur_word_with_boundaries.group()
#         start_idx, end_idx = cur_word_with_boundaries.span()
#         corrected_word = correct_word_bigram(word, given_text, cur_idx)
#         # to save the spaces and punctuation
#         corrected_text.append(given_text[cur_idx:start_idx])
#         # if the word's characters are all UPPER
#         if word.isupper():
#             corrected_word = corrected_word.upper()
#         # if the first letter is in upper case
#         elif word.istitle():
#             corrected_word = corrected_word.capitalize()
    
#         corrected_text.append(corrected_word)
#         cur_idx = end_idx
#     corrected_text.append(given_text[cur_idx:])
#     corrected_text_result = ''.join(corrected_text)
        
#     return corrected_text_result


In [100]:
import re

def correct_word_bigram(given_word, given_text_tokens, given_word_idx):
    print("Processing word:", given_word)
    given_word_lower = given_word.lower()
    
    # If the word is already correct, return it
    if check_existence(given_word_lower, unique_words):
        print('Given word is in the vocabulary')
        return given_word

    # Generate candidate corrections
    all_candidates_with_edit_dist = []
    unique_candidates_edit_1 = generate_candidates_edit_1(given_word_lower)
    unique_candidates_edit_1_existent = filter_existent_words(unique_candidates_edit_1, unique_words)

    for candidate in unique_candidates_edit_1_existent:
        all_candidates_with_edit_dist.append((candidate, 1))
    
    candidates_edit_2 = []
    for candidate in unique_candidates_edit_1:
        new_candidates_edit_2 = generate_candidates_edit_1(candidate)
        candidates_edit_2.extend(new_candidates_edit_2)
    
    unique_candidates_edit_2_existent = filter_existent_words(set(candidates_edit_2), unique_words)
    for candidate in unique_candidates_edit_2_existent:
        all_candidates_with_edit_dist.append((candidate, 2))
    
    all_unique_candidates_with_edit_dist = set(all_candidates_with_edit_dist)

    if not all_unique_candidates_with_edit_dist:
        return given_word 
    # Find the best correction based on probability
    new_probabilities = []
    for (candidate, edit_dist) in all_unique_candidates_with_edit_dist:
        new_word_sequence = given_text_tokens.copy()
        new_word_sequence[given_word_idx] = candidate
        prob = calculate_word_sequence_prob(new_word_sequence, bigram_freq, single_word_freq, edit_distance=edit_dist)
        new_probabilities.append(prob)
    best_candidate = list(all_unique_candidates_with_edit_dist)[new_probabilities.index(max(new_probabilities))][0]


    # Preserve capitalization
    if given_word.isupper():
        return best_candidate.upper()
    elif given_word.istitle():
        return best_candidate.capitalize()
    else:
        return best_candidate

def correct_text_bigram(given_text):
    found_words = list(re.finditer(r'\b\w+\b', given_text))
    corrected_text = []
    cur_idx = 0

    for idx, match in enumerate(found_words):
        word = match.group()
        start, end = match.span()

        # Append text before the word (punctuation, spaces, etc.)
        corrected_text.append(given_text[cur_idx:start])

        corrected_word = correct_word_bigram(word, [m.group() for m in found_words], idx)
        
        # Append corrected word
        corrected_text.append(corrected_word)

        # Update index to the end of the current word
        cur_idx = end

    # Append any remaining text (punctuation, spaces after the last word)
    corrected_text.append(given_text[cur_idx:])

    return "".join(corrected_text)

In [101]:
correct_text('I am a cat7. Hello!')

'I am a cat. Hello!'

In [102]:
correct_text('dking sport')

'king sport'

In [103]:
correct_text('dking species')

'king species'

In [104]:
word='dking'
candidates = generate_candidates_edit_1(word)
print(f"Candidates for '{word}': {candidates}")

Candidates for 'dking': {'dkinyg', 'dving', 'doing', 'uking', 'dkaing', 'dkwing', 'dkidg', 'sking', 'daking', 'dkiny', 'dtking', 'dkinvg', 'dkifg', 'dkinz', 'dkitng', 'dkong', 'dkingg', 'hking', 'dkixng', 'djking', 'xking', 'dkivng', 'dkilng', 'dning', 'dkiung', 'dkinx', 'dling', 'dying', 'dkinc', 'dkiyng', 'dkeing', 'dkixg', 'dkijng', 'dkicg', 'bdking', 'dkinl', 'dkxing', 'dfing', 'dkibng', 'dkieg', 'dkinw', 'gdking', 'dkiqng', 'bking', 'dkipg', 'dkiang', 'dkcng', 'kking', 'mdking', 'dcking', 'dikng', 'diing', 'dkinig', 'fking', 'dkpng', 'dqking', 'wking', 'zking', 'dkina', 'dkind', 'dcing', 'dvking', 'dkinpg', 'edking', 'ldking', 'yking', 'dkinsg', 'dkigng', 'dkine', 'dkning', 'dfking', 'dgking', 'dkinjg', 'dkang', 'dkisng', 'dkinag', 'dkiag', 'dkyng', 'dwing', 'dging', 'dkfng', 'fdking', 'dkindg', 'rdking', 'dkinb', 'cdking', 'dkizg', 'dknng', 'idking', 'dkink', 'dkding', 'dkilg', 'xdking', 'dkting', 'dxking', 'dkqing', 'dkino', 'dkping', 'duking', 'dsing', 'dbing', 'dkinkg', 'dkrng

In [105]:
print("Bigram count:", bigram_freq.get("dying sport", 0))
print("Bigram count:", bigram_freq.get("dying species", 0))
print("Single word count:", single_word_freq.get("dying", 0))

Bigram count: 0
Bigram count: 0
Single word count: 9123557


### Trying trigrams
Dataset with trigrams
https://calmcode.io/datasets/english_3grams

## Justify your decisions

Write down justificaitons for your implementation choices. For example, these choices could be:
- Which ngram dataset to use
- Which weights to assign for edit1, edit2 or absent words probabilities
- Beam search parameters
- etc.

*Your text here...*

### Difficulties

multiplication of probabilities fastly becomes 0 => use sum of logarithms
for unseen words the probabilities are similar as for 'dying species' and 'dyong sport'

#### Difficulties
**Capturing context**

- moving from unigrams to bigrams
- moving from **bigrams** to **trigrams**

With bigrams for phrases of 2 words the context is not captured. In the given example, for the word `dking` we just see the start token `<S>` and do not see the next word: `sport` or `species`. Therefore, I decided to use trigrams.

- no trigram 
### Ideas
- backoff
- keyboard layout
- dataset larger
- forward and backward
- несколькр слов подряд некорректных - заменять на скорректированное
- использовать стеммы?
- добавить swap
- стемминг


Вместо big можно вот этот попробовать https://www.kaggle.com/datasets/ironicninja/coca-dataset?select=COCA_tokens.csv


## Evaluate on a test set

Your task is to generate a test set and evaluate your work. You may vary the noise probability to generate different datasets with varying compexity (or just take another dataset). Compare your solution to the Norvig's corrector, and report the accuracies.

### Comparing my unigram model with Norwigs

1. I noticed the difference in the training corpus. 

In [106]:
len(unique_words)

32198

In [107]:
len(all_words)

1115585

In [24]:
word_freq['the']

79809

In [25]:
get_word_prob('quintessential', all_words, word_freq)

0

In [26]:
get_word_prob('the', all_words, word_freq)

0.07154004401278254

In [27]:
# Your code here
# Norvig tests
def unit_tests():
    assert correct_text('speling') == 'spelling'              # insert
    assert correct_text('korrectud') == 'corrected'           # replace 2
    assert correct_text('bycycle') == 'bicycle'               # replace
    assert correct_text('inconvient') == 'inconvenient'       # insert 2
    assert correct_text('arrainged') == 'arranged'            # delete
    assert correct_text('peotry') =='poetry'                  # transpose
    assert correct_text('peotryy') =='poetry'                 # transpose + delete
    assert correct_text('word') == 'word'                     # known
    assert correct_text('quintessential') == 'quintessential' # unknown
    # assert process_corpus('This is a TEST.') == ['this', 'is', 'a', 'test']
    # assert len(unique_words) == 32192
    # assert len(all_words) == 1115504
    # assert all_words.most_common(10) == [
    #  ('the', 79808),
    #  ('of', 40024),
    #  ('and', 38311),
    #  ('to', 28765),
    #  ('in', 22020),
    #  ('a', 21124),
    #  ('that', 12512),
    #  ('he', 12401),
    #  ('was', 11410),
    #  ('it', 10681)]
    # assert all_words['the'] == 79808
    assert get_word_prob('quintessential', all_words, word_freq) == 0
    assert 0.07 < get_word_prob('the', all_words, word_freq) < 0.08
    return 'unit_tests pass'

unit_tests()

Corrected found
Corrected found
Corrected found


'unit_tests pass'

In [89]:
def spelltest(tests, verbose=True):
    "Run correction(wrong) on all (right, wrong) pairs; report results."
    import time
    start = time.time()
    good, unknown = 0, 0
    n = len(tests)
    for right, wrong in tests:
        w = correct_word_simple(wrong, unique_words)
        good += (w == right)
        if w != right:
            unknown += (right not in unique_words)
            if verbose:
                print('correction({}) => {} ({}); expected {} ({})'
                      .format(wrong, w, word_freq.get(w, 0), right, word_freq.get(right, 0)))
    dt = time.time() - start
    print('{:.0%} of {} correct ({:.0%} unknown) at {:.0f} words per second '
          .format(good / n, n, unknown / n, n / dt))
    
def Testset(lines):
    "Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."
    return [(right, wrong)
            for (right, wrongs) in (line.split(':') for line in lines)
            for wrong in wrongs.split()]

print(unit_tests())
spelltest(Testset(open('spell-testset1.txt')))

Corrected found
Corrected found
Corrected found
unit_tests pass
correction(contende) => contend (3); expected contented (13)
correction(contended) => contended (9); expected contented (13)
correction(proplen) => people (891); expected problem (71)
correction(guic) => guns (111); expected juice (5)
correction(juce) => june (44); expected juice (5)
correction(jucie) => julie (71); expected juice (5)
correction(juise) => guise (8); expected juice (5)
correction(juse) => just (767); expected juice (5)
correction(localy) => local (181); expected locally (10)
correction(compair) => company (190); expected compare (29)
correction(transportibility) => transportibility (0); expected transportability (0)
correction(miniscule) => miniscule (0); expected minuscule (0)
correction(poartry) => party (298); expected poetry (10)
correction(stanerdizing) => stanerdizing (0); expected standardizing (0)
correction(futher) => father (533); expected further (138)
correction(biscutes) => disputes (27); expec

In [129]:
def spelltest(tests, verbose=True):
    "Run correction(wrong) on all (right, wrong) pairs; report results."
    import time
    start = time.time()
    good, unknown = 0, 0
    n = len(tests)
    for right, wrong in tests:
        w = correct_text_bigram(wrong)
        good += (w == right)
        if w != right:
            unknown += (right not in unique_words)
            if verbose:
                print('correction({}) => {} ({}); expected {} ({})'
                      .format(wrong, w, word_freq.get(w, 0), right, word_freq.get(right, 0)))
    dt = time.time() - start
    print('{:.0%} of {} correct ({:.0%} unknown) at {:.0f} words per second '
          .format(good / n, n, unknown / n, n / dt))
    
def Testset(lines):
    "Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."
    return [(right, wrong)
            for (right, wrongs) in (line.split(':') for line in lines)
            for wrong in wrongs.split()]

print(unit_tests())
spelltest(Testset(open('spell-testset1.txt')))

Corrected found
Corrected found
Corrected found
unit_tests pass
Processing word: contenpted
Processing word: contende
correction(contende) => content (29); expected contented (13)
Processing word: contended
Given word is in the vocabulary
correction(contended) => contended (9); expected contented (13)
Processing word: contentid
correction(contentid) => content (29); expected contented (13)
Processing word: begining
Processing word: problam
correction(problam) => program (43); expected problem (71)
Processing word: proble
correction(proble) => people (891); expected problem (71)
Processing word: promblem
Processing word: proplen
correction(proplen) => people (891); expected problem (71)
Processing word: dirven
correction(dirven) => given (364); expected driven (66)
Processing word: exstacy
correction(exstacy) => eustace (1); expected ecstasy (8)
Processing word: ecstacy
Processing word: guic
correction(guic) => music (56); expected juice (5)
Processing word: juce
correction(juce) => suc

In [130]:
def add_errors(correct_text, error_rate = 0.2):
    found_words = list(re.finditer(r'\b\w+\b', correct_text))
    corrupted_text = []
    cur_idx = 0

    num_of_errors = int(len(found_words) * error_rate)
    error_indices = random.sample(range(len(found_words)), num_of_errors)

    for idx, match in enumerate(found_words):
        word = match.group()
        start, end = match.span()

        # Append text before the word (punctuation, spaces, etc.)
        corrupted_text.append(correct_text[cur_idx:start])

        # Introduce errors only for selected words
        if idx in error_indices and len(word) > 1:
            error_type = random.choice(["add", "delete", "replace", "swap"])
            if error_type == "add":
                corrupted_word = random.choice(add_char(word))
            elif error_type == "delete":
                corrupted_word = random.choice(delete_char(word)) if len(word) > 2 else word
            elif error_type == "replace":
                corrupted_word = random.choice(replace_char(word))
            elif error_type == "swap":
                corrupted_word = random.choice(swap_chars(word))
        else:
            corrupted_word = word 

        # Append corrupted word
        corrupted_text.append(corrupted_word)

        # Update index to the end of the current word
        cur_idx = end

    # Append any remaining text (punctuation, spaces after the last word)
    corrupted_text.append(correct_text[cur_idx:])

    return "".join(corrupted_text), num_of_errors


# text fragment from "The Lord of The Rings"
test_text = "This tale grew in the telling, until it became a history of the Great War of the Ring and included many glimpses of the yet more ancient history that preceded it. It was begun soon after The Hobbit was written and before its publication in 1937; but I did not go on with this sequel, for I wished first to complete and set in order the myth- ology and legends of the Elder Days, which had then been taking shape for some years. I desired to do this for my own satisfaction, and I had little hope that other people would be interested in this work, especially since it was primarily linguistic in inspiration and was begun in order to provide the necessary background of ‘history’ for Elvish tongues."
test_text_with_errors, added_error_num = add_errors(test_text)


def calculate_word_accuracy(original, corrupted, corrected):
    initial_words = re.findall(r'\b\w+\b', original)
    words_with_errors = re.findall(r'\b\w+\b', corrupted)
    corrected_words = re.findall(r'\b\w+\b', corrected)
    correctly_corrected_words_count = 0
    possible_corrected_words = 0

    for initial_word, word_with_error, corrected_word in zip(initial_words, words_with_errors, corrected_words):
        if initial_word in unique_words:
            possible_corrected_words+=1
            if initial_word == corrected_word and initial_word != word_with_error:
                correctly_corrected_words_count+=1
    
    if possible_corrected_words == 0:
        return 1
    else:
        return correctly_corrected_words_count / possible_corrected_words

corrected_unigram = correct_text(test_text_with_errors)
corrected_bigram = correct_text_bigram(test_text_with_errors)
print("Unigram accuracy:", calculate_word_accuracy(test_text, test_text_with_errors, corrected_unigram))
print("Bigram accuracy:", calculate_word_accuracy(test_text, test_text_with_errors, corrected_bigram))
    

Processing word: This
Given word is in the vocabulary
Processing word: tale
Given word is in the vocabulary
Processing word: grew
Given word is in the vocabulary
Processing word: in
Given word is in the vocabulary
Processing word: tjhe
Processing word: telling
Given word is in the vocabulary
Processing word: until
Given word is in the vocabulary
Processing word: it
Given word is in the vocabulary
Processing word: ebcame
Processing word: a
Given word is in the vocabulary
Processing word: history
Given word is in the vocabulary
Processing word: of
Given word is in the vocabulary
Processing word: the
Given word is in the vocabulary
Processing word: Great
Given word is in the vocabulary
Processing word: War
Given word is in the vocabulary
Processing word: of
Given word is in the vocabulary
Processing word: the
Given word is in the vocabulary
Processing word: Ring
Given word is in the vocabulary
Processing word: aznd
Processing word: included
Given word is in the vocabulary
Processing word:

In [131]:
test_sentence_1 = "The autumn season brought colorful leaves as artists prepared for an annual exhibition showcasing contemporary artwork."
# fix the random seed
test_sentence_1_with_errors, added_error_num = add_errors(test_sentence_1)
print(test_sentence_1_with_errors)

corrected_unigram = correct_text(test_sentence_1_with_errors)
corrected_bigram = correct_text_bigram(test_sentence_1_with_errors)

print("Unigram accuracy:", calculate_word_accuracy(test_sentence_1, test_sentence_1_with_errors, corrected_unigram))
print("Bigram accuracy:", calculate_word_accuracy(test_sentence_1, test_sentence_1_with_errors, corrected_bigram))

The autumn season brought colorful leaves sa artists prepared for aw annual exhibition showcasing contemporary artwerk.
Processing word: The
Given word is in the vocabulary
Processing word: autumn
Given word is in the vocabulary
Processing word: season
Given word is in the vocabulary
Processing word: brought
Given word is in the vocabulary
Processing word: colorful
Processing word: leaves
Given word is in the vocabulary
Processing word: sa
Given word is in the vocabulary
Processing word: artists
Given word is in the vocabulary
Processing word: prepared
Given word is in the vocabulary
Processing word: for
Given word is in the vocabulary
Processing word: aw
Processing word: annual
Given word is in the vocabulary
Processing word: exhibition
Given word is in the vocabulary
Processing word: showcasing
Processing word: contemporary
Given word is in the vocabulary
Processing word: artwerk
Unigram accuracy: 0.0
Bigram accuracy: 0.08333333333333333


In [132]:
test_paragraph = """As the sun set, an art exhibition opened in the city's cultural center, showcasing contemporary artwork from renowned 
and emerging visual artists. The gallery was filled with vibrant paintings, abstract sculptures, and multimedia installations that explored 
themes of identity and transformation. Art enthusiasts and collectors engaged in thoughtful discussions about the impact of modern art on society. 
Meanwhile, the event organizers prepared for an evening panel featuring well-known creative professionals discussing the future of digital media 
in the artistic landscape."""

test_paragraph_with_errors, added_error_num = add_errors(test_paragraph)
print(test_paragraph_with_errors)

corrected_paragraph_unigram = correct_text(test_paragraph_with_errors)
corrected_paragraph_bigram = correct_text_bigram(test_paragraph_with_errors)

unigram_accuracy = calculate_word_accuracy(test_paragraph, test_paragraph_with_errors, corrected_paragraph_unigram)
bigram_accuracy = calculate_word_accuracy(test_paragraph, test_paragraph_with_errors, corrected_paragraph_bigram)

print(unigram_accuracy)
print(bigram_accuracy)

As the sun set, an art exhibition opend in the city's cultural center, showcasing contemporary artwork from renowned 
and emerging visual artists. Te gallery was filled with vibrant paintings, abstract sculptures, and mltimedia installations thbat efplored 
themes of identit and transformation. Art enthusiasts and ctllectors engaged in thoughtful discsusions about th impact of modern art on society. 
Meanwhile, the eveny organizers repared ofr an evening panel featuring welbl-known creative professionals discusfsing the future of digital mvdia 
in the artcstic landscape.
Processing word: As
Given word is in the vocabulary
Processing word: the
Given word is in the vocabulary
Processing word: sun
Given word is in the vocabulary
Processing word: set
Given word is in the vocabulary
Processing word: an
Given word is in the vocabulary
Processing word: art
Given word is in the vocabulary
Processing word: exhibition
Given word is in the vocabulary
Processing word: opend
Processing word: in
Giv

## Experiment results
1. Bigram model with laplase smoothing performed better than without it.
Words test file (without context)  74%  | 41% | 49%
Accuracy on the Lord of The Rings  0.106 |0.124| 0.142
Accuracy on test sentence 0.25 | 0.25 |  
Accuracy on the test paragraph  0.13| 0.13| 0.159

2. To increase the context utilization => use beam search.

#### Useful resources (also included in the archive in moodle):

1. [Possible dataset with N-grams](https://www.ngrams.info/download_coca.asp)
2. [Damerau–Levenshtein distance](https://en.wikipedia.org/wiki/Damerau–Levenshtein_distance#:~:text=Informally%2C%20the%20Damerau–Levenshtein%20distance,one%20word%20into%20the%20other.)