# spellchk: default program

In [1]:
from default import *

## Documentation

Read `answer/default.py` starting with the `spellchk` function and see how it solves the task of spell correction using a pre-trained language model that can predict a replacement token for a masked token in the input.

In your submission, write some beautiful documentation of your program here.

In [2]:
from io import StringIO
with StringIO("4\tit will put your maind into non-stop learning.") as f:
    for (locations, spellchk_sent) in spellchk(f):
        print("{locs}\t{sent}".format(
            locs=",".join([str(i) for i in locations]),
            sent=" ".join(spellchk_sent)
        ))

4	it will put your mind into non-stop learning.


# Analysis
## 1. Exploration and Understanding

+ To further understand the intermediate output of the codes, I changed the original functions in the default.py to generate the test output.
+ Also, since I never used transformer.pipeline's fill-mask package before, I go to the official documentation: https://huggingface.co/docs/transformers/task_summary#masked-language-modeling.

+ get_typo_locations fnction parse the input test cases and generates lists of typo locations and the sentences.

In [23]:
fill_mask = pipeline('fill-mask', model='distilbert-base-uncased')
mask = fill_mask.tokenizer.mask_token

exp1 = "5,14	Just before Myra left -- Sue was saying good-by to Cathy , and she didm't realize I was near '' ."

def get_typo_locations(sentence):
    tsv_f = [sentence.split('\t')]
    for line in tsv_f:
            # line[0] contains the comma separated indices of typo words
        first = [int(i) for i in line[0].split(',')],
            # line[1] contains the space separated tokens of the sentence
        second = line[1].split()
    return [first, second]

res1 = get_typo_locations(exp1)
print(res1)

[([5, 14],), ['Just', 'before', 'Myra', 'left', '--', 'Sue', 'was', 'saying', 'good-by', 'to', 'Cathy', ',', 'and', 'she', "didm't", 'realize', 'I', 'was', 'near', "''", '.']]


+ The core function inside spellchk function is the fill_mask provided by transformer.pipeline which could take in a masked sentence and generate suggestions soley by the context of the input sentence.
+ The returned sugguestion words are ranked by descending scores, so the default solution diretly returned the top scored word recommended by the API.

In [32]:
def select_correction(typo, predict):
    # return the most likely prediction for the mask token
    return predict[0]['token_str']

res1 = [[5, 14], ['Just', 'before', 'Myra', 'left', '--', 'Sue', 'was', 'saying', 'good-by', 'to', 'Cathy', ',', 'and', 'she', "didm't", 'realize', 'I', 'was', 'near', "''", '.']]

locations, sent = res1

for i in locations:
    predict = fill_mask(
                    " ".join([ sent[j] if j != i else mask for j in range(len(sent)) ]), # replace typo with a mask
                    top_k=5
                )
    print(predict)

[{'score': 0.10249210149049759, 'token': 18305, 'token_str': 'cathy', 'sequence': "just before myra left - - cathy was saying good - by to cathy, and she didm't realize i was near''."}, {'score': 0.046036895364522934, 'token': 23020, 'token_str': 'myra', 'sequence': "just before myra left - - myra was saying good - by to cathy, and she didm't realize i was near''."}, {'score': 0.041393160820007324, 'token': 2016, 'token_str': 'she', 'sequence': "just before myra left - - she was saying good - by to cathy, and she didm't realize i was near''."}, {'score': 0.03479192033410072, 'token': 9056, 'token_str': 'liz', 'sequence': "just before myra left - - liz was saying good - by to cathy, and she didm't realize i was near''."}, {'score': 0.031346630305051804, 'token': 12954, 'token_str': 'mum', 'sequence': "just before myra left - - mum was saying good - by to cathy, and she didm't realize i was near''."}]
[{'score': 0.8567774891853333, 'token': 2106, 'token_str': 'did', 'sequence': "just bef

## 2. Improvement Solution 1 : Edit Distance
+ From the above explorations, I found out the default solution miss out one important information, that is the miss-spelling word itself. Though there are some characteristics spelled wrong, most characters are in the right places. So it would be better to use this additional information, and pick out the most similar word from the top 20 suggested words, the correction accuracy should be better.
+ Another hint in the default solution is, the select_correction function has an input named 'typo', but it is never used.

Considering the above thought, We decided to select the suggested word with the smallest edit distance with the typo. This increase the the score from 0.23 to 0.52.

In [46]:
from Levenshtein import distance

res1 = [[5, 14], ['Just', 'before', 'Myra', 'left', '--', 'Sue', 'was', 'saying', 'good-by', 'to', 'Cathy', ',', 'and', 'she', "didm't", 'realize', 'I', 'was', 'near', "''", '.']]

locations, sent = res1

predict = fill_mask(
            " ".join([ sent[j] if j != 5 else mask for j in range(len(sent)) ]), # replace typo with a mask
            top_k=20
        )

def select_correction(typo, predict):
    recommended_words = [p['token_str'] for p in predict]
    levenshtein_distances = [distance(typo, word) for word in recommended_words]
    index = levenshtein_distances.index(min(levenshtein_distances))
    print(predict[index]['token_str'])

print("typo:", res1[1][5])
select_correction('Sue', predict)

predict = fill_mask(
            " ".join([ sent[j] if j != 14 else mask for j in range(len(sent)) ]), # replace typo with a mask
            top_k=20
        )

print("typo:", res1[1][14])
select_correction('didm\'t', predict)

typo: Sue
she
typo: didm't
did


After considering both the edit distance and context, the dev.out score comes to 0.52, which is much better than the default score of 0.23. But still, it is not that satisfying. So let's check the output inference to understand some of the wrongly fixed cases. 
+ We noticed that the top_k variable limits the number of recommend words, that is if this number is too small, the correct word will not be returned by the API. So we increased the top_k variable to see what happended. 
After we changed the top_k from 20 to 1000, the score raise from 0.52 to 0.68.

## 3. Improvement Solution 2: Filter the 0 Distance out & Optimize the Selection of Distance

When we further modified our model, we discovered that some words do not need to be modified during the process, i.e., their distance is 0, resulting in a large number of unnecessary replacement procedures. Therefore, we utilized the filter function to exclude words with a distance of 0 and increase the model's performance.

We sorted the data after applying filters, prioritizing predictions depending on how close they are. First, the distance's size was considered. When the distances were equal, the scores were sorted to see which had the highest score. This strategy greatly enhanced our model's overall performance. Using the previously given methods, we further increased the top_k value to 3000, resulting in an overall score of 0.70.

In [1]:
def select_correction(typo, predict):
    # calculate the edit distance between typo and token_str
    predict = [{**p, 'ldis': distance(typo, p['token_str'])} for p in predict]
    # keep the predict if the distance is not 0
    filter_predict = list(filter(filter_p, predict))
    # sort the predict and select the closer distance
    # if the distances are equal, then choose predict with the higher score
    sort_predict = sorted(filter_predict, key=sorting_k)
    return sort_predict[0]['token_str']

def filter_p(p):
    return p['ldis'] != 0

def sorting_k(p):
    res = (p['ldis'], -p['score'])
    return res

## 3. Improvement Solution 3: Capital Letter in the front of the Sentence

By obseving some outputs generated, we found out that some wrong predictions are becase upper/lower case matching. When the typo words are at the first word of a sentence, the first letter of the word should be upper case. But our previous solution does not handle this case. Thus, we add another few lines of code in the spellchk function to process it.

In [None]:
def spellchk(fh):
	for (locations, sent) in get_typo_locations(fh):
		# some codes ...
		# Added a condition to change the case of the first character of the word when the index of the word being updated is 0 i.e it is the first word of the sentence.             
		correct_word = select_correction(sent[i], predict)
		if (i==0):
			correct_word= correct_word.capitalize()
		spellchk_sent[i]=correct_word
		# some code ...

## Group work

* ywa422: Proposed and implemented Method 1.
* ningyik: Proposed and implemented Method 2.
* asa489: Proposed and implemented Method 3.