# Spelling correction

You receive the following files, available via SurfDrive:
- sentences_with_typos.txt: a file with two tab-separated columns, the first containing a numerical ID and the second containing a sentence in English
- SUBTLEXus.txt: a file with several tab-separated columns, containing data from the SUBTLEXus lexicon, a list of words derived from a part of the SUBTLEXus corpus containing movie subtitles with their frequency counts (and other pieces of information)
- LM_trainingCorpus.json: a file containing a pre-processed corpus in the form of a list of lists of strings, to be used to train a word-level statistical language model

You should carry out the following tasks:

- task1: comment the code to estimate the LM by adding doc-strings. 4 points available, you lose 1 point for every missing docstring. There are 6 docstrings to write, so writing 2 give you the same points as writing none: the rationale behind the grading approach is that writing only 2 is not a sufficient demonstration you understand the code.


- task2a: After reading in the files as pandas dataframes (mind how you import default strings, some of them may turn into NAs!), find which token contains a typo in the given sentences, where having a typo means that the word does not feature in SUBTLEXus - remember to tokenise the input sentences! There is one and only one word containing a typo in each sentence, as defined in this way: the typo can result from the insertion/deletion/substitution of one or two characters. You should submit a .json file containing a simple dictionary mapping the sentence ID (as an integer) to the mistyped word (as a string). 5 points available in total: for every incorrect word retrieved, you lose 0.5 point. If you retrieve 10 wrong mistyped words, you get no points even if the total number of mistyped words is higher than 10. If you submit a wrongly formatted file, you will get no points.


- task2b: read the target sentences manually. Some of them contain another mistyped word than the one you found in task2a, but were not detected because the typo resulted in a word which appears in SUBTLEXus. Discuss how you could automatically spot those mistyped words too using CL methods and resources: what information would you need? How would you use it? Reply in no more than 150 words. 3 points available, awarded based on how sensible the answer is.


- task3: find the words from SUBTLEXus with the smallest edit distance from each mistyped target (do not lowercase anything). You should return the 3 words at the smallest edit distance, sorted by edit distance. However, if there are more words at the same edit distance than the third closest, you should include all the words at the same edit distance. Therefore, supposing that the string 'abcdef' has two neighbors at edit distance 1, and four neighbors at edit distance 2, the third closest neighbor would be at edit distance 2, but there would be other three words at the same distance and you should thus return six neighbors for the target string. 5 points available: you loose 0.5 points for every wrong list of nearest neighbors you retrieve for a mistyped word (wrong means any of wrong words, wrong order, wrong edit distances, wrong data type). Submit a .json file containing a dictionary mapping sentence IDs (as integers) to lists of tuples, where each tuple contains first the word (as a string) and then the edit distance (as an integer), with tuples sorted by edit distance in ascending order (smallest edit distance first). The dictionary should have the following form (if you submit a wrongly formatted file, you will get no points):
    {id1: [(w1, 2), (w2, 2), (w3, 2)];
     id2: [(w6, 1), (w7, 1), (w8, 2), (w9, 2), (w10, 2)];
     ...}
     
     
- task4: use the list of candidate replacements you found in task3 to find the best one according to candidate frequency (derived from SUBTLEXus) - if two or more candidates have the exact same frequency in SUBTLEX, choose the one with the best edit distance. If two or more candidates at the same frequency also have the same edit distance, pick the one which comes first in alphabetical order. You should return a .json file containing a simple dictionary mapping the sentence ID (as an integer) to a tuple containing the best candidate replacement (as a string) and its SUBTLEXus frequency value (as an integer). 5 points available, you lose 1 point for each wrong best candidate:frequency pair you retrieve (if the candidate is right but the frequency doesn't match, you loose half a point).


- task5: use the list of candidate replacements you found in task3 to find the best one according to its perplexity under a statistical language model of order 3 implemented using a Markov Chain with add-k smoothing (k=0.01) and estimated using the given corpus. If two or more candidates have the exact same perplexity in the input sentence, choose the one with the best edit distance. If two or more candidates at the same perplexity also have the same edit distance, follow alphabetical order. You should return a .json file containing a simple dictionary mapping the sentence ID (as an integer) to a tuple containing the best candidate replacement (as a string) and its perplexity under the specified language model (as a float). Not all candidate replacements might appear in the LM training corpus: don't map such candidate replacements to the UNK string though, or the perplexity estimate would not reflect the specific candidate; you can directly exclude candidate replacements which don't appear in the training corpus from the perplexity computation. 5 points available, you lose 1 point for each wrong best candidate:frequency pair you retrieve (if the candidate is right but the perplexity doesn't match, you loose half a point).


- task6: Compare the candidate replacements found when considering frequency and when considering perplexity. Which are better? Do they match what you consider to be the right replacement? What extra resources/information would you use to pick better candidate replacements? 3 points available, awarded based on how sensible the answer is.


Name files as follows:
task[n]_NameSurname_solution.json

## Task1

In the cell below you find code to run a statistical language model, add the docstrings (you already find a blueprint) to complete the first task.

In [1]:
import operator
import numpy as np
from collections import defaultdict

class LM(object):
    
    def __init__(self, corpus, ngram_size=2, bos='+', eos='#', k=1):
        
        """
        :param corpus: corpus is the dataset that will be used to derive information from
        :param ngram_size: ngram size is the size of n_gram the brick to build the simplest language models, such as naive Bayes. 2 for bigrams, 3 for trigrams, and so on.
        :param k: smootihng parameter 
        :param bos: is the beginning of sentence symbol
        :param eos: end of sentence symbol 
        
        This class creates an LM object with attributes k,ngram_size,bos,eos,corpus,vocab (a set with only distinct tokens),vocab_size (the lenght of the vocabulary set)
        """
        
        self.k = k
        self.ngram_size = ngram_size
        self.bos = bos
        self.eos = eos
        self.corpus = corpus
        self.vocab = self.get_vocab()
        self.vocab.add(self.eos)
        self.vocab_size = len(self.vocab)
        
    def get_vocab(self):
        
        """
        :return: tokens
        
        This method derives distinct tokens from the corpus 
        """
        
        vocab = set()
        for sentence in self.corpus:
            for element in sentence:
                vocab.add(element)
        
        return vocab
                    
    def update_counts(self):
        
        """
        :return: nothing, it is only updating a variable
        This method creates a way to update a dictionary with the count of the ouccrances of a word given specific history (n-gram) 
        """
        
        r = self.ngram_size - 1
        
        self.counts = defaultdict(dict)
        
        for sentence in self.corpus:
            s = [self.bos]*r + sentence + [self.eos]
            
            for idx in range(self.ngram_size-1, len(s)):
                ngram = self.get_ngram(s, idx)
                
                try:
                    self.counts[ngram[0]][ngram[1]] += 1
                except KeyError:
                    self.counts[ngram[0]][ngram[1]] = 1
                        
    def get_ngram(self, s, i):
        
        """
        :param s: sentence that we will go through to get an engram 
        :param i: an iterator to know where in the sentence we want to look for the target n-gram
        :return: the target word and n-1 proceeding words that are needed for an n-gram 
        
        This method looks at the sentence passed and based on the iterator returns a target word with n-1 words proceeding the target one to creat and n-gram
        """
        
        ngram = s[i-(self.ngram_size-1):i+1]
        history = tuple(ngram[:-1])
        target = ngram[-1]
        return (history, target)
    
    def get_ngram_probability(self, history, target):
        
        """
        :param history: the past occurance of given target ngram 
        :param target: ngram that we want to predict 
        :return: legal probability of the target n-gram
        
        This method outputs a legal probability of the target n-gram given the previously observed history
        """
        
        try:
            ngram_tot = np.sum(list(self.counts[history].values())) + (self.vocab_size*self.k)
            try:
                transition_count = self.counts[history][target] + self.k
            except KeyError:
                transition_count = self.k
        except KeyError:
            transition_count = self.k
            ngram_tot = self.vocab_size*self.k
        
        return transition_count/ngram_tot 
    
    def perplexity(self, sentence):
        
        """
        :param string: a sentence (combinations of strings) that we are interested to compute perplexity 
        ^ I think the given parameter is incorrect, there is no parametere named string
        :return: perplexity
        
        This method computes the perplexity of a sentence. which is ~invers of probability. So it's average over log probabilities of the words
        the better the language model, the lower the perplexity
        """
        
        r = self.ngram_size - 1
        s = [self.bos]*r + sentence + [self.eos]
        

        probs = []
        for idx in range(self.ngram_size-1, len(s)):
            ngram = self.get_ngram(s, idx)
            probs.append(self.get_ngram_probability(ngram[0], ngram[1]))
                    
        entropy = np.log2(probs)
        avg_entropy = -1 * (sum(entropy) / len(entropy))
        return pow(2.0, avg_entropy)

In [3]:
#pip install collections

In [2]:
import json
import nltk
import spacy
import pandas as pd
from collections import defaultdict
import string
#pip install regex
import regex as re
import numpy as np

## Task2

In [3]:
# load the data
data_trainingCorpus = json.load(open(r'LM_trainingCorpus.json', 'r'))
#print(data_trainingCorpus[:10])

In [4]:
training_sentences=[]
new_element=''
for element in data_trainingCorpus:
    new_element=' '.join(element)
    training_sentences.append(new_element)

#training_sentences[0]

In [5]:
sentencesraw = pd.read_csv(r"sentence_with_typos.txt", sep="\t")
sentences = np.asarray(sentencesraw["sentence"])
sentences_tokenized = []

for sentence in sentences:
    sentence_tokenized = nltk.wordpunct_tokenize(sentence)
    sentences_tokenized.append(sentence_tokenized)

#print(sentences_tokenized)
#print()

In [6]:
movie_subtitlesraw = pd.read_csv(r'SUBTLEXus.txt', sep="\t",low_memory=False)
movie_subtitle=list(np.asarray(movie_subtitlesraw["Word"]))
#movie_subtitlesraw
#print(movie_subtitle)


## Task 2a

In [12]:
# find mistyped words in the input sentences
where_are_mistakes1=dict()
mistakes=dict()
i=0
while i < len(sentences_tokenized):
    sentence = sentences_tokenized[i]
    mistakes_i=[]
    for element in sentence:
        #existing word 
        if element in movie_subtitle:
            mistakes_i.append(0)
            pass
        #if it is a punctiation symbol
        elif element in string.punctuation:
                mistakes_i.append(0)
                pass
        #in any other case it is a mistake, so we should add it to the dictionary
        else:
           mistakes_i.append(1)
           mistakes[i+1]=element 
    possible_mistakes=np.asarray(mistakes_i)
    misspelling_index=np.where(possible_mistakes > 0)[0][0]
    where_are_mistakes1[i]=misspelling_index
    i+=1
print(mistakes)

{1: 'knoq', 2: 'reserachers', 3: 'grozn', 4: 'quolg', 5: 'waies', 6: 'wintr', 7: 'munors', 8: 'surgicaly', 9: 'aquire', 10: 'acomodate', 11: 'dats', 12: 'collegue', 13: 'layed', 14: 'cate', 15: 'giambic'}


In [10]:
# Data to be written -> mistakes
with open("task2a_KatarzynaKleczek_solution.json", "w") as outfile:
	json.dump(mistakes, outfile)

## Task2b 

 the mistyped words are things such as fiends instead of friends, red instead of read and so on. After the "I" pronoun should be e.g a verb, not a noun describing color.
 To caputre that informations using CL we can use PoS tagging and then by applying a grammar CNF we could make the algorithm learn what is allowed as a sentence
 in terms of parts of speech relations, and what not (Hiddem markov models).  Mistakes such as "our fiends had a baby" are harder to correct. They need context, some kind of probabilty of a sentence existance (using Naive Bayes for example) analysis would be in my opinion the most succesfull attepmt to correct them. as probability of a sentence, our friends had a baby 
 would be higher than fiends had a baby.

In [11]:
print(sentences)

['did you knoq that our fiends had a baby?'
 'I red that reserachers managed to deviate the orbit of a comet with a satellite.'
 'I could not help but grozn in frustration when my computer trashed right before I finished my project.'
 'I mate a quolg from old clothes for my newborn nephew.'
 'he waies his wand to get the attention of the waiter.'
 'the roofs of the old house needed to he repaired before the wintr.'
 'munors are not allowed to purchase cigarettes or alcohol.'
 'the tumor was remove surgicaly.'
 'they aquire the company in order to expand their business.'
 'the hotel was able to acomodate all of our needs.'
 'I marked the dats on my calendar so I would not forger.'
 'I asked my collegue for there opinion on the matter.'
 'due to the economic situation, main employees were layed off from their jobs.'
 'I was refered to the specialist by my primary cate physician.'
 'many poets wrote in giambic pentameter.']


## Task 3


In [12]:
# retrieve the nearest neighbors based on edit distance   


In [13]:
output = {}
id=0
for misspel in mistakes.values():
    neighbors=[]
    for word in movie_subtitle:
        word = re.findall(r'^[^a-zA-z]*([a-zA-Z]*)[^a-zA-z]*$', str(word))  # 
        #print(word[0]) <- a word without brackets and ' ' around
        if word and word[0] != misspel:
            d = nltk.edit_distance(word[0], misspel)
            neighbors.append(tuple((word[0],d)))
    neighbors.sort(key = lambda x: x[1], reverse=False)
    # change it so that it is adding not only 3 but 3 and all the others that have the same edit distance
    to_add=neighbors[:3]
    third_ele_dist=neighbors[2][1]
    for neighbor in neighbors[3:]:
        if neighbor[1] ==third_ele_dist:
            to_add.append(neighbor)
    # if we would put misspel instead of id then we would have the word and not id
    output[id+1]=to_add
    id+=1
        
print( output)

{1: [('know', 1), ('knot', 1), ('knob', 1)], 2: [('researchers', 2), ('researcher', 3), ('researches', 3), ('reachers', 3)], 3: [('grown', 1), ('groin', 1), ('groan', 1)], 4: [('quote', 2), ('quota', 2), ('quo', 2), ('quilt', 2), ('quell', 2), ('qualm', 2), ('quila', 2), ('quoad', 2), ('quoit', 2), ('quos', 2)], 5: [('waves', 1), ('wakes', 1), ('waits', 1), ('wages', 1), ('wails', 1), ('wares', 1), ('waxes', 1), ('wades', 1), ('waives', 1), ('waifs', 1), ('wanes', 1)], 6: [('winter', 1), ('wintry', 1), ('with', 2), ('want', 2), ('into', 2), ('went', 2), ('wants', 2), ('win', 2), ('wind', 2), ('wine', 2), ('winner', 2), ('wins', 2), ('wings', 2), ('wing', 2), ('minor', 2), ('hint', 2), ('diner', 2), ('winds', 2), ('ninth', 2), ('wit', 2), ('mint', 2), ('witty', 2), ('wits', 2), ('wiser', 2), ('pint', 2), ('wink', 2), ('finer', 2), ('wider', 2), ('windy', 2), ('intro', 2), ('hints', 2), ('wiener', 2), ('mints', 2), ('wilt', 2), ('wines', 2), ('miner', 2), ('liner', 2), ('pints', 2), ('li

In [14]:
# Data to be written -> output
with open("task3_KatarzynaKleczek_solution.json", "w") as outfile:
	json.dump(output, outfile)

## Task4: frequency
use the list of candidate replacements you found in task3 to find the best one according to candidate frequency (derived from SUBTLEXus) - if two or more candidates have the exact same frequency in SUBTLEX, choose the one with the best edit distance. If two or more candidates at the same frequency also have the same edit distance, pick the one which comes first in alphabetical order. You should return a .json file containing a simple dictionary mapping the sentence ID (as an integer) to a tuple containing the best candidate replacement (as a string) and its SUBTLEXus frequency value (as an integer). 5 points available, you lose 1 point for each wrong best candidate:frequency pair you retrieve (if the candidate is right but the frequency doesn't match, you loose half a point).

In [14]:
# pick the best candidate according to SUBTLEXus frequency counts
best_replacements={}
id=0
for element in output.values():
    #print( element)
    freq_high=0
    for i in range(len(element)):
        word=element[i][0]
        curr_ed=100
        #print(word)
        curr_substitute=""
        row_index=movie_subtitlesraw[movie_subtitlesraw['Word'] == word].index[0]
        #print(row_index)
        freq=movie_subtitlesraw["FREQcount"][row_index]
        if ((freq == freq_high) and (element[i][1] == curr_ed)):
            if (word[0]<curr_substitute[0]):
                possible_replecements=(word, freq)
                freq_high=freq
                curr_substitute=word
        if (freq > freq_high):
            #print(freq)
            #print((word, freq))
            #possible_replecements=[]
            possible_replecements=(word, freq)
            freq_high=freq
            curr_substitute=word
            curr_ed=element[i][1]

    best_replacements[id+1]=(possible_replecements)
    id+=1

print(best_replacements)



{1: ('know', 291780), 2: ('researchers', 57), 3: ('grown', 1275), 4: ('quote', 488), 5: ('waves', 674), 6: ('with', 257465), 7: ('minor', 654), 8: ('strictly', 548), 9: ('Squire', 157), 10: ('accommodate', 109), 11: ('days', 15592), 12: ('college', 4344), 13: ('played', 2870), 14: ('care', 24748), 15: ('gambit', 19)}


In [16]:
def np_encoder(object):
    if isinstance(object, np.generic):
        return object.item()


In [17]:
# Data to be written -> output    needed to add a default np_encoder because otherwise it didn't accept the tupple,
# now all tuples are converted into lists but I assume there is no other way around it

with open("task4_KatarzynaKleczek_solution.json", "w") as outfile:
	json.dump(best_replacements, outfile, default=np_encoder)


## Task5: perplexity

use the list of candidate replacements you found in task3 to find the best one according to its perplexity under a statistical language model of order 3 implemented using a Markov Chain with add-k smoothing (k=0.01) and estimated using the given corpus. If two or more candidates have the exact same perplexity in the input sentence, choose the one with the best edit distance. If two or more candidates at the same perplexity also have the same edit distance, follow alphabetical order. You should return a .json file containing a simple dictionary mapping the sentence ID (as an integer) to a tuple containing the best candidate replacement (as a string) and its perplexity under the specified language model (as a float). Not all candidate replacements might appear in the LM training corpus: don't map such candidate replacements to the UNK string though, or the perplexity estimate would not reflect the specific candidate; you can directly exclude candidate replacements which don't appear in the training corpus from the perplexity computation. 5 points available, you lose 1 point for each wrong best candidate:frequency pair you retrieve (if the candidate is right but the perplexity doesn't match, you loose half a point).

In [16]:
# pick the best candidate according to perplexity under the given language model
# best candidate replacements from task3 -> output.values()  from those using 3gram model with smoothing k=0.01, estimate perplexity
new_model=LM(data_trainingCorpus, ngram_size=3, k=0.01, )
new_model.update_counts()

In [17]:
perpl_list=[]
id=0
for sth in output.values():
    sentence = sentences_tokenized[id]
    #print(sentence)
    mistake_index=where_are_mistakes1[id]
    #print(sth)
    replacements=[]
    for item in sth:
        candidate=item[0]
        sentence[mistake_index] = candidate
        perp=new_model.perplexity(sentence)
        ed=item[1]
        # a list with a replacement, its preplexity and the smallest edit distance
        replacements.append((candidate, perp, ed))
    perpl_list.append(replacements)
    id+=1
          
print(perpl_list)


[[('know', 1002.7288999720108, 1), ('knot', 5639.354069762032, 1), ('knob', 8575.27839748091, 1)], [('researchers', 14475.415924638553, 2), ('researcher', 14476.248088113443, 3), ('researches', 14475.415924638553, 3), ('reachers', 14475.415924638553, 3)], [('grown', 1861.3599787842493, 1), ('groin', 1858.1347148669051, 1), ('groan', 1858.8593594720237, 1)], [('quote', 16820.32265338231, 2), ('quota', 28937.539844403862, 2), ('quo', 28933.1915634287, 2), ('quilt', 19307.938029302248, 2), ('quell', 28933.1915634287, 2), ('qualm', 28946.21295587551, 2), ('quila', 28933.1915634287, 2), ('quoad', 28933.1915634287, 2), ('quoit', 28933.1915634287, 2), ('quos', 28933.1915634287, 2)], [('waves', 1220.1171486647056, 1), ('wakes', 1832.2622055583065, 1), ('waits', 1831.7334704549087, 1), ('wages', 1827.260259071301, 1), ('wails', 1827.3976276540632, 1), ('wares', 1827.260259071301, 1), ('waxes', 1827.5350065638286, 1), ('wades', 1827.260259071301, 1), ('waives', 1827.260259071301, 1), ('waifs', 1

In [20]:
final_dictionary={}
id=0
for element in perpl_list:
    element.sort(key = lambda x: x[1], reverse=False)
    #print(element)
    curr_prep=element[0][1]
    #checking if the first element has the lowest preplexity
    if  curr_prep< element[1][1]:
        # if yes then add it to the final dictionary, the word and the preplexity
        final_dictionary[id+1]=(element[0][0], element[0][1])
    else:
        #create a list of all the words with the same preplexity
        temp_list=[]
        for prep in element:
            if prep[1] == curr_prep:
                temp_list.append(prep)
        #now we want to sort them by the edit distance
        temp_list.sort(key = lambda x: x[2], reverse=False)
        # now checking if the first two have the same edit distance
        curr_ed=temp_list[0][2]
        if curr_ed<temp_list[1][2]:
            # if yes, this is our best replacement
            final_dictionary[id+1]=(temp_list[0][0],temp_list[0][1])
        else:
            #create a list of all the words with the same edit distance
            temp_list2=[]
            for dist in temp_list:
                if dist[2] == curr_ed:
                    temp_list2.append(dist)
            #now we want to sort them alphabetically
            temp_list2.sort(key = lambda x: x[0], reverse=False)

            # first element of that list is our best distance and should be added to dictionary
            final_dictionary[id+1]=(temp_list2[0][0], temp_list2[0][1])
    id+=1

In [21]:

for element in final_dictionary.items():
    print(element)

print(final_dictionary)

(1, ('know', 1002.7288999720108))
(2, ('researchers', 14475.415924638553))
(3, ('groin', 1858.1347148669051))
(4, ('quote', 16820.32265338231))
(5, ('waves', 1220.1171486647056))
(6, ('wind', 1857.4043179891894))
(7, ('jurors', 9336.433652424052))
(8, ('survival', 18918.657040618258))
(9, ('Squire', 2190.000743626073))
(10, ('accommodate', 1897.8469004552046))
(11, ('date', 2274.1923660774823))
(12, ('colleague', 888.999722318408))
(13, ('Fayed', 6625.174599990876))
(14, ('care', 11882.810558938134))
(15, ('iambic', 31291.168009478923))
{1: ('know', 1002.7288999720108), 2: ('researchers', 14475.415924638553), 3: ('groin', 1858.1347148669051), 4: ('quote', 16820.32265338231), 5: ('waves', 1220.1171486647056), 6: ('wind', 1857.4043179891894), 7: ('jurors', 9336.433652424052), 8: ('survival', 18918.657040618258), 9: ('Squire', 2190.000743626073), 10: ('accommodate', 1897.8469004552046), 11: ('date', 2274.1923660774823), 12: ('colleague', 888.999722318408), 13: ('Fayed', 6625.174599990876)

In [22]:
with open("task5_KatarzynaKleczek_solution.json", "w") as outfile:
	json.dump(final_dictionary, outfile, default=np_encoder)

## Task 6

(add your comments here)
Compare the candidate replacements found when considering frequency and when considering perplexity. Which are better? Do they match what you consider to be the right replacement? What extra resources/information would you use to pick better candidate replacements? 3 points available, awarded based on how sensible the answer is.

It looks like based on the frequency the replacements are more accurate (but it is just 3:2). However it is not always the case, for example in the sentence "the roofs of the old house needed to he repaired before the wintr." the replacement should be in my opinion "winter" but that is not the replacemetn choosen by any of the algorithms. Sometimes neither is correct. I think depending on the corpus and also smoothing or n-gram size used, perplexity can do much better than frequency. Some fine tuning of the parameters should be used to find out if that is indeed correct for this corpus. Another fact about those cases is that when some words had the same perplexity, we chose alphabetic orrder. That is a criteria that we could use but can change the output completely. Other option would be to do it randomly for exmaple but we habe no guarantee that it would be any better than the alphabetic order and now at least we can backtrack it and be sure why this exact word was choosen to be a replacement. In terms of extra information that could be used is for example the Parts of Speach parts, although here it does not seem to have a huge part in terms of making a difference between frequency and perplexity, it is nevertheless something that can add soem context and infromation to our model and help with choosing a correct word.

In [23]:
prep=final_dictionary
freq=best_replacements

for i in range(len(prep.items())):
    if(prep[i][0]==freq[i][0]):
        #print(mistakes[i], ": the same replacement found")
        continue
    else:
        print()
        print(sentences[i])
        print(mistakes[i], prep[i], freq[i])
        


I could not help but grozn in frustration when my computer trashed right before I finished my project.
grozn ('groin', 1858.1347148669051) ('grown', 1275)

the roofs of the old house needed to he repaired before the wintr.
wintr ('wind', 1857.4043179891894) ('with', 257465)

munors are not allowed to purchase cigarettes or alcohol.
munors ('jurors', 9336.433652424052) ('minor', 654)

the tumor was remove surgicaly.
surgicaly ('survival', 18918.657040618258) ('strictly', 548)

I marked the dats on my calendar so I would not forger.
dats ('date', 2274.1923660774823) ('days', 15592)

I asked my collegue for there opinion on the matter.
collegue ('colleague', 888.999722318408) ('college', 4344)

due to the economic situation, main employees were layed off from their jobs.
layed ('Fayed', 6625.174599990876) ('played', 2870)

many poets wrote in giambic pentameter.
giambic ('iambic', 31291.168009478923) ('gambit', 19)
