# Week 1 - Second Year Project

---

Welcome to the weekly assignment of week 1. The assignments are split up per lecture. You can upload your solutions on LearnIt.

# Lecture 1. What are words?


## 1. Regular Expressions
For this section, it might be handy to use the website https://regex101.com/ to test your solutions.

- a) Write a regular expression (regex or pattern) that matches any of the following words: `cat`, `sat`, `mat`.
<br>
(Bonus: What is a possible long solution? Can you find a shorter solution? *hint*: match characters instead of words)
- b) Write a regular expression that matches numbers, e.g. 12, 1,000, 39.95
- c) Expand the previous solution to match Danish price indications, e.g., `1,000 kr` or `39.95 DKK` or `19.95`.

a) 
- Long solution <br>
  `\b(?:cat|sat|mat)\b`
- Short solution <br>
  `[acmst]{3}`


b) `\b\d{1,3}(?:,\d{3})*(?:\.\d+)?\b`

c) `\b\d{1,3}(?:,\d{3})*(?:\.\d+)?\s*(kr|DKK)?\b`

## 2. Tokenization

(Adapted notebook from S. Riedel, UCL & Facebook: https://github.com/uclnlp/stat-nlp-book).

In Python, a simple way to tokenize a text is via the `split` method that divides a text wherever a particular substring is found. In the code below this pattern is simply the whitespace character, and this seems like a reasonable starting point for an English tokenization approach.

In [1]:
text = "Mr. Bob Dobolina is thinkin' of a master plan." + \
       "\nWhy doesn't he quit?"
text.split(" ")

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.\nWhy',
 "doesn't",
 'he',
 'quit?']

To make more fine-grained decisions, we will focus on using regular expressions for tokenization in this assignment. This can be done by either:
1. Defining the character sequence patterns at which to split.
2. Specifying patters that define what constitutes a token. 

In the code below we use a simple pattern `\s` that matches **any whitespace** to define where to split.

In [2]:
import re
gap = re.compile('\s')
gap.split(text)

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.',
 'Why',
 "doesn't",
 'he',
 'quit?']

One **shortcoming** of this tokenization is its treatment of punctuation because it considers `plan.` as a token whereas ideally we would prefer `plan` and `.` to be distinct tokens. It might be easier to address this problem if we define what a token is, instead of what constitutes a gap. Below we have defined tokens as sequences of alphanumeric characters and punctuation.

In [3]:
token = re.compile('\w+|[.?:]')
token.findall(text)

['Mr',
 '.',
 'Bob',
 'Dobolina',
 'is',
 'thinkin',
 'of',
 'a',
 'master',
 'plan',
 '.',
 'Why',
 'doesn',
 't',
 'he',
 'quit',
 '?']

This still isn't perfect as `Mr.` is split into two tokens, but it should be a single token. Moreover, we have actually lost an apostrophe. Both are fixed below, although we now fail to break up the contraction `doesn't`.

In [4]:
token = re.compile('Mr.|[\w\']+|[.?]')
tokens = token.findall(text)
tokens

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan',
 '.',
 'Why',
 "doesn't",
 'he',
 'quit',
 '?']

In the code below, we have an input text and apply the tokenizer (described previously) on the text:

In [6]:
import re
text = """'Curiouser and curiouser!' cried Alice (she was so much surprised, that for the moment she quite
forgot how to speak good English); 'now I'm opening out like the largest telescope that ever was! Good-bye,
feet!' (for when she looked down at her feet, they seemed to be almost out of sight, they were getting so far
off). 'Oh, my poor little feet, I wonder who will put on your shoes and stockings for you now, dears? I'm sure I
shan't be able! I shall be a great deal too far off to trouble myself about you: you must manage the best
way you can; —but I must be kind to them,' thought Alice, 'or perhaps they won't walk the way I want to go!
Let me see: I'll give them a new pair of boots every Christmas...'
"""

token = re.compile('Mr.|[\w\']+|[.?]')
tokens = token.findall(text)
print(tokens) # print(tokens[:10])
print(len(tokens))

["'Curiouser", 'and', 'curiouser', "'", 'cried', 'Alice', 'she', 'was', 'so', 'much', 'surprised', 'that', 'for', 'the', 'moment', 'she', 'quite', 'forgot', 'how', 'to', 'speak', 'good', 'English', "'now", "I'm", 'opening', 'out', 'like', 'the', 'largest', 'telescope', 'that', 'ever', 'was', 'Good', 'bye', 'feet', "'", 'for', 'when', 'she', 'looked', 'down', 'at', 'her', 'feet', 'they', 'seemed', 'to', 'be', 'almost', 'out', 'of', 'sight', 'they', 'were', 'getting', 'so', 'far', 'off', '.', "'Oh", 'my', 'poor', 'little', 'feet', 'I', 'wonder', 'who', 'will', 'put', 'on', 'your', 'shoes', 'and', 'stockings', 'for', 'you', 'now', 'dears', '?', "I'm", 'sure', 'I', "shan't", 'be', 'able', 'I', 'shall', 'be', 'a', 'great', 'deal', 'too', 'far', 'off', 'to', 'trouble', 'myself', 'about', 'you', 'you', 'must', 'manage', 'the', 'best', 'way', 'you', 'can', 'but', 'I', 'must', 'be', 'kind', 'to', 'them', "'", 'thought', 'Alice', "'or", 'perhaps', 'they', "won't", 'walk', 'the', 'way', 'I', 'wan

Questions:

* a) The tokenizer clearly makes a few mistakes. Where?

* b) Write a tokenizer to tokenize the text correctly by your own definition.

* c) Should one separate `'m`, `'ll`, `n't`, possessives, and other forms of contractions from the word? Implement a tokenizer that separates these, and attaches the `'` to the latter part of the contraction.

* d) Should elipsis (...) be considered as three `.`s or one `...`? Design a regular expression for both solutions.


a) 
- That the sign `'` is not considered as a separate token and instead it is connected with a word.
- We missed `!`, `,`, `:`, `;`, `()`, `-`

b) 
```
\b\w+(?:'\w+)?\b|[.,!?;:()—'\"…]
```

- `\b\w+(?:'\w+)?\b` matches whole words, including those with a trailing apostrophe followed by more word characters (for contractions and possessive endings), ensuring they are kept as a single token. The `\b` on both ends ensures we are matching whole words.
- `[.,!?;:()—'\"…]` matches individual punctuation marks and special characters, ensuring they are treated as separate tokens.


In [17]:
tokenizer_regex = r"\b\w+(?:'\w+)?\b|[.,!?;:()—'\"…]"

tokens = re.findall(tokenizer_regex, text)

print(tokens)
print(len(tokens))

["'", 'Curiouser', 'and', 'curiouser', '!', "'", 'cried', 'Alice', '(', 'she', 'was', 'so', 'much', 'surprised', ',', 'that', 'for', 'the', 'moment', 'she', 'quite', 'forgot', 'how', 'to', 'speak', 'good', 'English', ')', ';', "'", 'now', "I'm", 'opening', 'out', 'like', 'the', 'largest', 'telescope', 'that', 'ever', 'was', '!', 'Good', 'bye', ',', 'feet', '!', "'", '(', 'for', 'when', 'she', 'looked', 'down', 'at', 'her', 'feet', ',', 'they', 'seemed', 'to', 'be', 'almost', 'out', 'of', 'sight', ',', 'they', 'were', 'getting', 'so', 'far', 'off', ')', '.', "'", 'Oh', ',', 'my', 'poor', 'little', 'feet', ',', 'I', 'wonder', 'who', 'will', 'put', 'on', 'your', 'shoes', 'and', 'stockings', 'for', 'you', 'now', ',', 'dears', '?', "I'm", 'sure', 'I', "shan't", 'be', 'able', '!', 'I', 'shall', 'be', 'a', 'great', 'deal', 'too', 'far', 'off', 'to', 'trouble', 'myself', 'about', 'you', ':', 'you', 'must', 'manage', 'the', 'best', 'way', 'you', 'can', ';', '—', 'but', 'I', 'must', 'be', 'kind'

c) Separating contractions and possessive forms into their base words and suffixes can be beneficial for certain natural language processing tasks, as it helps in understanding the syntactic and semantic structures of sentences. For example, transforming "I'm" into "I" and "'m" or "Alice's" into "Alice" and "'s" can make it easier to analyze the grammatical roles of words in sentences.

- `\b\w+\b(?='[smtldve]\b)` matches words that are followed by common contractions (such as 's, 'm, 't, 'll, 've) without including the apostrophe part in the match. It uses a positive lookahead to check for the contraction pattern without consuming it.
- `[smtldve]\b|'n'\b|'re\b|'ve\b|'ll\b|'d\b|'s\b|'m\b|'t\b` specifically matches the contraction parts, including the apostrophe.
- `\b\w+\b` matches standalone words.
- `[.,!?;:()—'\"…]` matches individual punctuation marks and special characters.

In [20]:
# Tokenizer regex that separates contractions and possessive forms
tokenizer_regex = r"\b\w+\b(?='[smtldve]\b)|'n'\b|'re\b|'ve\b|'ll\b|'d\b|'s\b|'m\b|'t\b|\b\w+\b|[.,!?;:()—'\"…]"

tokens = re.findall(tokenizer_regex, text)

print(tokens)
print(len(tokens))

["'", 'Curiouser', 'and', 'curiouser', '!', "'", 'cried', 'Alice', '(', 'she', 'was', 'so', 'much', 'surprised', ',', 'that', 'for', 'the', 'moment', 'she', 'quite', 'forgot', 'how', 'to', 'speak', 'good', 'English', ')', ';', "'", 'now', 'I', "'m", 'opening', 'out', 'like', 'the', 'largest', 'telescope', 'that', 'ever', 'was', '!', 'Good', 'bye', ',', 'feet', '!', "'", '(', 'for', 'when', 'she', 'looked', 'down', 'at', 'her', 'feet', ',', 'they', 'seemed', 'to', 'be', 'almost', 'out', 'of', 'sight', ',', 'they', 'were', 'getting', 'so', 'far', 'off', ')', '.', "'", 'Oh', ',', 'my', 'poor', 'little', 'feet', ',', 'I', 'wonder', 'who', 'will', 'put', 'on', 'your', 'shoes', 'and', 'stockings', 'for', 'you', 'now', ',', 'dears', '?', 'I', "'m", 'sure', 'I', 'shan', "'t", 'be', 'able', '!', 'I', 'shall', 'be', 'a', 'great', 'deal', 'too', 'far', 'off', 'to', 'trouble', 'myself', 'about', 'you', ':', 'you', 'must', 'manage', 'the', 'best', 'way', 'you', 'can', ';', '—', 'but', 'I', 'must', 

d) It depends on the use case. In some contexts, it might be preferable to consider an ellipsis as a single token (...), representing a pause or omission. In other cases, one might want to treat each period in an ellipsis as a separate token (.), especially if you're analyzing punctuation usage or performing syntactic parsing where the precise number of punctuation marks is relevant.

Treating Ellipsis as One Token
- `[\w']+` matches words including those with apostrophes.
- `[...]` specifically matches an ellipsis as a single token.
- `[.,!?;:()—'\"…]` matches individual punctuation marks and special characters, excluding the ellipsis since it's already matched as a single token.

Treating Each Period in an Ellipsis Separately
- the regex is the same as before but without a specific pattern for `[...]`. This means it doesn't explicitly differentiate between ellipses and single periods, treating them all as individual punctuation marks.

In [21]:
# Treating Ellipsis as One Token
tokenizer_regex_ellipsis_single = r"\b\w+\b|[\w']+|[...]|[.,!?;:()—'\"…]"

# Treating Each Period in an Ellipsis Separately
tokenizer_regex_ellipsis_separate = r"\b\w+\b|[\w']+|[.,!?;:()—'\"…]"

## 3. Twitter Tokenization
As you might imagine, tokenizing tweets differs from standard tokenization. There are 'rules' on what specific elements of a tweet might be (mentions, hashtags, links), and how they are tokenized. The goal of this exercise is not to create a bullet-proof Twitter tokenizer but to understand tokenization in a different domain.

In the next exercises, we will focus on the following tweet:

In [22]:
tweet = "@robv New vids coming tomorrow #excited_as_a_child, can't w8!!"

In [23]:
token = re.compile('[\w]+')
tokens = token.findall(tweet)
print(tokens)

['robv', 'New', 'vids', 'coming', 'tomorrow', 'excited_as_a_child', 'can', 't', 'w8']


Questions:
- a) What is the correct tokenization of the tweet above according to you?
- b) Try your tokenizer from the previous exercise (Question 2). Which cases are going wrong? Rewrite your tokenizer such that it handles the above tweet correctly.
- c) How will your tokenizer handle emojis?
- d) Think of at least one example where your tokenizer (from b) will behave incorrectly.

a) 
- handle correctly user handles
- handle correctly hash tags as separate tokens <br>
Correct tokenization of the above tweet would be:

```
[@robv, New, vids, coming, tomorrow, #excited_as_a_child, can, 't, w8, !, !]
```

b) Previous regex is not capable of handling correctly user and hashtag.

In [25]:
tokens = re.findall(tokenizer_regex, tweet)
print(tokens)

['robv', 'New', 'vids', 'coming', 'tomorrow', 'excited_as_a_child', ',', 'can', "'t", 'w8', '!', '!']


Fix: 
- `@\w+` to match usernames. This pattern captures the @ symbol followed by one or more word characters.
- `#\w+` to match hashtags. This pattern captures the # symbol followed by one or more word characters.
- The rest of the pattern is similar to the previous version but has been slightly adjusted to ensure it can still match contractions, punctuation, and words correctly.

In [27]:
# Fix
tokenizer_regex_tweet = r"@\w+|#\w+|\b\w+\b(?='[smtldve]n?\b)|'n'\b|'re\b|'ve\b|'ll\b|'d\b|'s\b|'m\b|'t\b|\b\w+\b|[.,!?;:()—'\"…]"
tokens = re.findall(tokenizer_regex_tweet, tweet)
print(tokens)

['@robv', 'New', 'vids', 'coming', 'tomorrow', '#excited_as_a_child', ',', 'can', "'t", 'w8', '!', '!']


c) Emojis are not handled correctly. Emojis are Unicode characters that fall outside the ranges typically matched by \w (which matches word characters like letters and digits) and the specific punctuation marks listed. To handle emojis in text tokenization, I would need to include a pattern that matches Unicode emoji characters.


d) example: "Excited for the launch @midnight!!! 🚀 #new_beginnings... Wait, isn't it @midnight's show? 😕"

In [28]:
tweet_2 = "Excited for the launch @midnight!!! 🚀 #new_beginnings... Wait, isn't it @midnight's show? 😕"
tokens_2 = re.findall(tokenizer_regex_tweet, tweet_2)
print(tokens_2)

['Excited', 'for', 'the', 'launch', '@midnight', '!', '!', '!', '#new_beginnings', '.', '.', '.', 'Wait', ',', 'isn', "'t", 'it', '@midnight', "'s", 'show', '?']


## 4. Segmentation


Sentence segmentation is not a trivial task either.

First, make sure you understand the following sentence segmentation code:

In [29]:
import re

def sentence_segment(match_regex, tokens):
    """
    Splits a sequence of tokens into sentences, splitting wherever the given matching regular expression
    matches.

    Parameters
    ----------
    tokens      the input sequence as list of strings (each item is a ``word'')
    match_regex the regular expression that defines at which token to split.

    Returns
    -------
    a list of token lists, where each inner list represents a sentence.

    >>> tokens = ['the','man','eats','.','She', 'sleeps', '.']
    >>> sentence_segment(re.compile('\.'), tokens)
    [['the', 'man', 'eats', '.'], ['She', 'sleeps', '.']]
    """
    sentences = [[]]
    for tok in tokens:
        sentences[-1].append(tok)
        if match_regex.match(tok):
            sentences.append([])
            
    if sentences[-1] == []:
        del sentences[-1]
    return sentences


In the following code, there is a variable `text` containing a small text and a regular expression-based segmenter:

In [30]:
text = """
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch is the longest official one-word placename in U.K. Isn't that weird? I mean, someone took the effort to really make this name as complicated as possible, huh?! Of course, U.S.A. also has its own record in the longest name, albeit a bit shorter... This record belongs to the place called Chargoggagoggmanchauggagoggchaubunagungamaugg. There's so many wonderful little details one can find out while browsing http://www.wikipedia.org during their Ph.D. or an M.Sc.
"""

token = re.compile('Mr.|[\w\']+|[.?]+')

tokens = token.findall(text)
sentences = sentence_segment(re.compile('\.'), tokens)

for sentence in sentences:
    print(sentence)

['Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch', 'is', 'the', 'longest', 'official', 'one', 'word', 'placename', 'in', 'U', '.']
['K', '.']
["Isn't", 'that', 'weird', '?', 'I', 'mean', 'someone', 'took', 'the', 'effort', 'to', 'really', 'make', 'this', 'name', 'as', 'complicated', 'as', 'possible', 'huh', '?', 'Of', 'course', 'U', '.']
['S', '.']
['A', '.']
['also', 'has', 'its', 'own', 'record', 'in', 'the', 'longest', 'name', 'albeit', 'a', 'bit', 'shorter', '...']
['This', 'record', 'belongs', 'to', 'the', 'place', 'called', 'Chargoggagoggmanchauggagoggchaubunagungamaugg', '.']
["There's", 'so', 'many', 'wonderful', 'little', 'details', 'one', 'can', 'find', 'out', 'while', 'browsing', 'http', 'www', '.']
['wikipedia', '.']
['org', 'during', 'their', 'Ph', '.']
['D', '.']
['or', 'an', 'M', '.']
['Sc', '.']


Questions:
- a) Improve the segmenter so that it segments the text in the way you think is correct.
- b) How could you deal with all URLs effectively?

a) and b)
- The current `sentence_segment` function should not only consider periods (.) as sentence delimiters, which is insufficient for the given text that includes question marks, exclamation points, and other sentence-ending punctuation.
- First, adjust the tokenizer regex to better capture words, contractions, and a broader range of punctuation marks, including URLs to avoid splitting them into multiple tokens.

In [36]:
# Updated tokenizer regex to include URLs, and capture sentence-ending punctuation as separate tokens
token_regex = re.compile(r'https?://\S+|\b(?:[A-Z]+\.)+\b|\b[A-Za-z]+\.[A-Za-z]+\.\b|\w+\'\w+|\w+|[.!?]+')

tokens = token_regex.findall(text)

def improved_sentence_segment(tokens):
    """
    Splits a sequence of tokens into sentences, splitting at sentence-ending punctuation.

    Parameters
    ----------
    tokens: List of strings, where each string is a token.

    Returns
    -------
    List of lists, where each inner list represents a sentence.
    """
    sentences = [[]]
    for tok in tokens:
        sentences[-1].append(tok)
        if re.match('[.!?]', tok):  # Adjusted to match any sentence-ending punctuation
            sentences.append([])

    if not sentences[-1]:
        del sentences[-1]
    return sentences

# Segment into sentences
sentences = improved_sentence_segment(tokens)

# Print each sentence
for sentence in sentences:
    print(sentence)

['Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch', 'is', 'the', 'longest', 'official', 'one', 'word', 'placename', 'in', 'U.', 'K', '.']
["Isn't", 'that', 'weird', '?']
['I', 'mean', 'someone', 'took', 'the', 'effort', 'to', 'really', 'make', 'this', 'name', 'as', 'complicated', 'as', 'possible', 'huh', '?!']
['Of', 'course', 'U.S.', 'A', '.']
['also', 'has', 'its', 'own', 'record', 'in', 'the', 'longest', 'name', 'albeit', 'a', 'bit', 'shorter', '...']
['This', 'record', 'belongs', 'to', 'the', 'place', 'called', 'Chargoggagoggmanchauggagoggchaubunagungamaugg', '.']
["There's", 'so', 'many', 'wonderful', 'little', 'details', 'one', 'can', 'find', 'out', 'while', 'browsing', 'http://www.wikipedia.org', 'during', 'their', 'Ph', '.']
['D', '.']
['or', 'an', 'M.', 'Sc', '.']


# Bonus: Tokenization competition

Tokenization of social media can be more challenging. We provide a small development set for you to experiment with, which you can find in `week1/tok.dev.txt`. The file is a tab-separated file, where the first column contains the input text, and the second column the gold tokenization (as decided by an annotator).

In [None]:
data = [line.strip().split('\t') for line in open('tok.dev.txt', encoding="utf-8")]
print(data[0])

There is also a test file with the same format, but the gold annotation is missing. This can be found in `week1/tok.test.txt`. You are supposed to develop your tokenizer based on the development data, and then apply your tokenizer on the test data. You can hand in the predictions on the test data on LearnIt in the same slot as the rest of the assignment. We will use F1 score for evaluation.

Make sure that the file you hand in:
- Uses exactly the same format as the dev data (`input\<tab\>output`), where the input and output contain the same characters (except for placement of whitespaces). 
- Has your ITU username as the name of the file: i.e. `robv.txt`.

We have provided an evaluation script for your convenience, it return F1 score, recall, and precision. It also prints out all sentences where your model made an error (indicating the error in red if supported by your terminal), and checks whether your output is in the right format. It can be found in `week1/tok_eval.py`.

# Lecture 2: Language (correction)

## 5. Spelling correction

Below is an implementation of the Levenshtein distance. It uses a some efficiency tricks, and it is not important you understand every line of this implementation.

In [37]:
def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

print(levenshteinDistance('this', 'that'))

2


We also provide you with an English word list from [Aspell](http://aspell.net/) in `aspell-en-dict.txt`. It can be used as follows:

In [38]:
# Load wordlist (one word per line)
en_dict = set([word.strip() for word in open('aspell-en-dict.txt', encoding="utf-8").readlines()])
    
# Example usage
typo = 'brower'
correction = 'browser'
print(typo, correction)
print(typo in en_dict)
print(correction in en_dict)
print(levenshteinDistance(typo, correction))

brower browser
False
True
1


* a) Implement a (naive) spelling correction system that finds the word in the word list with the smallest minimum edit distance for a word that contains a misspelling. 
* b) There could be multiple words with the smallest minimum edit distance for some typos, what are supplementary methods to re-rank these? (mention at least 2)

a) 
- Function `naiveSpellingCorrection()` iterates through each word in the provided dictionary, calculating the Levenshtein distance to the misspelled word. It keeps track of the word with the smallest distance seen so far.
- This implementation is "naive" because it does not incorporate any optimizations like early stopping for words that are too long or too short compared to the misspelled word, nor does it use more sophisticated methods like trie structures for more efficient lookups.

In [39]:
def naiveSpellingCorrection(misspelled_word, word_list):
    closest_word = None
    min_distance = float('inf')
    
    for word in word_list:
        distance = levenshteinDistance(misspelled_word, word)
        if distance < min_distance:
            min_distance = distance
            closest_word = word
            
    return closest_word, min_distance

typo = 'brower'
correction, distance = naiveSpellingCorrection(typo, en_dict)
print(f"Misspelled word: {typo}")
print(f"Correction: {correction} (Edit distance: {distance})")

Misspelled word: brower
Correction: brewer (Edit distance: 1)


b)
- When multiple candidate corrections for a typo have the same smallest minimum edit distance, supplementary methods can be employed to re-rank these candidates and select the most appropriate correction. Here are two methods for re-ranking:

1. Word Frequency 
   
2. Contextual Fit (using N-gram)
- Description: The context in which a word appears can significantly influence its likelihood of being the correct correction. Using n-gram models can help assess the fit of each candidate correction within the context of the surrounding words.
- Implementation: For each candidate correction, calculate the probability of the word given its context (the surrounding words) using an n-gram model. The word that results in the highest contextual probability is chosen as the best fit.

Other Methods:

3. Phonetic Similarity: For words with the same edit distance, the one that sounds more similar to the typo could be preferred. Algorithms like Soundex or the Metaphone family (Double Metaphone, Metaphone 3) can be used to compare the phonetic similarity between the typo and candidate corrections.
   
4. Semantic Similarity: Especially useful in context-aware spelling correction, leveraging word embeddings (e.g., Word2Vec, GloVe) can help identify the candidate word that is semantically closest to the expected word in the given context.
   
5. Part-of-Speech Matching: If the part-of-speech (POS) for the misspelled word can be inferred from the context, candidates can be re-ranked based on their POS match. This requires POS tagging of the context and the candidate words.

## 6. Evaluation and Analysis of spelling correction

We also provide you with a list of 100 typos and their corrections from the [GitHub Typo Corpus](https://aclanthology.org/2020.lrec-1.835/) in `typos.txt`. It can be used as follows:

In [40]:
# Load github typo corpus misspellings
typos = []
corrections = []
for line in open('typos.txt', encoding="utf-8"):
    tok = line.strip().split('\t')
    typos.append(tok[0])
    corrections.append(tok[1])
    
# Example usage
print(typos[0], corrections[0])

browers browser


* a) Evaluate the spelling correction system you implemented in the previous assignment with accuracy. How many of the words did it correct right?
* b) Now evaluate the errors, can you identify some common causes (i.e. trends) in the mistakes of your model?

a)

In [41]:
# Load typos and corrections
typos_path = './typos.txt'
typos = []
corrections = []

with open(typos_path, 'r', encoding="utf-8") as file:
    for line in file:
        tok = line.strip().split('\t')
        typos.append(tok[0])
        corrections.append(tok[1])

# Evaluate the spelling correction system
corrected_count = 0
for typo, expected in zip(typos, corrections):
    correction, _ = naiveSpellingCorrection(typo, en_dict)
    if correction == expected:
        corrected_count += 1

accuracy = corrected_count / len(typos)
print(f"Accuracy: {accuracy*100:.2f}%")
print(f"Corrected {corrected_count} out of {len(typos)} typos.")


Accuracy: 48.00%
Corrected 48 out of 100 typos.


In [42]:
# print typos and their corrections
for typo, expected in zip(typos, corrections):
    correction, _ = naiveSpellingCorrection(typo, en_dict)
    print(f"Typo: {typo}, Expected: {expected}, Correction: {correction}")

Typo: browers, Expected: browser, Correction: rowers
Typo: asigned, Expected: assigned, Correction: aligned
Typo: hanlde, Expected: handle, Correction: hands
Typo: poining, Expected: pointing, Correction: poising
Typo: pittsburg, Expected: pittsburgh, Correction: Pittsburgh
Typo: inequlity, Expected: inequality, Correction: inequality
Typo: exeption, Expected: exception, Correction: exception
Typo: soem, Expected: some, Correction: stem
Typo: meagniful, Expected: meaningful, Correction: manful
Typo: securiy, Expected: security, Correction: security
Typo: meassure, Expected: measure, Correction: reassure
Typo: dessa, Expected: essa, Correction: Odessa
Typo: buiild, Expected: build, Correction: build
Typo: aproach, Expected: approach, Correction: approach
Typo: oldeer, Expected: older, Correction: older
Typo: aroung, Expected: around, Correction: around
Typo: repsond, Expected: respond, Correction: resend
Typo: explicliy, Expected: explicitly, Correction: explicit
Typo: tranlate, Expecte

b) 

1. Phonetic Similarity Ignored: The model does not consider phonetic similarity between words. For example, it corrected "browers" to "rowers" instead of "browsers". A phonetically aware correction method might have performed better for such cases.

2. Lack of Contextual Understanding: The system does not take into account the context in which a word is used. This can lead to incorrect corrections where the correct word is dependent on the surrounding text. For example, "hanlde" was corrected to "hands" instead of "handle", which a context-aware system might have correctly identified.

3. Frequency Ignorance: The correction system does not consider the frequency of word usage. More common words are more likely to be the intended correction than less common ones. For instance, "soem" was corrected to "stem" instead of the more common word "some". Incorporating word frequency could improve accuracy by prioritizing more likely corrections.

4. Handling of Proper Nouns and Specialized Terms: The system might struggle with proper nouns (e.g., "pittsburg" to "Pittsburgh") or specialized terms, especially when such words are not present or are underrepresented in the dictionary used for corrections.

5. Overfitting to Dictionary Words: Since the correction is based on the closest match within a dictionary, the system might "overfit" to incorrect words that are closer in edit distance but not the right correction in context. For example, "poining" was corrected to "poising" instead of "pointing".

6. Semantic Similarity: The model does not consider the semantic similarity or the meaning of words, which can lead to choosing words that are lexically close but semantically distant from the intended correction.

To address these issues, enhancements like incorporating a phonetic algorithm (e.g., Soundex or Metaphone), using context-aware models (e.g., n-gram models or neural language models like BERT for contextual prediction), and including word frequency data could significantly improve the performance and accuracy of the spelling correction system.