# Week 1 - NLP and Deep Learning

---

Welcome to the weekly assignment of week 1. Note that the assignments are split up per lecture. 
Please upload your solutions on LearnIt as a notebook file to receive feedback.


# Lecture 1. What are words?


## 1. Regular Expressions
For this section, it might be handy to use the website https://regex101.com/ to test your solutions.

- a) Write a regular expression (regex or pattern) that matches any of the following words: `cat`, `sat`, `mat`.
<br>
(Bonus: What is a possible long solution? Can you find a shorter solution? *hint*: match characters instead of words)
- b) Write a regular expression that matches numbers, e.g. `12`, `1,000`, `39.95`.
- c) Expand the previous solution to match Danish price indications, e.g., `1,000 kr` or `39.95 DKK` or `19.95`.

In [159]:
import re

##a
text1="the cat is sat on the mat"
token1=re.compile('cat|sat|mat')
print(token1.findall(text1))

##b
text2="1.000 bøfer blev 3 dage for gamle, og blev sat ned til 39,95 dkk, 40 menesker blev glade"
token2=re.compile('(\d+[,.]?\d+)')
print(token2.findall(text2))

##c

text3="1.000 bøfer 12/4/4 blev 3 dage for gamle, og blev sat ned fra 100 kr til 39,95 DKK, 40 menesker blev glade"
token3=re.compile(r'\d*[,.]?\d*\s?(?:kr|DKK)')
print(token3.findall(text3))



['cat', 'sat', 'mat']
['1.000', '39,95', '40']
['100 kr', '39,95 DKK']


## 2. Tokenization

**Note that we here use the old-fashioned meaning of tokenization which is the task of word segmentation (not to be confused with subword tokenization,which will be introduced in a later lecture**

(Adapted notebook from S. Riedel, UCL & Facebook: https://github.com/uclnlp/stat-nlp-book).

In Python, a simple way to tokenize a text is via the `split` method that divides a text wherever a particular substring is found. In the code below this pattern is simply the whitespace character, and this seems like a reasonable starting point for an English tokenization approach.

In [160]:
text = """Mr. Bob Dobolina is thinkin' of a master plan.
Why doesn't he quit?"""
text.split(" ")

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.\nWhy',
 "doesn't",
 'he',
 'quit?']

To make more fine-grained decisions, we will focus on using regular expressions for tokenization in this assignment. This can be done by either:
1. Defining the character sequence patterns at which to split.
2. Specifying patters that define what constitutes a token. 

In the code below we use a simple pattern `\s` that matches **any whitespace** to define where to split.

In [161]:
import re
gap = re.compile(r'\s')
gap.split(text)

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.',
 'Why',
 "doesn't",
 'he',
 'quit?']

One **shortcoming** of this tokenization is its treatment of punctuation because it considers `plan.` as a token whereas ideally we would prefer `plan` and `.` to be distinct tokens. It might be easier to address this problem if we define what a token is, instead of what constitutes a gap. Below we have defined tokens as sequences of alphanumeric characters and punctuation.

In [162]:
token = re.compile(r'\w+|[.?:]')
token.findall(text)

['Mr',
 '.',
 'Bob',
 'Dobolina',
 'is',
 'thinkin',
 'of',
 'a',
 'master',
 'plan',
 '.',
 'Why',
 'doesn',
 't',
 'he',
 'quit',
 '?']

This still isn't perfect as `Mr.` is split into two tokens, but it should be a single token. Moreover, we have actually lost an apostrophe. Both are fixed below, although we now fail to break up the contraction `doesn't`.

In [163]:
token = re.compile(r'Mr.|[\w\']+|[.?]')
tokens = token.findall(text)
tokens

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan',
 '.',
 'Why',
 "doesn't",
 'he',
 'quit',
 '?']

In the code below, we have an input text and apply the tokenizer (described previously) on the text:

In [164]:
import re
text = """'Curiouser and curiouser!' cried Alice (she was so much surprised, that for the moment she quite
forgot how to speak good English); 'now I'm opening out like the largest telescope that ever was! Good-bye,
feet!' (for when she looked down at her feet, they seemed to be almost out of sight, they were getting so far
off). 'Oh, my poor little feet, I wonder who will put on your shoes and stockings for you now, dears? I'm sure I
shan't be able! I shall be a great deal too far off to trouble myself about you: you must manage the best
way you can; —but I must be kind to them,' thought Alice, 'or perhaps they won't walk the way I want to go!
Let me see: I'll give them a new pair of boots every Christmas... . ...... .. .'
"""

token = re.compile(r'Mr.|[\w\']+|[.?]')
tokens = token.findall(text)
print(tokens[:10])
print(len(tokens))

["'Curiouser", 'and', 'curiouser', "'", 'cried', 'Alice', 'she', 'was', 'so', 'much']
157


Questions:

* a) The tokenizer clearly makes a few mistakes with punctuations. Where?

* b) Write a tokenizer to separate all tokenization characters as a single token. Note that there is no \p for punctuation in python `re` (so either make a list yourself, or you could specify it is any character which is not alphanumeric/whitespace).

* c) Should one separate `'m`, `'ll`, `n't`, possessives, and other forms of contractions from the word? Implement a tokenizer that separates these, and attaches the `'` to the latter part of the contraction (so that I'm -> I 'm and can't -> can 't).

* d) Should elipsis (...) be considered as three `.`s or one `...`? Design a regular expression that keeps the `.` together in 1 token.


Answers
* a) 
1: the ' before the first word should be sepereate 2: we are losing the ! after curiouser!, and a ( is also lost


* b)



In [165]:
token = re.compile(r'\w+|[.?,!()]')
tokens = token.findall(text)
print(tokens[:10])
print(len(tokens))

['Curiouser', 'and', 'curiouser', '!', 'cried', 'Alice', '(', 'she', 'was', 'so']
176


* c)

In [166]:
token = re.compile(r'\w+|\'\w*|[.?,!()\']')
tokens = token.findall(text)
print(tokens[-20:])
print(len(tokens))

['new', 'pair', 'of', 'boots', 'every', 'Christmas', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', "'"]
180


* d)
they should be keept together

In [167]:
token = re.compile(r'\w+|\'\w*|[?,!()\']|\.+')
tokens = token.findall(text)
print(tokens[-20:])
print(len(tokens))

['Let', 'me', 'see', 'I', "'ll", 'give', 'them', 'a', 'new', 'pair', 'of', 'boots', 'every', 'Christmas', '...', '.', '......', '..', '.', "'"]
172


## 3. Twitter Tokenization
As you might imagine, tokenizing tweets differs from standard tokenization. There are 'rules' on what specific elements of a tweet might be (mentions, hashtags, links), and how they are tokenized. The goal of this exercise is not to create a bullet-proof Twitter tokenizer but to understand tokenization in a different domain.

In the next exercises, we will focus on the following tweet:

In [168]:
tweet = "@robv New vids coming tomorrow #excited_as_a_child, can't w8!!"

In [169]:
token = re.compile(r'[\w]+')
tokens = token.findall(tweet)
print(tokens)

['robv', 'New', 'vids', 'coming', 'tomorrow', 'excited_as_a_child', 'can', 't', 'w8']


Questions:
- a) What is the correct tokenization of the tweet above according to you?
- b) Try your tokenizer from the previous exercise (Question 2). Which cases are going wrong? Rewrite your tokenizer such that it handles the above tweet correctly.
- c) How will your tokenizer handle emojis?
- d) Think of at least one example where your tokenizer (from b) will behave incorrectly.

Answers
* a) 
1: "@robv", "New", "vids", "coming", "tomorrow", "#excited_as_a_child" ,",", ,"can" ,"'t", ,"w8","!!""

* b)



In [170]:
#og
token = re.compile(r'\w+|\'\w*|[?,!()\']|\.+')
tokens = token.findall(tweet)
print(tokens)
print(len(tokens))

#new
token = re.compile(r'(?:[@|\#])?\w+|\'\w*|[?,!()\']|\.+')
tokens = token.findall(tweet)
print(tokens)
print(len(tokens))

['robv', 'New', 'vids', 'coming', 'tomorrow', 'excited_as_a_child', ',', 'can', "'t", 'w8', '!', '!']
12
['@robv', 'New', 'vids', 'coming', 'tomorrow', '#excited_as_a_child', ',', 'can', "'t", 'w8', '!', '!']
12


* c) i assume it will skip emojies, since they arent included in the specialtegn list
lets find out:

In [171]:
emojitxt="@robv New vids coming tomorrow #excited_as_a_child, can't w8!! 🍆 💦"
tokens = token.findall(emojitxt)
print(tokens)
print(len(tokens))

['@robv', 'New', 'vids', 'coming', 'tomorrow', '#excited_as_a_child', ',', 'can', "'t", 'w8', '!', '!']
12


* d) as ilustrated it doesn't work for emojies correctly, and the mr. ,ms., etc. etc. will be split

## 4. Segmentation


Sentence segmentation is not a trivial task either.

First, make sure you understand the following sentence segmentation code:

In [172]:
import re

def sentence_segment(match_regex, tokens):
    """
    Splits a sequence of tokens into sentences, splitting wherever the given matching regular expression
    matches.

    Parameters
    ----------
    tokens      the input sequence as list of strings (each item is a ``word'')
    match_regex the regular expression that defines at which token to split.

    Returns
    -------
    a list of lists of strings, where each string is a word, and each inner list
    represents a sentence.
    """
    sentences = [[]]
    for tok in tokens:
        sentences[-1].append(tok)
        if match_regex.match(tok):
            sentences.append([])
            
    if sentences[-1] == []:
        del sentences[-1]
    return sentences


In the following code, there is a variable `text` containing a small text and a regular expression-based segmenter:

In [173]:
text = """
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch is the longest official one-word placename in U.K. Isn't that weird? I mean, someone took the effort to really make this name as complicated as possible, huh?! Of course, U.S.A. also has its own record in the longest name, albeit a bit shorter... This record belongs to the place called Chargoggagoggmanchauggagoggchaubunagungamaugg. There's so many wonderful little details one can find out while browsing http://www.wikipedia.org during their Ph.D. or an M.Sc.
"""

token = re.compile(r'Mr.|[\w\']+|[.?]+')

tokens = token.findall(text)
sentences = sentence_segment(re.compile(r'\.'), tokens)

for sentence in sentences:
    print(sentence)

['Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch', 'is', 'the', 'longest', 'official', 'one', 'word', 'placename', 'in', 'U', '.']
['K', '.']
["Isn't", 'that', 'weird', '?', 'I', 'mean', 'someone', 'took', 'the', 'effort', 'to', 'really', 'make', 'this', 'name', 'as', 'complicated', 'as', 'possible', 'huh', '?', 'Of', 'course', 'U', '.']
['S', '.']
['A', '.']
['also', 'has', 'its', 'own', 'record', 'in', 'the', 'longest', 'name', 'albeit', 'a', 'bit', 'shorter', '...']
['This', 'record', 'belongs', 'to', 'the', 'place', 'called', 'Chargoggagoggmanchauggagoggchaubunagungamaugg', '.']
["There's", 'so', 'many', 'wonderful', 'little', 'details', 'one', 'can', 'find', 'out', 'while', 'browsing', 'http', 'www', '.']
['wikipedia', '.']
['org', 'during', 'their', 'Ph', '.']
['D', '.']
['or', 'an', 'M', '.']
['Sc', '.']


Questions:
- a) Improve the segmenter so that it handles a list of abbreviations (U.S./U.K./Ph.D/M.Sc) correctly.
- b) Improve the segmenter so that it handles URL's correctly. You can assume that URLs start with "www."

Answers:

* a)

In [174]:
import re

def sentence_segment(match_regex, tokens):
    """
    Splits a sequence of tokens into sentences, splitting wherever the given matching regular expression
    matches.

    Parameters
    ----------
    tokens      the input sequence as list of strings (each item is a ``word'')
    match_regex the regular expression that defines at which token to split.

    Returns
    -------
    a list of lists of strings, where each string is a word, and each inner list
    represents a sentence.
    """
    abbreviations={'U','K','S','A', 'Ph', 'M', 'Dr', 'Mr', 'Mrs', 'Inc', 'Ltd', 'Co'}
    sentences = [[]]
    for i, tok in enumerate(tokens):
        sentences[-1].append(tok)
        if match_regex.match(tok) and tokens[i-1] not in abbreviations:
            sentences.append([])
            
    if sentences[-1] == []:
        del sentences[-1]
    return sentences


In [175]:
text = """
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch is the longest official one-word placename in U.K. Isn't that weird? I mean, someone took the effort to really make this name as complicated as possible, huh?! Of course, U.S.A. also has its own record in the longest name, albeit a bit shorter... This record belongs to the place called Chargoggagoggmanchauggagoggchaubunagungamaugg. There's so many wonderful little details one can find out while browsing http://www.wikipedia.org during their Ph.D. or an M.Sc.
"""

token = re.compile(r'Mr.|[\w\']+|[.?]+')

tokens = token.findall(text)
sentences = sentence_segment(re.compile(r'[\.?!]'), tokens)

for sentence in sentences:
    print(sentence)

['Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch', 'is', 'the', 'longest', 'official', 'one', 'word', 'placename', 'in', 'U', '.', 'K', '.', "Isn't", 'that', 'weird', '?']
['I', 'mean', 'someone', 'took', 'the', 'effort', 'to', 'really', 'make', 'this', 'name', 'as', 'complicated', 'as', 'possible', 'huh', '?']
['Of', 'course', 'U', '.', 'S', '.', 'A', '.', 'also', 'has', 'its', 'own', 'record', 'in', 'the', 'longest', 'name', 'albeit', 'a', 'bit', 'shorter', '...']
['This', 'record', 'belongs', 'to', 'the', 'place', 'called', 'Chargoggagoggmanchauggagoggchaubunagungamaugg', '.']
["There's", 'so', 'many', 'wonderful', 'little', 'details', 'one', 'can', 'find', 'out', 'while', 'browsing', 'http', 'www', '.']
['wikipedia', '.']
['org', 'during', 'their', 'Ph', '.', 'D', '.']
['or', 'an', 'M', '.', 'Sc', '.']


* b)

In [176]:
import re

def sentence_segment_link_state(match_regex, tokens):
    linkindicator=re.compile(r'www|WWW|http|HTTP')
    linkoffdicator=re.compile(r'com|org|dk|gov|io')
    abbreviations={'U','K','S','A', 'Ph', 'M', 'Dr','D', 'Mr', 'Mrs', 'Inc', 'Ltd', 'Co'}
    link=False
    sentences = [[]]
    for i, tok in enumerate(tokens):
        sentences[-1].append(tok)
        if linkindicator.match(tok) or link==True:
            link=True
            if  linkoffdicator.match(tok):                
                link=False      
        elif match_regex.match(tok) and tokens[i-1] not in abbreviations:
            sentences.append([])
            
    if sentences[-1] == []:
        del sentences[-1]
    return sentences


In [177]:

def sentence_segment_linkcheck(match_regex, tokens):
    linkindicator=re.compile(r'www|WWW|http|HTTP')
    linkoffdicator=re.compile(r'com|org|dk|gov|io')
    abbreviations={'U','K','S','A', 'Ph', 'M', 'Dr','D', 'Mr', 'Mrs', 'Inc', 'Ltd', 'Co'}
    sentences = [[]]

    for i, tok in enumerate(tokens):
        prev_token = tokens[i-1] if i > 0 else ""
        next_token = tokens[i+1] if i < len(tokens)-1 else ""
        sentences[-1].append(tok)
        if match_regex.match(tok) and prev_token not in abbreviations and not linkindicator.match(prev_token) and not linkoffdicator.match(next_token):
            sentences.append([])           
    if sentences[-1] == []:
        del sentences[-1]
    return sentences


In [178]:
sentences1 = sentence_segment_link_state(re.compile(r'[\.?!]'), tokens)
sentences2 = sentence_segment_linkcheck(re.compile(r'[\.?!]'), tokens)

for sentence in sentences1:
    print(sentence)
print()

for sentence in sentences2:
    print(sentence)

['Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch', 'is', 'the', 'longest', 'official', 'one', 'word', 'placename', 'in', 'U', '.', 'K', '.', "Isn't", 'that', 'weird', '?']
['I', 'mean', 'someone', 'took', 'the', 'effort', 'to', 'really', 'make', 'this', 'name', 'as', 'complicated', 'as', 'possible', 'huh', '?']
['Of', 'course', 'U', '.', 'S', '.', 'A', '.', 'also', 'has', 'its', 'own', 'record', 'in', 'the', 'longest', 'name', 'albeit', 'a', 'bit', 'shorter', '...']
['This', 'record', 'belongs', 'to', 'the', 'place', 'called', 'Chargoggagoggmanchauggagoggchaubunagungamaugg', '.']
["There's", 'so', 'many', 'wonderful', 'little', 'details', 'one', 'can', 'find', 'out', 'while', 'browsing', 'http', 'www', '.', 'wikipedia', '.', 'org', 'during', 'their', 'Ph', '.', 'D', '.', 'or', 'an', 'M', '.', 'Sc', '.']

['Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch', 'is', 'the', 'longest', 'official', 'one', 'word', 'placename', 'in', 'U', '.', 'K', '.', "Isn't", 'that', 'w

## 5. Tokenization competition

Tokenization of social media can be more challenging. We provide a small development set for you to experiment with, which you can find in `week1/tok.dev.txt`. The file is a tab-separated file, where the first column contains the input text, and the second column the gold tokenization (as decided by an annotator).

In [179]:
data = [line.strip().split('\t') for line in open('tok.dev.txt', encoding="utf-8")]
print(data[0])

["Found the scarriest mystery door in my school. I'M SO CURIOUS D:", "Found the scarriest mystery door in my school . I 'M SO CURIOUS D:"]


In [180]:
import re

old_school_emojis = [
    r':\)', r':D', r'D:', r':P', r':\(', r';\)', r':\|', r':\^D', r':O', r':/', r'<3',
    r':3', r':\*', r':-D', r':-\)', r':-\(', r':-P', r';-\)', r';-\(', r':-/', 
    r':<', r':>', r'B\)', r';P', r'=\)', r'X\)', r'XD', r':c', r'=\('
]

tester = "Found the scarriest mystery door in my school. I'M SO CURIOUS D:"

# Create a regex pattern that includes old-school emojis
emoji_pattern = '|'.join(old_school_emojis)

# Define the regex pattern for words and other tokens
token_pattern = r'(?:[@#])?\w+|\'\w*|[?,!()\']|\.+|[^a-zA-Z\s]'

# Final pattern: Combine emoji matching first, followed by other tokens
final_pattern = f'({emoji_pattern}|{token_pattern})'

# Apply the regex pattern to the string
tokens = re.findall(final_pattern, tester)
print(tokens)
print(len(tokens))


['Found', 'the', 'scarriest', 'mystery', 'door', 'in', 'my', 'school', '.', 'I', "'M", 'SO', 'CURIOUS', 'D:']
14


There is also a test file with the same format, but the gold annotation is missing. This can be found in `week1/tok.test.txt`. You are supposed to develop your tokenizer based on the development data, and then apply your tokenizer on the test data. You can hand in the predictions on the test data on LearnIt in the same slot as the rest of the assignment. We will use F1 score for evaluation.

Make sure that the file you hand in:
- Contains only the output of your system, so it should have the same number of lines as the test data file, but contains different placements of whitespace characters (the characters without whitespaces should be exactly the same though). 
- Has your ITU username as the name of the file: i.e. `robv.txt`.
- You can test whether the format matches by using the evaluation script provided (described below) on the development data.

We have provided an evaluation script for your convenience, it returns F1 score, recall, and precision. It also prints out all sentences where your model made an error (indicating the error in red if supported by your terminal), and checks whether your output is in the right format. It can be found in `week1/tok_eval.py`.

# Lecture 2: Language (correction)

## 6. Spelling correction

Below is an implementation of the Levenshtein distance. It uses a some efficiency tricks, and it is not important you understand every line of this implementation.

In [181]:
#added swap for better results

def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

print(levenshteinDistance('this', 'that'))

2


In [182]:
def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = [[i + j if i * j == 0 else 0 for j in range(len(s2) + 1)] for i in range(len(s1) + 1)]
    
    for i1 in range(1, len(s1) + 1):
        for i2 in range(1, len(s2) + 1):
            if s1[i1 - 1] == s2[i2 - 1]:
                distances[i1][i2] = distances[i1 - 1][i2 - 1]
            else:
                distances[i1][i2] = 1 + min(
                    distances[i1 - 1][i2],     # Deletion
                    distances[i1][i2 - 1],     # Insertion
                    distances[i1 - 1][i2 - 1]  # Substitution
                )
                
                # Check for transposition (swapping adjacent letters)
                if i1 > 1 and i2 > 1 and s1[i1 - 1] == s2[i2 - 2] and s1[i1 - 2] == s2[i2 - 1]:
                    distances[i1][i2] = min(distances[i1][i2], distances[i1 - 2][i2 - 2] + 1)
    
    return distances[-1][-1]

print(levenshteinDistance('this', 'thsi'))  # Example usage


1


We also provide you with an English word list from [Aspell](http://aspell.net/) in `aspell-en-dict.txt`. It can be used as follows:

In [183]:
# Load wordlist (one word per line)
en_dict = set([word.strip() for word in open('aspell-en-dict.txt', encoding="utf-8").readlines()])
    
# Example usage
typo = 'brower'
correction = 'browser'
print(typo, correction)
print(typo in en_dict)
print(correction in en_dict)
print(levenshteinDistance(typo, correction))

brower browser
False
True
1


* a) Implement a (naive) spelling correction system that finds the word in the word list with the smallest minimum edit distance for a word that contains a misspelling. 
* b) There could be multiple words with the smallest minimum edit distance for some typos, what are supplementary methods to re-rank these? (mention at least 2)

In [184]:

def spellcheck(testword):
    min_edit=10
    min_edits=[]
    for word in en_dict:
        if (len(word) <= len(testword)+1) and (len(word) >= len(testword)-1):
            dist=levenshteinDistance(testword, word)
            if dist == min_edit:
                min_edits.append([dist,word])
            elif dist < min_edit:
                min_edits=[[dist,word]]
                min_edit=dist
    return min_edits
    
spellcheck("hanlde")


[[1, 'handle']]

## 7. Evaluation and Analysis of spelling correction

We also provide you with a list of 100 typos and their corrections from the [GitHub Typo Corpus](https://aclanthology.org/2020.lrec-1.835/) in `typos.txt`. It can be used as follows:

In [185]:
# Load github typo corpus misspellings
typos = []
corrections = []
for line in open('typos.txt', encoding="utf-8"):
    tok = line.strip().split('\t')
    typos.append(tok[0])
    corrections.append(tok[1])
    
# Example usage
print(typos[0], corrections[0])

browers browser


* a) Evaluate the spelling correction system you implemented in the previous assignment with accuracy. How many of the words did it correct right?
* b) Now evaluate the errors, can you identify some common causes (i.e. trends) in the mistakes of your model?

In [186]:
mycorrections=[]
for i, typo in enumerate(typos):
    corrected=spellcheck(typo)[0][1]
    mycorrections.append(corrected)
    print(i," / 100 :",typo, "->",corrected)

0  / 100 : browers -> brokers
1  / 100 : asigned -> aligned
2  / 100 : hanlde -> handle
3  / 100 : poining -> pointing
4  / 100 : pittsburg -> Pittsburgh
5  / 100 : inequlity -> inequity
6  / 100 : exeption -> exception
7  / 100 : soem -> some
8  / 100 : meagniful -> meaningful
9  / 100 : securiy -> security
10  / 100 : meassure -> reassure
11  / 100 : dessa -> Odessa
12  / 100 : buiild -> build
13  / 100 : aproach -> approach
14  / 100 : oldeer -> older
15  / 100 : aroung -> around
16  / 100 : repsond -> respond
17  / 100 : explicliy -> explicit
18  / 100 : tranlate -> translate
19  / 100 : khói -> Thai
20  / 100 : embedeed -> embedded
21  / 100 : wriitten -> written
22  / 100 : defailt -> default
23  / 100 : kenrel -> kennel
24  / 100 : adventage -> advantage
25  / 100 : validater -> validate
26  / 100 : initilise -> initialise
27  / 100 : conise -> ionise
28  / 100 : stephan -> Stephan
29  / 100 : persitant -> persistent
30  / 100 : evalution -> evaluation
31  / 100 : rysnc -> sync


In [196]:
mycorrections = [s.lower() for s in mycorrections]

In [197]:
from sklearn.metrics import accuracy_score, recall_score, f1_score

# Example true labels and predictions
y_true = corrections # Ground truth
y_pred = mycorrections  # Predicted labels

# Convert to binary: True (1) if same, False (0) if different
y_binary_true = [1] * len(y_true)  # Always 1 (ground truth is always "correct")
y_binary_pred = [1 if a == b else 0 for a, b in zip(y_true, y_pred)]

# Compute metrics
accuracy = accuracy_score(y_binary_true, y_binary_pred)
recall = recall_score(y_binary_true, y_binary_pred, average="binary")
f1 = f1_score(y_binary_true, y_binary_pred, average="binary")

print(f'Accuracy: {accuracy:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')

Accuracy: 0.6100
Recall: 0.6100
F1 Score: 0.7578


In [198]:
for i in range(100):
    print(typos[i],y_true[i],y_pred[i])

browers browser brokers
asigned assigned aligned
hanlde handle handle
poining pointing pointing
pittsburg pittsburgh pittsburgh
inequlity inequality inequity
exeption exception exception
soem some some
meagniful meaningful meaningful
securiy security security
meassure measure reassure
dessa essa odessa
buiild build build
aproach approach approach
oldeer older older
aroung around around
repsond respond respond
explicliy explicitly explicit
tranlate translate translate
khói khỏi thai
embedeed embedded embedded
wriitten written written
defailt default default
kenrel kernel kennel
adventage advantage advantage
validater validator validate
initilise initialize initialise
conise concise ionise
stephan stephen stephan
persitant persistent persistent
evalution evaluation evaluation
rysnc rsync sync
githuhb github github
locgical logical logical
rhe the rhea
grenlets greenlets runlets
fultiple multiple multiple
broswers browsers browsers
succssful successful successful
legen legend legmen
unknw

## 8. Levenshtein distance in practice
Work out the distance matrix of the edit distance for the two words: `dansk` and `dane`. This is not a programming assignment, but rather a pen-and-paper assignment. You can fill out a table in any editor you like, or draw the matrix on a piece of paper and upload a picture or scan. An example of such a matrix was shown in class, but is also included in the Speech and Language Processing book (figure 2.18).