In [11]:
# LOAD THE FILES FOR THIS NOTEBOOK
!wget -q --show-progress --no-check-certificate 'https://docs.google.com/uc?export=download&id=1efQdDM1FTLgcRoq7Zkizw_d7Vd7e52xK' -O 'Class 7.zip'
from zipfile import ZipFile
with ZipFile('Class 7.zip', 'r') as zipObj:
  zipObj.extractall()



<b>LING 193 - Class 12<br>
Automatic text correction in Python</b><br>
Andrew McInnerney<br>
October 10, 2022

#1 Typo correction

The goal for today is to take our `minedit()` function and use it to suggest corrections for possible typos. Let's first load the `minedit()` function into this Colab notebook. Either run the code block below, or replace it with your own code and run that.

In [12]:
def firstrow(m):
    return [i for i in range(m+1)]

def substitution_penalty(letter1, letter2):
    if letter1 == letter2:
        return 0
    else:
        return 2

def nextrow(priorrow, word1, letter):
    row = [priorrow[0] + 1]
    priorcell = row[0]
    for i in range(len(word1)):
        insertion = priorrow[i+1] + 1
        deletion = priorcell + 1
        substitution = priorrow[i] + substitution_penalty(word1[i], letter)
        priorcell = min(insertion, deletion, substitution)
        row.append(priorcell)
    return row

def minedit(word1, word2):
  m = len(word1) # input word
  n = len(word2) # correct word

  prior_row = firstrow(m)
  table = []
  table.append(prior_row)

  for index in range(n):
    next_row = nextrow(prior_row, word1, word2[index])
    table.append(next_row)
    prior_row = next_row
  
  return table

Let's now read in our Scrabble dictionary:

In [None]:
with open("Collins Scrabble Words (2019).txt") as file:
    dictionary = file.read().splitlines()

Now, write code to build a dictionary containing the words in `dictionary` as keys, and their distance to the word `'SONNG'` as values. 
1. Create an empty dictionary, e.g. `distances = {}`
2. Write a *for*-loop that iterates over each word in `dictionary`, and
  - calculates the minedit table for each word in comparison to `'SONNG'`,
  - extracts the bottom-right cell from the table (i.e., the last position in the last row)
  - creates a dictionary entry for that word with its distance to `'SONNG'`.

In [13]:
distances = dict()

for word in dictionary:
  # Generate the table
  table = minedit(word, "SONNG")
  
  # Find the distance 
  value = table[-1][-1]
  # print(f"Distance from {dictionary[0]} to SONNG is: {value}")
  
  # Write the distance to the dict()
  distances[word] = value

# print("Size of distances dict() is:", len(distances))


Use Python's `min()` function to extract the *closest* word to `'SONNG'`. The result should be `'SONG'`.

In [None]:
closest = min(distances, key = distances.get)
print(closest)

SONG


Finally, use the code you just wrote to make a function that can do this for any word. The function should take a typo (a string) and a wordlist (a list) as inputs, and it should
1. Create an empty dictionary
2. Iterate through a *for*-loop on each word in the wordlist that
  - calculates the minedit table for each word in comparison to the input typo
  - extracts the bottom-right cell from the table
  - creates a dictionary entry for that word with its distance to the input typo
3.  Use Python's `min()` function to extract the closest word to the typo, and return that word.

In [14]:
def correct(typo, wordlist):
  distances = dict()

  for word in dictionary:
    # Generate the table
    table = minedit(word, typo)
    
    # Find the distance 
    value = table[-1][-1]
    
    # Write the distance to the dict()
    distances[word] = value

  return min(distances, key = distances.get)


When you are done, try running the code block below, to see how the function "corrects" a few typos for color words. The output should be
```python
ORANGE
PURPLE
MAGENTA
```


In [None]:
# print(correct('ORNGE', dictionary))
# print(correct('PURPPLE', dictionary))
# print(correct('MAGRNTA', dictionary))
print(correct("LOGICIAN", dictionary))

LOGICIAN


Next, we will split into groups to quality-test this typo-correction system. If you finish working on the `correct()` function with some time to spare, it will be useful to test the function out on some more words, to start getting familiar with what it does well, and what it struggles with.

# 2 Identifying flaws
In groups, work together to identify problems with the `correct()` function. 
1. Come up with a list of one- to two-dozen clear typos.
  - E.g. you might have 'JUMPTS' for 'JUMPS'
2. For each typo, make note of what you would expect the right correction to be
  - E.g. `correct('JUMPTS')` should give `'JUMPS'`
3. and what kind of error is involved 
  - E.g. 'JUMPTS' is *insertion* of 'T'
4. See if you can identify any regularities in the corrections `correct()` gets *right*, and what it gets *wrong*
5. Discuss possible reasons why you might be seeing the patterns you observe

After a few minutes, be ready to share some ideas!

In [None]:
print(correct('JUMPTS', dictionary))
# print(correct('JUMPS', dictionary))
# print(correct('JUMPED', dictionary))
# print(correct('COMPUTERR', dictionary))
# print(correct('TAUGT', dictionary))

JUMP


(Also, for faster results, you might be interested in trying out the code block below. But notice there are also other ways this changes things! Explore to find out.)

In [15]:
with open("wordfreq.txt") as file: # You can also try the subtlex_words.txt file instead of wordfreq.txt
    words = file.read().splitlines()
frequencies = {}

# entry = words[0]
# print(f"Entry: {entry}")

# word = entry.split()[0]
# print(f"Word: {word}")

# word_uppter = entry.split()[0].upper()
# print(f"Word Upper: {word_uppter}")

# freq = int(entry.split()[1])
# print(f"Freq: {freq}")

for entry in words:
    # entry is each line: (word, frequency)
    word = entry.split()[0].upper()
    # Transform the second word in to int
    freq = int(entry.split()[1])

    if word not in frequencies:
        frequencies[word] = freq
dictionary = {word:freq for word,freq in frequencies.items() if freq >= 20} # lower this number to get more words in your dictionary

# print(f"Size of dict(): {len(dictionary)}")
# print(f"Dict(): {dictionary}")


# 3 Finding applications
In your same groups, work together to create a useful way of applying the `correct()` function to correct actual text:

- How could the `correct()` function be used in the context of a larger program to identify and either (i) automatically correct, or (ii) suggest corrections for, typos in a given paragraph?

- E.g., here's a paragraph that could be cleaned up:
```
It is widelly taugt that parts of speech are definned in terms of sinple
defimitions. For example, "a noun is a person place or thinge." But in
reality, simple defintions like that are not very usefful. For exampe, 
is it not true that everthing is a "thing"? Then in wghat sense can we
deffine nouns as "thngs"? We need moree rigorouss teckneques to define
these notioms.
```
- Take on the goal of writing a function that can take in a whole paragraph like this, and use `correct()` in ways that you determine to be useful.

- You might start by discussing the question "hypothetically", like, "if we were Python wizards, what kinds of things could we imagine doing?"

- Then talk about things you could do with the tools you already have, and try to implement it!

- At the end of class, each group will share what they came up with.

- You may also turn in your work on this for credit as part of the skills day on Wednesday, if you wish.

In [None]:
sample = """It is widelly taugt that parts of speech are definned in terms of sinple
defimitions. For example, "a noun is a person place or thinge." But in
reality, simple defintions like that are not very usefful. For exampe, 
is it not true that everthing is a "thing"? Then in wghat sense can we
deffine nouns as "thngs"? We need moree rigorouss teckneques to define
these notioms."""

Note: You might find it useful to have a function that can trim punctuation marks from the beginnings and ends of words.<br>
And, it would also be useful for this function to convert words to upper case.<br>
E.g.,
```python
def citformat(word):
  punctuation = ["!","?",",",".","-","~",":",";", "'", '"']
  while word[-1] in punctuation:
    word = word[:-1]
  while word[0] in punctuation:
    word = word[1:]
  return word.upper()
```

It might also be useful to think about using the *frequency* information included in the updated `dictionary` object above. For example, if you wanted to pick the most frequent word out of a list, you could use code like this:
```python
minidict = ["APPLE","ORANGE","LEMON","MANGO"]
print(max(minidict, key=dictionary.get))
```

In [19]:
def citformat(word):
  punctuation = ["!","?",",",".","-","~",":",";", "'", '"']
  while word[-1] in punctuation:
    word = word[:-1]
  while word[0] in punctuation:
    word = word[1:]
  return word.upper()

sample = "This confoluted snetence definitelt contans numerrous mistaks in it."
print(f"Original: {sample}")

words_list = sample.split()
fixed_list = []
for word in words_list:
  word = citformat(word)
  fixed_list.append(correct(word, dictionary).lower())
# print(f"Fixed: {fixed_list}")

output = " ".join(fixed_list).capitalize()
print(f"The fixed string: {output}")




Original: This confoluted snetence definitelt contans numerrous mistaks in it.
The fixed string: This confused sentence definitely contains numerous mistakes in it
