**Instructions** 


* 1 hr round
    - 10 mins: Discussion of possible approaches.
    - 45-50 mins: Coding time.
* You are free to use internet for the entire duration (Ideally this should be used to lookup documentations/APIs)


**Problem**

1. Develop a model to correct the spelling mistakes in the provided text. You cannot use existing spell-checking libraries.
2. Write code to produce an evaluation metric to measure the performance of the model 

**1. Downloading dataset**



In [1]:
def get_data():
  !gdown https://drive.google.com/uc?id=1OB5AQ6i0CQ69jbp0iQa9EJ5nu4FkbeU2
  import zipfile
  with zipfile.ZipFile('data.zip', 'r') as zip_ref:
      zip_ref.extractall('.')

get_data()

Downloading...
From: https://drive.google.com/uc?id=1OB5AQ6i0CQ69jbp0iQa9EJ5nu4FkbeU2
To: /content/data.zip
  0% 0.00/217k [00:00<?, ?B/s]100% 217k/217k [00:00<00:00, 6.56MB/s]


In [4]:
import pandas as pd
data = pd.read_csv("data/data.tsv", sep="\t")
data.head()

Unnamed: 0,original_text,human_corrected_text
0,"I have just recieved the letter , which lets m...","I have just received the letter , which lets m..."
1,"Surprisily , there were no discounds .","Surprisingly , there were no discounts ."
2,"She just remembered it ought to be a secret , ...","She just remembered it ought to be a secret , ..."
3,I swim really well and I am a proffesional bas...,I swim really well and I am a professional bas...
4,I am writting to you about the show .,I am writing to you about the show .


In [5]:
data.shape

(4275, 2)

**2. Spell correcter model** (20-25 mins)

In [6]:
import numpy as np

In [7]:
subset = data[:5]
subset

Unnamed: 0,original_text,human_corrected_text
0,"I have just recieved the letter , which lets m...","I have just received the letter , which lets m..."
1,"Surprisily , there were no discounds .","Surprisingly , there were no discounts ."
2,"She just remembered it ought to be a secret , ...","She just remembered it ought to be a secret , ..."
3,I swim really well and I am a proffesional bas...,I swim really well and I am a professional bas...
4,I am writting to you about the show .,I am writing to you about the show .


In [18]:
def get_emb(word):
    word = word.lower()
    emb = np.zeros(26)
    for char in word:
        emb[ord(char) - ord('a')] += 1
    return emb

In [19]:
english_words = dict()

for sentence in subset.human_corrected_text:
    for word in sentence.split():
        if word.isalpha():
            word = word.lower()
            # convert this word to a 26x1 embedding
            word_length = len(word)
            embedding = np.zeros(26)
            for char in word:
                embedding[ord(char) - ord('a')] += 1

            # normalise this word vector
            # embedding = embedding / norm(embedding)

            english_words[word] = embedding


In [None]:
english_words

In [21]:
all_words = english_words.keys()
all_words

dict_keys(['i', 'have', 'just', 'received', 'the', 'letter', 'which', 'lets', 'me', 'know', 'that', 'won', 'first', 'prize', 'surprisingly', 'there', 'were', 'no', 'discounts', 'she', 'remembered', 'it', 'ought', 'to', 'be', 'a', 'secret', 'and', 'became', 'really', 'embarrassed', 'swim', 'well', 'am', 'professional', 'basketball', 'player', 'writing', 'you', 'about', 'show'])

In [15]:
from numpy.linalg import norm
def cosine_sim(emb_1, emb_2):
    sum = 0
    for i in range(26):
        sum += emb_1[i] * emb_2[i]
    sum /= norm(emb_1)
    sum /= norm(emb_2)
    return sum

In [17]:
word_you = [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0.]
word_copy = word_you.copy()
print(cosine_sim(word_you, word_copy))

1.0000000000000002


In [30]:
machine_corrected_sentences = []

for sentence in subset.original_text:
    corrected_sentence = ""
    for word in sentence.split():
        if word.isalpha():
            word = word.lower()

            # check if word is in vocab
            if word in all_words:
                # word is correct
                corrected_sentence += word + " "
            else:
                # there is some spelling mistake
                best_match_word = ""
                best_match_score = 0
                for key in english_words:
                    # here key is the correct word 
                    score = cosine_sim(get_emb(word), english_words[key])
                    if score > best_match_score:
                        best_match_score = score
                        best_match_word = key
                corrected_sentence += best_match_word + " "
                print(word, ', ', best_match_word)
    machine_corrected_sentences.append(corrected_sentence)


recieved ,  received
surprisily ,  surprisingly
discounds ,  discounts
embarassed ,  embarrassed
proffesional ,  professional
writting ,  writing


In [None]:
IISc

In [None]:
omg 

In [None]:
train vs trial 

In [31]:
machine_corrected_sentences

['i have just received the letter which lets me know that i have won the first prize ',
 'surprisingly there were no discounts ',
 'she just remembered it ought to be a secret and she became really embarrassed ',
 'i swim really well and i am a professional basketball player ',
 'i am writing to you about the show ']

In [27]:
subset.original_text[0]

'I have just recieved the letter , which lets me know that I have won the first prize .'

In [None]:
def correct_the_spellings(original_sentence):
    """
    inputs: 
        original_sentence: sentence with misspelt words. eg. 'I have just recieved the letter'
    output: 
        machine_corrected_sentence: sentence with spellings corrected by proposed model. eg. 'I have just received the letter'
    """
    

    
    #TODO
    
    pass

In [None]:
correct_the_spellings('I have just recieved the letter')

In [None]:
# Get corrections for first 5 sentences

machine_corrected_text = []

for original_sentence in data['original_text'].values[0:5]:
  machine_corrected_text.append(correct_the_spellings(original_sentence))

print(machine_corrected_text)

**3. Model performance** (5-10 mins)

In [None]:
def evaluate_corrections(original_text, human_corrected_text, machine_corrected_text):
    """
    inputs: 
        original_text: list of sentences with misspelt words. 
                       eg. ['I have just recieved the letter', 'how aer you']
        human_corrected_text: list of sentences from original_text with correct spellings for each word. 
                       eg. ['I have just received the letter', 'how are you']
        machine_corrected_text: list of sentences from original_text with spellings corrected by proposed model
                       eg. ['I have just recieved the leter', 'how are you']
    output: 
        evaluation_score, which could be used to compare different correction models
    """
    
    #TODO
    
    pass
    

In [None]:
# Get evaluation score over first 5 sentences

evaluate_corrections(data['original_text'].values[0:5], data['human_corrected_text'].values[0:5], machine_corrected_text)

**4. Model improvements** (If time permits)