# Assignment-1

### Pooja Bandal 
##### 2022-02-06



### Installing essential libraries 

I need to install `nltk` and `pytrec-eval-terrier` libraries again in the notebook because I used a different environment.

In [1]:
!pip install nltk
!pip install "pytrec-eval-terrier"

Collecting pytrec-eval-terrier
  Downloading pytrec_eval_terrier-0.5.2-cp37-cp37m-manylinux2010_x86_64.whl (287 kB)
[K     |████████████████████████████████| 287 kB 3.9 MB/s 
[?25hInstalling collected packages: pytrec-eval-terrier
Successfully installed pytrec-eval-terrier-0.5.2


### Importing modules

Importing all the modules used in the project

In [2]:
import concurrent.futures
import json
import nltk
import os
import pytrec_eval
import time
import tqdm

from utils.common_utils import return_top_1_5_10_words, return_success_at_k, find_closes_match


#### Downloading the wordnet corpus

In [3]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

####  create a lists to store correct and incorrect spellings of words from missp.dat file

In [5]:
correct_spellings = []
incorrect_spellings = []
file_ = open('Data/missp.dat', 'r')
Lines = file_.readlines()


#### Strips the newline character

In [6]:
crr_spell = ''
icrr_spell = ''
for line in Lines:
    if '$' in line:
        crr_spell = line.replace('$', '').replace('\n', '').lower()
    else:
        incrr_spell = line.lower().replace('\n', '')
        correct_spellings.append(crr_spell)
        incorrect_spellings.append(incrr_spell)

In [8]:
print(f'total number of words in Birkbeck corpus are: {len(incorrect_spellings)}')

total number of words in Birkbeck corpus are: 36133


In [9]:
from nltk.corpus import wordnet as wn
count = 0
for _ in wn.words():
  count += 1
print(f'total number of words in WordNet corpus are: {count}')

total number of words in WordNet corpus are: 147306


We have **36,133** words in the corpus **BirkBeck** corpus and **147,306** words in the **WordNet** corpus.

### Parallelization

1. Parallelizing task across different cores of a CPU:

  
a) Running without parallelization

In [None]:
incorrect_words = ['caugt', 'nit', 'siit', 'garl']
start = time.time()
results = []
for word in tqdm.tqdm(incorrect_words, desc='Without parallelization'):
    results.append(find_closes_match(word))
print('#' * 60)
print(f'Without parallelization the time taken is {round(time.time()-start, 4)} second(s)')
print('#' * 60)
del results
print('\n')

Without parallelization: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [03:31<00:00, 52.84s/it]


############################################################
Without parallelization the time taken is 211.3826 second(s)
############################################################




  b) Running with parallelization

In [None]:
start = time.time()
results = []
with concurrent.futures.ProcessPoolExecutor() as executor:
    results.append(executor.map(find_closes_match, incorrect_words))
print('#' * 60)
print(f'With parallelization the time taken is {round(time.time() - start, 4)} second(s)')
print('#' * 60)
del results
print('\n')

############################################################
With parallelization the time taken is 103.3952 second(s)
############################################################




We can clearly see that the performance (run-time) improves when we parallelize jobs across differnt cores.

2. Parallelizing across different systems

Fortunately enough I had multiple machines to run this analysis because the code takes a very long time to run on one machine. I parallelized it by caching the results of functions and then combing the caches folders across all the system in one system to finish the evaluation. 

# Analysis and result generation

Running the functions and generating results.

In [None]:
results = []
argument_list = [(icrr, crr) for icrr, crr in zip(incorrect_spellings, correct_spellings)]

with concurrent.futures.ProcessPoolExecutor() as executor:
    for result in executor.map(return_top_1_5_10_words, argument_list):
        results.append(result)

looking at a few of the results to describe the structure of the result.

In [None]:
results[50:55]

[{'incorrect': 'ab',
  'correct': 'albert',
  1: ['ab'],
  5: ['ab',
   'fab',
   'a',
   'aba',
   'abb',
   'abc',
   'abm',
   'abo',
   'abs',
   'alb',
   'arb',
   'b',
   'cab',
   'dab',
   'gab',
   'jab',
   'lab',
   'tab',
   'aby',
   'nab'],
  10: ['ab',
   'fab',
   'a',
   'aba',
   'abb',
   'abc',
   'abm',
   'abo',
   'abs',
   'alb',
   'arb',
   'b',
   'cab',
   'dab',
   'gab',
   'jab',
   'lab',
   'tab',
   'aby',
   'nab']},
 {'incorrect': 'ameraca',
  'correct': 'america',
  1: ['america'],
  5: ['america',
   'american',
   'arauca',
   'arca',
   'asmera',
   'camera',
   'maraca',
   'amerce'],
  10: ['america',
   'american',
   'arauca',
   'arca',
   'asmera',
   'camera',
   'maraca',
   'amerce',
   'amber',
   'aaa',
   'abaca',
   'aceraceae',
   'aec',
   'amberjack',
   'ameba',
   'ameer',
   'americana',
   'ametria',
   'amora',
   'ara',
   'araceae',
   'arava',
   'arc',
   'areca',
   'armeria',
   'armoracia',
   'camera_care',
   'camer

In [None]:
del argument_list

### Evaluation

We have a custom function to find success at k for the words using the list of dictionaries that we generated in the step above. We used this to cross validate the results generated using `pytrec_eval` and the results match. 

In [None]:
success_at_k = return_success_at_k(results)

query = {}
results_eval = {}
for result in results:
    query[result["incorrect"]] = {result["correct"]: 1}
    results_eval[result["incorrect"]] = {}
    for word in result[1]:
        results_eval[result["incorrect"]][word] = 1

    for word in result[5]:
        if word not in results_eval[result["incorrect"]].keys():
            results_eval[result["incorrect"]][word] = 1/5

    for word in result[10]:
        if word not in results_eval[result["incorrect"]].keys():
            results_eval[result["incorrect"]][word] = 1/10

evaluator = pytrec_eval.RelevanceEvaluator(query, {'success'})

print(json.dumps(evaluator.evaluate(results_eval[50:55]), indent=1))
eval = evaluator.evaluate(results_eval)

for measure in sorted(list(eval[list(eval.keys())[0]].keys())):
    print(measure, 'average:',
          pytrec_eval.compute_aggregated_measure(
              measure, [query_measures[measure] for query_measures in eval.values()])
          )

{
 "ab": {
  "success_1": 0.0,
  "success_5": 0.0,
  "success_10": 0.0
 },
 "ameraca": {
  "success_1": 1.0,
  "success_5": 1.0,
  "success_10": 1.0
 },
 "amercia": {
  "success_1": 0.0,
  "success_5": 1.0,
  "success_10": 1.0
 },
 "ameracan": {
  "success_1": 1.0,
  "success_5": 1.0,
  "success_10": 1.0
 },
 "apirl": {
  "success_1": 0.0,
  "success_5": 1.0,
  "success_10": 1.0
 }
}
success_1 average: 0.26672162034856334
success_10 average: 0.47850918511540275
success_5 average: 0.41371290626471974


In [None]:
# END