### Install and import the transformers

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 2.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 24.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 47.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.3 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYA

In [1]:
from transformers import AutoConfig, AutoTokenizer, TFAutoModel
from transformers import pipeline

PARSBERT = "HooshvareLab/bert-base-parsbert-uncased"
# config = AutoConfig.from_pretrained(PARSBERT)
tokenizer = AutoTokenizer.from_pretrained(PARSBERT)
model = TFAutoModel.from_pretrained(PARSBERT)


Some layers from the model checkpoint at HooshvareLab/bert-base-parsbert-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at HooshvareLab/bert-base-parsbert-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [19]:
import random 
import numpy as np
import nltk
import pandas as pd
import codecs
import tqdm

In [2]:
!pip install hazm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from __future__ import unicode_literals
from hazm import *
normalizer = Normalizer()

### normalize_input

*   Input: inp: str
*   Ouptut: normalized: str

convert the *inp* string to normalized string.



In [3]:
text = "پس از سال‌‌ها تلاش رازی موفق به کشف الکل شد. این دانشمند ایرانی باعث افتخار در تاریخ کور است."
import string

def normalize_input(inp: str):
  inp_splitted =  inp.strip().split()
  inp_with_halfspace = normalizer.normalize(" ".join(inp_splitted))
  inp_without_halfspace = inp_with_halfspace.replace("\u200c", " ")
  for ch in string.punctuation:
    inp_without_halfspace = inp_without_halfspace.replace(ch, " "+ ch + " ")
  words_list = [word.strip() for word in inp_without_halfspace.split()]
  # words_list.remove("ها")
  normalized = " ".join(words_list)
  return normalized
normalize_input(text)

'پس از سال ها تلاش رازی موفق به کشف الکل شد . این دانشمند ایرانی باعث افتخار در تاریخ کور است .'

### Create the ParsBert Model for masking

In [4]:
model = pipeline('fill-mask', model=PARSBERT)

Some weights of the model checkpoint at HooshvareLab/bert-base-parsbert-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### tester

*   Input: sent: srt, ind: int
*   Output: response: dict, masked_string: str

gets a sentence and the index of the word we want to be masked. then returns a dict which looks like the proper output in the question, and a masked string.



In [34]:
import string
def tester(sent, ind):
    words = sent.strip().split()
    response = dict()
    prev_len = sent.find(words[ind])
    response["raw"] = words[ind] 
    response["correct"] = None
    last_index = prev_len + len(words[ind]) 
    response["span"] = [prev_len, last_index]
    return response, ' '.join(words[:ind] + ["[MASK]"] + words[ind+1:])

### attach

*   Input: toknized: list
*   Output: string: str

gets a list of tokenized words (*tokenized*) and attach each word to each other, in a way that if it has ## then remove them.


In [6]:
def attach(tokenized):
  string = ""
  for token in tokenized:
    if token[:2] == "##":
      string += token[2:]
    else:
      string += " " + token
  return string

### is_true

*   Inputs: resp: dict, input_str: str, max_threshold: int
*   Outputs:



In [7]:
def is_true(resp, input_str, max_threshold = 1500):
  tokenized = tokenizer.tokenize(input_str)
  # print("toknized:", tokenized)
  attached = attach(tokenized)
  # print("attached:", attached)
  preds = model(attached, top_k = max_threshold)
  # print("predinctoin:", preds)
  preds_str = [pred["token_str"] for pred in preds]
  num_of_occurence = -1
  if resp["raw"] in preds_str:
    num_of_occurence = preds_str.index(resp["raw"])
  return resp["raw"] in preds_str, num_of_occurence, preds_str


### Clone the fasttext from github then install it.

In [10]:
!git clone https://github.com/facebookresearch/fastText.git
%cd fastText
!pip install .
%cd ..

Cloning into 'fastText'...
remote: Enumerating objects: 3930, done.[K
remote: Counting objects: 100% (76/76), done.[K
remote: Compressing objects: 100% (29/29), done.[K
remote: Total 3930 (delta 29), reused 70 (delta 29), pack-reused 3854[K
Receiving objects: 100% (3930/3930), 8.33 MiB | 22.93 MiB/s, done.
Resolving deltas: 100% (2446/2446), done.
/content/fastText
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing /content/fastText
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting pybind11>=2.2
  Using cached pybind11-2.9.2-py2.py3-none-any.whl (213 kB)
Building w

In [8]:
import fasttext
import fasttext.util

### Use the snippet below if you dont have the cc.fa.300.bin model

In [None]:
fasttext.util.download_model('fa', if_exists='ignore')

Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fa.300.bin.gz



'cc.fa.300.bin'

### Uncomment the comments if you want to access the model from the google drive

It's for us, not for TAs

In [12]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
# !cp "/content/drive/MyDrive/Arshad/NLP/HW3-TransformersDataset/cc.fa.300.bin" "/content" 

### To load the fasttext model, run the snippet below.

In [9]:
ft = fasttext.load_model('cc.fa.300.bin')

### To copy the model to the drive uncomment the snipper below
It's for us, not for TAs

In [None]:
# !cp "/content/cc.fa.300.bin" "/content/drive/MyDrive/Arshad/NLP/HW3-TransformersDataset"

### cos_sim

*   Inputs: word1: str, word2: str
*   Output: cosine similarity: float

first calculate the embedding of the 2 givin words (with *fasttext*) then calcualte the cosine similarity using *scipy*



In [10]:
import scipy
def cos_sim(word1, word2):
  emb1 = ft[word1]
  emb2 = ft[word2]
  return 1 - scipy.spatial.distance.cosine(emb1, emb2)

In [16]:
!pip install gensim -U

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 5.9 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.2.0


### word_sim_sent

*   Inputs: sentence: str, word: str
*   Output: Mean similarity between word and sentence




In [23]:
def word_sim_sent(sent, word):
  words = sent.split()
  sum_sim = 0
  for w in words:
    sum_sim += cos_sim(w, word)
  return sum_sim / len(words)

### most_sim_to_sent

*   Inputs: sentence: str, a list of words: list 
*   Ouput: The sorted list of similarities between each word and the sentence



In [24]:
def most_sim_to_sent(sent, words):
  max_sim = 0
  most_sim = None
  sims = list()
  for word in words:
    sim = word_sim_sent(sent, word)
    sims.append((sim, word))
  sims.sort(reverse=True, key=lambda x: x[0])
  return sims

### calc_sim_neighs_for_other_word

*   Inputs: this_neighs: list of tuples, other_word: str
*   Output: final_prob_neighs: list of tuples

calculate similarity of each neighbor and the other word, then add new similarities to their previous probabilities as their final probability to be true.



In [25]:
def calc_sim_neighs_for_other_word(this_neighs, other_word):
  final_prob_neighs = list()
  for prob, neigh in this_neighs:
    final_prob = prob + cos_sim(neigh, other_word)
    final_prob_neighs.append((final_prob, neigh))
  return final_prob_neighs

### check_sim_two_neighbors

*   Inputs: prev_neighbors: list of tuples, next_neighbors: list of tuples
*   Ouput: most_sim_neigh: str, most_prob: float

check the top 3 of each of the input neighbors to see, if there is any equal neighbor, to return as the ouput.

if there were more than equal neighbors, return the most probable one (by adding the probs of each in two neighbors).



In [31]:
def check_sim_two_neighbors(prev_neighbors, next_neighbors):
  most_sim_neigh = None
  most_prob = 0
  prev_neighbors = prev_neighbors[:3]
  next_neighbors = next_neighbors[:3]
  if next_neighbors == None:
    return most_sim_neigh, most_prob
  for prev_prob, prev_neigh in prev_neighbors:
    for next_prob, next_neigh in next_neighbors:
      if prev_neigh == next_neigh:
        sum_prob = prev_prob + next_prob
        if sum_prob > most_prob:
          most_prob = sum_prob
          most_sim_neigh = prev_neigh
        break
  return most_sim_neigh, most_prob


### fasttext_verification_2

*   Inputs: sent: str, word: str, prev_word:str, next_word: str
*   Output: chosen_neighbor: str, max_prob: float

This verification, at first uses two adjacent words (*prev_word* & *next_word*) to find the best word instead of the given *word*.

At first we use fasttext's get_nearest_neighbors method to find the top 1M similiar words to the *prev_word* (*prev_neighbors*) and then calculate the edit distances between each prev_nieghbor and the given *word*. we look for minimum edit distnace possible (except 0, if 0 break to use the min_edit_dist = 1) and we collect words which has min_edit_distances. then sort these words based on the similarity to the given sentence (*sent*) 

These steps will be also done for the next_word.

and then best answer (according to the min_dist) will be returned from both final chosen neighbors.

In [42]:
import gensim
def fasttext_verification_2(sent, word, prev_word, next_word):
  # previous neighbors
  prev_neighbors = ft.get_nearest_neighbors(prev_word,k=1000000)
  prev_min_edit_dist = 100
  prev_chosen_neighbors = list()
  for prob, neighbor in prev_neighbors:
    neighbor = neighbor.replace(prev_word, "")
    edit_dist = gensim.similarities.fastss.editdist(neighbor, word)
    if edit_dist < prev_min_edit_dist:
      if edit_dist == 0:
        break
      prev_min_edit_dist = edit_dist
      prev_chosen_neighbors = [(prob, neighbor)]
      # print("-"*50)
      # print("new min edit distance", prev_min_edit_dist)
      # print("prev_chosen_neighbors", prev_chosen_neighbors)
      # print("-"*50)
    elif edit_dist == prev_min_edit_dist:
      prev_chosen_neighbors.append((prob, neighbor))
      # print("-"*50)
      # print("update prev_neighbors", prev_chosen_neighbors)
      # print("-"*50)
  # print("prev_chosen_neighbors", prev_chosen_neighbors)
  tmp_prev_chosen_neighs = calc_sim_neighs_for_other_word(prev_chosen_neighbors, next_word)
  # print("tmp_prev_chosen_neighs", tmp_prev_chosen_neighs)
  final_prev_chosen_neighs = most_sim_to_sent(sent, words = [x[1] for x in tmp_prev_chosen_neighs])
  # print("final_prev_chosen_neighs", final_prev_chosen_neighs)

  # next neighbors
  next_neighbors = ft.get_nearest_neighbors(next_word,k=1000000)
  next_min_edit_dist = 100
  next_chosen_neighbors = list()
  for prob, neighbor in next_neighbors:
    neighbor = neighbor.replace(next_word, "")
    edit_dist = gensim.similarities.fastss.editdist(neighbor, word)
    if edit_dist < next_min_edit_dist:
      if edit_dist == 0:
        break
      next_min_edit_dist = edit_dist
      next_chosen_neighbors = [(prob, neighbor)]
      # print("+"*50)
      # print("new min edit distance", next_min_edit_dist)
      # print("next_chosen_neighbors", next_chosen_neighbors)
      # print("+"*50)
    elif edit_dist == next_min_edit_dist:
      next_chosen_neighbors.append ((prob, neighbor))
      # print("+"*50)
      # print("update next_neighbors", next_chosen_neighbors)
      # print("+"*50)

  
  tmp_next_chosen_neighs = []
  final_next_chosen_neighs = []
  if next_min_edit_dist <= prev_min_edit_dist:
    # print("next_chosen_neighbors", next_chosen_neighbors)
    tmp_next_chosen_neighs = calc_sim_neighs_for_other_word(next_chosen_neighbors, prev_word)
    # print("tmp_next_chosen_neighs", tmp_next_chosen_neighs)
    final_next_chosen_neighs = most_sim_to_sent(sent, words = [x[1] for x in tmp_next_chosen_neighs])
    # print("final_next_chosen_neighs", final_next_chosen_neighs)

  if next_min_edit_dist == prev_min_edit_dist:
      chosen_neighbor, max_prob = check_sim_two_neighbors(tmp_prev_chosen_neighs, tmp_next_chosen_neighs)
      if chosen_neighbor != None:
        return chosen_neighbor, max_prob 


  final_neighs = None
  if next_min_edit_dist == prev_min_edit_dist:
    final_neighs = final_prev_chosen_neighs + final_next_chosen_neighs
  elif next_min_edit_dist > prev_min_edit_dist:
    final_neighs = final_prev_chosen_neighs
  else:
    final_neighs = final_next_chosen_neighs

  max_prob = 0
  chosen_neighbor = None
  for prob,neigh in final_neighs:
    if max_prob < prob:
      max_prob = prob
      chosen_neighbor = neigh
  
  return chosen_neighbor, max_prob

### fasttext_verification

*   Inputs: sent: str, word: str, near_word:str 
*   Output: chosen_neighbor: str, max_prob: float

It works similiar to fasttext_verification_2, but just for one adjacent neighbor (it can be the prev_neighbor or the next_neighbor).

In [41]:
def fasttext_verification(sent, word, near_word):
  prev_neighbors = ft.get_nearest_neighbors(near_word,k=1000000)
  min_edit_dist = 100
  chosen_neighbors = list()
  for prob, neighbor in prev_neighbors:
    neighbor = neighbor.replace(near_word, "")
    edit_dist = gensim.similarities.fastss.editdist(neighbor, word)
    if edit_dist < min_edit_dist:
      if edit_dist == 0:
        break
      min_edit_dist = edit_dist
      chosen_neighbors = [(prob, neighbor)]
      # print("-"*50)
      # print("new min edit distance", min_edit_dist)
      # print("chosen_neighbors", chosen_neighbors)
      # print("-"*50)
    elif edit_dist == min_edit_dist:
      chosen_neighbors.append((prob, neighbor))
      # print("-"*50)
      # print("update chosen_neighbors", chosen_neighbors)
      # print("-"*50)
  # print("chosen_neighbors", chosen_neighbors)
  final_chosen_neighs = most_sim_to_sent(sent, words = [x[1] for x in chosen_neighbors])
  # print("final_chosen_neighs", final_chosen_neighs)
  max_prob = 0
  chosen_neighbor = None
  for prob,neigh in final_chosen_neighs:
    if max_prob < prob:
      max_prob = prob
      chosen_neighbor = neigh
  
  return chosen_neighbor, max_prob

### final_tester

*   Input: text: str
*   Output: corrected_answers: list of dicts 

check each word of the sentence, if the word was false, recommend the best possible word instead of that, and then as the result, returns a list of responses with the correct word.


In [37]:
def final_tester(text):
  corrected_answers = list()
  text = normalize_input(text)
  text_arr = text.split()
  len_text_arr = len(text_arr)
  for ind in range(len(text_arr)):
    print(text)
    resp, input_str = tester(text, ind)
    if resp["raw"] == "ها":
      continue;
    is_valid, num_of_occurence, predictions = is_true(resp, input_str, 1000)
    if is_valid:
      print("the word ", resp["raw"], " is true! and occured in number ", num_of_occurence)
    else:
      print("the word ", resp["raw"], "is false!")
      recom_word = None
      if ind == 0:
        recom_word, _ = fasttext_verification(resp["raw"], text_arr[ind+1])
      elif ind == len_text_arr - 1:
        recom_word, _ = fasttext_verification(resp["raw"], text_arr[ind-1])
      else:
        recom_word, _ = fasttext_verification_2(text, resp["raw"], text_arr[ind-1], text_arr[ind+1])
      resp["correct"] = recom_word
      corrected_answers.append(resp)
      print(" we recommed you to use the word", recom_word)
      text_arr[ind] = recom_word
      text = ' '.join(text_arr)  
    print("-"*50)
  return corrected_answers

In [38]:
text = "بسیاری از مباحث علوم غیرطبیعی با استفاده از فیزیک دنیای مادی ابل توجیح نیست و برای یادگیری باید به فلسفه های خاصی رجو کرد."
final_tester(text)

بسیاری از مباحث علوم غیرطبیعی با استفاده از فیزیک دنیای مادی ابل توجیح نیست و برای یادگیری باید به فلسفه های خاصی رجو کرد .
the word  بسیاری  is true! and occured in number  0
--------------------------------------------------
بسیاری از مباحث علوم غیرطبیعی با استفاده از فیزیک دنیای مادی ابل توجیح نیست و برای یادگیری باید به فلسفه های خاصی رجو کرد .
the word  از  is true! and occured in number  0
--------------------------------------------------
بسیاری از مباحث علوم غیرطبیعی با استفاده از فیزیک دنیای مادی ابل توجیح نیست و برای یادگیری باید به فلسفه های خاصی رجو کرد .
the word  مباحث  is true! and occured in number  56
--------------------------------------------------
بسیاری از مباحث علوم غیرطبیعی با استفاده از فیزیک دنیای مادی ابل توجیح نیست و برای یادگیری باید به فلسفه های خاصی رجو کرد .
the word  علوم  is true! and occured in number  0
--------------------------------------------------
بسیاری از مباحث علوم غیرطبیعی با استفاده از فیزیک دنیای مادی ابل توجیح نیست و برای یادگیری باید به

[{'correct': 'قابل', 'raw': 'ابل', 'span': [61, 64]},
 {'correct': 'رو', 'raw': 'رجو', 'span': [115, 118]}]

In [39]:
text = "دیوار حال مستحکم نیست"
final_tester(text)

دیوار حال مستحکم نیست
the word  دیوار  is true! and occured in number  377
--------------------------------------------------
دیوار حال مستحکم نیست
the word  حال is false!
 we recommed you to use the word حائل
--------------------------------------------------
دیوار حائل مستحکم نیست
the word  مستحکم  is true! and occured in number  314
--------------------------------------------------
دیوار حائل مستحکم نیست
the word  نیست  is true! and occured in number  62
--------------------------------------------------


[{'correct': 'حائل', 'raw': 'حال', 'span': [6, 9]}]

In [40]:
text = "پس از سال‌‌ها تلاش رازی موفق به کسف الکل شد. این دانشمند تیرانی باعث افتخار در تاریخ کور است."
final_tester(text)

پس از سال ها تلاش رازی موفق به کسف الکل شد . این دانشمند تیرانی باعث افتخار در تاریخ کور است .
the word  پس  is true! and occured in number  0
--------------------------------------------------
پس از سال ها تلاش رازی موفق به کسف الکل شد . این دانشمند تیرانی باعث افتخار در تاریخ کور است .
the word  از  is true! and occured in number  0
--------------------------------------------------
پس از سال ها تلاش رازی موفق به کسف الکل شد . این دانشمند تیرانی باعث افتخار در تاریخ کور است .
the word  سال  is true! and occured in number  1
--------------------------------------------------
پس از سال ها تلاش رازی موفق به کسف الکل شد . این دانشمند تیرانی باعث افتخار در تاریخ کور است .
پس از سال ها تلاش رازی موفق به کسف الکل شد . این دانشمند تیرانی باعث افتخار در تاریخ کور است .
the word  تلاش  is true! and occured in number  376
--------------------------------------------------
پس از سال ها تلاش رازی موفق به کسف الکل شد . این دانشمند تیرانی باعث افتخار در تاریخ کور است .
the word  رازی  is true! and 

[{'correct': 'کشف', 'raw': 'کسف', 'span': [31, 34]},
 {'correct': 'ایرانی', 'raw': 'تیرانی', 'span': [57, 63]},
 {'correct': 'کشور', 'raw': 'کور', 'span': [85, 88]}]