# Import Required Libraries
In this project, we use **transformers** library (from **huggingface.co**) to use the pre-trained **BERT** base model. We use BERT and RoBERTa models for English and BERT and ALBERT models for Persian.

In [109]:
from transformers import pipeline, BertTokenizer, BertForMaskedLM, AlbertTokenizer, AlbertForMaskedLM, RobertaTokenizer, RobertaModel
from transformers.pipelines.fill_mask import FillMaskPipeline
import torch

from spacy.tokens.token import Token
from spacy.tokens.doc import Doc
import editdistance
import pandas as pd
import string

import stanza
import spacy
import spacy_stanza

# Models

The following is a brief description of these Transformer models and their differences and similarities with the base bert model:

1. ALBERT: As stated earlier, BERT base consists of 110 million parameters which makes it computationally intensive and therefore a light version was required with reduced parameters. ALBERT model has 12 million parameters with 768 hidden layers and 128 embedding layers. As expected, the lighter model reduced the training time and inference time. To achieve lesser set of parameters, the **Cross-layer parameter sharing** & **Factorized embedding layer parameterization** techniques are used.

2. RoBERTa stands for “Robustly Optimized BERT pre-training Approach”. In many ways this is a better version of the BERT model. The key points of difference are as follows:

    - **Dynamic Masking**: BERT uses static masking i.e. the same part of the sentence is masked in each Epoch. In contrast, RoBERTa uses dynamic masking, wherein for different Epochs different part of the sentences are masked. This makes the model more robust.

    - **Remove NSP Task**: It was observed that the NSP task is not very useful for pre-training the BERT model. Therefore, the RoBERTa only with the MLM task.

    - **More data Points**: BERT is pre-trained on “Toronto BookCorpus” and “English Wikipedia datasets” i.e. as a total of 16 GB of data. In contrast, in addition to these two datasets, RoBERTa was also trained on other datasets like CC-News (Common Crawl-News), Open WebText etc. The total size of these datasets is around 160 GB.

    - **Large Batch size**: To improve on the speed and performance of the model, RoBERTa used a batch size of 8,000 with 300,000 steps. In comparison, BERT uses a batch size of 256 with 1 million steps.


In [110]:
len(vocab)

Torch Device: cpu
fa bert Model Loaded ...


# Vocabulary

We use transformer model vocabulary to identify possible typos (misspelling). In this way, if a word is not in the vocabulary, it **probably** has a misspelling. In the next step, this typo is corrected with the help of a pre-trained model predictions and lexical distance.

# Stanza

spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.

## Setup

# Stanza

spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.

## Setup

We use **Stanza** library for Persian and **Spacy** base model for English.

In [101]:
if language == 'fa':
    stanza.install_corenlp()
    stanza.download('fa')
    nlp = spacy_stanza.load_pipeline("fa")

elif language == 'en':
    spacy.prefer_gpu()
    nlp = spacy.load("en_core_web_lg")

else:
    raise ValueError(f"Stanza: {language} not supported.")

For Persian texts, a semicolon plays a key role. Unfortunately, pre-trained models in Persian do not support half space and their predicted words do not have half space.
With this function, if the difference between the predicted words and the main word in the given input is only contains half-space, we do not change the main word in the given input.

## Correct Lexico Typo

# Spell Correction

## Correct Lexico Typo

## Correct Contextual Typo



In [103]:
def contextual_typo_correction(
        text,
        alpha=10,
        max_edit_distance=2,
        top_k=10,
        verbose=False,
):
    doc = nlp(text)
    for index in range(len(doc)):

        current_token: Token = doc[index]

        print("*" * 50)
        print(f"Token: {current_token.text}")

        start_char_index = current_token.idx
        end_char_index = start_char_index + len(current_token)

        masked_text = doc.text[:start_char_index] + MASK + doc.text[end_char_index:]

        predicts = unmasker(masked_text, top_k=top_k)
        ### Select Token From Predicts
        predicts = pd.DataFrame(predicts)

        try:
            if current_token.text in string.punctuation:
                filtered_predicts = predicts.loc[predicts['token_str'].apply(lambda tk: tk in string.punctuation), :].copy()
                selected_predict = filtered_predicts['token_str'].iloc[0]

            elif any(c.isdigit() for c in current_token.text):
                selected_predict = current_token.text

            else:
                predicts.loc[:, 'token_str'] = predicts['token_str'].apply(lambda tk: tk.replace(" ", ""))
                predicts.loc[:, 'edit_distance'] = predicts['token_str'].apply(lambda tk: editdistance.eval(current_token.text, tk))

                # Filter tokens with at most 3 edit distance
                filtered_predicts = predicts.loc[predicts['edit_distance'] <= max_edit_distance, :].copy()

                # Apply total score function
                # e: edit distance + 1
                # l: token length
                filtered_predicts.loc[:, 'e_to_l'] = (filtered_predicts.loc[:, 'edit_distance'] + 1) / len(current_token.text)

                filtered_predicts.loc[:, 'total_score'] = filtered_predicts.loc[:, 'score'] / filtered_predicts.loc[:, 'e_to_l'] ** alpha

                filtered_predicts = filtered_predicts.sort_values('total_score', ascending=False)
                selected_predict_row = filtered_predicts.iloc[0, :]

                selected_predict = selected_predict_row['token_str']

        except Exception as e:
            print(f"Error: {e} From {current_token.text} Filtered Predictions Length: {len(filtered_predicts)}")
            selected_predict = current_token.text

        if selected_predict != current_token.text:
            if not half_space_case(selected_predict, current_token.text):
                text = masked_text.replace(MASK, selected_predict, 1)
                doc = nlp(text)

            else:
                vocab.add(current_token.text)
                selected_predict = current_token.text

        if verbose:
            if current_token.text != selected_predict:
                print("Filtered Predicts: \n")
                print(filtered_predicts[['token_str', 'score', 'total_score']])

                print(f"{current_token.text} -> {selected_predict} : contextual")

                typo_correction_details = {
                    "raw": current_token.text,
                    "corrected": selected_predict,
                    "span": f"[{start_char_index}, {end_char_index}]",
                    "around": text[start_char_index - 10: end_char_index + 10],
                    "type": "contextual"
                }

                print(typo_correction_details)

    return text

# Correction Pipeline Class

In [104]:
class SpellCorrector:

    def __init__(
            self,
            alpha=5,
            max_edit_distance=2,
            verbose=False,
            top_k=50
    ):
        self.alpha = alpha
        self.max_edit_distance = max_edit_distance
        self.verbose = verbose
        self.top_k = top_k

    def _lexico_typo_correction(self, text):
        return lexico_typo_correction(text, self.alpha, self.max_edit_distance, self.top_k, self.verbose, )

    def _contextual_typo_correction(self, text):
        return contextual_typo_correction(text, self.alpha, self.max_edit_distance, self.top_k, self.verbose, )

    def correction_pipeline(self, text):
        # print("Lexico Correction ...") if self.verbose else print()
        corrected_text = self._lexico_typo_correction(text)

        # print("Contextual Correction ...") if self.verbose else print()
        corrected_text = self._contextual_typo_correction(corrected_text)

        return corrected_text

    def __call__(self, text, *args, **kwargs):
        return self.correction_pipeline(text)


# Sample Texts

# Sample Texts

In this section, X is executed on the given sample texts and the output is compared with the corrected input text.

In [None]:
"الکل" in vocab