# Collect evaluation dataset for the spell checking (and correction) tools.

First, let's define our task: checking of incorrectly spelled words. Which operations with characters within a word might be considered as misspelling:

*   **deletion** of character (cat -> ct)
*   **swapping** two adjacent characters (cat -> cta)
*   **addition** of extra character (cat -> cart)
*   **replacement** the current character with a random one (cat -> cvt)

Thus, we have to collect as much representative evaluation dataset as possible for this task.

Next step is to decide on which level we want to detect and collect misspelled words: token level/sentence level/paragraph level. Ideally, we want to be able to correct misspellings on a paragraph level, but spell checking tools (especially language models) have limitations in number of tokens, so for simplicity of evaluation and dataset collection, **we will focus on a sentence level**.

## Evaluation dataset sources

Evaluation dataset will constist of:
1. [SpellGram](https://huggingface.co/datasets/vishnun/SpellGram) - this dataset has only character replacement misspelling. I'll take only some part of the dataset because it has too many samples for our evaluation purposes.


2. Model or rule-based generated samples - we'll cover all the rest categories: deletion, swapping and replacement.

All four categories of misspellings will be balanced in orded to fairly evaluate the model and/or rule-based method.

### Install and load all necessary libraries

In [7]:
!pip install datasets pandas transformers torch numpy



In [230]:
import pandas as pd
import numpy as np

from tqdm import tqdm
import random
import torch
from datasets import load_dataset
from transformers import GPT2LMHeadModel, GPT2Tokenizer

import warnings
warnings.filterwarnings('ignore', category=pd.errors.SettingWithCopyWarning)

### Part 1: Load SpellGram dataset
I will use pandas to store dataset mainly because I will merge it with a custom dataset later.

In [94]:
SpellGram_dataset = load_dataset("vishnun/SpellGram")
SpellGram_data = SpellGram_dataset['train'].to_pandas() # dataset has only "train" split
SpellGram_data.head()

Unnamed: 0,source,target
0,rate the silent upeaker four out oe 6,rate the silent speaker four out of 6
1,please find me tqe gork tqe bfrning sorld,please find me the work the burning world
2,three friendl afe relaxing uround the tsble,three friends are relaxing around the table
3,what dm they want,what do they want
4,man in tan aat working with stones,man in tan hat working with stones


In [95]:
len(SpellGram_data)

40000

### Let's see some statistics of the loaded dataset.

In [96]:
source_nulls = SpellGram_data['source'].isna().sum()
print(f"Number of rows with None/NaN in 'source': {source_nulls}")

Number of rows with None/NaN in 'source': 0


In [97]:
source_nulls = SpellGram_data['target'].isna().sum()
print(f"Number of rows with None/NaN in 'target': {source_nulls}")

Number of rows with None/NaN in 'target': 1


In [98]:
SpellGram_data = SpellGram_data.dropna()
SpellGram_data.reset_index(drop=True, inplace=True)

In [99]:
def text_statistics(df, column_name):
    df[f'{column_name}_length_char'] = df[column_name].apply(len)
    df[f'{column_name}_length_word'] = df[column_name].apply(lambda x: len(x.split()))

    # Characters statistics
    stats_char = {
        'mean': df[f'{column_name}_length_char'].mean(),
        'max': df[f'{column_name}_length_char'].max(),
        'min': df[f'{column_name}_length_char'].min(),
        'std': df[f'{column_name}_length_char'].std()
    }
    # Words statistics
    stats_word = {
        'mean': df[f'{column_name}_length_word'].mean(),
        'max': df[f'{column_name}_length_word'].max(),
        'min': df[f'{column_name}_length_word'].min(),
        'std': df[f'{column_name}_length_word'].std()
    }
    return {
        'stats_char': stats_char,
        'stats_word': stats_word
    }

In [100]:
source_stats = text_statistics(SpellGram_data, 'source')
target_stats = text_statistics(SpellGram_data, 'target')

print("*"*10, "Source Column Statistics:", "*"*10)
print(f"Text length (characters):\n{source_stats['stats_char']}")
print(f"Text length (words):\n{source_stats['stats_word']}\n")

print("*"*10, "Target Column Statistics:", "*"*10)
print(f"Text length (characters):\n{target_stats['stats_char']}")
print(f"Text length (words):\n{target_stats['stats_word']}\n")

********** Source Column Statistics: **********
Text length (characters):
{'mean': 55.11177779444486, 'max': 210, 'min': 2, 'std': 20.075655223022174}
Text length (words):
{'mean': 9.7811195279882, 'max': 40, 'min': 1, 'std': 3.161276201164558}

********** Target Column Statistics: **********
Text length (characters):
{'mean': 55.117477936948426, 'max': 210, 'min': 2, 'std': 20.073317393567915}
Text length (words):
{'mean': 9.781094527363184, 'max': 40, 'min': 1, 'std': 3.1612546984707244}



As we can see, the character statistics on "source" and "target" gives us the same (or almost the same) number meaning that there are no additions/deletions indeed. Let's see why we got slightly different mean values in characters.

In [101]:
SpellGram_data['source_length_char'] = SpellGram_data['source'].apply(len)
SpellGram_data['target_length_char'] = SpellGram_data['target'].apply(len)

diff_length_rows = SpellGram_data[SpellGram_data['source_length_char'] != SpellGram_data['target_length_char']]

print('Number of rows with length mismatch:', len(diff_length_rows[['source', 'target', 'source_length_char', 'target_length_char']]))
diff_length_rows[['source', 'target', 'source_length_char', 'target_length_char']].head()

Number of rows with length mismatch: 228


Unnamed: 0,source,target,source_length_char,target_length_char
5,treacle said the dormouse without considering ...,treacle said the dormouse without considering...,62,63
426,she s grown a goop deal xas her first remark,she s grown a good deal was her first remark,44,45
606,if i d meant that i d have naid it naid humpty...,if i d meant that i d have said it said humpt...,53,54
767,scow your ticket chiwd the gufrd went on looki...,show your ticket child the guard went on look...,65,66
814,can t zou tae queen said in a pitying tone,can t you the queen said in a pitying tone,42,43


This difference might occur due to extra spaces in the target column. Let's try to fix that.

In [102]:
SpellGram_data['target'] = SpellGram_data['target'].str.replace(r'\s+', ' ', regex=True).str.strip()
SpellGram_data['source'] = SpellGram_data['source'].str.replace(r'\s+', ' ', regex=True).str.strip()

SpellGram_data['source_length_char'] = SpellGram_data['source'].apply(len)
SpellGram_data['target_length_char'] = SpellGram_data['target'].apply(len)

diff_length_rows = SpellGram_data[SpellGram_data['source_length_char'] != SpellGram_data['target_length_char']]

print('Number of rows with length mismatch:', len(diff_length_rows[['source', 'target', 'source_length_char', 'target_length_char']]))
diff_length_rows[['source', 'target', 'source_length_char', 'target_length_char']].head()

Number of rows with length mismatch: 1


Unnamed: 0,source,target,source_length_char,target_length_char
2055,however i know my name now she said that s som...,however i know my name now she said that s som...,57,55


Now we have one row with length mismatch, let's explore it.

In [103]:
diff_length_rows['source'].iloc[0]

'however i know my name now she said that s some comfort g'

In [104]:
diff_length_rows['target'].iloc[0]

'however i know my name now she said that s some comfort'

This row is most probably an outlier. Let's delete it from the dataset.

In [105]:
SpellGram_data = SpellGram_data.drop(diff_length_rows.index)
SpellGram_data.reset_index(drop=True, inplace=True)
len(SpellGram_data)

39998

Now, let's take a slice of our dataset and keep only "source" and "target" columns. Then, save our dataset to a .csv file.

In [106]:
SpellGram_data_500 = SpellGram_data[['source', 'target']][:500]

In [108]:
SpellGram_data_500.to_csv("SpellGram_data_500.csv")

### Part 2: Generate dataset using GPT2 and rule-based methods

Now we need to generate 1500 (500 for each misspelling category) target sentences.

- First, we will try to use GPT2 because it's not that large and it should handle one random sentence generation just fine.

- Otherwise, we will take 1500 random sentences from some dataset.

For a minimal prompt for the model, we'll take a random word from SpellGram dataset ("target" column) because it's more reliable than GPT2's tokenizer vocabulary which might have irrelevant characters.

In [61]:
SpellGram_vocab = []

for sent in SpellGram_data['target'].to_list():
  SpellGram_vocab.extend(sent.split())

In [56]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [58]:
device

device(type='cuda')

In [57]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.to(device)
model.eval()



GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

To generate sentences, I will use character and word statistics of SpellGram dataset so our evaluation dataset can be consistent:

**Characters**:

- mean: 55.117477936948426
- max: 210
- min: 2
- std: 20.073317393567915

**Words**:
- mean: 9.781094527363184
- max: 40
- min: 1
- std: 3.1612546984707244

In [88]:
def generate_sentence(max_length=210):
    random_word = random.choice(SpellGram_vocab)
    input_ids = tokenizer.encode(random_word, return_tensors='pt').to(device)

    output = model.generate(input_ids, max_length=random.randint(len(random_word)+1, max_length), num_return_sequences=1)
    sentence = tokenizer.decode(output[0], skip_special_tokens=True)

    return sentence.strip()

Let's do the test run to see whether our model can generate someting decent.

In [92]:
n_sentences = 50
sentences = []

while len(sentences) < n_sentences:
    sentence = generate_sentence(max_length=40)
    word_count = len(sentence.split())

    # Check the word count criteria
    if word_count <= 40 and word_count >= 1:
        sentences.append(sentence.replace('\n', '').split('.')[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generati

In [93]:
for s in sentences[:10]:
  print(s)

located in the same directory as the file
months
border
are, a former U
was, in fact, a very good one
the, "I'm not sure what you're talking about," and
adult, and the most important thing is that you're not going to
arthropod, which is a type of fish that is
has
damaged


We can see that our model hallucinates a lot and a random starting word prompt was not a good idea.

Let's get 1500 sentences from the SpellGram_data dataset but only those target sentences which are not present in our saved dataset.

In [268]:
SpellGram_remaining_samples = SpellGram_data[~SpellGram_data.index.isin(SpellGram_data.index[:500])]['target']
SpellGram_data_1500 = SpellGram_remaining_samples[:1500]
SpellGram_data_1500.reset_index(drop=True, inplace=True)
SpellGram_data_1500.head()

Unnamed: 0,target
0,is it for further automatic processing
1,the result was an oldsmobilelike bulge
2,i actually felt lighthearted
3,dialogue implies a deadly rivalry between him ...
4,the book has been wrong in the past


Then, let's introduce our misspellings to generate the "target" column.

In [229]:
def introduce_spelling_errors(sentence, error_type, error_rate=0.5):
    sentence_words = sentence.split()

    assert error_type in ['swap', 'delete', 'add'], "Error type must be 'swap', 'delete', or 'add'."

    # Calculate the number of characters to change based on the error_rate
    if int(error_rate * len(sentence_words)) > 1:
      num_words_to_change = random.sample(range(1, int(error_rate * len(sentence_words))), 1)[0]
    else:
      num_words_to_change = 1
    words_to_change_indexes = random.sample(range(len(sentence_words)), num_words_to_change)

    for index in words_to_change_indexes:
        word = sentence_words[index]
        chars = list(word)

        if error_type == 'swap' and len(chars) > 1:  # Ensure there are at least 2 characters to swap
            swap_index = random.randint(0, len(chars) - 2)
            chars[swap_index], chars[swap_index + 1] = chars[swap_index + 1], chars[swap_index]

        elif error_type == 'delete' and len(chars) > 0:  # Ensure there's at least one character to delete
            delete_index = random.randint(0, len(chars) - 1)
            chars.pop(delete_index)

        elif error_type == 'add':
            add_index = random.randint(0, len(chars))  # Random index to add a character (can add at the end)
            chars.insert(add_index, random.choice('abcdefghijklmnopqrstuvwxyz'))

        sentence_words[index] = ''.join(chars)

    return ' '.join(sentence_words)

In [217]:
input_sentence = "this is a sample sentence for generating spelling mistakes"
incorrect_sentence = introduce_spelling_errors(input_sentence,
                                               error_type='add',
                                               error_rate=0.5)

print("Original sentence:", input_sentence)
print("Incorrectly spelled sentence:", incorrect_sentence)

Original sentence: this is a sample sentence for generating spelling mistakes
Incorrectly spelled sentence: this is a sample sentenkce for generating spellintg mistakzes


Now let's generate 1500 source misspelled samples with maximum error_rate equal to 0.5 (meaning that n words in range [1, n_words*0.5] will be changed).

In [239]:
source_sentences = []

Swap:

In [240]:
for sent in tqdm(SpellGram_data_1500[:500]):
    incorrect_sentence = introduce_spelling_errors(sentence=sent,
                                                   error_type='swap',
                                                   error_rate=0.5)
    source_sentences.append(incorrect_sentence)

100%|██████████| 500/500 [00:00<00:00, 82360.76it/s]


Delete:

In [241]:
for sent in tqdm(SpellGram_data_1500[500:1000]):
    incorrect_sentence = introduce_spelling_errors(sentence=sent,
                                                   error_type='delete',
                                                   error_rate=0.5)
    source_sentences.append(incorrect_sentence)

100%|██████████| 500/500 [00:00<00:00, 42522.19it/s]


Add:

In [242]:
for sent in tqdm(SpellGram_data_1500[1000:]):
    incorrect_sentence = introduce_spelling_errors(sentence=sent,
                                                   error_type='add',
                                                   error_rate=0.5)
    source_sentences.append(incorrect_sentence)

100%|██████████| 500/500 [00:00<00:00, 46944.51it/s]


In [243]:
len(source_sentences)

1500

In [263]:
source_sentences[:5]

['si it for further autmoatic processing',
 'the result was na oldsmobilelike buleg',
 'i actually felt lighthearted',
 'dialogue implies a deadly rivalry between him nad hnazo hasashi ebtter known as scorpion',
 'the book has been wrong in the pats']

In [270]:
SpellGram_data_1500_final = pd.DataFrame({'source': source_sentences, 'target': SpellGram_data_1500})
SpellGram_data_1500_final.head()

Unnamed: 0,source,target
0,si it for further autmoatic processing,is it for further automatic processing
1,the result was na oldsmobilelike buleg,the result was an oldsmobilelike bulge
2,i actually felt lighthearted,i actually felt lighthearted
3,dialogue implies a deadly rivalry between him ...,dialogue implies a deadly rivalry between him ...
4,the book has been wrong in the pats,the book has been wrong in the past


In [271]:
SpellGram_data_1500_final.to_csv('SpellGram_data_1500.csv')

### Collect all samples together.

In [278]:
SpellGram_dataset_2k = pd.concat([SpellGram_data_500, SpellGram_data_1500_final], ignore_index=True)

SpellGram_dataset_2k.head()

Unnamed: 0,source,target
0,rate the silent upeaker four out oe 6,rate the silent speaker four out of 6
1,please find me tqe gork tqe bfrning sorld,please find me the work the burning world
2,three friendl afe relaxing uround the tsble,three friends are relaxing around the table
3,what dm they want,what do they want
4,man in tan aat working with stones,man in tan hat working with stones


In [279]:
SpellGram_dataset_2k.to_csv("SpellGram_dataset_2k.csv")