## Dataset reading
Because a different number of potential correct spellings are presented for each misspelled word in the dataset, it's not possible to use the *read_csv* method from *pandas* library, where it's required the same number of columns-features for each row. So the dataset will be processed line by line.

In [31]:
def read_dataset(path_to_dataset: str) -> dict:
    """

    Parameters
    ----------
    path_to_dataset : str
        Path to the dataset file (may have any format).

    Returns
    -------
    dataset_dict : dict
        Dictionary from misspelled dataset word to list of its possible correct spellings.

    """
    # reading dataset file
    with open(path_to_dataset, 'r') as dataset_file:
        dataset_lines = dataset_file.readlines()

    # delete '\n' symbols
    dataset_lines = [line.strip() for line in dataset_lines]

    # filling dataset dictionary
    dataset_dict = {}
    for word_line in dataset_lines:
        line_words = word_line.split()
        misspelled_word = line_words[0][:-1] # removing ':'
        correct_spellings = line_words[1:]
        dataset_dict[misspelled_word] = correct_spellings

    return dataset_dict

In [32]:
path_to_misspells_dataset = "data/wikipedia_misspells.txt"
read_dataset(path_to_misspells_dataset)

{'Apennines': ['Apenines', 'Appenines'],
 'Athenian': ['Athenean'],
 'Athenians': ['Atheneans'],
 'Bernoulli': ['Bernouilli'],
 'Blitzkrieg': ['Blitzkreig'],
 'Brazilian': ['Brasillian'],
 'Britain': ['Britian'],
 'British': ['Brittish'],
 'Caesar': ['Ceasar'],
 'Cambridge': ['Cambrige'],
 'Caracas': ['carcas'],
 'Caribbean': ['Carribean'],
 'Carthaginian': ['Carthagian'],
 'Catalina': ['Cataline'],
 'Catiline': ['Cataline'],
 'Celsius': ['Celcius'],
 'Champagne': ['Champange'],
 'Connecticut': ['Conneticut'],
 'Cypriot': ['Cyprian'],
 'Ellis': ['eles'],
 'English': ['Enlish'],
 'European': ['Europian', 'Eurpean', 'Eurpoean'],
 'Europeans': ['Europians'],
 'February': ['febuary'],
 'Flemish': ['Flemmish'],
 'Franciscan': ['Fransiscan'],
 'Franciscans': ['Fransiscans'],
 'Gael': ['gae'],
 'Galatians': ['Galations'],
 'Gandhi': ['Ghandi'],
 'Gauguin': ['gogin'],
 'Guatemala': ['Guatamala'],
 'Guatemalan': ['Guatamalan'],
 'Guinness': ['Guiness'],
 'Israelis': ['Israelies'],
 'Ithaca': ['

1. pyspellchecker library

>pip install pyspellchecker

In [1]:
from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['let', 'us', 'wlak','on','the','groun'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

walk
{'weak', 'wak', 'walk', 'flak'}
group
{'groin', 'groan', 'grout', 'ground', 'grown', 'groon', 'group', 'aroun'}


2. Pre-trained spell correction model 'Spello'

>pip install spello

In [None]:
from spello.model import SpellCorrectionModel