# Imports

In [None]:
! git clone https://github.com/ai-forever/sage.git

In [None]:
cd sage

In [None]:
! pip install .
! pip install -r requirements.txt

In [1]:
# Load functionality

import os
import torch
from sage.spelling_correction import T5ModelForSpellingCorruption, RuM2M100ModelForSpellingCorrection, AvailableCorrectors

# Quick tour [English]

In [None]:
# Load corrector

corrector = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.ent5_large.value)

In [4]:
# Place model on your favorite device

corrector.model.to(torch.device("cuda:0"));

## Generate correct texts

Simply, corrector's API have two methods that allow 
you to generate correct text. 

First, `correct()` method: use it when you have single sample. 
You can also provide additional `prefix` argument if needed, 
and `**generation_params` of your choice.

Apparent counterpart is `batch_correct` method.
As a name suggests, most useful when you've batch of texts to correct.
You may also provide `batch_size` parameter.

In [6]:
# Imagine you have bunch of texts with broken spelling.

samples = [
    "So I think we would not be live if our ancestors did not develop siences and tecnologies.",
    "There are very successful politicians that have never tried somthing new.",
    "second , birds navigate by landmarks like river , coastlines , and moutains.",
    "Because of this , I prefer studying concepts and ideas more thad learnig facts."
]

In [None]:
# Model's been trained with "grammar: " prefix. 
# Don't forget to past `prefix` when calling corresponding methods.

result = corrector.correct(samples[0], prefix="grammar: ")

In [9]:
print(result[0])

So I think we would not be alive if our ancestors did not develop sciences and technologies.


In [None]:
batch_result = corrector.batch_correct(samples, batch_size=1, prefix="grammar: ")

In [12]:
print(*batch_result, sep="\n")

['So I think we would not be alive if our ancestors did not develop sciences and technologies.']
['There are very successful politicians that have never tried something new.']
['second, birds navigate by landmarks like rivers, coastlines, and mountains.']
['Because of this, I prefer studying concepts and ideas more than learning facts.']


In [None]:
# Try with bigger `batch_size`

batch_result = corrector.batch_correct(samples, batch_size=4, prefix="grammar: ")

In [16]:
print(*batch_result[0], sep="\n")

So I think we would not be alive if our ancestors did not develop sciences and technologies.
There are very successful politicians that have never tried something new.
second, birds navigate by landmarks like rivers, coastlines, and mountains.
Because of this, I prefer studying concepts and ideas more than learning facts.


In [None]:
# Experiment with different `**generation_params`

batch_result = corrector.batch_correct(
    samples, batch_size=1, prefix="grammar: ", num_return_sequences=2, do_sample=True, top_k=50, top_p=0.95)

In [25]:
for elem in batch_result:
    print(*elem, sep="\n")
    print()

So I think we would not be alive if our ancestors did not develop sciences and technologies.
So I think we would not be alive if our ancestors did not develop sciences and technologies.

There are very successful politicians that have never tried something new.
There are very successful politicians that have never tried something new.

second, birds navigate by landmarks like rivers, coastlines, and mountains.
second, birds navigate by landmarks like rivers, coastlines, and mountains.

Because of this, I prefer studying concepts and ideas more than learning facts.
Because of this, I prefer studying concepts and ideas more than learning facts.



## Validation on JFLEG

You can call the evaluation on any dataset that is available 
either on HF hub or localy.

Remember, it should be properly formatted. 
Two text files: `sources.txt` and `corrections.txt` in one folder. If you prefer
single file, you may want to use `data.csv` with two columns `source` and `correction`. 
Or just write down the correct name of dataset on HF hub. 

In [None]:
# Make sure to stay inside sage directory or change path to validation data

metrics = corrector.evaluate(
    os.path.join(os.getcwd(), "data", "example_data", "jfleg"), batch_size=32, prefix="grammar: ")

In [32]:
print(metrics)

{'Precision': 83.43, 'Recall': 84.25, 'F1': 83.84}


# Quick tour [Russian]

In [2]:
# For Russian we have wider range of available models.
# P.S. ent5_large model corresponds to the English language, of course)

print(*["{}: {}".format(item.name, item.value) for item in AvailableCorrectors], sep="\n")

m2m100_1B: ai-forever/RuM2M100-1.2B
m2m100_418M: ai-forever/RuM2M100-418M
fred_large: ai-forever/FRED-T5-large-spell
ent5_large: ai-forever/T5-large-spell


In [2]:
# Load corrector

# NOTE: all three models may exceed the amount of RAM available in free version of Colab. 
# If the case, comment out one or two models and make sure to comment corresponding outputs and samples.

m2m_1b_corrector = RuM2M100ModelForSpellingCorrection.from_pretrained(AvailableCorrectors.m2m100_1B.value)
m2m_418m_corrector = RuM2M100ModelForSpellingCorrection.from_pretrained(AvailableCorrectors.m2m100_418M.value)
fred_corrector = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.fred_large.value)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [41]:
# Make up some spoiled sentences

samples = [
    "прийдя в МГТУ я был удивлен никого необноружив там…",
    "Нащщот Чавеса разве что не соглашусь.",
    "Мошный лазер - в нерабочем состоянии - 350 кредиток.",
    "Ощушаю себя с ними монголойдом, я никогда так много не молчала как молчю тут, и не потому, что языковый баръер или еще что-то, просто коментариев нет"
]

In [None]:
result_1b = m2m_1b_corrector.correct(samples[0])
result_418m = m2m_418m_corrector.correct(samples[0])
result_fred = fred_corrector.correct(samples[0], prefix="Исправь: ")

In [46]:
print("m2m1b")
print(result_1b[0])
print()

print("m2m418m")
print(result_418m[0])
print()

print("fred")
print(result_fred[0])

m2m1b
прийдя в МГТУ я был удивлен никого не обнаружив там...

m2m418m
Прийдя в МГТУ, я был удивлен, никого не обнаружив там...

fred
прийдя в МГТУ я был удивлен никого не обнаружив там.. «при


In [None]:
result_1b = m2m_1b_corrector.batch_correct(samples, batch_size=1)
result_418m = m2m_418m_corrector.batch_correct(samples, batch_size=1)
result_fred = fred_corrector.batch_correct(samples, prefix="Исправь: ", batch_size=1)

In [50]:
print("m2m1b")
print(*result_1b, sep="\n")
print()

print("m2m418m")
print(*result_418m, sep="\n")
print()

print("fred")
print(*result_fred, sep="\n")

m2m1b
['прийдя в МГТУ я был удивлен никого не обнаружив там...']
['Насчет Чавеса разве что не соглашусь.']
['Мощный лазер - в нерабочем состоянии - 350 кредиток.']
['Ощущаю себя с ними монголойдом, я никогда так много не молчала как молчу тут, и не потому, что языковый барьер или еще что-то, просто комментариев нет']

m2m418m
['Прийдя в МГТУ, я был удивлен, никого не обнаружив там...']
['Нащ от Чавеса. Разве что не соглашусь...']
['Мощный лазер - в нерабочем состоянии - 350 кредиток.']
['Ощушаю себя с ними монголойдом. Я никогда так много не молчала, как молчаю тут. И не потому, что языковый баръер или еще что-то, просто комментариев нет.']

fred
['прийдя в МГТУ я был удивлен никого не обнаружив там.. «при']
['На счет Чавеса разве что не соглашусь. На счет']
['Мощный лазер - в нерабочем состоянии - 350 кредиток']
['Ощущаю себя с ними монголойдом, я никогда так много не молчала как молчу тут, и не потому, что языковый барьер или еще что-то, просто коментариев нет, просто ком']


In [None]:
result_1b = m2m_1b_corrector.batch_correct(
    samples, batch_size=1, num_return_sequences=2, do_sample=True, top_k=50, top_p=0.95)
result_418m = m2m_418m_corrector.batch_correct(
    samples, batch_size=1, num_return_sequences=2, do_sample=True, top_k=50, top_p=0.95)
result_fred = fred_corrector.batch_correct(
    samples, batch_size=1, prefix="Исправь: ", num_return_sequences=2, do_sample=True, top_k=50, top_p=0.95)

In [61]:
sep = "\n------------------------------------------------------------------------------------------------------------\n"
result_1b = [elem[0] + "\n" + elem[1] for elem in result_1b] 
result_418m = [elem[0] + "\n" + elem[1] for elem in result_418m] 
result_fred = [elem[0] + "\n" + elem[1] for elem in result_fred] 

In [63]:
print("m2m1b")
print(*result_1b, sep=sep)
print()

print("m2m418m")
print(*result_418m, sep=sep)
print()

print("fred")
print(*result_fred, sep=sep)
print()

m2m1b
прийдя в МГТУ я был удивлен никого не обнаружив там...
прийдя в МГТУ я был удивлен никого не обнаружив там...
------------------------------------------------------------------------------------------------------------
Насчет Чавеса разве что не соглашусь.
Насчет Чавеса разве что не соглашусь.
------------------------------------------------------------------------------------------------------------
Мощный лазер - в нерабочем состоянии - 350 кредиток.
Мощный лазер - в нерабочем состоянии - 350 кредиток.
------------------------------------------------------------------------------------------------------------
Ощущаю себя с ними монголойдом, я никогда так много не молчала как молчу тут, и не потому, что языковый барьер или еще что-то, просто комментариев нет
Ощущаю себя с ними монголойдом, я никогда так много не молчала как молчу тут, и не потому, что языковый барьер или еще что-то, просто комментариев нет

m2m418m
Прийдя в МГТУ, я был удивлен, никого не обнаружив там...
Прийдя 

## Validation

In [2]:
# Load available datasets

from sage.utils import DatasetsAvailable

In [3]:
# Available datasets at HF hub

print(*["{}: {}".format(item.name, item.value) for item in DatasetsAvailable], sep="\n")

MultidomainGold: Multidomain gold dataset. For more see `ai-forever/spellcheck_benchmark`.
RUSpellRU: Social media texts and blogs. For more see `ai-forever/spellcheck_benchmark`.
MedSpellchecker: Medical anamnesis. For more see `ai-forever/spellcheck_benchmark`.
GitHubTypoCorpusRu: Github commits. For more see `ai-forever/spellcheck_benchmark`.


### M2M100-1.2B

In [64]:
# Place model on device

m2m_1b_corrector.model.to(torch.device("cuda:0"));

In [None]:
metrics = m2m_1b_corrector.evaluate("RUSpellRU", batch_size=32)

In [69]:
print("m2m1b RUSpellRU:")
print(metrics)

m2m1b RUSpellRU:
{'Precision': 59.44, 'Recall': 43.32, 'F1': 50.12}


### M2M100-418M

In [4]:
# Place model on device

m2m_418m_corrector.model.to(torch.device("cuda:0"));

In [None]:
metrics = m2m_418m_corrector.evaluate("MultidomainGold", batch_size=16)

In [6]:
print("m2m418m MultidomainGold:")
print(metrics)

m2m418m MultidomainGold:
{'Precision': 32.82, 'Recall': 57.69, 'F1': 41.84}


### FredT5-large

In [3]:
# Place model on device

fred_corrector.model.to(torch.device("cuda:0"));

In [None]:
metrics = fred_corrector.evaluate("GitHubTypoCorpusRu", prefix="Исправь: ", batch_size=1)

In [6]:
print("fred GitHubTypoCorpusRu:")
print(metrics)

fred GitHubTypoCorpusRu:
{'Precision': 52.73, 'Recall': 41.75, 'F1': 46.6}
