# Imports

In [None]:
! git clone https://github.com/ai-forever/sage.git

In [None]:
cd sage

In [None]:
# change to pip install -e ".[errant]" in case of zsh

! pip install .
! pip install -e .[errant]

In [1]:
# Load functionality

import os
import torch
from sage.spelling_correction import T5ModelForSpellingCorruption, RuM2M100ModelForSpellingCorrection, AvailableCorrectors

# Quick tour [English]

In [None]:
# Load corrector

corrector = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.ent5_large.value)

In [7]:
# Place model on your favorite device

corrector.model.to(torch.device("cuda:0"));

## Generate correct texts

Simply, corrector's API have two methods that allow
you to generate correct text.

First, `correct()` method: use it when you have single sample.
You can also provide additional `prefix` argument if needed,
and `**generation_params` of your choice.

Apparent counterpart is `batch_correct` method.
As a name suggests, most useful when you've batch of texts to correct.
You may also provide `batch_size` parameter.

In [8]:
# Imagine you have bunch of texts with broken spelling.

samples = [
    "So I think we would not be live if our ancestors did not develop siences and tecnologies.",
    "There are very successful politicians that have never tried somthing new.",
    "second , birds navigate by landmarks like river , coastlines , and moutains.",
    "Because of this , I prefer studying concepts and ideas more thad learnig facts."
]

In [None]:
# Model's been trained with "grammar: " prefix.
# Don't forget to past `prefix` when calling corresponding methods.

result = corrector.correct(samples[0], prefix="grammar: ")

In [10]:
print(result[0])

So I think we would not be alive if our ancestors did not develop sciences and technologies.


In [None]:
batch_result = corrector.batch_correct(samples, batch_size=1, prefix="grammar: ")

In [12]:
print(*batch_result, sep="\n")

['So I think we would not be alive if our ancestors did not develop sciences and technologies.']
['There are very successful politicians that have never tried something new.']
['second, birds navigate by landmarks like rivers, coastlines, and mountains.']
['Because of this, I prefer studying concepts and ideas more than learning facts.']


In [None]:
# Try with bigger `batch_size`

batch_result = corrector.batch_correct(samples, batch_size=4, prefix="grammar: ")

In [14]:
print(*batch_result[0], sep="\n")

So I think we would not be alive if our ancestors did not develop sciences and technologies.
There are very successful politicians that have never tried something new.
second, birds navigate by landmarks like rivers, coastlines, and mountains.
Because of this, I prefer studying concepts and ideas more than learning facts.


In [None]:
# Experiment with different `**generation_params`

batch_result = corrector.batch_correct(
    samples, batch_size=1, prefix="grammar: ", num_return_sequences=2, do_sample=True, top_k=50, top_p=0.95)

In [16]:
for elem in batch_result:
    print(*elem, sep="\n")
    print()

So I think we would not be alive if our ancestors did not develop sciences and technologies.
So I think we would not be alive if our ancestors did not develop sciences and technologies.

There are very successful politicians that have never tried something new.
There are very successful politicians that have never tried something new.

second, birds navigate by landmarks like river, coastlines, and mountains.
second, birds navigate by landmarks like river, coastlines, and mountains.

Because of this, I prefer studying concepts and ideas more than learning facts.
Because of this, I prefer studying concepts and ideas more than learning facts.



## Validation on JFLEG

You can call the evaluation on any dataset that is available
either on HF hub or localy.

Remember, it should be properly formatted.
Two text files: `sources.txt` and `corrections.txt` in one folder. If you prefer
single file, you may want to use `data.csv` with two columns `source` and `correction`.
Or just write down the correct name of dataset on HF hub.

In [None]:
# Make sure to stay inside sage directory or change path to validation data

metrics = corrector.evaluate(
    os.path.join(os.getcwd(), "data", "example_data", "jfleg"), batch_size=32, prefix="grammar: ", metrics=["ruspelleval"])

In [18]:
print(metrics)

{'Precision': 83.39, 'Recall': 84.25, 'F1': 83.82}


# Quick tour [Russian]

In [19]:
# For Russian we have wider range of available models.
# P.S. ent5_large model corresponds to the English language, of course)

print(*["{}: {}".format(item.name, item.value) for item in AvailableCorrectors], sep="\n")

sage_fredt5_large: ai-forever/sage-fredt5-large
sage_fredt5_distilled_95m: ai-forever/sage-fredt5-distilled-95m
sage_m2m100_1B: ai-forever/sage-m2m100-1.2B
sage_mt5_large: ai-forever/sage-mt5-large
m2m100_1B: ai-forever/RuM2M100-1.2B
m2m100_418M: ai-forever/RuM2M100-418M
fred_large: ai-forever/FRED-T5-large-spell
ent5_large: ai-forever/T5-large-spell


In [None]:
# Load corrector

# NOTE: all three models may exceed the amount of RAM available in free version of Colab.
# If the case, comment out one or two models and make sure to comment corresponding outputs and samples.

sage_m2m100_corrector = RuM2M100ModelForSpellingCorrection.from_pretrained(AvailableCorrectors.sage_m2m100_1B.value)
sage_fredt5_large_corrector = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.sage_mt5_large.value)
sage_fredt5_95m_corrector = T5ModelForSpellingCorruption.from_pretrained(AvailableCorrectors.sage_fredt5_distilled_95m.value)

In [24]:
# Make up some spoiled sentences

samples = [
    "прийдя в МГТУ я был удивлен никого необноружив там…",
    "Нащщот Чавеса разве что не соглашусь.",
    "Мошный лазер - в нерабочем состоянии - 350 кредиток.",
    "Ощушаю себя с ними монголойдом, я никогда так много не молчала как молчю тут, и не потому, что языковый баръер или еще что-то, просто коментариев нет"
]

In [None]:
result_1b = sage_m2m100_corrector.correct(samples[0])
result_large = sage_fredt5_large_corrector.correct(samples[0])
result_95m = sage_fredt5_95m_corrector.correct(samples[0])

In [27]:
print("sage_m2m100")
print(result_1b[0])
print()

print("sage_fredt5_large")
print(result_large[0])
print()

print("sage_fredt5_95m")
print(result_95m[0])

sage_m2m100
придя в МГТУ я был удивлен никого не обнаружив там

sage_fredt5_large
Придя в МГТУ, я был удивлен, никого не обнаружив там...

sage_fredt5_95m
Придя в МГТУ, я был удивлён, никого не обнаружив там.


In [None]:
result_1b = sage_m2m100_corrector.batch_correct(samples, batch_size=1)
result_large = sage_fredt5_large_corrector.batch_correct(samples, batch_size=1)
result_95m = sage_fredt5_95m_corrector.batch_correct(samples, batch_size=1)

In [29]:
print("sage_m2m100")
print(*result_1b, sep="\n")
print()

print("sage_fredt5_large")
print(*result_large, sep="\n")
print()

print("sage_fredt5_95m")
print(*result_95m, sep="\n")

sage_m2m100
['придя в МГТУ я был удивлен никого не обнаружив там']
['Насчёт Чавеса разве что не соглашусь']
['Мощный лазер в нерабочем состоянии 350 кредиток']
['Ощущаю себя с ними монголойдом я никогда так много не молчала как молчу тут и не потому что языковый барьер или еще что-то просто комментариев нет']

sage_fredt5_large
['Придя в МГТУ, я был удивлен, никого не обнаружив там...']
['Наш щит Чавеса разве что не соглашусь.']
['Мощный лазер в нерабочем состоянии - 350 кредиток.']
['Ощущаю себя с ними монголоидом, я никогда так много не молчала, как молчу тут, и не потому, что языковый барьер или еще что-то, просто комментариев нет.']

sage_fredt5_95m
['Придя в МГТУ, я был удивлён, никого не обнаружив там.']
['Насчёт Чавеса разве что не соглашусь.']
['Мощный лазер в нерабочем состоянии - 350 кредиток.']
['Ощущаю себя с ними монголоидом: я никогда так много не молчала, как молчу тут, и не потому, что языковый барьер или ещё что-то, просто комментариев нет.']


In [None]:
result_1b = sage_m2m100_corrector.batch_correct(
    samples, batch_size=1, num_return_sequences=2, do_sample=True, top_k=50, top_p=0.95)
result_large = sage_fredt5_large_corrector.batch_correct(
    samples, batch_size=1, num_return_sequences=2, do_sample=True, top_k=50, top_p=0.95)
result_95m = sage_fredt5_95m_corrector.batch_correct(
    samples, batch_size=1, num_return_sequences=2, do_sample=True, top_k=50, top_p=0.95)

In [34]:
sep = "\n------------------------------------------------------------------------------------------------------------\n"
result_1b = [elem[0] + "\n" + elem[1] for elem in result_1b]
result_large = [elem[0] + "\n" + elem[1] for elem in result_large]
result_95m = [elem[0] + "\n" + elem[1] for elem in result_95m]

In [35]:
print("sage_m2m100")
print(*result_1b, sep=sep)
print()

print("sage_fredt5_large")
print(*result_large, sep=sep)
print()

print("sage_fredt5_95m")
print(*result_95m, sep=sep)
print()

sage_m2m100
придя в МГТУ я был удивлен никого не обнаружив там
придя в МГТУ я был удивлен никого не обнаружив там
------------------------------------------------------------------------------------------------------------
Насчёт Чавеса разве что не соглашусь
Насчёт Чавеса разве что не соглашусь
------------------------------------------------------------------------------------------------------------
Мощный лазер в нерабочем состоянии 350 кредиток
Мощный лазер в нерабочем состоянии 350 кредиток
------------------------------------------------------------------------------------------------------------
Ощущаю себя с ними монголойдом я никогда так много не молчала как молчу тут и не потому что языковый барьер или еще что-то просто комментариев нет
Ощущаю себя с ними монголойдом я никогда так много не молчала как молчу тут и не потому что языковый барьер или еще что-то просто комментариев нет

sage_fredt5_large
Придя в МГТУ, я был удивлен, никого не обнаружив там...
Придя в МГТУ я был у

## Validation

In [2]:
# Load available datasets

from sage.utils import DatasetsAvailable

In [3]:
# Available datasets at HF hub

print(*["{}: {}".format(item.name, item.value) for item in DatasetsAvailable], sep="\n")

MultidomainGold: Multidomain gold dataset. For more see `ai-forever/spellcheck_punctuation_benchmark`.
RUSpellRU: Social media texts and blogs. For more see `ai-forever/spellcheck_punctuation_benchmark`.
MedSpellchecker: Medical anamnesis. For more see `ai-forever/spellcheck_punctuation_benchmark`.
GitHubTypoCorpusRu: Github commits. For more see `ai-forever/spellcheck_punctuation_benchmark`.
MultidomainGold_orth: Multidomain gold dataset orthography only. For more see `ai-forever/spellcheck_benchmark`.
RUSpellRU_orth: Social media texts and blogs orthography only. For more see `ai-forever/spellcheck_benchmark`.
MedSpellchecker_orth: Medical anamnesis orthography only. For more see `ai-forever/spellcheck_benchmark`.
GitHubTypoCorpusRu_orth: Github commits orthography only. For more see `ai-forever/spellcheck_benchmark`.


### sage-fredt5-distilled-95m

In [5]:
# Place model on device

sage_fredt5_95m_corrector.model.to(torch.device("cuda:0"));

In [None]:
metrics = sage_fredt5_95m_corrector.evaluate("RUSpellRU", batch_size=32, metrics=["errant", "ruspelleval"])

In [7]:
print("sage_fredt5_95m_corrector RUSpellRU:")
print(metrics)

sage_fredt5_95m_corrector RUSpellRU:
{'CASE_Precision': 94.41, 'CASE_Recall': 92.55, 'CASE_F1': 93.47, 'SPELL_Precision': 77.52, 'SPELL_Recall': 64.09, 'SPELL_F1': 70.17, 'PUNCT_Precision': 86.77, 'PUNCT_Recall': 80.59, 'PUNCT_F1': 83.56, 'YO_Precision': 46.21, 'YO_Recall': 73.83, 'YO_F1': 56.84, 'Precision': 83.48, 'Recall': 74.75, 'F1': 78.87}
