>In this Notebook, we are trying three different spelling correctors on a large text inorder to choose the most accurate: 1. corrects all spelling errors, 2. runs on a list of sentences and 3. with best runtime (if possible).

In [1]:
text = "Malaria is a infectious disease caused by parasites that are transmitted to peoples throu female Anopheles mosquitoes. Each year over half a billion\
 people will become infected with malaria, with rughly 80% of them living in Sub-Saharan africa. Nearly half a million people die of malaria every year, most of them young children\
 under the age of five. Unlike many other infectious diseases, the death toll for malaria is rising. Wile there have been many programs designed to improove acces to malaria treatment, the best way to reducing the impact of malaria in Sub-Saharan Africa is to focus on reducing the number of people who contract the diseese in the first place rather than waiting to treat\
 the disease after the person has already infected. There are multiple drogs available to treet malaria, and many of them work well and save lives, but malaria eradication programs that \
 focus too much on them and not enough on prevention haven't seen long-term success in Sub-Saharan Africa. A major program to combat malaria was WHO's global malaria eradication \
 programme. Started in 1955, it had a goal of eliminating malaria in africa within the next ten years. Based upon previously succesful programs in Brezil and the United State, \
 the program focused mainly on vector controle. This included widely distributing chloroquine and spraying large amounts of DDT. More than one billions dollar was spent trying to \
 abolish Malaria. However, the program suffered from many problems and in 1969, WHO was forced to admit that the program had not succeeded in eradicating Malaria. The number of people\
 in Sub-Saharan Africa who contracted malaria as well as the number of malaria deaths had actually increased over 10% during the time the program was active. One of the major reasons for the failure of the project was that it set uniform strategies and policies. By failing to consider variations between governments, geography, and \
infrastructure, the program was not nearly as succesfull as it could have been. Sub-Saharan Africa has neither the money nor the infrastructure too support such an elaborate program, \
and it couldn't be run the way it was ment to. Most African countries don't have the resources to send all their people to doctors and get shots, nor can they afford to clear wetlands\
 or other malaria prone areas. the continent's spending per person for eradicating malaria was just a quarter of what Brezil spent. Sub-Saharan Africa simply can't rely on a plan that \
 requires more money, infrastructure, and expertise than they have to spare. \
 Additionally, the widespread use of chloroquine has created drug resistant parasites which are now plaguing Sub-Saharan Africa. Because cloroquine was used widely but inconsistently, \
 mosquitoes developed resistance, and chloroquine is now nearly completely ineffective in Sub-Saharan Africa, with over 95% of mosquitoes resistant to it. As a result, newer, more \
 expensive drugs need to be used to prevent and treat malaria, which further drives up the coast of malaria treatment for a region that can ill afford it. Instead of developing plans to treat malaria after the infection has incurred, programs should focus on preventing infection from occurring in the first place. Not only is this plan \
cheaper and more effective, reducing the number of people who contract malaria also reduces loss of work/school days wich can further bring down the productivity of the region. One of the cheapest and most effective ways of preventing malaria is to implement insecticide-treated bed nets (ITNs). These nets provide a protective barrier around the person or \
people using them. While untreated bed nets are still helpful, those treated with insecticides are much more useful because they stop mosquitoes from biting people through the nets, \
and they help reduce mosquito populations in a community, thus helping people who don't even own bed nets. Bed nets are also very effective because most mosquito bites occur while \
the person is sleeping, so bed nets would be able to drastically reduce the number of transmissions during the night. In fact, transmission of malaria can be reduced by as much as 90% \
in areas where the use of ITNs is widespread. Because money is so scarce in Sub-Saharan Africa, the low cost is a great benefit and a major reason why the program is so successful. \
Bed nets cost roughly 2 USD to make, last several years, and can protect too adults. Studies have shown that, for every 100-1000 more nets are being used, one less child dies of malaria. \
With an estimated 300 million people in Africa not being protected by mosquito nets, there's the potential to save three millions lives by spending just a few dollars per person. \
Reducing the number of people who contract malaria would also reduce poverty levels in Africa significantly, thus improving other aspects of society like education levels and the economy. \
Vector control is more effective than treatment strategies because it mean fewer people are getting sick. When fewer people get sick, the working population is stronger as a whole \
 because people are not put out of work from malaria, nor are they caring for sick relatives. Malaria-afflicted families can typically only harvest 40% of the crops that healthy families \
 can harvest. Additionally, a family with members who have malaria spends roughly a quarter of its income treatment, not including the loss of work they also must deal with due to the \
 illness. It's estimated that malaria costs Africa 12 billions USD in lost income every year. A strong working population creates a stronger economy, wich Sub-Saharan Africa is in \
 desperate need of? The question is how to do that. How to apply it to reality."

In [2]:
batches = text.split('. ')
len(batches)

41

# Neuspell

In [3]:
import neuspell
from neuspell import BertChecker

data folder is set to `/home/fatenghali/ml4ed/neuspell/neuspell/../data` script


In [4]:
# Initialize spelling check model
print('\nLoading spelling checker...\n')
checker = BertChecker()
checker.from_pretrained("./neuspell-subwordbert-probwordnoise/")


Loading spelling checker...

loading vocab from path:/home/fatenghali/ml4ed/neuspell/neuspell/../data/checkpoints/subwordbert-probwordnoise/vocab.pkl
initializing model


Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Number of parameters in the model: 185211810
Loading model params from checkpoint dir: /home/fatenghali/ml4ed/neuspell/neuspell/../data/checkpoints/subwordbert-probwordnoise


In [5]:
import time
start = time.time()
corrected_batches = checker.correct_strings(batches)
print(time.time() - start)

3.0736958980560303


In [6]:
"""
Neuspell inserts ' ' between words and special characters (such as punctuation) so here we are manually removing the additional spaces for
a more accurate comparison with the original text.
"""
for i in range(len(corrected_batches)):
    sentence = corrected_batches[i]
    sentence = sentence.replace(' , ', ', ')\
                        .replace(' - ', '-')\
                        .replace(" ' ", "'")\
                        .replace(' ! ', '! ')\
                        .replace(' ? ', '? ')\
                        .replace(' % ', '% ')\
                        .replace(' @ ', '@ ')\
                        .replace(' ( ', ' (')\
                        .replace(' ) ', ') ')\
                        .replace(' .', '')
    corrected_batches[i] = sentence

In [7]:
import math
def uncommonWords(text, correct):
    count = {}
    # insert words of string A to hash
    for word in text.split():
        count[word] = count.get(word, 0) + 1

    # insert words of string B to hash
    for word in correct.split():
        count[word] = count.get(word, 0) - math.inf

    # return required list of words
    return [word for word in count if count[word] > 0 ]

In [33]:
corrected_batches

['Malaria is a infectious disease caused by parasites that are transmitted to peoples through female Anopheles mosquitoes',
 'Each year over half a billion people will become infected with malaria, with roughly 80% of them living in Sub-Saharan Africa',
 'Nearly half a million people die of malaria every year, most of them young children under the age of five',
 'Unlike many other infectious diseases, the death toll for malaria is rising',
 'While there have been many programs designed to improve access to malaria treatment, the best way to reducing the impact of malaria in Sub-Saharan Africa is to focus on reducing the number of people who contract the disease in the first place rather than waiting to treat the disease after the person has already infected',
 "There are multiple drugs available to treat malaria, and many of them work well and save lives, but malaria eradication programs that focus too much on them and not enough on prevention haven't seen long-term success in Sub-Saha

In [34]:
highlighted_text = ''
for (orig, corr) in zip(batches, corrected_batches):
    mistakes = uncommonWords(orig, corr)
    for m in mistakes:
        orig = orig.replace(m, '<span style="color: red">'+m+'</span>')
    highlighted_text += orig + '. ' 

print(highlighted_text)

Malaria is a infectious disease caused by parasites that are transmitted to peoples <span style="color: red">throu</span> female Anopheles mosquitoes. Each year over half a billion people will become infected with malaria, with <span style="color: red">rughly</span> 80% of them living in Sub-Saharan <span style="color: red">africa</span>. Nearly half a million people die of malaria every year, most of them young children under the age of five. Unlike many other infectious diseases, the death toll for malaria is rising. <span style="color: red">Wile</span> there have been many programs designed to <span style="color: red">improove</span> <span style="color: red">acces</span> to malaria treatment, the best way to reducing the impact of malaria in Sub-Saharan Africa is to focus on reducing the number of people who contract the <span style="color: red">diseese</span> in the first place rather than waiting to treat the disease after the person has already infected. There are multiple <span 

>**Neuspell does not correct punctuation.**

In [8]:
file1 = open('neuspell_batches.txt', 'a')
# Writing a string to file
for sentence in corrected_batches:
    sentence = sentence.replace(' , ', ', ')\
                        .replace(' - ', '-')\
                        .replace(" ' ", "'")\
                        .replace(' ! ', '! ')\
                        .replace(' ? ', '? ')\
                        .replace(' % ', '% ')\
                        .replace(' @ ', '@ ')\
                        .replace(' ( ', ' (')\
                        .replace(' ) ', ') ')\
                        .replace(' .', '')
    file1.write(sentence + '. ')
# Closing file
file1.close()

# Hugging face pretrained models

**1**

In [9]:
from transformers import pipeline

fix_spelling = pipeline("text2text-generation",model="oliverguhr/spelling-correction-english-base")

In [10]:
import time
start = time.time()
correct = fix_spelling(batches, max_length=2048)
print(time.time() - start)

43.29895043373108


In [11]:
correct[0]

{'generated_text': 'Malaria is an infectious disease caused by parasites that are transmitted to peoples through female Anopheles mosquitoes.'}

In [12]:
file2 = open('hf_1.txt', 'a')
# Writing a string to file
for sent in correct:
    file2.write(sent['generated_text'])
# Closing file
file2.close()

>This model is goos at correcting spelling and punctuation mistakes but takes too long to run.

**2**

In [13]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Bhuvana/t5-base-spellchecker")

model = AutoModelForSeq2SeqLM.from_pretrained("Bhuvana/t5-base-spellchecker")


def correct(inputs):
    input_ids = tokenizer.encode(inputs,return_tensors='pt')
    sample_output = model.generate(
        input_ids,
        do_sample=True,
        max_length=50,
        top_p=0.99,
        num_return_sequences=1
    )
    res = tokenizer.decode(sample_output[0], skip_special_tokens=True)
    return res


In [14]:
import time
start = time.time()
corrected = []
for sentence in batches:
    corrected.append(correct(sentence))
print(time.time() - start)

47.51789426803589


In [18]:
file3 = open('hf_2.txt', 'a')
# Writing a string to file
for sent in corrected:
    file3.write(sent)
# Closing file
file3.close()

>Takes too long to run and loses punctuation.

**==> We stick with Neuspell**