<a href="https://colab.research.google.com/github/fberanizo/spelling-correction/blob/master/Correcao_T5_NoTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Detecção e correção de erro ortográfico com modelo T5

**Nome: Fabio Beranizo Lopes**<br>
**Nome: Luiz Pita Almeida**

Usaremos o modelo T5 pré-treinado e o dataset Paracrawl Inglês-Português. <br>
Truncamos para strings de tamanho 100 para deixar os testes mais rápidos.

Métrica de avaliação: F0.5-score <br>
https://www.cl.cam.ac.uk/research/nl/bea2019st/#eval

O método de correção aplicado foi sugerido pelos docentes:<br>
> dado uma frase e um palavra nesta frase a ser corrigida ou não, iremos
> mascarar a palavra, rodar o BERT ou T5, e prever as top-10 palavras 
> alternativas usando mask language modeling. Se a palavra original estiver 
> entre as top previstas, não sugerir correção. Caso contrário, usar edit 
> distance para ver qual é a palavra mais próxima, e sugerí-la ao usuário.

Passos:

1. Geram-se tuplas: `(original, corrected)`
2. Aplica-se modelo T5 para prever top-10 palavras.<br>
   Caso a palavra original esteja no Top 10 do modelo, é classificada como correta.<br>
   Senão, a palavra é classificada como incorreta.

**Obs: os notebooks contém excertos de códigos dos colegas de turma.**<br>
**Obrigado Diedre, Gabriela, Leard, Lucas e Israel.**


In [1]:
import torch

print(f"Current GPU: {torch.cuda.get_device_name(0)}")

# don't even start if it's not a P100 GPU
# if torch.cuda.get_device_name(0) != "Tesla P100-PCIE-16GB":
#     import os
#     os.kill(os.getpid(), 9)

Current GPU: Tesla P100-PCIE-16GB


In [2]:
#@title Configurações gerais
experiment_name = "no-tuning"  #@param {type:"string"}
model_name = "t5-small"  #@param ["t5-small", "t5-base", "t5-large", "t5-3B", "t5-11B"] {type:"string"}
batch_size = 10  #@param {type:"integer"}
accumulate_grad_batches = 1  #@param {type:"integer"}
sequence_length = 100  #@param {type:"integer"}
learning_rate = 5e-3  #@param {type:"number"}
decode_mode = "topk"  #@param ["greedy", "nucleus", "topk", "beam"] {type:"string"}
k = 10  #@param {type:"integer"}

## Instala dependências

- PyTorch Lightning
- Hugginface Transformers
- ERRANT (ERRor ANnotation Toolkit)
- pyxDamerauLevenshtein

In [3]:
!git clone --quiet https://github.com/fberanizo/Adversarial-Misspellings.git

try:
    import pytorch_lightning
    import transformers
except ImportError as e:
    # can't import modules, then install
    !pip install --quiet pytorch-lightning
    !pip install --quiet transformers
    !pip install --quiet errant==2.0.0
    !pip install pyxDamerauLevenshtein
    !python -m spacy download en
    # kill kernel (necessary for tqdm)
    import os
    os.kill(os.getpid(), 9)

fatal: destination path 'Adversarial-Misspellings' already exists and is not an empty directory.


In [4]:
# Importar todos os pacotes de uma só vez para evitar duplicados ao longo do notebook.
import datetime
import errant
import gzip
import json
import numpy as np
import nvidia_smi
import os
import pandas as pd
import joblib
import psutil
import pytorch_lightning as pl
import random
import spacy
import sys
import tarfile
import tempfile
import torch
import torch.nn.functional as F

from argparse import Namespace
from collections import deque
from google.colab import drive
from itertools import cycle

from pyxdameraulevenshtein import damerau_levenshtein_distance, \
    normalized_damerau_levenshtein_distance
from pyxdameraulevenshtein import damerau_levenshtein_distance_ndarray, \
    normalized_damerau_levenshtein_distance_ndarray

from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning import Trainer


from tqdm import tqdm
from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer
from torch.optim import Adam
from torch.utils.data import DataLoader
from torch.utils.data import Dataset

from typing import Dict
from typing import List
from typing import Tuple

# Leard decoding solution
import html
import unicodedata

nlp = spacy.load("en")
annotator = errant.load("en", nlp)

import nltk
nltk.download("stopwords")

sys.path.insert(0, "/content/Adversarial-Misspellings")
import attacks

nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)

def hardware_stats():
    """
    Returns a dict containing some hardware related stats
    """
    res = nvidia_smi.nvmlDeviceGetUtilizationRates(handle)
    return {"cpu": f"{str(psutil.cpu_percent())}%",
            "mem": f"{str(psutil.virtual_memory().percent)}%",
            "gpu": f"{str(res.gpu)}%",
            "gpu_mem": f"{str(res.memory)}%"}

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Define random seeds

Importante: Fix seeds so we can replicate results

In [5]:
import random

seed = 0
random.seed(seed)
torch.random.manual_seed(seed)
torch.cuda.manual_seed(seed)

## Mapeia Google Drive

Iremos salvar os checkpoints (pesos do modelo) no google drive, para que possamos continuar o treino de onde paramos.

In [6]:
# drive.mount("/content/drive")
# base_path = "/content/drive/My Drive/PF-Correcao/t5-no-tuning"
base_path = "/content/t5-no-tuning"
os.environ["BASE_PATH"] = base_path

## ERRANT Scorer

Comando para avaliação que compara um arquivo M2 "hipótese" contra um arquivo M2 "referência".<br>

### **Example**
**Original**: This are gramamtical sentence .<br>
**Corrected**: This is a grammatical sentence .<br>
**Output M2**:<br>
S This are gramamtical sentence .<br>
A 1 2|||R:VERB:SVA|||is|||REQUIRED|||-NONE-|||0<br>
A 2 2|||M:DET|||a|||REQUIRED|||-NONE-|||0<br>
A 2 3|||R:SPELL|||grammatical|||REQUIRED|||-NONE-|||0<br>
A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||1<br>

In M2 format, a line preceded by S denotes an original sentence while a line preceded by A indicates an edit annotation. Each edit line consists of the start and end token offset of the edit, the error type, and the tokenized correction string. The next two fields are included for historical reasons (see the CoNLL-2014 shared task) while the last field is the annotator id.

In [7]:
%%writefile orig.txt
Eu não cei pra ondi vou .
Podi até num dá em nada .
Minha vida segui o sol .
No horizonti dessa istrada .

Overwriting orig.txt


In [8]:
%%writefile ref.txt
Eu não sei pra onde vou .
Pode até não dar em nada .
Minha vida segue o sol .
No horizonte dessa estrada .

Overwriting ref.txt


In [9]:
%%writefile hyp.txt
Eu não sei pra ondi vou .
Podi até não dar em nada .
Minha vida segui u sol .
Num horizonte dessa estrada .

Overwriting hyp.txt


In [10]:
!errant_parallel -orig orig.txt -cor ref.txt -out ref.m2 > /dev/null
!errant_parallel -orig orig.txt -cor hyp.txt -out hyp.m2 > /dev/null

In [11]:
import pandas as pd
!errant_compare -hyp hyp.m2 -ref ref.m2
# x = !errant_compare -hyp hyp.m2 -ref ref.m2
# df = pd.DataFrame(data=x[2:4])[0].str.split('\t', expand=True)
# new_header = df.iloc[0] #grab the first row for the header
# df = df[1:] #take the data less the header row
# df.columns = new_header
# df
# df["F0.5"][1]
# d = df.apply(pd.to_numeric).to_dict('r')



TP	FP	FN	Prec	Rec	F0.5
5	2	3	0.7143	0.625	0.6944



## Gerador 

From:
https://github.com/huggingface/transformers/issues/3985

In [12]:
# tokenizer = T5Tokenizer.from_pretrained(model_name)
# model = T5ForConditionalGeneration.from_pretrained(model_name)
# # Input text
# original = 'This are gramamtical sentence .'
# correct = 'This is a grammatical sentence .'

# text = 'This <extra_id_0> sentence. </s>'

# encoded = tokenizer.encode_plus(text, add_special_tokens=True, return_tensors='pt')
# input_ids = encoded['input_ids']

# # Generating 20 sequences with maximum length set to 5
# outputs = model.generate(input_ids=input_ids, 
#                           num_beams=200, num_return_sequences=20,
#                           max_length=5)

# _0_index = text.index('<extra_id_0>')
# _result_prefix = text[:_0_index]
# _result_suffix = text[_0_index+12:]  # 12 is the length of <extra_id_0>

# def _filter(output, end_token='<extra_id_1>'):
#     # The first token is <unk> (inidex at 0) and the second token is <extra_id_0> (indexed at 32099)
#     _txt = tokenizer.decode(output[2:], skip_special_tokens=False, clean_up_tokenization_spaces=False)
#     if end_token in _txt:
#         _end_token_index = _txt.index(end_token)
#         return _result_prefix + _txt[:_end_token_index] + _result_suffix
#     else:
#         return _result_prefix + _txt + _result_suffix

# results = list(map(_filter, outputs))
# results
# del tokenizer
# del model

## Classe Dataset
Gerenciamento dos dados, e um pequeno teste.

In [13]:
hparams = {"model_name": model_name, "seq_len": sequence_length, "batch_size": batch_size}
class ParaCrawl(Dataset):
    """
    Loads data from preprocessed file and manages them.
    """
    VALID_MODES = ["train", "validation", "test"]
    TOKENIZER = T5Tokenizer.from_pretrained(hparams["model_name"],
                                            cache_dir=base_path)
    def __init__(self, mode: str, seq_len: int):
        """
        mode: One of train, validation or test 
        seq_len: limit to returned encoded tokens
        """
        super().__init__()
        assert mode in ParaCrawl.VALID_MODES

        self.mode = mode
        self.seq_len = seq_len

        file_name = os.path.join(base_path, f"{mode}.pkl")
        if not os.path.isfile(file_name):
            print("Pre-processed files not found, preparing data.")
            self.prepare_data()
        
        with open(file_name, "rb") as preprocessed_file:
            self.data = joblib.load(preprocessed_file)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, i: int):
        """
        Unpacks line from data.

        returns: input (corrputed), target (corrected)
        """
        input, target = self.data[i]

        return input, target

    def get_dataloader(self, batch_size: int, shuffle: bool):
        return DataLoader(self, batch_size=batch_size, shuffle=shuffle, 
                          num_workers=4)

    @staticmethod
    def load_text_pairs(path):
        """
        Load pairs from original files, selects pt, then corrupts the samples.
        """
        text_pairs = []
        for line in tqdm(gzip.open(path, mode="rt")):
            text_pt = line.strip().split("\t")[1]
            text_pt = text_pt[:hparams["seq_len"]].rsplit(" ", 1)[0]
            try:
                attack_list = deque(attacks.all_one_attack(text_pt, include_ends=True))
                text_corrupt = list(map(lambda a: a[1], random.sample(attack_list, k=10)))
                text_pairs.extend(list(zip(text_corrupt, cycle([text_pt]))))
            except ValueError:
                pass

        return text_pairs

    @staticmethod
    def prepare_data(train_size=9997000, val_size=3000):
        """
        Performs everything needed to get the data ready.
        Addition of Eos token and encoding is performed in runtime.
        """
        if not os.path.isfile("paracrawl_enpt_train.tsv.gz"):    
            !wget -nc https://storage.googleapis.com/neuralresearcher_data/unicamp/ia376e_2020s1/paracrawl_enpt_train.tsv.gz -P "$BASE_PATH"
            !wget -nc https://storage.googleapis.com/neuralresearcher_data/unicamp/ia376e_2020s1/paracrawl_enpt_test.tsv.gz -P "$BASE_PATH"

        data = {}
        test_data = ParaCrawl.load_text_pairs(os.path.join(base_path, "paracrawl_enpt_test.tsv.gz"))
        train_val_data = ParaCrawl.load_text_pairs(os.path.join(base_path, "paracrawl_enpt_train.tsv.gz"))

        random.shuffle(train_val_data)

        train_data = train_val_data[:train_size]
        val_data = train_val_data[train_size:train_size + val_size]

        for mode, data in zip(ParaCrawl.VALID_MODES, [train_data, val_data, test_data]):
            file_name = os.path.join(base_path, f"{mode}.pkl")
            with open(file_name, "wb") as pkl_file:
                joblib.dump(data, pkl_file)
            print(f"Pre-processed data saved as {file_name}.")


datasets = {m: ParaCrawl(mode=m, seq_len=hparams["seq_len"]) for m in ParaCrawl.VALID_MODES}

# Testing datasets
for mode, dataset in datasets.items():
    print(f"\n{mode} dataset length: {len(dataset)}\n")
    print("Random sample")
    input, target = random.choice(dataset)
    print("input\n", input, end="\n\n")
    print("target\n", target, end="\n\n")

Pre-processed files not found, preparing data.
File ‘/content/t5-no-tuning/paracrawl_enpt_train.tsv.gz’ already there; not retrieving.

File ‘/content/t5-no-tuning/paracrawl_enpt_test.tsv.gz’ already there; not retrieving.



20000it [00:36, 552.64it/s]
1000000it [30:16, 550.47it/s]


Pre-processed data saved as /content/t5-no-tuning/train.pkl.
Pre-processed data saved as /content/t5-no-tuning/validation.pkl.
Pre-processed data saved as /content/t5-no-tuning/test.pkl.

train dataset length: 9997000

Random sample
input
 Será justas uma sociedade que não aceita as diferenças?, será o amor mais forte que as

target
 Será justa uma sociedade que não aceita as diferenças?, será o amor mais forte que as


validation dataset length: 2670

Random sample
input
 História da arte (1) Epistemologia e rmétodos

target
 História da arte (1) Epistemologia e métodos


test dataset length: 200000

Random sample
input
 Uma vaga que npão faz

target
 Uma vaga que não faz



## Dataloaders

Verificação se dataloaders estão funcionando corretamente.

In [14]:
shuffle = {"train": True, "validation": False, "test": False}
debug_dataloaders = {mode: datasets[mode].get_dataloader(batch_size=hparams["batch_size"], 
                                                         shuffle=shuffle[mode])
                     for mode in ParaCrawl.VALID_MODES}

# Testing dataloaders
for mode, dataloader in debug_dataloaders.items():
    print("{} number of batches: {}".format(mode, len(dataloader)))
    batch = next(iter(dataloader))

train number of batches: 999700
validation number of batches: 267
test number of batches: 20000


## Lightning Module

Aqui a classe principal do PyTorch Lightning é definida.


In [15]:
class T5Corrector(pl.LightningModule):
    def __init__(self, hparams):
        super().__init__()

        self.hparams = hparams
        self.t5 = T5ForConditionalGeneration.from_pretrained(hparams.model_name,
                                                             cache_dir=hparams.base_path)
        self.tokenizer = ParaCrawl.TOKENIZER
        self.start_token = ParaCrawl.TOKENIZER.convert_tokens_to_ids('<extra_id_0>')
        self.end_token = ParaCrawl.TOKENIZER.convert_tokens_to_ids('<extra_id_1>')

    def select_correction(self, word, hypotheses):
        """
        Selects the most probable correction for a given word and given hypotheses.
        """
        if word in hypotheses:
            # print('Only copy:', word)
            return word
        else:
            distances = damerau_levenshtein_distance_ndarray(word, np.array(hypotheses))
            # print(distances.shape)
            # print('distances', distances)
            idx_min_distance = np.argmin(distances)
            # print('idx_min_distance', idx_min_distance)
            # print('idx_min_distance distance', distances[idx_min_distance])
            # print('idx_min_distance', hypotheses[idx_min_distance])
            if distances[idx_min_distance] < 10:
                # print('Replaced:', word, 'by', hypotheses[idx_min_distance])
                return hypotheses[idx_min_distance]
        # print('No suggestion, word copied:', word)
        return word

    def generate(self, original, end_token="<extra_id_1>"):
        """
        Generates a correction hypothesis for a given sentence.
        """
        hypothesis = ""
        words = original.split()
        # we could not fit n masked_sentences in memory:
        # masked_ids = []
        # input_ids = torch.stack(masked_ids).squeeze().to("cuda")
        # masked_ids.append(input_ids)

        for idx, word in enumerate(words):
            # masks the i-th word
            input = f"{words[:idx]} <extra_id_0> {words[idx + 1:]} {self.tokenizer.eos_token}"
            input_ids = self.tokenizer.encode(input,
                                              max_length=self.hparams.seq_len,
                                              pad_to_max_length=True,
                                              add_special_tokens=True,
                                              return_tensors="pt").to("cuda")

            # generates k hypothesis for the sentence
            hypotheses = self.t5.generate(input_ids=input_ids,
                                        top_k=self.hparams.k,
                                        do_sample=True,
                                        num_return_sequences=self.hparams.k)
            hypotheses = [self.tokenizer.decode(output_ids[2:],
                                                skip_special_tokens=False,
                                                clean_up_tokenization_spaces=False) for output_ids in hypotheses]

            hypothesis_word = self.select_correction(word, hypotheses)
            hypothesis = f"{hypothesis} {hypothesis_word}"

        return hypothesis

    def forward(self, x):
        inputs, targets = x

        if self.training:
            input_ids = []
            attention_mask = []
            lm_labels = []
            # for each sample in batch
            for input, target in zip(inputs, targets):
                input = f"{input} {self.tokenizer.eos_token}"
                d = self.tokenizer.encode_plus(
                    input,
                    max_length=self.hparams.seq_len,
                    pad_to_max_length=True,
                    add_special_tokens=True,
                    return_tensors="pt").to("cuda")
                input_ids.append(d["input_ids"])
                attention_mask.append(d["attention_mask"])
                lm_labels.append(self.tokenizer.encode(
                    target,
                    max_length=self.hparams.seq_len,
                    pad_to_max_length=True,
                    add_special_tokens=True,
                    return_tensors="pt"))

            input_ids = torch.stack(input_ids).squeeze(1).to("cuda")
            attention_mask = torch.stack(attention_mask).squeeze(1).to("cuda")
            lm_labels = torch.stack(lm_labels).squeeze(1).to("cuda")

            outputs = self.t5(input_ids=input_ids, 
                              attention_mask=attention_mask,
                              lm_labels=lm_labels)
            loss, predicted_scores = outputs[:2]
            return loss, predicted_scores, inputs, targets
        else:
            predicts = []
            # for each sample in batch
            for input, target in zip(inputs, targets):
                predicts.append(self.generate(input))
                # print("original", orig)
                # print("corrected", corr)
                # print("outputs", outputs[-1])
            return predicts, inputs, targets

    def training_step(self, batch, batch_idx):
        loss, predicted_scores, inputs, targets = self(batch)

        return {"loss": loss, "log": {"loss": loss}, "progress_bar": hardware_stats()}

    def validation_step(self, batch, batch_idx):
        predicts, inputs, targets = self(batch)

        with open("orig.txt", "w") as f:
            for input in inputs:
                input = input.replace("\n", "")
                f.write(f"{input}\n")

        with open("ref.txt", "w") as f:
            for target in targets:
                target = target.replace("\n", "")
                f.write(f"{target}\n")

        with open("hyp.txt", "w") as f:
            for pred in predicts:
                pred = pred.replace("\n", "")
                f.write(f"{pred}\n")

        !errant_parallel -orig orig.txt -cor ref.txt -out ref.m2 > /dev/null
        !errant_parallel -orig orig.txt -cor hyp.txt -out hyp.m2 > /dev/null
        x = !errant_compare -hyp hyp.m2 -ref ref.m2
        df = pd.DataFrame(data=x[2:4])[0].str.split("\t", expand=True)
        new_header = df.iloc[0] #grab the first row for the header
        df = df[1:].apply(pd.to_numeric) #take the data less the header row
        df.columns = new_header

        true_positive = df["TP"][1]
        false_positive = df["FP"][1]
        false_negative = df["FN"][1]
        precision = df["Prec"][1]
        recall = df["Rec"][1]
        f_score = df["F0.5"][1]

        progress_bar = hardware_stats()
        progress_bar.update({"precision": precision, "recall": recall, "f_score": f_score})
        # print("progress_bar", progress_bar)

        return {"true_positive": true_positive, "false_positive": false_positive, "false_negative": false_negative,
                "precision": precision, "recall": recall, "f_score": f_score,
                "predicts": predicts, "inputs": inputs, "targets": targets, "progress_bar": progress_bar}

    def test_step(self, batch, batch_idx):
        predicts, inputs, targets = self(batch)

        with open("orig.txt", "w") as f:
            for input in inputs:
                input = input.replace("\n", "")
                f.write(f"{input}\n")

        with open("ref.txt", "w") as f:
            for target in targets:
                target = target.replace("\n", "")
                f.write(f"{target}\n")

        with open("hyp.txt", "w") as f:
            for pred in predicts:
                pred = pred.replace("\n", "")
                f.write(f"{pred}\n")

        !errant_parallel -orig orig.txt -cor ref.txt -out ref.m2 > /dev/null
        !errant_parallel -orig orig.txt -cor hyp.txt -out hyp.m2 > /dev/null
        x = !errant_compare -hyp hyp.m2 -ref ref.m2
        df = pd.DataFrame(data=x[2:4])[0].str.split("\t", expand=True)
        new_header = df.iloc[0] #grab the first row for the header
        df = df[1:].apply(pd.to_numeric) #take the data less the header row
        df.columns = new_header

        true_positive = df["TP"][1]
        false_positive = df["FP"][1]
        false_negative = df["FN"][1]
        precision = df["Prec"][1]
        recall = df["Rec"][1]
        f_score = df["F0.5"][1]

        progress_bar = hardware_stats()
        progress_bar.update({"precision": precision, "recall": recall, "f_score": f_score})
        # print("progress_bar", progress_bar)

        return {"true_positive": true_positive, "false_positive": false_positive, "false_negative": false_negative,
                "precision": precision, "recall": recall, "f_score": f_score,
                "predicts": predicts, "inputs": inputs, "targets": targets, "progress_bar": progress_bar}

    def training_epoch_end(self, outputs):
        avg_loss = torch.stack([x["loss"] for x in outputs]).mean()

        return {"log": {"train_loss": avg_loss}} 

    def validation_epoch_end(self, outputs):
        avg_precision = sum([x["precision"] for x in outputs]) / len(outputs)
        avg_recall = sum([x["recall"] for x in outputs]) / len(outputs)
        avg_f_score = sum([x["f_score"] for x in outputs]) / len(outputs)

        tensorboard_logs = {"avg_precision": avg_precision,
                            "avg_recall": avg_recall,
                            "avg_f_score": avg_f_score}

        origs = sum([list(x["inputs"]) for x in outputs], [])
        trues = sum([list(x["targets"]) for x in outputs], [])
        preds = sum([list(x["predicts"]) for x in outputs], [])

        n = random.choice(range(len(trues)))
        print(f"\Input: {origs[n]}\nTarget: {trues[n]}\nPrediction: {preds[n]}\n")

        return {"avg_precision": avg_precision, "avg_recall": avg_recall, "avg_f_score": avg_f_score,
                "log": tensorboard_logs, "progress_bar": tensorboard_logs}

    def test_epoch_end(self, outputs):
        avg_precision = sum([x["precision"] for x in outputs]) / len(outputs)
        avg_recall = sum([x["recall"] for x in outputs]) / len(outputs)
        avg_f_score = sum([x["f_score"] for x in outputs]) / len(outputs)

        tensorboard_logs = {"avg_precision": avg_precision,
                            "avg_recall": avg_recall,
                            "avg_f_score": avg_f_score}

        origs = sum([list(x["inputs"]) for x in outputs], [])
        trues = sum([list(x["targets"]) for x in outputs], [])
        preds = sum([list(x["predicts"]) for x in outputs], [])

        n = random.choice(range(len(trues)))
        print(f"\Input: {origs[n]}\nTarget: {trues[n]}\nPrediction: {preds[n]}\n")
        
        return {"avg_precision": avg_precision, "avg_recall": avg_recall, "avg_f_score": avg_f_score,
                "log": tensorboard_logs, "progress_bar": tensorboard_logs}

    def configure_optimizers(self):
        return Adam(self.parameters(), lr=self.hparams.lr)    

    def train_dataloader(self):
        if self.hparams.overfit_pct > 0:
            logging.info("Disabling train shuffle due to overfit_pct.")
            shuffle = False
        else:
            shuffle = True
        dataset = ParaCrawl("train", seq_len=self.hparams.seq_len)
        return dataset.get_dataloader(batch_size=self.hparams.batch_size, shuffle=shuffle)

    def val_dataloader(self):
        dataset = ParaCrawl("validation", seq_len=self.hparams.seq_len)
        return dataset.get_dataloader(batch_size=self.hparams.batch_size, shuffle=False)

    def test_dataloader(self):
        dataset = ParaCrawl("test", seq_len=self.hparams.seq_len)
        return dataset.get_dataloader(batch_size=self.hparams.batch_size, shuffle=False)

## Preparação

In [16]:
hparams = {"name": experiment_name, "base_path": base_path,
           "model_name": model_name, "seq_len": sequence_length,
           "decode_mode": decode_mode, "k": k,
           "lr": learning_rate, "batch_size": batch_size, "batch_accum": accumulate_grad_batches,
           "max_epochs": 3,
           "overfit_pct": 0, "debug": 0,
           "decode_mode": decode_mode}


for key, parameter in hparams.items():
    print("{}: {}".format(key, parameter))

name: no-tuning
base_path: /content/t5-no-tuning
model_name: t5-small
seq_len: 100
decode_mode: topk
k: 10
lr: 0.005
batch_size: 10
batch_accum: 1
max_epochs: 3
overfit_pct: 0
debug: 0


In [17]:
# Instantiate model
model = T5Corrector(Namespace(**hparams))

# Folder/path management, for logs and checkpoints
tensorboard_path = os.path.join(base_path, "logs")
experiment_name = hparams["name"]
model_folder = os.path.join(tensorboard_path, experiment_name)
os.makedirs(model_folder, exist_ok=True)
ckpt_path = os.path.join(model_folder, "-{epoch}")

# Callback initialization
checkpoint_callback = ModelCheckpoint(prefix=experiment_name, 
                                      filepath=ckpt_path, 
                                      mode="max")
logger = TensorBoardLogger(tensorboard_path, experiment_name)

# PL Trainer initialization
trainer = Trainer(gpus=1,
                  checkpoint_callback=checkpoint_callback, 
                  early_stop_callback=False,
                  logger=logger,
                  accumulate_grad_batches=hparams["batch_accum"],
                  max_epochs=hparams["max_epochs"], 
                  fast_dev_run=bool(hparams["debug"]), 
                  overfit_pct=hparams["overfit_pct"],
                  progress_bar_refresh_rate=1)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]


## Tensorboard

In [18]:
%load_ext tensorboard
# %tensorboard --logdir "/content/drive/My Drive/PF-Correcao/t5-no-tuning"
%tensorboard --logdir "/content/t5-no-tuning"

Reusing TensorBoard on port 6006 (pid 1514), started 0:52:43 ago. (Use '!kill 1514' to kill it.)

<IPython.core.display.Javascript object>

## Teste

In [None]:
trainer.test(model)

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Testing', layout=Layout(flex='2'), max=…