# Notebook de Preparação do Diretório

Nesse notebook iremos preparar o diretório para utiliz

## Preparação dos Dados

Download do modelo de tradução FR-EN

In [None]:
!wget https://pretrained-nmt-models.s3-us-west-2.amazonaws.com/uncorpus-fren-subword-transformer-model_step_200000.pt

In [None]:
!pip install --upgrade pip
!pip install ctranslate2  
!pip install nltk

Conversão do modelo opennmt para o ctranslate2, que é uma engine especializada em modelos de arquitetura transformer.

In [None]:
!ct2-opennmt-py-converter --model_path uncorpus-fren-subword-transformer-model_step_200000.pt --output_dir ende_ctranslate2

Download do modelo de tokenização sentence-piece que utilizaremos.

In [None]:
!wget https://un-corpus.s3-us-west-2.amazonaws.com/un-subword-model.tar.gz
!mkdir sentence-piece-model
!tar xf un-subword-model.tar.gz -C sentence-piece-model

## Métodos para a Tradução

In [None]:
import nltk
nltk.download('punkt')

In [None]:
import sentencepiece as spm
import ctranslate2
from nltk import sent_tokenize


def tokenize(text, sp_source_model):
    sp = spm.SentencePieceProcessor(sp_source_model)
    tokens = sp.encode(text, out_type=str)
    return tokens


def detokenize(text, sp_target_model):
    sp = spm.SentencePieceProcessor(sp_target_model)
    translation = sp.decode(text)
    return translation


def translate(source, ct_model, sp_source_model, sp_target_model, device="cpu"):
    translator = ctranslate2.Translator(ct_model, device)
    source_sentences = sent_tokenize(source)
    source_tokenized = tokenize(source_sentences, sp_source_model)
    translations = translator.translate_batch(source_tokenized, replace_unknowns=True)
    translations = [translation[0]["tokens"] for translation in translations]
    translations_detokenized = detokenize(translations, sp_target_model)
    translation = " ".join(translations_detokenized)

    return translation

### Testando a Tradução

In [None]:
src = "Une grande partie de ces accidents se produisent à ces points noirs que certains États membres identifient et répertorient déjà."
src = str.lower(src)
model = "ende_ctranslate2"
sp_source_model = "sentence-piece-model/source.model"
sp_target_model = "sentence-piece-model/target.model"

translate(src, model, sp_source_model, sp_target_model)