# Imports

In [1]:
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

# Load models and tokenizers

In [2]:
tokenizers = {
    'eng': AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-uk"),
    'ukr': AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-uk-en")
}

In [3]:
models = {
    "eng": TFAutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-uk"),
    "ukr": TFAutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-uk-en")
}

2024-02-24 17:50:39.422557: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-uk.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.
All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-uk-en.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


# Define translation routine

In [4]:
def translate(input_text, tokenizer, model):
    
    # 1. Токенайзер переводит исходный текст в вектор
    input_ids = tokenizer.encode(input_text, return_tensors="tf")
    
    # 2. Модель делает перевод и возвращает вектор для фразы на целевом языке
    outputs = model.generate(input_ids)
    
    # 3. Токенайзер декодирует вектор во фрацу на целевом языке
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return translated_text

# English -> Ukrainian

In [5]:
translate_from = 'eng'

tokenizer = tokenizers[translate_from]
model = models[translate_from]

In [6]:
text_in_eng_ = """
Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. 
Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. 
These models support common tasks in different modalities, such as:

Natural Language Processing: 
text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.
Computer Vision: image classification, object detection, and segmentation.
Audio: automatic speech recognition and audio classification.
Multimodal: table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.
"""

In [7]:
text_in_ukr = translate(input_text=text_in_eng_, tokenizer=tokenizer, model=model)

In [8]:
print(text_in_ukr)

Трансформатори надають вам змогу з легкістю звантажувати і тренувати стандартні моделі. За допомогою попередньо підготовлених моделей ви можете зменшити ваші обчислювальні витрати, сліди вуглецю, а також зберегти час і ресурси, необхідні для того, щоб тренувати модель з нуля. Ці моделі підтримують спільні завдання у різних модулях, зокрема: орієнтація мови: класифікація тексту, розпізнавання сутностей, відповідь на питання, коментар до мови, резюме, переклад, вибір з декількох варіантів і створення тексту. Комп' ютерне бачення: класифікація зображень, визначення об' єктів і сегментація. Звук: автоматичне розпізнавання і класифікація звуку. Групове: відповідь на питання таблиці, оптичне розпізнавання символів, отримання інформації з сканованих документів, класифікації відео та візуальних питань.


# Ukrainian -> English

In [9]:
translate_from = 'ukr'

tokenizer = tokenizers[translate_from]
model = models[translate_from]

In [10]:
text_in_eng = translate(input_text=text_in_ukr, tokenizer=tokenizer, model=model)

In [11]:
print(text_in_eng)

Transformers allow you to easily download and train standard models. With pre-trained models you can reduce your computational costs, carbon footprints, as well as save the time and resources needed to train the model from scratch. These models support common tasks in various modules, including: Language orientation: Text rating, Entity Authentication, Answer, Language Comment, Summary, Multiple Choices and Text Creation. Computer vision: Image classification, Object Definition, and Segment. Sound: Automatic audio recognition and audio classification. Group: Answer to Questions, Optical Character Authentication, get information from scanned documents, ratings, and visuals.


# Pipeline

Более простым способом получения результата, является применение конвейера

In [35]:
from transformers import pipeline

In [41]:
translator = pipeline(task='translation_en_to_fr')
translator('Please, translate this text to French!')

No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


[{'translation_text': 'Veuillez traduire ce texte en français!'}]

Также можно указать имя используемой модели и токенизатора. Более того, это рекомендуется.
Вместе с тем, достаточно указать лишь имя модели. Токенизатор в таком случае будет выбран автоматически. 

In [42]:
translator = pipeline(task='translation', model="Helsinki-NLP/opus-mt-en-uk")
translator('Please, translate this text to Ukrainian')

All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-uk.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


[{'translation_text': 'Перекладіть цей текст українською'}]