In [1]:
try:
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
except:
    !pip install transformers
finally:
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# These steps are used to either import the transformers library, if already installed in the system, or to install it and then
# import.

In [2]:
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-it")
# This step loads the pre-trained tokenizer Helsinki-NLP/opus-mt-en-it form AutoTokenizer. Tokenizer is used to break a sentence
# (string) into chunks of numerical value, so that the sytsem and model can usnderstand them (as models and system do not 
# understand strings). We can train a tokenizer of our own using a customer vocabulary (set of words), but it will be highly
# time consuming and not much beneficial as a pre-trained one like this can be used for a general use case.

model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-it")
# We are loading the model for Helsinki-NLP/opus-mt-en-it from AutoModelForSeq2SeqLM. This is again a pre-trained model which 
# has been trained on the tokens from the above tokenizer. One can train a custom model as well, if required.

In [3]:
text ='How is this so fast?'
# We are passing the sample text to be translated.

tokenized_text = tokenizer.prepare_seq2seq_batch([text], return_tensors='pt')
# This step generates 2 outputs, input_ids and attention_mask. input_ids are the indices corresponding to each token in the 
# sentence while the attention_mask indicates whether a token should be attended to or not. The return_tensors parameter is set 
# to pt for PyTorch, thus returning the actual tensors that are fed to the model.

translation = model.generate(**tokenized_text)
# This output is a sequence of the generated sequences of tokens.

translated_text = tokenizer.batch_decode(translation, skip_special_tokens=False)[0]
# The batch_decode takes as input the list of lists of token ids geenrated by the model.generate() function and 
# converts them into a list of strings by calling decode. The [0] is used at the end to access the first index of the output 
# list.

print(translated_text)

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



<pad> Perche' e' cosi' veloce?


In [4]:
sentence = input("Please input the sentence you want to translate:\n")
tokenized_text = tokenizer.prepare_seq2seq_batch([sentence], return_tensors='pt')
translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=False)[0]
print("The translated sentecne in Italian is:\n",translated_text)

Please input the sentence you want to translate:
Amazing buddy!!!
The translated sentecne in Italian is:
 <pad> Amico incredibile!!!
