
<div style="color:#ffffff;
          font-size:50px;
          font-style:italic;
          text-align:left;
          font-family: 'Lucida Bright';
          background:#4686C8;">
  	&nbsp; Machine Translation using Hugging Face + pretrained  + pipeline
</div>
<br>   
<div style="
          font-size:20px;
          text-align:left;
          font-family: 'Palatino';
          ">
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Project: Machine Translation<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Author: George Barrinuevo<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date: 06/27/2025<br>
</div>

<br><div style="color:#ffffff; 
          font-size:30px; 
          font-style:italic; 
          text-align:left;
          font-family: 'Lucida Bright';
          background:#4686C8;">
  	      &nbsp; Project Notes
</div>
<div style="
          font-size:16px; 
          text-align:left;
          font-family: 'Cambria';">
    
Here are my thoughts on this project.
- <b>PROS</b>:
    - Pretrained models are used to save time over coding a new model. Also, this avoids training the model which uses a lot of CPU/GPU/TPU processing, saving time.
    - The pretrained model can be converted to TensorFlow/Keras format which the author prefers over PyTorch.
    - HuggingFace has many datasets for language translation that can be used in this notebook.
    - A pipeline method is used simplifying the coding process.
- <b>CONS</b>:
    - There is no customization of the model since the model is fixed. Can not use Transfer Learning methods to modify the pretrained model.
    - Can not use other dataset sources since that would require additional preprocessing custom code so that the input text is in a format the tokenizer expects.
- <b>INFO</b>:
    - This model uses Transformers, a time-series or sequence model.
</div>

<br><div style="color:#ffffff; 
          font-size:30px; 
          font-style:italic; 
          text-align:left;
          font-family: 'Lucida Bright';
          background:#4686C8;">
  	      &nbsp; Install Python Libraries and Load the Libraries
</div>
<div style="
          font-size:16px; 
          text-align:left;
          font-family: 'Cambria';">
After installing the first time, will have to restart the kernel and re-run this notebook.
</div>

In [1]:
!pip install sentencepiece
!pip install sacremoses

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TFAutoModelForSeq2SeqLM



<br><div style="color:#ffffff; 
          font-size:30px; 
          font-style:italic; 
          text-align:left;
          font-family: 'Lucida Bright';
          background:#4686C8;">
  	      &nbsp; HuggingFace Access Token
</div>
<div style="
          font-size:16px; 
          text-align:left;
          font-family: 'Cambria';">
    
To use the HuggingFace pretrained models, this notebook must have the HuggingFace Access Token configured.
To created the Access Token, follow these instructions:
- https://huggingface.co/docs/hub/en/security-tokens

To configure this notebook with the Access Token, there are 3 ways to do this. Just select only one of these.
- <b>User Prompt</b>: The user can manually enter their Access Token via the user prompt. To use this option, set use_selection to 'user_prompt'.
- <b>Google Colab</b>: The user can configure Google Colab to store this Access Token in the secret key section. To use this option, set use_selection to 'google_colab.
    - You can find more info here: https://pyimagesearch.com/2025/04/04/configure-your-hugging-face-access-token-in-colab-environment/
- <b>Stored Token</b>: The Access Token can be stored on a local file. To use this option, set use_selection to 'stored_token'.
    - Here is the basic steps to store the Access Token in a local file:<br>
              vi ~/.cache/huggingface/stored_token<br>
                  # Enter this info in to the file. Can use any name for the 'test-01' part. Substitue hf_* with your actual HuggingFace Access Token.<br>
                  [test-01]<br>
                  hf_token = hf_*<br>
</div>

In [2]:
use_selection = 'stored_token'    # Values: 'user_prompt', 'google_colab', 'stored_token'

if use_selection == 'user_prompt':
    from huggingface_hub import notebook_login
    notebook_login()
elif use_selection == 'google_colab':
    from google.colab import userdata
elif use_selection == 'stored_token':
    pass

<br>
<div style="color:#ffffff; 
          font-size:30px; 
          font-style:italic; 
          text-align:left;
          font-family: 'Lucida Bright';
          background:#4686C8;">
  	      &nbsp; Load the HuggingFace Pretrained Model
</div>
<br>
<div style="
          font-size:16px; 
          text-align:left;
          font-family: 'Cambria';">
    
Select the model and languages to use. the model name with the 'en-fr' denotes translation from english to french. Make this selection be setting the 'model_name' variable below. You can use other HuggingFace models with different languages.<br>
Some specific details:<br>
- tokenizer: This is used to preprocess the input text so that it is in a format usable by the model. This includes splitting the text in to words, creating a vocabulary, and representing the words as integer ID values.
- model: This is the HuggingFace pretrained model which is the actual LLM. Since it is already pretrained, training the model is not needed saving a lot of time. The <b>TF</b>AutoModelForSeq2SeqLM object is used which loads the pretrained model in Tensorflow format so that TensorFlow/Keras methods can be used. If using PyTorch, use AutoModelForSeq2SeqLM object. Normally, the model name and the dataset name can be independently specified, but here the model and dataset names are in one entity.
- A list of language translation models can be found here: https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models<br>
  On the left side of that web site, look for directories like 'en-fr' which is english to french. The model name would then be 'Helsinki-NLP/opus-mt-en-fr'. Just substitute 'en-fr' with another lanuage pair.

</div>

In [3]:

model_name1 = 'Helsinki-NLP/opus-mt-en-nl'    # English to Dutch (Netherlands)
model_name2 = 'Helsinki-NLP/opus-mt-en-fr'    # English to French
model_name3 = 'Helsinki-NLP/opus-mt-en-es'    # English to Spanish
model_name4 = 'Helsinki-NLP/opus-mt-en-tl'    # English to Tagalog (Phillipines)
model_name = model_name2

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name, from_pt = False)

2025-06-27 15:36:25.000608: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-27 15:36:25.002374: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-06-27 15:36:25.007707: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-06-27 15:36:25.023615: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751063785.051182  380872 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751063785.06

<br>
<div style="color:#ffffff; 
          font-size:30px; 
          font-style:italic; 
          text-align:left;
          font-family: 'Lucida Bright';
          background:#4686C8;">
  	      &nbsp; Preprocess the Input Text
</div>
<br>
<div style="
          font-size:16px; 
          text-align:left;
          font-family: 'Cambria';">
    
A list of input sentences is created which will later be language translated. The tokenizer object is used to convert the input text to a format the model can use.

</div>

In [4]:
texts = []
texts.append("Hello my friends! How are you doing today?")
texts.append('what is your name?')
texts.append('what did you have for dinner yesterday?')
texts.append('I plan to travel to Australia next summer.')
texts.append('My name is Steve. I work at Apple Computers as a marketing manager in the engineering department.')

tokenized_text = tokenizer.prepare_seq2seq_batch(texts, return_tensors='tf')

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.


<br>
<div style="color:#ffffff; 
          font-size:30px; 
          font-style:italic; 
          text-align:left;
          font-family: 'Lucida Bright';
          background:#4686C8;">
  	      &nbsp; Perform the Language Translation
</div>
<br>
<div style="
          font-size:16px; 
          text-align:left;
          font-family: 'Cambria';">
    
The generate() function will perform the actual language translation. The translated text is then printed out.

</div>

In [5]:
translation = model.generate(**tokenized_text)

In [6]:
translated_texts = tokenizer.batch_decode(translation, skip_special_tokens=True)

for idx in range(0, len(texts)):
    print(f'text:        {texts[idx]}')
    print(f'translation: {translated_texts[idx]}')
    print(f'')

text:        Hello my friends! How are you doing today?
translation: Bonjour mes amis, comment allez-vous aujourd'hui ?

text:        what is your name?
translation: Quel est votre nom ?

text:        what did you have for dinner yesterday?
translation: Qu'avez-vous mangé hier ?

text:        I plan to travel to Australia next summer.
translation: Je compte voyager en Australie l'été prochain.

text:        My name is Steve. I work at Apple Computers as a marketing manager in the engineering department.
translation: Je m'appelle Steve. Je travaille chez Apple Computers comme directeur marketing dans le département d'ingénierie.

