# (a) Setup (5 points)
* Environment Preparation: Use Google Colab as the execution environment for
this task. Set up the necessary libraries, including the Hugging Face
Transformers library. Make sure that the required dependencies (e.g.,
transformers, torch) are installed in the Colab notebook.

*  Model Selection: Choose a pre-trained English-to-Spanish translation model of
your choice from the Hugging Face model hub, such as
Helsinki-NLP/opus-mt-en-es. Explain why this model is a good choice for the
task

In [21]:
import torch
from transformers import AutoTokenizer
from transformers import MarianTokenizer, MarianMTModel # import MarianTokenizer and MarianMTModel
!pip install -U datasets evaluate rouge_score
from evaluate import load  # Use 'evaluate' instead of 'datasets'
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu, SmoothingFunction


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [5]:
# Translation
#from transformers import pipeline

translator = ("Helsinki-NLP/opus-mt-en-es")


This is the right model for the task because its suffix depicts the specification of a translation from en - es, which is english to spanish.

# b) Implementation (10 points)
* Model Loading: Load the chosen pre-trained model and tokenizer for
English-to-Spanish translation from the Hugging Face library.

* Translation Function: Implement a Python function
translate_to_spanish(sentence) that accepts an English sentence as input and
outputs its Spanish translation. Ensure the function integrates the tokenizer and
model inference pipeline.

* Batch Processing: Extend the function to handle a list of sentences and return
their respective translations. Use padding or truncation to ensure compatibility
with the model’s input size.

* Example Usage: Demonstrate the function with a few English sentences and
display the Spanish translations. Include at least three examples in the notebook

In [None]:
# Step 1 Preprocessing with a tokenizer
# Convert the text inputs into numbers that the model can make sense of. To do
# this we use a tokenizer Splitting the input into words, subwords, or symbols
# (like punctuation) that are called tokens Mapping each token to an integer
# Adding additional inputs that may be useful to the model %% Load the tokenizer
# and model

In [6]:
# Load the tokenizer and model
tokenizer = MarianTokenizer.from_pretrained(translator)
model = MarianMTModel.from_pretrained(translator)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Translated Text (Spanish): Vamos a aprender IA usando Python


Translation Function: Implement a Python function translate_to_spanish(sentence) that accepts an English sentence as input and outputs its Spanish translation. Ensure the function integrates the tokenizer and model inference pipeline.

Batch Processing: Extend the function to handle a list of sentences and return their respective translations. Use padding or truncation to ensure compatibility with the model’s input size.

In [4]:
def translate_to_spanish1(sentence):
    input_ids = tokenizer.encode(sentence, return_tensors="pt")
    translated_ids = model.generate(input_ids)
    translated_text = tokenizer.decode(translated_ids[0], skip_special_tokens=True)
    return translated_text

In [7]:
def translate_to_spanish2(list_of_sentences, translations):
    translated_texts = []
    for i in list_of_sentences:
        input_ids = tokenizer.encode(i, return_tensors="pt", padding=True, truncation=True)
        translated_ids = model.generate(input_ids)
        translated_text = tokenizer.decode(translated_ids[0], skip_special_tokens=True)
        translated_texts.append(translated_text)
        if translations.casefold() != "yes".casefold():
            print("English Sentence:", i, "Spanish Translation:", translated_text, '\n')
    if translations.casefold() == "yes".casefold():
        return translated_texts


Example Usage: Demonstrate the function with a few English sentences and display the Spanish translations. Include at least three examples in the notebook

In [6]:
# demonstrate
sentence1 = "I am feeling hungry"
sentence2 = "Should we do our homework today"
sentence3 = "We've got to go swimming soon"


In [None]:
translate_to_spanish1(sentence1)

'Tengo hambre.'

In [None]:
translate_to_spanish1(sentence2)

'¿Deberíamos hacer nuestros deberes hoy?'

this isn't the best translation, homework should be : Tarea

In [None]:
translate_to_spanish1(sentence3)

'Tenemos que ir a nadar pronto.'

# c) Evaluation (10 points)
* Dataset Creation: Create a test dataset of at least 10 diverse English sentences.
The dataset should include simple, complex, and idiomatic expressions to test
the model's robustness.

* Model Testing: Use the implemented function to translate the test dataset and
display the results alongside the original English sentences.

* Analysis: Evaluate the quality of translations. Focus on:

* Accuracy of simple translations.

* Handling of complex sentence structures.

* Translation of idiomatic expressions.

* Metrics: Compute BLEU scores (Bilingual Evaluation Understudy score is a
metric for evaluating the quality of text generated by machine translation models.
It measures the similarity between the machine-generated translation and a set
of human reference translations) and other relevant metrics to quantitatively
assess the translations. Discuss any observed strengths or weaknesses.

In [8]:
test_dataset = [("Let's run around the block twice."), ("Should we dance at Lunar New Year Festival?"), ("How much is the bag of candy?"), ("Can I pick you up after I finish working?"), ("Let's not beat around the bush."), ("You are the best singer ever."), ("My girlfriend cheated on me."), ("I will get revenge on her."), ("I destroyed her living room and slashed her tires."), ("A long time ago, many people used to walk 5 miles to school because transportation was limited.")]

In [9]:
translate_to_spanish2(test_dataset, "yes")

['Vamos a correr alrededor de la manzana dos veces.',
 '¿Deberíamos bailar en el Festival de Año Nuevo Lunar?',
 '¿Cuánto cuesta la bolsa de caramelos?',
 '¿Puedo recogerte después de terminar de trabajar?',
 'No nos demos vueltas por los arbustos.',
 'Eres el mejor cantante de la historia.',
 'Mi novia me engañó.',
 'Me vengaré de ella.',
 'Destruí su sala de estar y le corté los neumáticos.',
 'Hace mucho tiempo, muchas personas solían caminar 5 millas a la escuela porque el transporte era limitado.']

Analysis: The accuaracy is very good. i speak spanish and the words I don't recognize I looked up and they are just different dialects. One issue I found is that homework in part b) implementation is translated to deberes which means more of chores. But I figure that it translated home and work and recognized it as the work you do at home.

It was also able to handle complex sentence structures, such as past tense and longer sentences like the last one.

The idiomatic sentence was directly translated, not *litterally* translated so that isn't as successful.


Metrics: Compute BLEU scores (Bilingual Evaluation Understudy score is a metric for evaluating the quality of text generated by machine translation models. It measures the similarity between the machine-generated translation and a set of human reference translations) and other relevant metrics to quantitatively assess the translations. Discuss any observed strengths or weaknesses.

In [10]:
# Call the function and store the returned list
translations = translate_to_spanish2(test_dataset, "yes")

# Now you can work with the 'translations' list
translations

['Vamos a correr alrededor de la manzana dos veces.',
 '¿Deberíamos bailar en el Festival de Año Nuevo Lunar?',
 '¿Cuánto cuesta la bolsa de caramelos?',
 '¿Puedo recogerte después de terminar de trabajar?',
 'No nos demos vueltas por los arbustos.',
 'Eres el mejor cantante de la historia.',
 'Mi novia me engañó.',
 'Me vengaré de ella.',
 'Destruí su sala de estar y le corté los neumáticos.',
 'Hace mucho tiempo, muchas personas solían caminar 5 millas a la escuela porque el transporte era limitado.']

In [16]:
references

['Vamos a dar la vuelta a la manzana dos veces.',
 '¿Deberíamos bailar en el festival del Año Nuevo Lunar?',
 '¿Cuánto cuesta la bolsa de caramelos?',
 '¿Puedo recogerte después de terminar de trabajar?',
 'No andemos con rodeos.Eres el mejor cantante de todos los tiempos.',
 'Mi novia me engañó.',
 'Voy a vengarme de ella.',
 'Destruí su sala de estar y le corté las llantas.Hace mucho tiempo, muchas personas solían caminar 5 millas a la escuela porque el transporte era limitado.']

In [44]:
test_dataset

["Let's run around the block twice.",
 'Should we dance at Lunar New Year Festival?',
 'How much is the bag of candy?',
 'Can I pick you up after I finish working?',
 "Let's not beat around the bush.",
 'You are the best singer ever.',
 'My girlfriend cheated on me.',
 'I will get revenge on her.',
 'I destroyed her living room and slashed her tires.',
 'A long time ago, many people used to walk 5 miles to school because transportation was limited.']

In [17]:
predictions = translations
# claude says my references is supposed to be what the sentence is supposed to be
references = ['Vamos a dar la vuelta a la manzana dos veces.',
              '¿Deberíamos bailar en el festival del Año Nuevo Lunar?',
              '¿Cuánto cuesta la bolsa de caramelos?',
              '¿Puedo recogerte después de terminar de trabajar?',
              'No andemos con rodeos.',
              'Eres el mejor cantante de todos los tiempos.', 'Mi novia me engañó.',
              'Voy a vengarme de ella.', 'Destruí su sala de estar y le corté las llantas.',
              'Hace mucho tiempo, muchas personas solían caminar 5 millas a la escuela porque el transporte era limitado.']

# Calculate BLEU score for each sentence and average them
bleu_scores = []
for i in range(len(predictions)):
    # references for a single sentence needs to be in a list of list format
    bleu_scores.append(sentence_bleu([references[i].split()], predictions[i].split(), smoothing_function=SmoothingFunction().method4))

average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print(f"Average BLEU score: {average_bleu_score}")

Average BLEU score: 0.6205441877982293


In [22]:
rouge = load("rouge")  # Example for ROUGE metric
results = rouge.compute(predictions=predictions, references=references)
print(results)

{'rouge1': 0.7633743449532924, 'rouge2': 0.7018837818528221, 'rougeL': 0.7633249791144529, 'rougeLsum': 0.7670995670995672}


In [23]:
# using the METEOR score

meteor = load("meteor")
meteor_results = meteor.compute(predictions=predictions, references=references)
print(f"METEOR Score: {meteor_results}")


Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


METEOR Score: {'meteor': 0.7668583227294498}


Dicuss any observed strengths or weaknesses:

A Bleu score of 0.6205 means that there is still strong overlap of the generated predictions and the translations, but the score only measures word matching rather than the correct meaning usage or symbolism. A weaknesses of the bleu score is that it is sensitive to word order, and will penalize when it shouldn't

The ROUGE-1 score of 0.7634 means that 76.34% of the unigrams in the generated text match the reference text, showing a decent overlap between the generated translations, but not very strong. Similarly to the weakness of the Bleu score, it can not catch same meaning phrases ordered differently, making it less trust worthy when many of us know there is little difference in the point a sentence makes if it is reordered.

I think overall this model could translate well word for word, but sometimes could not capture the entire sentence well enough.


# (d) Analysis (5 points)
* Advantages: Discuss the benefits of using pre-trained models for translation
tasks. For instance, address how pre-trained models reduce computational
requirements and enable rapid prototyping.

* Limitations: Identify limitations of the pre-trained model. Consider factors like:

* Handling domain-specific or rare vocabulary.
* Limitations in idiomatic or culturally nuanced translations.

* Suggestions for Improvement: Provide suggestions for improving the translation
pipeline. For example, consider techniques like fine-tuning the model on a
domain-specific dataset or augmenting the data.

When using a trained model, computation time is cut down and a lot of the hard work of training linguistic nuances is already captured by these.
A downside is the more generic models will not recognize niche vocabulary for a niche situation you ask it to translate for. Another limitation of a pre-trained model is the inability to stick to a dialect or understand idiomatic meanings.

If I wanted to improve the translations I would train the model with more common english phrases that don't have a direct translation in spanish, and find the closest saying. If I needed a translation of a sports game I would focus on the training this model on sports focused conversations and such.   