# Urdu to English Translation and Sentiment Analysis Assignment

## Objective
This assignment focuses on building a translation system that converts Urdu text into English using a pre-trained MarianMT model. Additionally, sentiment analysis will be performed on the translated text. The goal is to:

- Understand how machine translation works using deep learning models which consist of encoders and decoder.
- Analyze sentiment of the translated text.
- Experiment with different translation parameters such as `max_length` and `temperature`.

## Task Breakdown

### 1. Load the Pre-trained MarianMT Model
MarianMT is a neural machine translation model trained by Helsinki-NLP. We will use the `opus-mt-ur-en` model for Urdu to English translation.

In [1]:
# used python3.11 as i was running this on a system with python3.11 installed
# !python3.11 -m pip install transformers datasets torch
# !python3.11 -m pip install sentencepiece

from transformers import MarianMTModel, MarianTokenizer
from transformers import pipeline

### 2. Initialize Model and Tokenizer

In [2]:
# Load pre-trained MarianMT model and tokenizer for Urdu to English translation
model_name = "Helsinki-NLP/opus-mt-ur-en"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/306M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/306M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/848k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/816k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.91M [00:00<?, ?B/s]



### 3. Load Sentiment Analysis Pipeline
Hugging Face provides a pre-trained sentiment analysis pipeline, which we will use for analyzing sentiment of the translated text.

In [3]:
sentiment_analyzer = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


### 4. Define Translation Function

In [25]:
def translate_urdu_to_english(urdu_text, max_length=512, temperature=1.0):
    """
    Translates Urdu text to English using MarianMT model.

    Parameters:
    - urdu_text (str): The Urdu text to be translated.
    - max_length (int): Maximum length of the translated output.
    - temperature (float): Sampling temperature for generation.

    Returns:
    - english_text (str): Translated English text.
    """
    # TODO
    input_ids = tokenizer.encode(urdu_text, return_tensors="pt", max_length=max_length, truncation=True)
    translated_tokens = model.generate(input_ids, max_length=max_length, num_beams=5, temperature=temperature)
    english_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return english_text

### 5. Define Sentiment Analysis Function

In [5]:
def perform_sentiment_analysis(text):
    """
    Performs sentiment analysis on the translated English text.

    Parameters:
    - text (str): The English text for sentiment analysis.

    Returns:
    - sentiment (str): Sentiment of the text (positive/negative/neutral).
    """
    sentiment = sentiment_analyzer(text)[0]
    return sentiment

### 7. Perform Translation and Sentiment Analysis

In [None]:
urdu_text = "یہ ایک مثال ہے"
max_length = 99  # Experiment with different max lengths
temperature = 0.5  # Experiment with different temperatures

# Translate Urdu to English
english_translation = translate_urdu_to_english(urdu_text, max_length=max_length, temperature=temperature)

# Perform sentiment analysis
sentiment_result = perform_sentiment_analysis(english_translation)

print(f"Original Urdu Text: {urdu_text}")
print(f"Translated English Text: {english_translation}")
print(f"Sentiment Analysis of Translated Text: {sentiment_result['label']} (Confidence: {sentiment_result['score']})")



Original Urdu Text: یہ ایک مثال ہے
Translated English Text: This is an example.
Sentiment Analysis of Translated Text: POSITIVE (Confidence: 0.995923638343811)


### 8. Experiment with More Complex Sentences

In [27]:
urdu_text_complex = "پاکستان ایک خوبصورت ملک ہے اور اس کے لوگ بہت مہمان نواز ہیں۔"
english_translation_complex = translate_urdu_to_english(urdu_text_complex, max_length=100, temperature=1.2)
sentiment_result_complex = perform_sentiment_analysis(english_translation_complex)

print(f"Original Complex Urdu Text: {urdu_text_complex}")
print(f"Translated Complex English Text: {english_translation_complex}")
print(f"Sentiment Analysis of Complex Translated Text: {sentiment_result_complex['label']} (Confidence: {sentiment_result_complex['score']})")



Original Complex Urdu Text: پاکستان ایک خوبصورت ملک ہے اور اس کے لوگ بہت مہمان نواز ہیں۔
Translated Complex English Text: Pakistan is a beautiful country and its people are very hospitable.
Sentiment Analysis of Complex Translated Text: POSITIVE (Confidence: 0.9996907711029053)


### 9. Repeat the same task but now for English to Urdu translation

In [33]:
#now we will use the model for translating english to urdu
model_name = "Helsinki-NLP/opus-mt-en-ur"
new_model = MarianMTModel.from_pretrained(model_name)
new_tokenizer = MarianTokenizer.from_pretrained(model_name)

##define the relevant functions as done before and test the translation and sentiment analysis
#TODO
def translate_english_to_urdu(english_text, max_length=100, temperature=0.9):

    # Tokenize input text with length constraints
    inputs = new_tokenizer(english_text, return_tensors="pt", padding=True, truncation=True, max_length=max_length)

    # Generate translated tokens with the specified temperature
    translated_tokens = new_model.generate(**inputs, max_length=max_length, temperature=temperature)

    # Decode the translated tokens to get the Urdu text
    urdu_text = new_tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return urdu_text

# Load pre-trained sentiment analysis model
sentiment_analyzer = pipeline("sentiment-analysis")

def perform_sentiment_analysis(text):

    sentiment = sentiment_analyzer(text)[0]
    return sentiment

english_text = "You are very bad"
max_length = 50  # Experiment with different max lengths
temperature = 0.7  # Experiment with different temperatures

# Translate English to Urdu
urdu_translation = translate_english_to_urdu(english_text, max_length=max_length, temperature=temperature)

# Perform sentiment analysis
sentiment = perform_sentiment_analysis(urdu_translation)

print(f"Original English Text: {english_text}")
print(f"Translated Urdu Text: {urdu_translation}")
print(f"Sentiment Analysis of Translated Text: {sentiment_result['label']} (Confidence: {sentiment_result['score']})")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


Original English Text: You are very bad
Translated Urdu Text: تم بہت برا ہو
Sentiment Analysis of Translated Text: NEGATIVE (Confidence: 0.7732239961624146)


In [36]:
#Experiment with atleast 3 different english texts and observe the results
#TODO
english_text1 = "Are You in Class?"
english_text2 = "Are you done your work?"
english_text3 = "Not i'm not in class and not done the assignment"

max_length = 70
temperature = 0.7



urdu_translation1 = translate_english_to_urdu(english_text1, max_length=max_length, temperature=temperature)
urdu_translation2 = translate_english_to_urdu(english_text2, max_length=max_length, temperature=temperature)
urdu_translation3 = translate_english_to_urdu(english_text3, max_length=max_length, temperature=temperature)



sentiment_result1 = perform_sentiment_analysis(english_text1)
sentiment_result2 = perform_sentiment_analysis(english_text2)
sentiment_result3 = perform_sentiment_analysis(english_text3)

print(f"Original English Text 1: {english_text1}")
print(f"Translated Urdu Text 1: {urdu_translation1}")
print(f"Sentiment Analysis of Translated Text 1: {sentiment_result1['label']} (Confidence: {sentiment_result1['score']})")

print(f"Original English Text 2: {english_text2}")
print(f"Translated Urdu Text 2: {urdu_translation2}")
print(f"Sentiment Analysis of Translated Text 2: {sentiment_result2['label']} (Confidence: {sentiment_result2['score']})")

print(f"Original English Text 3: {english_text3}")
print(f"Translated Urdu Text 3: {urdu_translation3}")
print(f"Sentiment Analysis of Translated Text 3: {sentiment_result3['label']} (Confidence: {sentiment_result3['score']})")

Original English Text 1: Are You in Class?
Translated Urdu Text 1: کیا آپ کلاس میں ہیں ؟
Sentiment Analysis of Translated Text 1: POSITIVE (Confidence: 0.991032063961029)
Original English Text 2: Are you done your work?
Translated Urdu Text 2: کیا آپ نے اپنا کام کِیا ہے ؟
Sentiment Analysis of Translated Text 2: NEGATIVE (Confidence: 0.9566459059715271)
Original English Text 3: Not i'm not in class and not done the assignment
Translated Urdu Text 3: میں کلاس میں نہیں ہوں اور اس کام کو نہیں کیا
Sentiment Analysis of Translated Text 3: NEGATIVE (Confidence: 0.9996975660324097)
