This notebook performs evaluation of the `tamil_mfd.dic` through the following steps:
1. Select 1 word from each MFD category (10 categories × 1 words = 10 total) from the Tamil MFD and obtain their corresponding English MFD terms.
2. Back-translate Tamil to English using Sarvam AI and compare
3. Human translate English to Tamil, compare with Machine Translated Tamil terms and calculate IAA

In [15]:
!git clone https://github.com/anagha1999/anlp-project/

fatal: destination path 'anlp-project' already exists and is not an empty directory.


In [16]:
import os

# Navigate to the repository directory
repo_dir = 'anlp-project'
if os.path.exists(repo_dir):
    os.chdir(repo_dir)
    print(f"Changed directory to: {os.getcwd()}")
    # Pull the latest changes from the remote repository
    !git pull
    # Navigate back to the original directory if needed
    os.chdir('..')
    print(f"Changed directory back to: {os.getcwd()}")
else:
    print(f"Repository directory '{repo_dir}' not found. Please ensure it is cloned.")

Changed directory to: /content/anlp-project
remote: Enumerating objects: 22, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 16 (delta 8), reused 16 (delta 8), pack-reused 0 (from 0)[K
Unpacking objects: 100% (16/16), 7.54 MiB | 7.15 MiB/s, done.
From https://github.com/anagha1999/anlp-project
   1458d71..c2463a1  main       -> origin/main
Updating 1458d71..c2463a1
Fast-forward
 kannada/2.Clean_Preprocess_Kannada_Dataset.ipynb   | 1000 [32m+[m[31m--[m
 kannada/3.Moral_Foundations_Kannada.ipynb          |  214 [32m+[m[31m-[m
 kannada/5.Eval_Moral_Extraction.ipynb              |    0
 kannada/kannada-dataset/3-siri-kannada.pdf         |  Bin [31m0[m -> [32m7581798[m bytes
 .../1-janapada-kathegalu.txt                       |    0
 .../2-niti-kathegalu.txt                           |    0
 .../kannada-pre-processed/1-janapada-kathegalu.csv | 6605 [32m++++++++++++++++++++[m
 .../kannada-pre-processed/2

# Task
Evaluate the back-translation accuracy of a placeholder Sarvam AI model from Tamil to English using 10 aligned word pairs (1 from each of the 10 MFD categories) from "anlp-project/mfd2.0.dic" and "anlp-project/tamil/tamil_mfd.dic".

## Load and Preprocess English MFD

### Subtask:
Load the 'anlp-project/mfd2.0.dic' file, filter out comment and category lines (starting with '%' or a digit), and extract the first token as the English word from each remaining line.


In [6]:
# this step is not needed when repo is cloned and working on it directly
!git clone https://github.com/anagha1999/anlp-project/

Cloning into 'anlp-project'...
remote: Enumerating objects: 450, done.[K
remote: Counting objects: 100% (52/52), done.[K
remote: Compressing objects: 100% (39/39), done.[K
^Cceiving objects:   3% (16/450), 1.83 MiB | 1.80 MiB/s  


In [None]:
# this step is not needed when repo is cloned and working on it directly

import os

# Navigate to the repository directory
repo_dir = 'anlp-project'
if os.path.exists(repo_dir):
    os.chdir(repo_dir)
    print(f"Changed directory to: {os.getcwd()}")
    # Pull the latest changes from the remote repository
    !git pull
    # Navigate back to the original directory if needed
    os.chdir('..')
    print(f"Changed directory back to: {os.getcwd()}")
else:
    print(f"Repository directory '{repo_dir}' not found. Please ensure it is cloned.")

In [None]:
import os

file_path = 'anlp-project/mfd2.0.dic'
english_words = []

if os.path.exists(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip() # Remove leading/trailing whitespace
            if line and not (line.startswith('%') or line[0].isdigit()):
                # Extract the first token (word)
                first_token = line.split()[0]
                english_words.append(first_token)
    print(f"Successfully loaded {len(english_words)} English words.")
    print("First 10 extracted words:", english_words[:10])
else:
    print(f"Error: File not found at {file_path}")

Error: File not found at anlp-project/mfd2.0.dic


## Load and Preprocess Tamil MFD

In [2]:
import os

tamil_file_path = 'anlp-project/tamil/tamil_mfd.dic'
tamil_words = []

if os.path.exists(tamil_file_path):
    with open(tamil_file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip() # Remove leading/trailing whitespace
            if line and not (line.startswith('%') or line[0].isdigit()):
                # Extract the first token (word)
                first_token = line.split()[0]
                tamil_words.append(first_token)
    print(f"Successfully loaded {len(tamil_words)} Tamil words.")
    print("First 10 extracted Tamil words:", tamil_words[:10])
else:
    print(f"Error: File not found at {tamil_file_path}")

Error: File not found at anlp-project/tamil/tamil_mfd.dic


In [3]:
cleaned_tamil_words = [word for word in tamil_words if word != '#']

print(f"Successfully cleaned {len(tamil_words) - len(cleaned_tamil_words)} non-word entries.")
print(f"Total Tamil words after cleaning: {len(cleaned_tamil_words)}")
print("First 10 cleaned Tamil words:", cleaned_tamil_words[:10])

Successfully cleaned 0 non-word entries.
Total Tamil words after cleaning: 0
First 10 cleaned Tamil words: []


In [4]:
import os

# Load English MFD with categories
english_data = []
with open('anlp-project/mfd2.0.dic', 'r', encoding='utf-8') as f:
    for line in f:
        line_orig = line.strip()
        if line_orig and not (line_orig.startswith('%') or line_orig[0].isdigit()):
            # Category is the last token
            parts = line_orig.split()
            if len(parts) >= 2:
                try:
                    category = int(parts[-1])
                    word = ' '.join(parts[:-1])
                    english_data.append((word, category))
                except ValueError:
                    continue

# Load Tamil MFD with categories
tamil_data = []
with open('anlp-project/tamil/tamil_mfd.dic', 'r', encoding='utf-8') as f:
    for line in f:
        line_orig = line.strip()
        if line_orig and not (line_orig.startswith('%') or (line_orig and line_orig[0].isdigit())):
            parts = line_orig.split()
            if len(parts) >= 2:
                try:
                    category = int(parts[-1])
                    word = ' '.join(parts[:-1])
                    if word != '#':
                        tamil_data.append((word, category))
                except ValueError:
                    continue

print(f"English entries: {len(english_data)}")
print(f"Tamil entries: {len(tamil_data)}")

# Category mapping
category_names = {
    1: 'care.virtue',
    2: 'care.vice',
    3: 'fairness.virtue',
    4: 'fairness.vice',
    5: 'loyalty.virtue',
    6: 'loyalty.vice',
    7: 'authority.virtue',
    8: 'authority.vice',
    9: 'sanctity.virtue',
    10: 'sanctity.vice'
}

# Select 1 word from each category
sampled_pairs = []
print("\nSelected words by category:")
print("=" * 60)

for cat_num in sorted(category_names.keys()):
    cat_name = category_names[cat_num]
    # Get first 1 English word from this category
    eng_words = [w for w, c in english_data if c == cat_num][:1]
    # Get first 1 Tamil word from this category
    tam_words = [w for w, c in tamil_data if c == cat_num][:1]

    print(f"\n{cat_name}:")
    for i in range(min(len(eng_words), len(tam_words))):
        eng = eng_words[i]
        tam = tam_words[i]
        sampled_pairs.append((eng, tam))
        print(f"  {eng} -> {tam}")

print(f"\n{'='*60}")
print(f"Total pairs selected: {len(sampled_pairs)}")
print(f"Coverage: {len(category_names)} categories \u00d7 1 word = {len(sampled_pairs)} words")

FileNotFoundError: [Errno 2] No such file or directory: 'anlp-project/mfd2.0.dic'

In [None]:
import os
import pickle

# Save as .dic file
dic_file_path = os.path.join('tamil_samples.dic')
with open(dic_file_path, 'w', encoding='utf-8') as f:
    for eng_word, tam_word in sampled_pairs:
        f.write(f"{tam_word}\t{eng_word}\n")
print(f"Saved sampled pairs to {dic_file_path}")

# Save as .pkl file
pkl_file_path = os.path.join('tamil_samples.pkl')
with open(pkl_file_path, 'wb') as f:
    pickle.dump(sampled_pairs, f)
print(f"Saved sampled pairs to {pkl_file_path}")

Saved sampled pairs to tamil_samples.dic
Saved sampled pairs to tamil_samples.pkl


In [None]:
import pickle
import os

pkl_file_path = 'tamil_samples.pkl'

if os.path.exists(pkl_file_path):
    with open(pkl_file_path, 'rb') as f:
        sampled_pairs = pickle.load(f)
    print(f"Successfully loaded {len(sampled_pairs)} sampled pairs from {pkl_file_path}")
    print("First 5 loaded pairs:")
    for i, (eng, tam) in enumerate(sampled_pairs[:5]):
        print(f"{i+1}. English: {eng}, Tamil: {tam}")
else:
    print(f"Error: Pickle file not found at {pkl_file_path}")


Successfully loaded 10 sampled pairs from tamil_samples.pkl
First 5 loaded pairs:
1. English: compassion, Tamil: கருணை
2. English: harm, Tamil: தீங்கு
3. English: equality, Tamil: சமத்துவம்
4. English: cheat, Tamil: ஏமாற்று
5. English: team player, Tamil: குழு விளையாட்டு வீரர்


In [None]:
pip install transformers



In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Model details
model_name = "sarvamai/sarvam-translate"
tgt_lang = "English"  # English in Latin script
source_lang = "Tamil"  # Tamil in Tamil script

# Check for GPU availability
device = 0 if torch.cuda.is_available() else -1
print(f"Using device: {'cuda' if device == 0 else 'cpu'}")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


import requests

def sarvam_ai_tamil_to_english(tamil_word):
    """
    Uses Sarvam's official API for concise translations
    """
    url = "https://api.sarvam.ai/translate"

    payload = {
        "input": tamil_word,
        "source_language_code": "ta-IN",  # Tamil
        "target_language_code": "en-IN",  # English
        "speaker_gender": "Male",
        "mode": "formal",
        "model": "mayura:v1"
    }

    headers = {
        "Content-Type": "application/json",
        "API-Subscription-Key": "sk_19fh2qvp_sAvUR5GoQReBtxKqwrKW1P5V"
    }

    response = requests.post(url, json=payload, headers=headers)
    result = response.json()

    return result.get("translated_text", "").strip()


print("Sarvam AI translation model and function loaded.")


Using device: cuda


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Sarvam AI translation model and function loaded.


In [None]:
back_translated_tamil = []
for english_word, tamil_word in sampled_pairs:
  back_translated_english = sarvam_ai_tamil_to_english(tamil_word)
  back_translated_tamil.append({
        "original_english": english_word,
        "tamil_word": tamil_word,
        "translated_english": back_translated_english
    })
print(back_translated_tamil)


[{'original_english': 'compassion', 'tamil_word': 'கருணை', 'translated_english': 'Karunai'}, {'original_english': 'harm', 'tamil_word': 'தீங்கு', 'translated_english': 'Harmful'}, {'original_english': 'equality', 'tamil_word': 'சமத்துவம்', 'translated_english': 'Equality'}, {'original_english': 'cheat', 'tamil_word': 'ஏமாற்று', 'translated_english': 'Deceit'}, {'original_english': 'team player', 'tamil_word': 'குழு விளையாட்டு வீரர்', 'translated_english': 'Team player'}, {'original_english': 'traitor', 'tamil_word': 'துரோகி', 'translated_english': 'Traitor'}, {'original_english': 'respect', 'tamil_word': 'மரியாதை', 'translated_english': 'Respect'}, {'original_english': 'disrespect', 'tamil_word': 'அவமரியாதை', 'translated_english': 'Disrespect'}, {'original_english': 'sanctity', 'tamil_word': 'புனிதத்தன்மை', 'translated_english': 'Purity'}, {'original_english': 'impurity', 'tamil_word': 'மாசு', 'translated_english': 'Masu'}]


## Back-Translation

In [None]:
print(f"{'Original English':<20} | {'Tamil Word':<20} | {'Back-Translated English':<30}")
print(f"{'':-<20}-+-{'':-<20}-+-{'':-<30}")
for item in back_translated_tamil:
    print(f"{item['original_english']:<20} | {item['tamil_word']:<20} | {item['translated_english']:<30}")

Original English     | Tamil Word           | Back-Translated English       
---------------------+----------------------+-------------------------------
compassion           | கருணை                | Karunai                       
harm                 | தீங்கு               | Harmful                       
equality             | சமத்துவம்            | Equality                      
cheat                | ஏமாற்று              | Deceit                        
team player          | குழு விளையாட்டு வீரர் | Team player                   
traitor              | துரோகி               | Traitor                       
respect              | மரியாதை              | Respect                       
disrespect           | அவமரியாதை            | Disrespect                    
sanctity             | புனிதத்தன்மை         | Purity                        
impurity             | மாசு                 | Masu                          




## Back-Translation Results

| Original English | Tamil Word | Back-Translated English | Semantic Match
|:-----------------|:-------------|:-------------------|:--------------|
| compassion | கருணை | Karunai  | No |
| harm   | தீங்கு | Harmful    | Yes |
| equality  | சமத்துவம்  | Equality   | Yes|
| cheat  | ஏமாற்று  | Deceit   | Yes |
| team player   | குழு விளையாட்டு வீரர்  | Team player  | Yes|
| traitor     | துரோகி  | Traitor  | Yes|
| respect        | மரியாதை     | Respect  | Yes|
| disrespect         | அவமரியாதை| Disrespect   | Yes |
| sanctity         | புனிதத்தன்மை  |Purity  | Yes|
| impurity         | மாசு| Masu        | No |


## Machine-Human Translation Comparison

| English      | Sarvam AI        | Human 1            | Human 2            | Agreement Type | MT Performance | Category | Notes |
| ------------ | ---------------- | ------------------ | ------------------ | -------------- | -------------- | -------- | ----- |
| compassion   | கருணை           | இறக்கம்                  | கருணை     | Partial agree              | High              | care.virtue | Machine matched Human 2 exactly; Human 1 used a near-synonym emphasizing mercy rather than empathy  |
| harm         | தீங்கு          | துன்புறுதல்                 | தீங்கு            | Partial agree              | High              | care.vice | Machine and Human 2 agree on abstract harm; Human 1 shifted to experiential suffering rather than the abstract term |
| equality     | சமத்துவம்        | சமத்துவம்                  | சமத்துவம்                | All agree              | Perfect              | fairness.virtue | - |
| cheat        | ஏமாற்று         | ஏமாற்றுதல்                  | ஏமாற்றம்                | All agree              | Perfect              | fairness.vice | - |
| team player  | குழு விளையாட்டு வீரர் | குழுவின் விளையாட்டு வீரர்          | குழு விளையாட்டு வீரர்                  | All agree              | Perfect              | loyalty.virtue | - |
| traitor      | துரோகி          |  துரோகி                 | நேர்மை அற்றவன்                  | Partial agree              | High              | loyalty.vice | deceit vs lack of integrity in Human2's transaltion |
| respect      | மரியாதை         | மரியாதை                  | மரியாதை                 | All agree              | Perfect              | authority.virtue | - |
| disrespect   | அவமரியாதை       | அவமரியாதை                  | அவமரியாதை                  | All agree              | Perfect              | authority.vice | - |
| sanctity     | புனிதத்தன்மை     | புனித தன்மை                  | புனிதத்தன்மை                  | All agree              | Perfect              | sanctity.virtue | - |
| impurity     | மாசு            | அசுத்தம்                  | அசுத்தம்                  | Partial              | Low             | sanctity.vice | The machine translated was correct but not the most common word. |

# Summary & Analysis

We conducted machine translation of selected Moral Foundations Dictionary (MFD) terms using Sarvam AI, translating from English to Tamil. To assess translation quality, we employed a back-translation strategy using Sarvam AI itself, allowing us to verify semantic fidelity by comparing the back-translated English terms with the original inputs.

In addition to machine-based verification, we incorporated human evaluation by engaging two independent human annotators fluent in both English and Tamil. Each annotator produced translations for the same set of moral terms. This enabled us to assess inter-annotator agreement as well as machine–human alignment, providing a more robust evaluation of translation quality beyond automated metrics.

Overall, Sarvam AI demonstrated high alignment with human translations, particularly for commonly used moral concepts. Disagreements primarily arose due to lexical variation, synonym choice, or differences in abstraction, rather than outright mistranslation. This suggests that Sarvam AI is generally reliable for translating moral vocabulary, though human judgment remains valuable for nuanced or less frequent terms.