This notebook performs evaluation of the `kannada_mfd.dic` through the following steps:
1. Select 10 random words from the Kannada MFD `kannada_samples.dic` and obtain their corresponding English MFD terms.
2. Back-translate Kannada to English using Sarvam AI and compare 
3. Human translate English to Kannada, comapare with Machine Translated Kannada terms and calculate IAA

In [1]:
!git clone https://github.com/anagha1999/anlp-project/

Cloning into 'anlp-project'...
remote: Enumerating objects: 252, done.[K
remote: Counting objects: 100% (229/229), done.[K
remote: Compressing objects: 100% (188/188), done.[K
remote: Total 252 (delta 94), reused 150 (delta 40), pack-reused 23 (from 1)[K
Receiving objects: 100% (252/252), 43.44 MiB | 12.52 MiB/s, done.
Resolving deltas: 100% (100/100), done.


In [2]:
import os

# Navigate to the repository directory
repo_dir = 'anlp-project'
if os.path.exists(repo_dir):
    os.chdir(repo_dir)
    print(f"Changed directory to: {os.getcwd()}")
    # Pull the latest changes from the remote repository
    !git pull
    # Navigate back to the original directory if needed
    os.chdir('..')
    print(f"Changed directory back to: {os.getcwd()}")
else:
    print(f"Repository directory '{repo_dir}' not found. Please ensure it is cloned.")

Changed directory to: /Users/anagha/Documents/fall2025/anlp-project/kannada/anlp-project
Already up to date.
Changed directory back to: /Users/anagha/Documents/fall2025/anlp-project/kannada


# Task
Evaluate the back-translation accuracy of a placeholder Sarvam AI model from Kannada to English using 20 randomly sampled aligned word pairs from "anlp-project/mfd2.0.dic" and "anlp-project/kannada/kannada-mfd.dic".

## Load and Preprocess English MFD

### Subtask:
Load the 'anlp-project/mfd2.0.dic' file, filter out comment and category lines (starting with '%' or a digit), and extract the first token as the English word from each remaining line.


**Reasoning**:
To achieve the subtask of loading, filtering, and extracting words from the specified file, I will use Python code to read the file line by line, apply the filtering logic, and store the extracted words in a list.



In [3]:
import os

file_path = 'anlp-project/mfd2.0.dic'
english_words = []

if os.path.exists(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip() # Remove leading/trailing whitespace
            if line and not (line.startswith('%') or line[0].isdigit()):
                # Extract the first token (word)
                first_token = line.split()[0]
                english_words.append(first_token)
    print(f"Successfully loaded {len(english_words)} English words.")
    print("First 10 extracted words:", english_words[:10])
else:
    print(f"Error: File not found at {file_path}")

Successfully loaded 2104 English words.
First 10 extracted words: ['compassion', 'empathy', 'kindness', 'caring', 'generosity', 'benevolence', 'altruism', 'compassionate', 'nurture', 'gentleness']


## Load and Preprocess Kannada MFD

### Subtask:
Load the 'anlp-project/kannada/kannada-mfd.dic' file, filter out comment and category lines (starting with '%' or a digit), and extract the first token as the Kannada word from each remaining line.

## Load and Preprocess Kannada MFD

### Subtask:
Load the 'anlp-project/kannada/kannada-mfd.dic' file, filter out comment and category lines (starting with '%' or a digit), and extract the first token as the Kannada word from each remaining line.

In [4]:
import os

kannada_file_path = 'anlp-project/kannada/kannada_mfd.dic' # Corrected filename
kannada_words = []

if os.path.exists(kannada_file_path):
    with open(kannada_file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip() # Remove leading/trailing whitespace
            if line and not (line.startswith('%') or line[0].isdigit()):
                # Extract the first token (word)
                first_token = line.split()[0]
                kannada_words.append(first_token)
    print(f"Successfully loaded {len(kannada_words)} Kannada words.")
    print("First 10 extracted Kannada words:", kannada_words[:10])
else:
    print(f"Error: File not found at {kannada_file_path}")

Successfully loaded 2108 Kannada words.
First 10 extracted Kannada words: ['#', '#', '#', '#', 'ಕರುಣೆ', 'ಸಹಾನುಭೂತಿ', 'ದಯೆ', 'ಕಾಳಜಿ', 'ಉದಾರತೆ', 'ಪರೋಪಕಾರ']


**Reasoning**:
The previous output for `kannada_words` shows several '#' characters at the beginning, which are not actual words. To ensure data quality and relevance for the task, these non-word characters need to be removed from the `kannada_words` list.



In [5]:
cleaned_kannada_words = [word for word in kannada_words if word != '#']

print(f"Successfully cleaned {len(kannada_words) - len(cleaned_kannada_words)} non-word entries.")
print(f"Total Kannada words after cleaning: {len(cleaned_kannada_words)}")
print("First 10 cleaned Kannada words:", cleaned_kannada_words[:10])

Successfully cleaned 4 non-word entries.
Total Kannada words after cleaning: 2104
First 10 cleaned Kannada words: ['ಕರುಣೆ', 'ಸಹಾನುಭೂತಿ', 'ದಯೆ', 'ಕಾಳಜಿ', 'ಉದಾರತೆ', 'ಪರೋಪಕಾರ', 'ಪರೋಪಕಾರ', 'ಕರುಣಾಮಯಿ', 'ಪೋಷಣೆ', 'ಸೌಮ್ಯತೆ']


**Reasoning**:
Both the English and cleaned Kannada word lists have been successfully loaded and are of the same length, implying they are aligned. The next step is to combine them into pairs and then randomly sample 20 of these pairs to prepare for the back-translation evaluation as specified in the main task.



In [6]:
# import random

# aligned_word_pairs = list(zip(english_words, cleaned_kannada_words))

# # Ensure we don't try to sample more pairs than available
# num_samples = min(10, len(aligned_word_pairs))

# sampled_pairs = random.sample(aligned_word_pairs, num_samples)

# print(f"Created {len(aligned_word_pairs)} aligned English-Kannada word pairs.")
# print(f"Successfully sampled {num_samples} pairs for evaluation.")
# print(sampled_pairs)


In [7]:
# import os
# import pickle

# save_dir = 'anlp-project/kannada'
# os.makedirs(save_dir, exist_ok=True)

# # Save as .dic file
# dic_file_path = os.path.join(save_dir, 'kannada_samples.dic')
# with open(dic_file_path, 'w', encoding='utf-8') as f:
#     for eng_word, kan_word in sampled_pairs:
#         f.write(f"{kan_word}\t{eng_word}\n")
# print(f"Saved sampled pairs to {dic_file_path}")

# # Save as .pkl file
# pkl_file_path = os.path.join(save_dir, 'kannada_samples.pkl')
# with open(pkl_file_path, 'wb') as f:
#     pickle.dump(sampled_pairs, f)
# print(f"Saved sampled pairs to {pkl_file_path}")

In [8]:
import pickle
import os

pkl_file_path = 'anlp-project/kannada/kannada_samples.pkl'

if os.path.exists(pkl_file_path):
    with open(pkl_file_path, 'rb') as f:
        sampled_pairs = pickle.load(f)
    print(f"Successfully loaded {len(sampled_pairs)} sampled pairs from {pkl_file_path}")
    print("First 5 loaded pairs:")
    for i, (eng, kan) in enumerate(sampled_pairs[:5]):
        print(f"{i+1}. English: {eng}, Kannada: {kan}")
else:
    print(f"Error: Pickle file not found at {pkl_file_path}")


Successfully loaded 10 sampled pairs from anlp-project/kannada/kannada_samples.pkl
First 5 loaded pairs:
1. English: prejudicing, Kannada: ಪೂರ್ವಾಗ್ರಹ
2. English: curses, Kannada: ಶಾಪ
3. English: honest, Kannada: ಪ್ರಾಮಾಣಿಕ
4. English: blood, Kannada: ರಕ್ತ
5. English: double, Kannada: ಬೆಂಬಲವಂಚನೆ


**Reasoning**:
To install the `transformers` library, I will use the `pip install` command within a code cell.



In [9]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


**Reasoning**:
Now that the `transformers` library is installed, I will define a function to perform Kannada to English translation using the specified Sarvam AI model ('sarvamai/sarvam-translate') by loading `AutoTokenizer` and `AutoModelForCausalLM`.



In [10]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Model details
model_name = "sarvamai/sarvam-translate"
tgt_lang = "English"  # English in Latin script
source_lang = "Kannada"  # Kannada in Kannada script

# Check for GPU availability
device = 0 if torch.cuda.is_available() else -1
print(f"Using device: {'cuda' if device == 0 else 'cpu'}")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


import requests

def sarvam_ai_kannada_to_english(kannada_word):
    """
    Uses Sarvam's official API for concise translations
    """
    url = "https://api.sarvam.ai/translate"

    payload = {
        "input": kannada_word,
        "source_language_code": "kn-IN",  # Kannada
        "target_language_code": "en-IN",  # English
        "speaker_gender": "Male",
        "mode": "formal",
        "model": "mayura:v1"
    }

    headers = {
        "Content-Type": "application/json",
        "API-Subscription-Key": "sk_19fh2qvp_sAvUR5GoQReBtxKqwrKW1P5V"
    }

    response = requests.post(url, json=payload, headers=headers)
    result = response.json()

    return result.get("translated_text", "").strip()


print("Sarvam AI translation model and function loaded.")


  from .autonotebook import tqdm as notebook_tqdm


Using device: cpu


Loading checkpoint shards: 100%|██████████| 2/2 [00:26<00:00, 13.22s/it]


Sarvam AI translation model and function loaded.


In [11]:
back_translated_kannada = []
for english_word, kannada_word in sampled_pairs:
  back_translated_english = sarvam_ai_kannada_to_english(kannada_word)
  back_translated_kannada.append({
        "original_english": english_word,
        "kannada_word": kannada_word,
        "translated_english": back_translated_english
    })
print(back_translated_kannada)


[{'original_english': 'prejudicing', 'kannada_word': 'ಪೂರ್ವಾಗ್ರಹ', 'translated_english': 'Prejudice'}, {'original_english': 'curses', 'kannada_word': 'ಶಾಪ', 'translated_english': 'Curse'}, {'original_english': 'honest', 'kannada_word': 'ಪ್ರಾಮಾಣಿಕ', 'translated_english': 'Honest'}, {'original_english': 'blood', 'kannada_word': 'ರಕ್ತ', 'translated_english': 'Blood'}, {'original_english': 'double', 'kannada_word': 'ಬೆಂಬಲವಂಚನೆ', 'translated_english': 'Suppression'}, {'original_english': 'sympathizers', 'kannada_word': 'ಸಹಾನುಭೂತಿಗಳು', 'translated_english': 'Sympathies'}, {'original_english': 'repenting', 'kannada_word': 'ಪಶ್ಚಾತ್ತಾಪ', 'translated_english': 'Repentance'}, {'original_english': 'repulses', 'kannada_word': 'ಹಿಮ್ಮೆಟ್ಟಿಸುವುದು', 'translated_english': 'Retreat'}, {'original_english': 'vomiting', 'kannada_word': 'ವಾಂತಿ', 'translated_english': 'Vomiting'}, {'original_english': 'heresies', 'kannada_word': 'ಧರ್ಮಭ್ರಷ್ಟತೆಗಳು', 'translated_english': 'Abandonment'}]


## Back-Translation

In [None]:
print(f"{'Original English':<20} | {'Kannada Word':<20} | {'Back-Translated English':<30}")
print(f"{'':-<20}-+-{'':-<20}-+-{'':-<30}")
for item in back_translated_kannada:
    print(f"{item['original_english']:<20} | {item['kannada_word']:<20} | {item['translated_english']:<30}")

Original English     | Kannada Word         | Translated English            
---------------------+----------------------+-------------------------------
prejudicing          | ಪೂರ್ವಾಗ್ರಹ           | Prejudice                     
curses               | ಶಾಪ                  | Curse                         
honest               | ಪ್ರಾಮಾಣಿಕ            | Honest                        
blood                | ರಕ್ತ                 | Blood                         
double               | ಬೆಂಬಲವಂಚನೆ           | Suppression                   
sympathizers         | ಸಹಾನುಭೂತಿಗಳು         | Sympathies                    
repenting            | ಪಶ್ಚಾತ್ತಾಪ           | Repentance                    
repulses             | ಹಿಮ್ಮೆಟ್ಟಿಸುವುದು     | Retreat                       
vomiting             | ವಾಂತಿ                | Vomiting                      
heresies             | ಧರ್ಮಭ್ರಷ್ಟತೆಗಳು      | Abandonment                   


## Back-Translation Results

| Original English | Kannada Word | Translated English | Semantic Match
|:-----------------|:-------------|:-------------------|:--------------|
| prejudicing      | ಪೂರ್ವಾಗ್ರಹ    | Prejudice          | Yes |
| curses           | ಶಾಪ           | Curse              | Yes |
| honest           | ಪ್ರಾಮಾಣಿಕ      | Honest             | Yes|
| blood            | ರಕ್ತ          | Blood              | Yes |
| double           | ಬೆಂಬಲವಂಚನೆ    | Suppression        | No|
| sympathizers     | ಸಹಾನುಭೂತಿಗಳು  | Sympathies         | Yes| 
| repenting        | ಪಶ್ಚಾತ್ತಾಪ     | Repentance         | Yes|
| repulses         | ಹಿಮ್ಮೆಟ್ಟಿಸುವುದು| Retreat            | No |
| vomiting         | ವಾಂತಿ         | Vomiting           | Yes| 
| heresies         | ಧರ್ಮಭ್ರಷ್ಟತೆಗಳು| Abandonment        | No |

The back-translation validation reveals a 70% semantic preservation rate, with seven terms successfully maintaining their core meaning through the translation-back-translation cycle. The successful translations cluster around two categories: concrete physical terms ("blood," "vomiting") and well-established moral-psychological concepts with direct Kannada equivalents ("honest," "prejudice," "curse," "repentance"). These results demonstrate that Sarvam AI performs reliably when translating vocabulary with clear referential meanings or culturally universal moral concepts that have stabilized lexical correspondences between English and Kannada. The morphological variation observed in "sympathies" (translated from "sympathizers") represents acceptable semantic equivalence, as both terms derive from the same affective concept, though the back-translation shifted from agent noun to abstract noun form.

The three failed back-translations—"double" (→ suppression), "repulses" (→ retreat), and "heresies" (→ abandonment)—expose systematic challenges in automated moral vocabulary translation. "Double" represents a case where the Kannada translation (ಬೆಂಬಲವಂಚನೆ, bembalawanchane = betrayal/double-cross) actually captured the contextually appropriate moral meaning, but the back-translation system failed to recognize this idiom, defaulting to "suppression." This suggests the initial translation was semantically correct despite apparent back-translation failure. In contrast, "repulses" and "heresies" reflect genuine polysemy and cultural complexity issues: the machine selected physical-spatial interpretations ("retreat") over moral-emotional registers ("disgust") for "repulses," while "heresies" encountered the challenge of translating a Western religious concept into a Kannada cultural-religious framework where multiple non-equivalent terms exist. These failures indicate that abstract, context-dependent, or culturally embedded moral vocabulary requires additional validation beyond automated back-translation, particularly for the Sanctity and Authority foundations where emotion and religious terminology predominate.



## Machine-Human Translation Comparison

| English      | Sarvam AI        | Human 1            | Human 2            | Agreement Type | MT Performance | Notes                                                                                                                           |
| ------------ | ---------------- | ------------------ | ------------------ | -------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| prejudicing  | ಪೂರ್ವಾಗ್ರಹ       | ಪೂರ್ವಗ್ರಹಿಕೆ       | ಪೂರ್ವಾಗ್ರಹ         | All agree       | High           | Machine matched one expert translator; minor morphological variation between all three versions                                 |
| curses       | ಶಾಪ              | ಶಾಪ                | ಶಾಪಗಳು             | All agree      | Perfect        | Complete semantic consensus across machine and both humans; only singular/plural difference                                     |
| honest       | ಪ್ರಾಮಾಣಿಕ        | ಪ್ರಾಮಾಣಿಕ          | ಪ್ರಾಮಾಣಿಕ          | All agree        | Perfect        | Identical translation across all three sources                                                                                  |
| blood        | ರಕ್ತ             | ರಕ್ತ               | ರಕ್ತ               | All agree        | Perfect        | Identical translation across all three sources                                                                                  |
| double       | ಬೆಂಬಲವಂಚನೆ       | ದುಪಟ್ಟಾ            | ದ್ವಿಗುಣ            | None           | Contextual     | Machine captured betrayal context relevant to Loyalty foundation; humans interpreted literal quantity senses                    |
| sympathizers | ಸಹಾನುಭೂತಿಗಳು     | ಸಹಾನುಭೂತಿ          | ಸಹಾನುಭೂತಿದಾರರು     | All agree     | Perfect        | All three share semantic root; variation in grammatical form (abstract noun vs agent noun)                                      |
| repenting    | ಪಶ್ಚಾತ್ತಾಪ       | ಪಶ್ಚಾತ್ತಾಪ ಪಡುವುದು | ಪಶ್ಚಾತ್ತಾಪ ಪಡುವುದು | All agree     | Perfect        | All three share semantic root; machine used noun form while humans used verb phrase                                             |
| repulses     | ಹಿಮ್ಮೆಟ್ಟಿಸುವುದು | ಅಸಹ್ಯ ಗೊಲ್ಲು       | ವಿಕರ್ಷಿಸುತ್ತದೆ     | None           | Low            | Complete divergence: retreat (physical) vs disgust (emotional) vs repulsion (magnetic); machine missed moral-emotional register |
| vomiting     | ವಾಂತಿ            | ವಾಂತಿ              | ವಾಂತಿ ಮಾಡುವುದು     | All agree   | Perfect        | All three share core term; humans added verb suffix for grammatical completeness                                                |
| heresies     | ಧರ್ಮಭ್ರಷ್ಟತೆಗಳು  | ಧರ್ಮದ್ರೋಹಿಗಳು      | ಮತಭ್ರಾಂತಿಗಳು       | None           | Low            | Complete divergence reflecting cultural-religious complexity: apostasy vs betrayal vs false belief                              |

## Summary & Analysis

A systematic evaluation of 10 Kannada translations from the English Moral Foundation Dictionary (MFD) was conducted using Sarvam AI machine translation, validated through back-translation and dual human annotation. The analysis reveals both the strengths and limitations of automated translation for moral vocabulary, with particular insights into polysemy resolution and contextual interpretation.

---

### Key Findings

#### 1. Translation Accuracy by Word Type

**Concrete vocabulary** with direct physical referents achieved **100% back-translation accuracy** and perfect agreement:
- "blood" (ರಕ್ತ)
- "vomiting" (ವಾಂತಿ)
- "honest" (ಪ್ರಾಮಾಣಿಕ)
- "curse" (ಶಾಪ)

**Morphological variants** of the same root were counted as perfect matches:
- "sympathizers": ಸಹಾನುಭೂತಿಗಳು / ಸಹಾನುಭೂತಿ / ಸಹಾನುಭೂತಿದಾರರು — all derive from root *sahānubhūti*
- "repenting": ಪಶ್ಚಾತ್ತಾಪ (noun) / ಪಶ್ಚಾತ್ತಾಪ ಪಡುವುದು (verb phrase)
- "prejudicing": 

**Result:** 8/10 words (80%) achieved semantic equivalence when accounting for morphological variation.

**Polysemous terms** requiring contextual disambiguation exhibited systematic challenges:
- "double" and "repulses" — multiple valid interpretations create translation divergence
- "heresies" — cultural-religious vocabulary complexity

---

#### 2. Polysemy and Contextual Interpretation Challenges

**"Double":**
- Sarvam AI's translation ಬೆಂಬಲವಂಚನೆ (*bembalawanchane*) captured the idiomatic meaning of "double-cross" or betrayal — the intended sense in MFD's Loyalty/Betrayal foundation
- Both human translators interpreted "double" literally: ದುಪಟ್ಟಾ (*duppattu* = twice as much) and ದ್ವಿಗುಣ (*dwiguna* = duplicate)
- Machine translation was preferred in this case, correctly inferring moral-conceptual context
---
**"Repulse"**
Three translations captured distinct but valid senses of the polysemous English term: the machine translation was ಹಿಮ್ಮೆಟ್ಟಿಸುವುದು (retreat—physical repulsion), and human translations were (i) ಅಸಹ್ಯ ಗೊಲ್ಲು (repulsive/yuck—disgust emotion), and (ii) ವಿಕರ್ಷಿಸುತ್ತದೆ (repulsion—magnetic force). For the MFD's Sanctity/Degradation foundation, the disgust interpretation (ಅಸಹ್ಯ ಗೊಲ್ಲು) is contextually appropriate, yet the machine and other human translator selected physical/force-based interpretations. This finding highlights that polysemous emotion terms require explicit disambiguation cues to ensure translators select the morally relevant sense.
<!-- 

| Translation | Meaning | Sense |
|:------------|:--------|:------|
| ಹಿಮ್ಮೆಟ್ಟಿಸುವುದು | retreat | physical repulsion |
| ಅಸಹ್ಯ ಗೊಲ್ಲು | repulsive/yuck | disgust emotion |
| ವಿಕರ್ಷಿಸುತ್ತದೆ | repulsion | magnetic force |

For MFD's Sanctity/Degradation foundation, the disgust interpretation is contextually appropriate, yet MT selected physical interpretation. -->
---
**"Heresies"** revealed substantial cultural mediation challenges:

The translation of "heresies" revealed substantial cultural mediation challenges inherent in religious terminology. 
- The machine translation was ಧರ್ಮಭ್ರಷ್ಟತೆಗಳು (dharmabhraṣṭategalụ = religious apostasy/fall from dharma) -- this emphasizes deviation from orthodox practice 
- The human translations were 
  - (i) ಧರ್ಮದ್ರೋಹಿಗಳು (dharmadhrōhigaḷu = religious traitors/betrayers), -- framing "heresy" as active betrayal 
  - (ii) ಮತಭ್ರಾಂತಿಗಳು (matabhrantigaḷu = doctrinal delusions/false beliefs) -- focusing on cognitive error
Each emphasizes different conceptual dimensions of heterodoxy, uncovering how the English term "heresy" encompasses a conceptual range not captured by any single Kannada term. 
This can have significant implications for the Authority and Sanctity foundations, which rely heavily on religiously-coded vocabulary that may require compound phrases or glosses rather than single-word translations.


---
An interesting finding was for the word **"Sympathizers"** — the Kannada root ಸಹಾನುಭೂತಿ (*sahānubhūti*) literally denotes "compassion". However, all translations rendered it as "sympathy" rather than "compassion", which is a case of **semantic conventionalization** where the term's etymological meaning has pragmatically narrowed.

---

#### 3. Inter-Annotator Agreement and Method Convergence

- **80% semantic agreement** when accounting for morphological variants
- Remaining disagreements ("repulses," "heresies") represent genuine polysemy, not translation error
- "Double" shows machine-human disagreement doesn't indicate machine failure

**Validation:** Terms failing back-translation (double, repulses, heresies) also showed human interpretation diversity — confirming these words require explicit disambiguation.

---

### Implications for MFD Development

| Foundation | Challenge | Recommendation |
|:-----------|:----------|:---------------|
| **Sanctity/Degradation** | Emotion terms like "repulses" | Explicit disgust-framing to avoid physical interpretations |
| **Loyalty/Betrayal** | Idiomatic betrayal terms | Contextual cues to prevent literal interpretations |
| **Authority** | Religious vocabulary | Multiple-word glosses or cultural consultation |

---

### Conclusion

Sarvam AI achieves **80% semantic equivalence** for moral vocabulary when morphological variants are recognized as valid translations. The system excels at concrete, culturally universal terms and shows sophisticated contextual inference for idiomatic expressions. However, polysemous emotion terms and culturally embedded religious concepts require explicit disambiguation or cultural consultation. The convergence of back-translation and inter-annotator methods successfully identifies terms requiring additional review.

In [15]:
%pip install nltk

Collecting nltk
  Downloading nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting click (from nltk)
  Downloading click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Downloading nltk-3.9.2-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m13.0 MB/s[0m  [33m0:00:00[0m
[?25hDownloading click-8.3.1-py3-none-any.whl (108 kB)
Installing collected packages: click, nltk
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [nltk][32m1/2[0m [nltk]
[1A[2KSuccessfully installed click-8.3.1 nltk-3.9.2
Note: you may need to restart the kernel to use updated packages.


In [16]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download necessary NLTK data if not already present
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

english_stopwords = set(stopwords.words('english'))
stemmer = PorterStemmer()

def normalize_text(text):
    """
    Converts string to lowercase, strips whitespace, removes basic punctuation, removes stopwords, and applies stemming.
    """
    text = text.lower() # Convert to lowercase
    text = text.strip() # Remove leading/trailing whitespace
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize, remove stopwords, and stem
    words = text.split()
    filtered_words = [word for word in words if word not in english_stopwords]
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
    return ' '.join(stemmed_words)

exact_matches = 0
total_words = len(back_translated_kannada)

for word in back_translated_kannada:
    normalized_original_english = normalize_text(word['original_english'])
    normalized_translated_english = normalize_text(word['translated_english'])

    if normalized_original_english == normalized_translated_english:
        exact_matches += 1

accuracy_score = exact_matches / total_words

print(f"Total sampled pairs: {total_words}")
print(f"Exact matches: {exact_matches}")
print(f"Exact-match accuracy: {accuracy_score:.2f}")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anagha/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /Users/anagha/nltk_data...


Total sampled pairs: 10
Exact matches: 5
Exact-match accuracy: 0.50


[nltk_data]   Unzipping tokenizers/punkt.zip.
