<a href="https://colab.research.google.com/github/anagha1999/anlp-project/blob/main/tamil/1.Generate_Tamil_MFD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Translate MFD to Tamil

This notebook:
1. Translates the English MFD to Tamil
2. Generates MFD Word Embeddings and Master Moral Vectors in Tamil

Outputs:
1. `tamil_mfd.dic`: Re-usable moral foundation dictionary in Tamil in human readable format.
2. `tamil_mfd.pkl`: Moral Foundation Dictionary in Tamil in machine readable "pickle" format.
3. `tamil_mfd_embeddings.pkl`: Moral Foundation Dictionary in Tamil, with corresponding embeddings per moral.
4. `tamil_master_moral_vectors.pkl`: Master Vectors for each moral in Tamil.

In [1]:
# Download English MFD 2.0 dictionary
!wget https://raw.githubusercontent.com/medianeuroscience/emfd/refs/heads/master/dictionaries/mfd2.0.dic

--2025-12-16 06:17:00--  https://raw.githubusercontent.com/medianeuroscience/emfd/refs/heads/master/dictionaries/mfd2.0.dic
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24574 (24K) [text/plain]
Saving to: ‘mfd2.0.dic’


2025-12-16 06:17:00 (2.12 MB/s) - ‘mfd2.0.dic’ saved [24574/24574]



## Load English MFD 2.0

In [2]:
from typing import List
import pandas as pd

MFD2 = 'mfd2.0.dic'
nummap = dict()
mfd2 = dict()
wordmode = True

with open(MFD2, 'r') as f:
    for line in f.readlines():
        ent = line.strip().split()
        if line[0] == '%':
            wordmode = not wordmode
        elif len(ent) > 0:
            if wordmode:
                moral = nummap[ent[-1]]
                if (moral not in mfd2.keys()):
                    mfd2[moral] = []
                mfd2[moral].append(''.join([e for e in ent if e not in nummap.keys()]))
            else:
                nummap[ent[0]] = ent[1]

print("English MFD 2.0 loaded successfully.")
print(f"Foundations: {list(mfd2.keys())}")
for foundation, words in mfd2.items():
    print(f"  {foundation}: {len(words)} words")

English MFD 2.0 loaded successfully.
Foundations: ['care.virtue', 'care.vice', 'fairness.virtue', 'fairness.vice', 'loyalty.virtue', 'loyalty.vice', 'authority.virtue', 'authority.vice', 'sanctity.virtue', 'sanctity.vice']
  care.virtue: 182 words
  care.vice: 288 words
  fairness.virtue: 115 words
  fairness.vice: 236 words
  loyalty.virtue: 142 words
  loyalty.vice: 49 words
  authority.virtue: 301 words
  authority.vice: 130 words
  sanctity.virtue: 272 words
  sanctity.vice: 388 words


## [Tamil] Embeddings Model

In [3]:
from sentence_transformers import SentenceTransformer
EMBEDDINGS_MODEL_NAME='l3cube-pune/indic-sentence-similarity-sbert'
model = SentenceTransformer(EMBEDDINGS_MODEL_NAME)
print(f"✓ Model loaded: {EMBEDDINGS_MODEL_NAME}")

  from .autonotebook import tqdm as notebook_tqdm


✓ Model loaded: l3cube-pune/indic-sentence-similarity-sbert


## Load Tamil MFD (Pre-translated)

Load the Tamil MFD dictionary from the pickle file. This was pre-translated using GPU.

In [4]:
import pickle

# Load the pre-translated Tamil MFD
with open('tamil_mfd.pkl', 'rb') as f:
    translated_dictionary_tamil = pickle.load(f)

print("Tamil MFD loaded successfully.")
print(f"Loaded {len(translated_dictionary_tamil)} foundations:")
for foundation, words in translated_dictionary_tamil.items():
    print(f"  {foundation}: {len(words)} words")

Tamil MFD loaded successfully.
Loaded 10 foundations:
  care.virtue: 182 words
  care.vice: 288 words
  fairness.virtue: 115 words
  fairness.vice: 236 words
  loyalty.virtue: 142 words
  loyalty.vice: 49 words
  authority.virtue: 301 words
  authority.vice: 130 words
  sanctity.virtue: 272 words
  sanctity.vice: 388 words


## Generate Tamil MFD Word Embeddings

In [5]:
# Generate word embeddings for each foundation
word_embeddings_tamil = {}
for foundation, words in translated_dictionary_tamil.items():
    word_embeddings_tamil[foundation] = model.encode(words)
    print(f"  {foundation}: encoded {len(words)} words → shape {word_embeddings_tamil[foundation].shape}")

print("\n✓ Word embeddings generated for all foundations.")

  care.virtue: encoded 182 words → shape (182, 768)
  care.vice: encoded 288 words → shape (288, 768)
  fairness.virtue: encoded 115 words → shape (115, 768)
  fairness.vice: encoded 236 words → shape (236, 768)
  loyalty.virtue: encoded 142 words → shape (142, 768)
  loyalty.vice: encoded 49 words → shape (49, 768)
  authority.virtue: encoded 301 words → shape (301, 768)
  authority.vice: encoded 130 words → shape (130, 768)
  sanctity.virtue: encoded 272 words → shape (272, 768)
  sanctity.vice: encoded 388 words → shape (388, 768)

✓ Word embeddings generated for all foundations.


In [6]:
# Save Tamil MFD embeddings
with open('tamil_mfd_embeddings.pkl', 'wb') as f:
    pickle.dump(word_embeddings_tamil, f)

print("✓ Saved Tamil MFD embeddings to 'tamil_mfd_embeddings.pkl'")

✓ Saved Tamil MFD embeddings to 'tamil_mfd_embeddings.pkl'


## [Tamil] Master Moral Vectors

In [7]:
import numpy as np

master_moral_vectors_tamil = {}
for foundation, embeddings in word_embeddings_tamil.items():
    master_moral_vectors_tamil[foundation] = np.mean(embeddings, axis=0)

print("Master Moral Vectors:")
for foundation, vector in master_moral_vectors_tamil.items():
    print(f"{foundation}: {vector[:5]}...") # Print first 5 elements for brevity

Master Moral Vectors:
care.virtue: [-0.01820922  0.01010052 -0.01616125 -0.00277066 -0.00108238]...
care.vice: [-0.01160204  0.00202883 -0.00275396  0.00212376 -0.00384181]...
fairness.virtue: [-0.01655583  0.00483554 -0.0107112  -0.00602367  0.00376076]...
fairness.vice: [ 0.00097255  0.00164303  0.00770801 -0.00599334 -0.00416056]...
loyalty.virtue: [-0.01360243  0.00682688 -0.00033667 -0.00468799  0.00723172]...
loyalty.vice: [-0.00057068 -0.0048489  -0.00019611 -0.00094322 -0.00780207]...
authority.virtue: [-0.01247829  0.01421615 -0.00244958 -0.00134895  0.00685352]...
authority.vice: [-0.00397229  0.00532196  0.00906202  0.00129553 -0.00250419]...
sanctity.virtue: [-0.01390889  0.01289998 -0.00074686 -0.00101416 -0.00827241]...
sanctity.vice: [-0.00581072  0.00650105  0.01670368 -0.00230699 -0.00139982]...


In [8]:
# Save Tamil master moral vectors
with open('tamil_master_moral_vectors.pkl', 'wb') as f:
    pickle.dump(master_moral_vectors_tamil, f)

print("✓ Saved Tamil master moral vectors to 'tamil_master_moral_vectors.pkl'")

✓ Saved Tamil master moral vectors to 'tamil_master_moral_vectors.pkl'


# Create Reusable Tamil MFD .dic Artifact
Create a Tamil moral foundations dictionary file (`tamil_mfd.dic`). This file should contain a '% word %' section with each translated Tamil word from `translated_dictionary_tamil` mapped to its corresponding numerical code, and a '% category %' section mapping these numerical codes back to their original moral foundation names. Finally, confirm the successful creation of this file.

## Create Foundation to Number Mapping

Generate a reverse mapping from moral foundation names (e.g., 'care.virtue') to their numerical codes (e.g., '1') using the existing `nummap` dictionary. This mapping is essential for structuring the new `.dic` file.

In [9]:
foundation_to_num = {}
for num_code, foundation_name in nummap.items():
    foundation_to_num[foundation_name] = num_code

print("Foundation to Number Mapping:")
print(foundation_to_num)

Foundation to Number Mapping:
{'care.virtue': '1', 'care.vice': '2', 'fairness.virtue': '3', 'fairness.vice': '4', 'loyalty.virtue': '5', 'loyalty.vice': '6', 'authority.virtue': '7', 'authority.vice': '8', 'sanctity.virtue': '9', 'sanctity.vice': '10'}


## Generate tamil_mfd.dic File

Now that the reverse mapping from moral foundation names to numerical codes is established, the next step is to create the `tamil_mfd.dic` file. This involves writing the category mapping and then the translated Tamil words with their corresponding numerical codes into the file.

In [10]:
output_filename = 'tamil_mfd.dic'

with open(output_filename, 'w', encoding='utf-8') as f:
    # Write header
    f.write('# Moral Foundations Dictionary (MFD 2.0) - Tamil Translated\n')
    f.write('# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n')
    f.write('# % MORAL FOUNDATION CODING SCHEME %\n')
    f.write('# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n')
    f.write('%\n')
    
    # Write the numerical code to foundation name mapping
    for num_code, foundation_name in nummap.items():
        f.write(f'{num_code}\t{foundation_name}\n')

    f.write('%\n')
    
    # Write the translated Tamil words mapped to their numerical codes
    for foundation_name, tamil_words in translated_dictionary_tamil.items():
        num_code = foundation_to_num.get(foundation_name)
        if num_code:
            for word in tamil_words:
                f.write(f'{word}\t{num_code}\n')

print(f"Successfully created '{output_filename}' with Tamil MFD data.")

# Count lines for verification
with open(output_filename, 'r', encoding='utf-8') as f:
    line_count = len(f.readlines())
print(f"  Total lines in .dic file: {line_count}")

Successfully created 'tamil_mfd.dic' with Tamil MFD data.
  Total lines in .dic file: 2128
