<a href="https://colab.research.google.com/github/anagha1999/anlp-project/blob/main/kannada/1.Generate_Kannada_MFD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Translate MFD to Kannada

This notebook:
1. Translates the English MFD to Kannada
2. Generates MFD Word Embeddings and Master Moral Vectors in Kannada

Outputs:
1. `kannada_mfd.dic`: Re-usable moral foundation dictionary in Kannada in human readable format.
2. `kannada_mfd.pkl`: Moral Foundation Dictionary in Kannada in machine readable "pickle" format.
3. `kannada_mfd_embeddings.pkl`: Moral Foundation Dictionary in Kannada, with corresponding embeddings per moral.
4. `kannada_master_moral_vectors.pkl`: Master Vectors for each moral in Kannada.

In [1]:
"""Run this only if working on Colab"""
!git clone https://github.com/anagha1999/anlp-project/

Cloning into 'anlp-project'...
remote: Enumerating objects: 270, done.[K
remote: Counting objects: 100% (247/247), done.[K
remote: Compressing objects: 100% (204/204), done.[K
remote: Total 270 (delta 106), reused 157 (delta 42), pack-reused 23 (from 1)[K
Receiving objects: 100% (270/270), 43.47 MiB | 16.45 MiB/s, done.
Resolving deltas: 100% (112/112), done.


In [2]:
!wget https://raw.githubusercontent.com/medianeuroscience/emfd/refs/heads/master/dictionaries/mfd2.0.dic

--2025-12-16 14:36:41--  https://raw.githubusercontent.com/medianeuroscience/emfd/refs/heads/master/dictionaries/mfd2.0.dic
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24574 (24K) [text/plain]
Saving to: ‘mfd2.0.dic’


2025-12-16 14:36:41 (187 MB/s) - ‘mfd2.0.dic’ saved [24574/24574]



In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

"""
    Translates the string values in a dictionary of lists to Kannada.
    Args:
        input_dict: A dictionary where keys are strings and values are lists of strings.
                    Example: {'moral': ['word1', 'word2'], 'moral2': ['word3']}

    Returns:
        A new dictionary with the same keys but with translated string values.
"""
def translate_dict_to_kannada(input_dict: dict) -> dict:

    # Define the model name and target language
    model_name = "sarvamai/sarvam-translate"
    tgt_lang = "Kannada"

    # Load the tokenizer and model
    # The .to('cuda:0') part moves the model to the GPU for faster inference
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name).to('cuda:0')

    translated_dict = {}
    # Iterate over each key-value pair in the input dictionary
    for key, words in input_dict.items():
        translated_words = []
        # Iterate over each word in the list
        for word in words:
            # Create the prompt for the model using a chat template
            messages = [
                {"role": "system", "content": f"Translate the text below to {tgt_lang}."},
                {"role": "user", "content": word}
            ]

            # Apply the chat template to structure the conversation
            text = tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True
            )

            # Tokenize the input and move it to the model's device (GPU)
            model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

            # Generate the translation
            generated_ids = model.generate(
                **model_inputs,
                max_new_tokens=128,  # Increased token limit for potentially longer words
                do_sample=True,
                temperature=0.01,
                num_return_sequences=1
            )

            # Decode the generated output to get the translated text
            output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
            output_text = tokenizer.decode(output_ids, skip_special_tokens=True)
            translated_words.append(output_text.strip())

        # Assign the list of translated words to the corresponding key
        translated_dict[key] = translated_words

    return translated_dict

# --- Load and parse mfd2.0.dic into 'mfd2' --- #
def load_english_mfd(filepath='mfd2.0.dic'):
    mfd = {}
    id_to_category = {}
    mode = 0  # 0: start, 1: categories, 2: words

    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith('#'):
                continue

            if line == '%':
                mode += 1
                continue

            if mode == 1:
                # Category Mapping: ID Name
                parts = line.split()
                if len(parts) >= 2:
                    cat_id = parts[0]
                    cat_name = parts[1]
                    id_to_category[cat_id] = cat_name
                    mfd[cat_name] = []

            elif mode == 2:
                # Word Mapping: Word ID
                parts = line.split('\t') # Use tab as delimiter for words and categories
                if len(parts) < 2: # Fallback to space if tab not found and assuming words have spaces
                    parts = line.split()

                word = parts[0].strip()
                # Ensure we only pick valid category IDs
                cat_ids = [x.strip() for x in parts[1:] if x.strip() and x.strip() in id_to_category]

                for cat_id in cat_ids:
                    foundation = id_to_category[cat_id]
                    mfd[foundation].append(word)
    return mfd

mfd2 = load_english_mfd()
print("English MFD (mfd2) loaded successfully with the following foundations:")
for k, v in mfd2.items():
    print(f"  {k}: {len(v)} words")

# --- Example Usage ---


# 2. Call the function to get the translated dictionary
#    (This requires a machine with a compatible GPU and transformers installed)
translated_dictionary_kannada = translate_dict_to_kannada(mfd2)

# 3. Print the result
print(translated_dictionary_kannada)


English MFD (mfd2) loaded successfully with the following foundations:
  care.virtue: 182 words
  care.vice: 288 words
  fairness.virtue: 115 words
  fairness.vice: 236 words
  loyalty.virtue: 142 words
  loyalty.vice: 49 words
  authority.virtue: 301 words
  authority.vice: 130 words
  sanctity.virtue: 272 words
  sanctity.vice: 388 words


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

## [Kannada] Embeddings

In [None]:
from sentence_transformers import SentenceTransformer
EMBEDDINGS_MODEL_NAME='l3cube-pune/indic-sentence-similarity-sbert'
model = SentenceTransformer(EMBEDDINGS_MODEL_NAME)

In [None]:
import pickle
translated_dictionary_kannada = load_kannada_mfd()
print("Kannada MFD loaded successfully.")

Parsing dictionary...
Loaded 10 foundations: ['care.virtue', 'care.vice', 'fairness.virtue', 'fairness.vice', 'loyalty.virtue', 'loyalty.vice', 'authority.virtue', 'authority.vice', 'sanctity.virtue', 'sanctity.vice']
  care.virtue: 182 words
  care.vice: 288 words
  fairness.virtue: 115 words
  fairness.vice: 236 words
  loyalty.virtue: 143 words
  loyalty.vice: 49 words
  authority.virtue: 301 words
  authority.vice: 130 words
  sanctity.virtue: 272 words
  sanctity.vice: 388 words
Kannada MFD loaded successfully.


In [4]:
import requests

def load_kannada_mfd():
  # 1. Load kannada MFD
  url = 'https://raw.githubusercontent.com/anagha1999/anlp-project/refs/heads/main/kannada/kannada_mfd.dic'
  response = requests.get(url)
  content = response.text.splitlines()

  # 2. Parse dictionary (LIWC format)
  translated_dictionary_kannada = {}
  id_to_category = {}
  mode = 0 # 0: start, 1: categories, 2: words

  print("Parsing dictionary...")
  for line in content:
      line = line.strip()
      if not line or line.startswith('#'):
          continue

      if line == '%':
          mode += 1
          continue

      if mode == 1:
          # Category Mapping: ID Name
          parts = line.split()
          if len(parts) >= 2:
              cat_id = parts[0]
              cat_name = parts[1]
              id_to_category[cat_id] = cat_name
              translated_dictionary_kannada[cat_name] = []

      elif mode == 2:
          # Word Mapping: Word ID
          # Handle potential spaces in words or tab separation
          if '\t' in line:
              parts = line.split('\t')
              word = parts[0].strip()
              cat_ids = [x.strip() for x in parts[1:] if x.strip()]
          else:
              # Fallback for space separation
              parts = line.split()
              # Assuming the last element is the ID
              if len(parts) >= 2 and parts[-1] in id_to_category:
                  word = " ".join(parts[:-1])
                  cat_ids = [parts[-1]]
              else:
                  continue

          for cat_id in cat_ids:
              if cat_id in id_to_category:
                  foundation = id_to_category[cat_id]
                  translated_dictionary_kannada[foundation].append(word)

  print(f"Loaded {len(translated_dictionary_kannada)} foundations: {list(translated_dictionary_kannada.keys())}")
  for k, v in translated_dictionary_kannada.items():
      print(f"  {k}: {len(v)} words")

In [None]:
with open('anlp-project/kannada/kannada_mfd.pkl', 'wb') as f:
    pickle.dump(translated_dictionary_kannada, f)
print("Kannada MFD created as a pickle file.")

In [None]:
word_embeddings_kannada = {}
for foundation, words in translated_dictionary_kannada.items():
  word_embeddings_kannada[foundation] = model.encode(words)
with open('anlp-project/kannada/kannada_mfd_embeddings.pkl', 'rb') as f:
    master_moral_vectors = pickle.load(f)

NameError: name 'translated_dictionary_kannada' is not defined

## [Kannada] Master Moral Vectors

In [None]:
import numpy as np

master_moral_vectors_kannada = {}
for foundation, embeddings in word_embeddings_kannada.items():
    master_moral_vectors_kannada[foundation] = np.mean(embeddings, axis=0)

print("Master Moral Vectors:")
for foundation, vector in master_moral_vectors_kannada.items():
    print(f"{foundation}: {vector[:5]}...") # Print first 5 elements for brevity

# Create Reusable Kannada MFD .dic Artifact
Create a Kannada moral foundations dictionary file (`kannada_mfd.dic`). This file should contain a '% word %' section with each translated Kannada word from `translated_dictionary_kannada` mapped to its corresponding numerical code, and a '% category %' section mapping these numerical codes back to their original moral foundation names. Finally, confirm the successful creation of this file.

Generate a reverse mapping from moral foundation names (e.g., 'care.virtue') to their numerical codes (e.g., '1') using the existing `nummap` dictionary. This mapping is essential for structuring the new `.dic` file.


**Reasoning**:
To generate the reverse mapping from moral foundation names to their numerical codes, I will iterate through the existing `nummap` dictionary and swap the keys and values to populate the `foundation_to_num` dictionary as instructed.



In [None]:
nummap = {}  # moral_foundation_name to numerical_code
file_path = '/content/anlp-project/mfd2.0.dic'

with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        line = line.strip()

        # Skip empty lines, comments, and section markers
        if not line or line.startswith('#') or line.startswith('%'):
            continue

        parts = line.split()

        # Category section format: numerical_code foundation_name
        if len(parts) >= 2 and parts[0].isdigit():
            num_code = parts[0]
            foundation_name = parts[1]
            nummap[num_code] = foundation_name

print("nummap:", nummap)


nummap: {'1': 'care.virtue', '2': 'care.vice', '3': 'fairness.virtue', '4': 'fairness.vice', '5': 'loyalty.virtue', '6': 'loyalty.vice', '7': 'authority.virtue', '8': 'authority.vice', '9': 'sanctity.virtue', '10': 'sanctity.vice'}


In [None]:
foundation_to_num = {}
for num_code, foundation_name in nummap.items():
    foundation_to_num[foundation_name] = num_code

print("Foundation to Number Mapping:")
print(foundation_to_num)

Foundation to Number Mapping:
{'care.virtue': '1', 'care.vice': '2', 'fairness.virtue': '3', 'fairness.vice': '4', 'loyalty.virtue': '5', 'loyalty.vice': '6', 'authority.virtue': '7', 'authority.vice': '8', 'sanctity.virtue': '9', 'sanctity.vice': '10'}


**Reasoning**:
Now that the reverse mapping from moral foundation names to numerical codes is established, the next step is to create the `kannada_mfd.dic` file. This involves writing the category mapping and then the translated Kannada words with their corresponding numerical codes into the file.



In [None]:
output_filename = '/content/anlp-project/kannada/kannada_mfd.dic'

with open(output_filename, 'w', encoding='utf-8') as f:
    f.write('% category %\n')
    # Write the numerical code to foundation name mapping
    for num_code, foundation_name in nummap.items():
        f.write(f'{num_code}\t{foundation_name}\n')

    f.write('% word %\n')
    # Write the translated Kannada words mapped to their numerical codes
    for foundation_name, kannada_words in translated_dictionary_kannada.items():
        num_code = foundation_to_num.get(foundation_name)
        if num_code:
            for word in kannada_words:
                f.write(f'{word}\t{num_code}\n')

print(f"Successfully created '{output_filename}' with Kannada MFD data.")

Successfully created '/content/anlp-project/kannada/kannada_mfd.dic' with Kannada MFD data.
