# Bidirectional Corpus Transformation
## Objective
The objective of this notebook is to transform a standard, unidirectional parallel corpus into a **bidirectional corpus** suitable for multitask fine-tuning. It takes a structured `.jsonl` file containing aligned Odia-German sentence pairs and creates a new, larger dataset where each original pair is represented as two distinct training instances: one for the `Odia → German` direction and one for the `German → Odia` direction.

## Methodology
The script programmatically restructures the dataset for a multitask learning setup by:

* **Reading Structured Data:** Loading the parallel corpus from the source `.jsonl` file, which includes rich metadata for each sentence pair.

* **Instance Duplication and Prefixing:** For each parallel pair, it generates two new JSON objects. Each new object contains a new `input_text` field, which is created by prepending a task-specific prefix (e.g., `"translate Odia to German:"`) to the appropriate source sentence.

* **Shuffling:** The final, doubled-sized list of training instances is thoroughly shuffled to ensure that the model does not see translation directions in a predictable order during training.

* **Saving the Final Corpus:** The script saves the shuffled, bidirectional data to a new `.jsonl` file, which will serve as the master corpus for splitting into training, validation, and test sets.

## Workflow
1. Mounts Google Drive to access the source `.jsonl` file and define a persistent save location.

2. Configures the input and output file paths and defines the task prefixes.

3. Reads the content of the source `final_corpus_poc.jsonl` file.

4. Iterates through each record, creating the two corresponding bidirectional instances while preserving all original metadata.

5. Shuffles the newly created list of 7,352 instances.

6. Saves the final dataset to the `bidirectional_corpus.jsonl` file.

## Input & Output
* **Input:** A single `.jsonl` file (`final_corpus_poc.jsonl`) containing structured, parallel Odia-German data.
* **Output:** A single, larger `.jsonl` file (`bidirectional_corpus.jsonl`) containing twice the number of entries, formatted for bidirectional model training.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
import json
import random
import os
from tqdm import tqdm

In [None]:
# --- CONFIGURATION ---
INPUT_FILE = "/content/drive/MyDrive/Thesis/data/transformed/authentic_corpus_final.jsonl"
OUTPUT_FILE = "/content/drive/MyDrive/Thesis/data/transformed/bidirectional_corpus_final.jsonl"

# Define the exact prefixes the model will learn
PREFIX_ORI_TO_DEU = "translate Odia to German: "
PREFIX_DEU_TO_ORI = "translate German to Odia: "

# Field names from your JSONL file
SOURCE_SENTENCE_FIELD = "sentence_ory_Orya"
TARGET_SENTENCE_FIELD = "sentence_deu_Latn"

In [None]:
def create_rich_bidirectional_dataset():
  """
  Creates a bidirectional JSON Lines (.jsonl) dataset from a structured parallel corpus.

  This function reads a JSON Lines file containing parallel Odia and German sentences, generates
  bidirectional training instances (Odia-to-German and German-to-Odia), and preserves metadata
  (e.g., URL, domain, topic, publication date). Each input record is transformed into two instances:
  one with the Odia sentence as input and German as target, and another with German as input and
  Odia as target. A prefix is added to input texts to indicate translation direction. The resulting
  dataset is shuffled and saved to a new JSON Lines file.

  Note:
    - Assumes global variables `INPUT_FILE`, `OUTPUT_FILE`, `SOURCE_SENTENCE_FIELD`,
      `TARGET_SENTENCE_FIELD`, `PREFIX_ORI_TO_DEU`, and `PREFIX_DEU_TO_ORI` are defined.
    - Skips records with missing or non-string sentences.
    - Uses UTF-8 encoding for reading and writing files to handle multilingual text.
    - Requires the `json`, `random`, and `tqdm` libraries.
  """
  if not os.path.exists(INPUT_FILE):
    print(f"⛔️ ERROR: Input file '{INPUT_FILE}' not found.")
    return

  print(f"Reading original structured corpus from '{INPUT_FILE}'...")
  with open(INPUT_FILE, 'r', encoding='utf-8') as f:
    original_data = [json.loads(line) for line in f]

  bidirectional_data = []
  new_id_counter = 1
  print("Creating bidirectional training instances with metadata...")

  for record in tqdm(original_data, desc="Processing Records"):
    # --- Extract all data from the original record ---
    original_id = record.get('id')
    url = record.get('URL')
    domain = record.get('domain')
    topic = record.get('topic')
    pub_date = record.get('publication_date')
    odia_sentence = record.get(SOURCE_SENTENCE_FIELD)
    german_sentence = record.get(TARGET_SENTENCE_FIELD)

    # Skip if either sentence is missing
    if not isinstance(odia_sentence, str) or not isinstance(german_sentence, str):
      continue

    # --- Create Instance 1: Odia -> German ---
    ori_to_deu_instance = {
        "id": new_id_counter,
        "original_id": original_id,
        "URL": url,
        "domain": domain,
        "topic": topic,
        "publication_date": pub_date,
        "input_text": PREFIX_ORI_TO_DEU + odia_sentence,
        "target_text": german_sentence
    }
    bidirectional_data.append(ori_to_deu_instance)
    new_id_counter += 1

    # --- Create Instance 2: German -> Odia ---
    deu_to_ori_instance = {
        "id": new_id_counter,
        "original_id": original_id,
        "URL": url,
        "domain": domain,
        "topic": topic,
        "publication_date": pub_date,
        "input_text": PREFIX_DEU_TO_ORI + german_sentence,
        "target_text": odia_sentence
    }
    bidirectional_data.append(deu_to_ori_instance)
    new_id_counter += 1

  print("Shuffling the new dataset...")
  random.shuffle(bidirectional_data)

  print(f"Saving {len(bidirectional_data)} rich instances to '{OUTPUT_FILE}'...")
  with open(OUTPUT_FILE, 'w', encoding='utf-8') as f_out:
    for instance in bidirectional_data:
      f_out.write(json.dumps(instance, ensure_ascii=False) + '\n')

  print("\n✅ Rich bidirectional dataset created successfully!")

In [None]:
if __name__ == "__main__":
  create_rich_bidirectional_dataset()

Reading original structured corpus from '/content/drive/MyDrive/Thesis/data/transformed/authentic_corpus_final.jsonl'...
Creating bidirectional training instances with metadata...


Processing Records: 100%|██████████| 3676/3676 [00:00<00:00, 280832.42it/s]

Shuffling the new dataset...
Saving 7352 rich instances to '/content/drive/MyDrive/Thesis/data/transformed/bidirectional_corpus_final.jsonl'...






✅ Rich bidirectional dataset created successfully!
