# Parallel Corpus to Structured JSONL Conversion
## Objective
The objective of this notebook is to perform a critical data transformation step in the research pipeline. It takes two separate, line-aligned plain text files—one containing the Odia corpus and one containing the German corpus—and merges them into a single, structured JSON Lines (`.jsonl`) file. This process enriches the raw parallel data with a consistent structure and placeholder metadata fields, preparing it for subsequent manual annotation and the creation of a bidirectional training corpus.

## Methodology
The script automates the creation of a structured parallel corpus by:

* **Loading Parallel Data:** Reading the Odia and German text files into memory.

* **Alignment Verification:** Performing a crucial validation step to ensure both input files have the exact same number of lines, preventing data misalignment errors downstream.

* **Structured Transformation:** Iterating through the parallel lines and creating a distinct JSON object for each pair. Each object is assigned a unique ID and includes fields for the Odia sentence (`sentence_ory_Orya`), the German sentence (`sentence_deu_Latn`), and placeholders for metadata (`URL`, `domain`, `topic`, `publication_date`).

* **JSONL Output:** Writing each JSON object as a new line to the output file, adhering to the JSON Lines standard.

## Workflow
The notebook executes the following sequential steps:

1. Mounts Google Drive to access the source corpora and define a persistent save location.

2. Configures the input and output file paths and defines the placeholder metadata.

3. Reads the content of the source Odia and German `.txt` files.

4. Validates that the line counts of both files match.

5. Generates the structured `.jsonl` file by merging the parallel data and adding the defined structure.

6. Saves the final `.jsonl` corpus to the specified Google Drive directory.

## Input & Output
* **Input:** Two `.txt` files, each containing one sentence (or text segment) per line, where line `N` of the Odia file is the direct translation of line `N` of the German file.
* **Output:** A single `.jsonl` file (e.g., `authentic_complete_corpus.jsonl`), where each line is a complete JSON object ready for manual metadata annotation.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import json
import os
from tqdm import tqdm

In [None]:
# --- 1. CONFIGURATION ---

# --- Input Files ---
# Make sure these files have the exact same number of lines!
ODIA_CORPUS_FILE = '/content/drive/MyDrive/Thesis/data/raw/authentic_odia_corpus_v1.txt'
GERMAN_CORPUS_FILE = '/content/drive/MyDrive/Thesis/data/raw/authentic_german_corpus_v1.txt'

# --- Output File ---
OUTPUT_JSONL_FILE = '/content/drive/MyDrive/Thesis/data/transformed/authentic_corpus.jsonl'

In [None]:
# --- Placeholder values for manual insertion ---
PLACEHOLDER_URL = "MANUAL_INSERT_URL"
PLACEHOLDER_DOMAIN = "MANUAL_INSERT_DOMAIN" # e.g., "Dharitri" or "Sambad"
PLACEHOLDER_TOPIC = "MANUAL_INSERT_TOPIC"   # e.g., "national-news"
PLACEHOLDER_DATE = "YYYY"

In [None]:
def create_structured_corpus():
  """
  Creates a structured JSON Lines (.jsonl) file from parallel Odia and German text files.

  This function reads two input files (specified by `ODIA_CORPUS_FILE` and `GERMAN_CORPUS_FILE`),
  verifies they have the same number of lines, filters out empty lines, and generates a JSON Lines
  file containing paired Odia and German sentences. Each valid entry is a JSON object with a unique
  ID, placeholder metadata (URL, domain, topic, publication date), and the corresponding sentences.
  The output is written to `OUTPUT_JSONL_FILE`. Progress is displayed using a `tqdm` progress bar.

  Note:
    - Assumes global variables `ODIA_CORPUS_FILE`, `GERMAN_CORPUS_FILE`, `OUTPUT_JSONL_FILE`,
      `PLACEHOLDER_URL`, `PLACEHOLDER_DOMAIN`, `PLACEHOLDER_TOPIC`, and `PLACEHOLDER_DATE` are defined.
    - Skips blank lines in either file to ensure only valid sentence pairs are included.
    - Uses UTF-8 encoding for reading and writing files to handle multilingual text.
    - Requires the `tqdm` and `json` libraries.
  """
  if not os.path.exists(ODIA_CORPUS_FILE) or not os.path.exists(GERMAN_CORPUS_FILE):
    print(f"⛔️ ERROR: One or both input files not found.")
    return

  print("Reading source files...")
  with open(ODIA_CORPUS_FILE, 'r', encoding='utf-8') as f_ori:
    odia_lines = [line.strip() for line in f_ori]
  with open(GERMAN_CORPUS_FILE, 'r', encoding='utf-8') as f_deu:
    german_lines = [line.strip() for line in f_deu]

  if len(odia_lines) != len(german_lines):
    print("⛔️ FATAL ERROR: The number of lines in the source files do not match!")
    return

  total_lines = len(odia_lines)
  print(f"✅ Files are aligned. Found {total_lines} total lines (including blanks) to process.")

  # --- Step 4: Create the JSONL file, SKIPPING blank lines ---
  print(f"Generating structured data in '{OUTPUT_JSONL_FILE}'...")

  # We will use a manual counter for the ID to keep it sequential for valid entries
  entry_id = 1

  with open(OUTPUT_JSONL_FILE, 'w', encoding='utf-8') as f_out:
    for odia_line, german_line in tqdm(zip(odia_lines, german_lines), total=total_lines, desc="Processing Lines"):
      # Only proceed if BOTH lines contain text after stripping whitespace.
      if odia_line and german_line:
        # Create the Python dictionary for this valid instance
        data_instance = {
            'id': entry_id,
            'URL': PLACEHOLDER_URL,
            'domain': PLACEHOLDER_DOMAIN,
            'topic': PLACEHOLDER_TOPIC,
            'publication_date': PLACEHOLDER_DATE,
            'sentence_ory_Orya': odia_line,
            'sentence_deu_Latn': german_line
        }

        # Write the JSON object to the file
        f_out.write(json.dumps(data_instance, ensure_ascii=False) + '\n')

        # Increment the ID only for valid entries
        entry_id += 1

  print(f"\n✅ Success! Your structured corpus has been created with {entry_id - 1} valid entries.")
  print("Blank lines have been ignored.")

if __name__ == "__main__":
  create_structured_corpus()

Reading source files...
✅ Files are aligned. Found 7351 total lines (including blanks) to process.
Generating structured data in '/content/drive/MyDrive/Thesis/data/transformed/authentic_corpus.jsonl'...


Processing Lines: 100%|██████████| 7351/7351 [00:00<00:00, 114426.48it/s]


✅ Success! Your structured corpus has been created with 3676 valid entries.
Blank lines have been ignored.



