# üöÄ Bilingual JSONL Dataset Processing Pipeline

This notebook processes English-Hindi bilingual datasets in JSONL format.

## üìã Pipeline Steps:
1. **Basic Cleaning** - Remove newlines, normalize whitespace
2. **LLM Deep Cleaning** - AI-powered verification and cleaning
3. **Phase 1 Chunking** - Split into 3-sentence chunks
4. **Phase 2 Chunking** - LLM-assisted alignment for mismatches
5. **Merge Results** - Combine into final dataset

## üîß Setup Instructions:
1. Upload your `pib_bilingual.jsonl` file (raw data from Rewat)
2. Set your OpenRouter API key in Step 0
3. Run cells in order (just click "Run All"!)

---

## Step 0: Configuration & Setup

Set your LLM API credentials here:

In [None]:
# ============================================================================
# CONFIGURATION - UPDATE THESE VALUES
# ============================================================================

# LLM Configuration (for Steps 2 and 4)
LLM_API_KEY = "your-openrouter-api-key-here"  # Get from https://openrouter.ai/
LLM_BASE_URL = "https://openrouter.ai/api/v1"
LLM_MODEL = "meta-llama/llama-3.1-8b-instruct:free"  # Free tier model

# File names (don't change unless you renamed your files)
INPUT_FILE = "pib_bilingual.jsonl"  # Your uploaded raw file
FINAL_OUTPUT = "pib_final_chunked_dataset.jsonl"  # Final result

print("‚úÖ Configuration loaded!")
print(f"   Model: {LLM_MODEL}")
print(f"   Input: {INPUT_FILE}")
print(f"   Final Output: {FINAL_OUTPUT}")

## Step 1: Install Dependencies

In [None]:
# Install required packages
!pip install -q openai

import json
import re
import time
from openai import OpenAI

print("‚úÖ All dependencies installed!")

## Step 2: Upload Your JSONL File

Click the folder icon on the left sidebar and upload your `pib_bilingual.jsonl` file.

In [None]:
import os

# Check if input file exists
if os.path.exists(INPUT_FILE):
    file_size = os.path.getsize(INPUT_FILE)
    with open(INPUT_FILE, 'r', encoding='utf-8') as f:
        num_lines = sum(1 for _ in f)
    print(f"‚úÖ Input file found!")
    print(f"   File: {INPUT_FILE}")
    print(f"   Size: {file_size:,} bytes")
    print(f"   Entries: {num_lines}")
else:
    print(f"‚ùå ERROR: {INPUT_FILE} not found!")
    print(f"   Please upload your file using the folder icon on the left.")

---
# üßπ PHASE 1: Basic Cleaning

Removes newlines, normalizes whitespace, cleans special characters.

In [None]:
def clean_text(text):
    """Clean text by removing special characters and excessive whitespace."""
    if not text:
        return ""
    
    # Replace newlines with spaces
    text = text.replace('\\n', ' ')
    
    # Replace multiple spaces with single space
    text = re.sub(r'\s+', ' ', text)
    
    # Remove leading/trailing whitespace
    text = text.strip()
    
    return text


def step1_basic_cleaning(input_file, output_file):
    """Step 1: Basic text cleaning."""
    print("=" * 70)
    print("STEP 1: Basic Cleaning")
    print("=" * 70)
    print(f"Input:  {input_file}")
    print(f"Output: {output_file}\n")
    
    processed = 0
    
    with open(input_file, 'r', encoding='utf-8') as infile, \
         open(output_file, 'w', encoding='utf-8') as outfile:
        
        for line_num, line in enumerate(infile, 1):
            try:
                data = json.loads(line)
                
                cleaned_entry = {
                    'english': clean_text(data.get('english', '')),
                    'hindi': clean_text(data.get('hindi', ''))
                }
                
                outfile.write(json.dumps(cleaned_entry, ensure_ascii=False) + '\n')
                processed += 1
                
                if line_num % 50 == 0:
                    print(f"  Processed: {line_num} entries...")
                    
            except Exception as e:
                print(f"‚ö† Warning: Skipping line {line_num}: {e}")
    
    print(f"\n‚úÖ Step 1 Complete!")
    print(f"   Processed: {processed} entries")
    print(f"   Output: {output_file}\n")
    return processed


# RUN STEP 1
cleaned_file = "pib_bilingual_cleaned.jsonl"
step1_count = step1_basic_cleaning(INPUT_FILE, cleaned_file)

---
# ü§ñ PHASE 2: LLM Deep Cleaning (Optional)

Uses AI to intelligently verify and deep clean the dataset.

‚ö†Ô∏è **Note:** This step uses API calls and may take time. Skip if you want faster processing.

In [None]:
# Initialize LLM client
client = OpenAI(
    api_key=LLM_API_KEY,
    base_url=LLM_BASE_URL
)

SYSTEM_PROMPT = """You are a bilingual data quality expert specializing in English-Hindi translation pairs.

Your task is to clean and verify translation pairs. Follow these rules:

1. CLEANING:
   - Remove special characters like backslashes (\\), forward slashes (/) that don't belong
   - Remove escape sequences or formatting artifacts
   - Preserve meaningful punctuation and numbers

2. VERIFICATION:
   - Check if English and Hindi are actual translations
   - Ensure semantic alignment
   - Do NOT retranslate or paraphrase

3. OUTPUT: Return ONLY a JSON object:
   {"english": "cleaned text", "hindi": "cleaned text", "is_aligned": true/false, "issues_found": "description"}
"""


def llm_clean_pair(english, hindi, entry_num):
    """Send pair to LLM for cleaning."""
    user_prompt = f"""Clean this English-Hindi translation pair:

ENGLISH: {english}

HINDI: {hindi}

Return cleaned version as JSON."""
    
    try:
        response = client.chat.completions.create(
            model=LLM_MODEL,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.1,
            max_tokens=4000
        )
        
        llm_output = response.choices[0].message.content.strip()
        
        # Remove markdown code blocks if present
        if llm_output.startswith('```'):
            llm_output = llm_output.split('```')[1]
            if llm_output.startswith('json'):
                llm_output = llm_output[4:]
            llm_output = llm_output.strip()
        
        result = json.loads(llm_output)
        return {
            'english': result.get('english', english),
            'hindi': result.get('hindi', hindi),
            'is_aligned': result.get('is_aligned', True),
            'verified': True
        }
        
    except Exception as e:
        print(f"  ‚ö† Entry {entry_num}: LLM error, using original")
        return {'english': english, 'hindi': hindi, 'is_aligned': True, 'verified': False}


def step2_llm_cleaning(input_file, output_file):
    """Step 2: LLM-powered deep cleaning."""
    print("=" * 70)
    print("STEP 2: LLM Deep Cleaning")
    print("=" * 70)
    print(f"Model: {LLM_MODEL}")
    print(f"Input:  {input_file}")
    print(f"Output: {output_file}\n")
    
    processed = 0
    verified = 0
    
    with open(input_file, 'r', encoding='utf-8') as infile, \
         open(output_file, 'w', encoding='utf-8') as outfile:
        
        for line_num, line in enumerate(infile, 1):
            try:
                data = json.loads(line)
                print(f"Processing entry {line_num}...", end=" ")
                
                result = llm_clean_pair(data['english'], data['hindi'], line_num)
                
                if result['verified']:
                    verified += 1
                    print("‚úì")
                else:
                    print("‚ö†")
                
                output_entry = {
                    'english': result['english'],
                    'hindi': result['hindi']
                }
                
                outfile.write(json.dumps(output_entry, ensure_ascii=False) + '\n')
                processed += 1
                
                # Rate limiting
                time.sleep(0.5)
                
            except Exception as e:
                print(f"  ‚úó Error: {e}")
    
    print(f"\n‚úÖ Step 2 Complete!")
    print(f"   Processed: {processed}")
    print(f"   LLM Verified: {verified}")
    print(f"   Output: {output_file}\n")
    return processed


# RUN STEP 2
final_clean_file = "pib_bilingual_final.jsonl"
step2_count = step2_llm_cleaning(cleaned_file, final_clean_file)

---
# ‚úÇÔ∏è PHASE 3: Sentence Chunking - Phase 1

Splits long texts into 3-sentence chunks. Handles entries with matching sentence counts.

In [None]:
def split_english_sentences(text):
    """Split English text into sentences."""
    # Handle abbreviations
    text = text.replace('Dr.', 'Dr<DOT>')
    text = text.replace('Mr.', 'Mr<DOT>')
    text = text.replace('Mrs.', 'Mrs<DOT>')
    text = text.replace('U.S.', 'U<DOT>S<DOT>')
    text = text.replace('etc.', 'etc<DOT>')
    
    # Split on sentence endings
    sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])', text)
    
    # Restore abbreviations
    sentences = [s.replace('<DOT>', '.').strip() for s in sentences if s.strip()]
    return sentences


def split_hindi_sentences(text):
    """Split Hindi text into sentences."""
    sentences = re.split(r'[‡•§.!?]\s+', text)
    return [s.strip() for s in sentences if s.strip()]


def chunk_into_groups(sentences, chunk_size=3):
    """Chunk sentences into groups."""
    chunks = []
    for i in range(0, len(sentences), chunk_size):
        chunk = sentences[i:i+chunk_size]
        chunks.append(' '.join(chunk))
    return chunks


def step3_phase1_chunking(input_file, matched_file, mismatched_file):
    """Step 3: Phase 1 chunking."""
    print("=" * 70)
    print("STEP 3: Phase 1 Chunking")
    print("=" * 70)
    print(f"Input: {input_file}\n")
    
    with open(input_file, 'r', encoding='utf-8') as f:
        entries = [json.loads(line) for line in f]
    
    matched_chunks = []
    mismatched = []
    
    for i, entry in enumerate(entries, 1):
        eng_sentences = split_english_sentences(entry['english'])
        hin_sentences = split_hindi_sentences(entry['hindi'])
        
        eng_count = len(eng_sentences)
        hin_count = len(hin_sentences)
        
        print(f"Entry {i}: EN={eng_count}, HI={hin_count}", end=" ")
        
        if eng_count == hin_count:
            print("‚úì MATCH")
            
            eng_chunks = chunk_into_groups(eng_sentences, 3)
            hin_chunks = chunk_into_groups(hin_sentences, 3)
            
            for eng_chunk, hin_chunk in zip(eng_chunks, hin_chunks):
                matched_chunks.append({
                    'english': eng_chunk,
                    'hindi': hin_chunk
                })
        else:
            print("‚ö† MISMATCH")
            mismatched.append({
                'entry_num': i,
                'english': entry['english'],
                'hindi': entry['hindi'],
                'eng_sentences': eng_count,
                'hin_sentences': hin_count
            })
    
    # Save matched chunks
    with open(matched_file, 'w', encoding='utf-8') as f:
        for chunk in matched_chunks:
            f.write(json.dumps(chunk, ensure_ascii=False) + '\n')
    
    # Save mismatched for Phase 2
    with open(mismatched_file, 'w', encoding='utf-8') as f:
        for entry in mismatched:
            f.write(json.dumps(entry, ensure_ascii=False) + '\n')
    
    print(f"\n‚úÖ Step 3 Complete!")
    print(f"   Matched chunks: {len(matched_chunks)}")
    print(f"   Mismatched entries: {len(mismatched)}")
    print(f"   Matched output: {matched_file}")
    print(f"   Mismatched output: {mismatched_file}\n")
    
    return len(matched_chunks), len(mismatched)


# RUN STEP 3
matched_file = "pib_chunked_matched.jsonl"
mismatched_file = "pib_mismatched_for_llm.jsonl"
matched_count, mismatched_count = step3_phase1_chunking(final_clean_file, matched_file, mismatched_file)

---
# ü§ñ PHASE 4: Sentence Chunking - Phase 2 (LLM-Assisted)

Uses LLM to handle mismatched entries (different sentence counts).

‚ö†Ô∏è **Note:** Only runs if there are mismatched entries from Phase 1.

In [None]:
def step4_phase2_chunking(input_file, output_file):
    """Step 4: Phase 2 LLM-assisted chunking."""
    print("=" * 70)
    print("STEP 4: Phase 2 LLM Chunking")
    print("=" * 70)
    
    # Check if there are mismatched entries
    import os
    if not os.path.exists(input_file) or os.path.getsize(input_file) == 0:
        print("\n‚úÖ No mismatched entries! Skipping Phase 2.\n")
        return 0
    
    print(f"Input: {input_file}\n")
    
    system_prompt = """You are an expert at chunking English-Hindi bilingual text.

Given English and Hindi texts with different sentence counts, create aligned 3-sentence chunks.

Rules:
1. Each chunk should have 1-3 sentences
2. English and Hindi chunks must be semantically aligned
3. Return JSON array: [{"english": "chunk", "hindi": "chunk"}, ...]
"""
    
    with open(input_file, 'r', encoding='utf-8') as f:
        entries = [json.loads(line) for line in f]
    
    all_chunks = []
    
    for i, entry in enumerate(entries, 1):
        try:
            print(f"Processing entry {i}/{len(entries)}...", end=" ")
            
            user_prompt = f"""English ({entry['eng_sentences']} sentences):
{entry['english']}

Hindi ({entry['hin_sentences']} sentences):
{entry['hindi']}

Create aligned 3-sentence chunks as JSON array."""
            
            response = client.chat.completions.create(
                model=LLM_MODEL,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_prompt}
                ],
                temperature=0.2,
                max_tokens=4000
            )
            
            llm_output = response.choices[0].message.content.strip()
            
            # Remove code blocks
            if llm_output.startswith('```'):
                llm_output = llm_output.split('```')[1]
                if llm_output.startswith('json'):
                    llm_output = llm_output[4:]
                llm_output = llm_output.strip()
            
            chunks = json.loads(llm_output)
            all_chunks.extend(chunks)
            print(f"‚úì ({len(chunks)} chunks)")
            
            time.sleep(0.5)
            
        except Exception as e:
            print(f"‚úó Error: {e}")
    
    # Save LLM-aligned chunks
    with open(output_file, 'w', encoding='utf-8') as f:
        for chunk in all_chunks:
            f.write(json.dumps(chunk, ensure_ascii=False) + '\n')
    
    print(f"\n‚úÖ Step 4 Complete!")
    print(f"   LLM-aligned chunks: {len(all_chunks)}")
    print(f"   Output: {output_file}\n")
    
    return len(all_chunks)


# RUN STEP 4
llm_aligned_file = "pib_chunked_llm_aligned.jsonl"
llm_chunks = step4_phase2_chunking(mismatched_file, llm_aligned_file)

---
# üéØ PHASE 5: Merge All Results

Combines matched chunks and LLM-aligned chunks into final dataset.

In [None]:
def step5_merge_results(matched_file, llm_file, output_file):
    """Step 5: Merge all chunks."""
    print("=" * 70)
    print("STEP 5: Merging Results")
    print("=" * 70)
    
    total_chunks = 0
    
    with open(output_file, 'w', encoding='utf-8') as outfile:
        # Add matched chunks
        if os.path.exists(matched_file):
            with open(matched_file, 'r', encoding='utf-8') as f:
                for line in f:
                    outfile.write(line)
                    total_chunks += 1
            print(f"‚úì Added matched chunks from {matched_file}")
        
        # Add LLM-aligned chunks
        if os.path.exists(llm_file) and os.path.getsize(llm_file) > 0:
            with open(llm_file, 'r', encoding='utf-8') as f:
                for line in f:
                    outfile.write(line)
                    total_chunks += 1
            print(f"‚úì Added LLM-aligned chunks from {llm_file}")
    
    print(f"\n‚úÖ Step 5 Complete!")
    print(f"   Total chunks in final dataset: {total_chunks}")
    print(f"   Final output: {output_file}\n")
    
    return total_chunks


# RUN STEP 5
final_chunks = step5_merge_results(matched_file, llm_aligned_file, FINAL_OUTPUT)

---
# ‚úÖ FINAL SUMMARY

Review the complete processing pipeline results.

In [None]:
print("=" * 70)
print("üéâ PIPELINE COMPLETE!")
print("=" * 70)
print("\nProcessing Summary:")
print(f"  Step 1 - Basic Cleaning:     {step1_count} entries")
print(f"  Step 2 - LLM Deep Cleaning:  {step2_count} entries")
print(f"  Step 3 - Phase 1 Chunking:   {matched_count} chunks")
print(f"  Step 4 - Phase 2 LLM:        {llm_chunks} chunks")
print(f"  Step 5 - Final Merge:        {final_chunks} chunks")
print(f"\nüìÅ Output Files:")
print(f"  ‚úÖ {FINAL_OUTPUT} ({final_chunks} chunks)")
print(f"\nüí° Next Steps:")
print(f"  1. Download {FINAL_OUTPUT} using the file browser")
print(f"  2. Use this dataset for training your bilingual models")
print(f"  3. Analyze quality and alignment\n")
print("=" * 70)

# Show sample from final dataset
print("\nüìä Sample from Final Dataset (first 3 entries):\n")
with open(FINAL_OUTPUT, 'r', encoding='utf-8') as f:
    for i, line in enumerate(f, 1):
        if i > 3:
            break
        data = json.loads(line)
        print(f"Entry {i}:")
        print(f"  English: {data['english'][:100]}...")
        print(f"  Hindi:   {data['hindi'][:100]}...\n")

---
# üì• Download Your Final Dataset

Click the folder icon on the left, find `pib_final_chunked_dataset.jsonl`, and download it!