<a href="https://colab.research.google.com/github/aditya-llm/Ncert-pdf-translation-using-en-indic2/blob/main/translate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

%%capture
!git clone https://github.com/AI4Bharat/IndicTrans2.git

In [None]:
%%capture
%cd /content/IndicTrans2/huggingface_interface

In [None]:

%%capture
!python3 -m pip install nltk sacremoses pandas regex mock transformers==4.53.2 mosestokenizer
!python3 -c "import nltk; nltk.download('punkt')"
!python3 -m pip install bitsandbytes scipy accelerate datasets
!python3 -m pip install sentencepiece

!git clone https://github.com/VarunGumma/IndicTransToolkit.git
%cd IndicTransToolkit
!python3 -m pip install --editable ./
%cd ..


In [None]:
# must restart the runtime

In [None]:
# --- Step 2: Initialize the IndicTrans2 Translation Model ---

import nltk
import torch
from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor

In [None]:
print("Setting up the translation model...")

BATCH_SIZE = 4
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
quantization = None # "4-bit" or "8-bit" for lower memory, but None is faster if VRAM allows

def initialize_model_and_tokenizer(ckpt_dir, quantization):
    if quantization == "4-bit":
        qconfig = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)
    elif quantization == "8-bit":
        qconfig = BitsAndBytesConfig(load_in_8bit=True)
    else:
        qconfig = None

    tokenizer = AutoTokenizer.from_pretrained(ckpt_dir, trust_remote_code=True)
    model = AutoModelForSeq2SeqLM.from_pretrained(
        ckpt_dir, trust_remote_code=True, low_cpu_mem_usage=True, quantization_config=qconfig
    )

    if qconfig is None:
        model = model.to(DEVICE)
        if DEVICE == "cuda":
            model.half()

    model.eval()
    return tokenizer, model

def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
    translations = []
    for i in range(0, len(input_sentences), BATCH_SIZE):
        batch = input_sentences[i:i + BATCH_SIZE]
        if not any(s.strip() for s in batch): # Skip empty batches
            translations.extend([""] * len(batch))
            continue

        processed_batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)
        inputs = tokenizer(
            processed_batch, truncation=True, padding="longest", return_tensors="pt", return_attention_mask=True
        ).to(DEVICE)

        with torch.no_grad():
            generated_tokens = model.generate(
                **inputs, use_cache=True, min_length=0, max_length=256, num_beams=5, num_return_sequences=1
            )

        decoded_tokens = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        translations.extend(ip.postprocess_batch(decoded_tokens, lang=tgt_lang))

        del inputs
        torch.cuda.empty_cache()

    return translations

# Initialize the English-to-Indic model
en_indic_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B"
en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(en_indic_ckpt_dir, quantization)
ip = IndicProcessor(inference=True)

print("✅ Translation model ready.")

Setting up the translation model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenization_indictrans.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-en-indic-1B:
- tokenization_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


dict.SRC.json: 0.00B [00:00, ?B/s]

dict.TGT.json: 0.00B [00:00, ?B/s]

model.SRC:   0%|          | 0.00/759k [00:00<?, ?B/s]

model.TGT:   0%|          | 0.00/3.26M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

configuration_indictrans.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-en-indic-1B:
- configuration_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_indictrans.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-en-indic-1B:
- modeling_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/4.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

✅ Translation model ready.


In [None]:
en_sents = [
    "When I was young, I used to go to the park every day.",
    "He has many old books, which he inherited from his ancestors.",
]

src_lang, tgt_lang = "eng_Latn", "hin_Deva"
hi_translations = batch_translate(en_sents, src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, ip)

print(f"\n{src_lang} - {tgt_lang}")
for input_sentence, translation in zip(en_sents, hi_translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")


eng_Latn - hin_Deva
eng_Latn: When I was young, I used to go to the park every day.
hin_Deva: जब मैं छोटा था, मैं हर दिन पार्क जाता था।
eng_Latn: He has many old books, which he inherited from his ancestors.
hin_Deva: उनके पास कई पुरानी किताबें हैं, जो उन्हें अपने पूर्वजों से विरासत में मिली हैं।


In [None]:
# --- Step 1: Install System and Python Dependencies ---
print("Installing system dependencies for OCR and PDF processing...")
!sudo apt-get update
!sudo apt-get install -y tesseract-ocr poppler-utils
!pip install --upgrade pip

print("\nInstalling Python packages...")

# PDF, Image, and OCR libraries
!pip install pdf2image pytesseract pillow

print("\n✅ All installations complete.")

Installing system dependencies for OCR and PDF processing...
Hit:1 https://cli.github.com/packages stable InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [2,065 kB]
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:10 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:13 http:

In [None]:
# --- Step 3: Mount Google Drive and Set Up Directories ---
import os
from google.colab import drive

drive.mount('/content/drive')

# Create a main project folder in your Google Drive
BASE_DIR = "/content/drive/MyDrive/Colab Notebooks/task1"
ENGLISH_PDF_DIR = os.path.join(BASE_DIR, "english_pdfs")
HINDI_PDF_DIR = os.path.join(BASE_DIR, "hindi_pdfs_layout_preserved")

os.makedirs(ENGLISH_PDF_DIR, exist_ok=True)
os.makedirs(HINDI_PDF_DIR, exist_ok=True)

print(f"Project directory set up at: {BASE_DIR}")

Mounted at /content/drive
Project directory set up at: /content/drive/MyDrive/Colab Notebooks/task1


In [None]:

!wget -q https://github.com/googlefonts/noto-fonts/raw/main/hinted/ttf/NotoSansDevanagari/NotoSansDevanagari-Regular.ttf -O NotoSansDevanagari-Regular.ttf

In [None]:
# --- Step 4 (REVISED): Core Helper Functions with Robustness ---
import requests
from pdf2image import convert_from_path
import pytesseract
from PIL import Image, ImageDraw, ImageFont
import pandas as pd
from tqdm import tqdm
import time # Import time for adding delays

# --- FIXED: Added retry logic to the download function ---
def download_book(base_url, chapter_codes, save_dir):
    print(f"Downloading {len(chapter_codes)} PDF files to {save_dir}...")
    for code in chapter_codes:
        file_url = f"{base_url}{code}.pdf"
        file_name = f"{code}.pdf"
        file_path = os.path.join(save_dir, file_name)
        if not os.path.exists(file_path):
            retries = 3
            for attempt in range(retries):
                try:
                    print(f" -> Attempting to download {file_name} (try {attempt+1}/{retries})...")
                    response = requests.get(file_url, stream=True, timeout=30)
                    response.raise_for_status()  # Will raise an HTTPError for bad responses (4xx or 5xx)
                    with open(file_path, "wb") as f:
                        for chunk in response.iter_content(chunk_size=8192):
                            f.write(chunk)
                    print(f" -> Downloaded {file_name} successfully.")
                    break  # Exit the retry loop on success
                except requests.exceptions.RequestException as e:
                    print(f" -> Download attempt failed for {file_name}: {e}")
                    if attempt + 1 == retries:
                        print(f" -> FAILED to download {file_name} after {retries} attempts.")
                    else:
                        time.sleep(2)  # Wait for 2 seconds before retrying
        else:
            print(f" -> {file_name} already exists. Skipping.")

# --- FIXED: Added robust error checking for the font download ---
def get_hindi_font(size=20):
    font_path = "/content/NotoSansDevanagari-Regular.ttf"
    if not os.path.exists(font_path):
        print("Downloading Hindi font...")
        font_url = "https://github.com/google/fonts/raw/main/ofl/notosansdevanagari/NotoSansDevanagari-Regular.ttf"
        try:
            response = requests.get(font_url, timeout=15)
            response.raise_for_status()
            with open(font_path, "wb") as f:
                f.write(response.content)
            print("Font downloaded successfully.")
        except requests.exceptions.RequestException as e:
            # This is a fatal error. The program cannot continue without the font.
            print(f"FATAL ERROR: Could not download the required Hindi font. Error: {e}")
            raise IOError("Failed to download Hindi font, cannot proceed with PDF generation.") from e

    # Now, we are sure the file exists (or the program has stopped).
    # We add a final check to ensure the file is not empty.
    if os.path.getsize(font_path) < 1000: # A valid font file will be much larger
        raise IOError("Downloaded font file is corrupt or empty. Please check the URL or your connection.")

    return ImageFont.truetype(font_path, size)

# The main processing function for a single PDF
def translate_pdf_with_layout(pdf_path, output_path):
    print(f"\nProcessing {os.path.basename(pdf_path)}...")

    try:
        pages_as_images = convert_from_path(pdf_path, dpi=300)
    except Exception as e:
        print(f"  ERROR: Could not convert PDF to images. Skipping. Error: {e}")
        return

    final_pages = []

    for i, page_image in enumerate(tqdm(pages_as_images, desc="  Translating pages")):
        ocr_data = pytesseract.image_to_data(page_image, lang='eng', output_type=pytesseract.Output.DATAFRAME)
        ocr_data.dropna(subset=['text'], inplace=True)
        ocr_data = ocr_data[ocr_data.conf > 40]

        if ocr_data.empty:
            final_pages.append(page_image)
            continue

        blocks = ocr_data.groupby(['block_num'])

        # --- FIXED: Applied fix for the pandas FutureWarning ---
        english_texts = [blocks.get_group((b,)).text.str.cat(sep=' ') for b in blocks.groups]

        src_lang, tgt_lang = "eng_Latn", "hin_Deva"
        hindi_translations = batch_translate(english_texts, src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, ip)

        draw = ImageDraw.Draw(page_image)
        block_translations = {i: text for i, text in enumerate(hindi_translations)}

        for idx, (block_num, block_df) in enumerate(blocks):
            x_min, y_min = block_df['left'].min(), block_df['top'].min()
            x_max, y_max = (block_df['left'] + block_df['width']).max(), (block_df['top'] + block_df['height']).max()

            draw.rectangle([x_min, y_min, x_max, y_max], fill='white', outline='white')

            hindi_text = block_translations.get(idx, "")
            if hindi_text.strip():
                font_size = int(block_df['height'].mean() * 0.8)
                font = get_hindi_font(size=max(font_size, 10))
                draw.text((x_min, y_min), hindi_text, font=font, fill='black')

        final_pages.append(page_image)

    if final_pages:
        final_pages[0].save(
            output_path, save_all=True, append_images=final_pages[1:], resolution=300.0
        )
        print(f"  ✅ Successfully created translated PDF: {os.path.basename(output_path)}")
    else:
        print(f"  WARNING: No pages were processed for {os.path.basename(pdf_path)}")

In [None]:
# --- Step 5: Main Execution Script ---

# --- Configuration ---
# You can easily add more books here to process them in a batch.
BOOKS_TO_PROCESS = {
    "prose_hepr1": {
        "base_url": "https://ncert.nic.in/textbook/pdf/",
        "chapters": [f"hepr10{i}" for i in range(1, 6)]
    },
    # Example: Add another book like History
    # "history_hess1": {
    #     "base_url": "https://ncert.nic.in/textbook/pdf/",
    #     "chapters": [f"hess10{i}" for i in range(1, 4)]
    # }
}

# --- Start Processing ---
for book_name, book_details in BOOKS_TO_PROCESS.items():
    print(f"\n{'='*25}\nProcessing Book: {book_name}\n{'='*25}")

    # 1. Download all chapters for the book (now with retry logic)
    download_book(book_details["base_url"], book_details["chapters"], ENGLISH_PDF_DIR)

    # 2. Process each chapter PDF
    for chapter_code in book_details["chapters"]:
        english_pdf_path = os.path.join(ENGLISH_PDF_DIR, f"{chapter_code}.pdf")
        hindi_pdf_path = os.path.join(HINDI_PDF_DIR, f"{chapter_code}_hindi_layout.pdf")

        if not os.path.exists(english_pdf_path):
            print(f"SKIPPING {chapter_code}: Source file was not downloaded successfully.")
            continue

        # Run the full layout-aware translation process for the PDF
        translate_pdf_with_layout(english_pdf_path, hindi_pdf_path)

        # Clear CUDA memory after processing a large file
        torch.cuda.empty_cache()

print("\n\n🎉🎉 All books processed successfully! 🎉🎉")
print(f"Find your translated PDFs with preserved layout in your Google Drive at: {HINDI_PDF_DIR}")


Processing Book: prose_hepr1
Downloading 5 PDF files to /content/drive/MyDrive/Colab Notebooks/task1/english_pdfs...
 -> hepr101.pdf already exists. Skipping.
 -> hepr102.pdf already exists. Skipping.
 -> hepr103.pdf already exists. Skipping.
 -> hepr104.pdf already exists. Skipping.
 -> hepr105.pdf already exists. Skipping.

Processing hepr101.pdf...


  Translating pages: 100%|██████████| 48/48 [17:48<00:00, 22.26s/it]


  ✅ Successfully created translated PDF: hepr101_hindi_layout.pdf

Processing hepr102.pdf...


  Translating pages: 100%|██████████| 52/52 [19:52<00:00, 22.93s/it]


  ✅ Successfully created translated PDF: hepr102_hindi_layout.pdf

Processing hepr103.pdf...
