# IndicTrans2 HF Inference

We provide an example notebook on how to use our IndicTrans2 models which were originally trained with the fairseq to HuggingFace transformers for inference purpose.


## Setup

Please run the cells below to install the necessary dependencies.


In [1]:
%%capture
!git clone https://github.com/AI4Bharat/IndicTrans2.git

In [2]:
%%capture
%cd /content/IndicTrans2/huggingface_interface

In [3]:
%%capture
!python3 -m pip install nltk sacremoses pandas regex mock transformers>=4.33.2 mosestokenizer
!python3 -c "import nltk; nltk.download('punkt')"
!python3 -m pip install bitsandbytes scipy accelerate datasets
!python3 -m pip install sentencepiece

!git clone https://github.com/VarunGumma/IndicTransTokenizer
%cd IndicTransTokenizer
!python3 -m pip install --editable ./
%cd ..

**IMPORTANT : Restart your run-time first and then run the cells below.**

## Inference


In [6]:
from tqdm import tqdm

In [1]:
import torch
from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer

BATCH_SIZE = 4
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
quantization = None

In [7]:
def initialize_model_and_tokenizer(ckpt_dir, direction, quantization):
    if quantization == "4-bit":
        qconfig = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
        )
    elif quantization == "8-bit":
        qconfig = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_use_double_quant=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
        )
    else:
        qconfig = None

    tokenizer = IndicTransTokenizer(direction=direction)
    model = AutoModelForSeq2SeqLM.from_pretrained(
        ckpt_dir,
        trust_remote_code=True,
        low_cpu_mem_usage=True,
        quantization_config=qconfig,
    )

    if qconfig == None:
        model = model.to(DEVICE)
        if DEVICE == "cuda":
            model.half()

    model.eval()

    return tokenizer, model


def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
    translations = []
    for i in tqdm(range(0, len(input_sentences), BATCH_SIZE)):
        batch = input_sentences[i : i + BATCH_SIZE]

        # Preprocess the batch and extract entity mappings
        batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)

        # Tokenize the batch and generate input encodings
        inputs = tokenizer(
            batch,
            src=True,
            truncation=True,
            padding="longest",
            return_tensors="pt",
            return_attention_mask=True,
        ).to(DEVICE)

        # Generate translations using the model
        with torch.no_grad():
            generated_tokens = model.generate(
                **inputs,
                use_cache=True,
                min_length=0,
                max_length=256,
                num_beams=5,
                num_return_sequences=1,
            )

        # Decode the generated tokens into text
        generated_tokens = tokenizer.batch_decode(generated_tokens.detach().cpu().tolist(), src=False)

        # Postprocess the translations, including entity replacement
        translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)

        del inputs
        torch.cuda.empty_cache()

    return translations

### Indic to English Example

In [3]:
indic_en_ckpt_dir = "ai4bharat/indictrans2-indic-en-1B"  # ai4bharat/indictrans2-indic-en-dist-200M
indic_en_tokenizer, indic_en_model = initialize_model_and_tokenizer(indic_en_ckpt_dir, "indic-en", "")

ip = IndicProcessor(inference=True)

guj_sents = [
    """Sample No. : 0
    Inputs: નીચેના વાક્યને અલગ શબ્દોનો ઉપયોગ કરીને લખો: "દિલ્હીના સુપરસ્ટાર શિખર ધવન આઈપીએલના ઈતિહાસમાં સતત બે સદી ફટકારનાર પ્રથમ બેટ્સમેન બન્યો છે."
    Targets: "દિલ્હીના સુપરસ્ટાર શિખર ધવને સતત બે સદી ફટકારીને આઈપીએલના ઈતિહાસમાં આમ કરનાર પ્રથમ બેટ્સમેન બન્યો છે."
    Generated: "દિલ્હીના સુપરસ્ટાર શિખર ધવન આઈપીએલના ઈતિહાસમાં દિલ્હીના સુપરસ્ટાર શિખર ધવન આઈપીએલના ઈતિહાસમાં """,
]
src_lang, tgt_lang = "guj_Gujr", "eng_Latn"
en_translations = batch_translate(guj_sents, src_lang, tgt_lang, indic_en_model, indic_en_tokenizer, ip)


print(f"\n{src_lang} - {tgt_lang}")
for input_sentence, translation in zip(guj_sents, en_translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

# flush the models to free the GPU memory
del indic_en_tokenizer, indic_en_model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

configuration_indictrans.py:   0%|          | 0.00/14.1k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-indic-en-1B:
- configuration_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_indictrans.py:   0%|          | 0.00/61.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-indic-en-1B:
- modeling_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/4.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]


guj_Gujr - eng_Latn
guj_Gujr: Sample No. : 0
    Inputs: નીચેના વાક્યને અલગ શબ્દોનો ઉપયોગ કરીને લખો: "દિલ્હીના સુપરસ્ટાર શિખર ધવન આઈપીએલના ઈતિહાસમાં સતત બે સદી ફટકારનાર પ્રથમ બેટ્સમેન બન્યો છે."
    Targets: "દિલ્હીના સુપરસ્ટાર શિખર ધવને સતત બે સદી ફટકારીને આઈપીએલના ઈતિહાસમાં આમ કરનાર પ્રથમ બેટ્સમેન બન્યો છે."
    Generated: "દિલ્હીના સુપરસ્ટાર શિખર ધવન આઈપીએલના ઈતિહાસમાં દિલ્હીના સુપરસ્ટાર શિખર ધવન આઈપીએલના ઈતિહાસમાં 
eng_Latn: Sample No.: 0 Inputs: Type the following sentence using different words: "Delhi superstar Shikhar Dhawan has become the first batsman in IPL history to score two consecutive centuries." Targets: "Delhi superstar Shikhar Dhawan has become the first batsman in IPL history to score two consecutive centuries." Generated: "Delhi superstar Shikhar Dhawan in IPL history Delhi superstar Shikhar Dhawan in IPL history."


In [4]:
input_file_path = '/content/mT5-gujarati-summarization-test-results-ep29.txt'
with open(input_file_path, 'r', encoding='utf-8') as file:
        gujarati_samples = file.read()

gujarati_samples = gujarati_samples.split('\n\n')
len(gujarati_samples)

202

In [9]:
indic_en_ckpt_dir = "ai4bharat/indictrans2-indic-en-1B"  # ai4bharat/indictrans2-indic-en-dist-200M
indic_en_tokenizer, indic_en_model = initialize_model_and_tokenizer(indic_en_ckpt_dir, "indic-en", "")

In [10]:
src_lang, tgt_lang = "guj_Gujr", "eng_Latn"
en_translations = batch_translate(gujarati_samples, src_lang, tgt_lang, indic_en_model, indic_en_tokenizer, ip)

100%|██████████| 51/51 [02:23<00:00,  2.82s/it]


In [13]:
output_file_path = "/content/mT5-gujarati-summarization-test-results-ep29-en-IndicTrans2-1B.txt"
with open(output_file_path, 'w', encoding='utf-8') as file:
    file.write('\n\n'.join(en_translations))