Certainly! Here’s a detailed explanation of the key concepts and components used in your machine translation project:

### **1. Libraries and Tools**

**a. `transformers` Library:**
   - Developed by Hugging Face, this library provides pre-trained models and tools for natural language processing (NLP) tasks.
   - Models include BERT, GPT, T5, and many others, which are pre-trained on large corpora and can be fine-tuned for specific tasks.
   - For sequence-to-sequence tasks like translation, models like T5 and MarianMT are commonly used.

**b. `datasets` Library:**
   - Provides tools to easily load and preprocess datasets for machine learning tasks.
   - It supports a wide range of datasets and formats, making it convenient for handling large amounts of data.

**c. `tensorflow` Library:**
   - An open-source machine learning framework developed by Google.
   - Used for building and training deep learning models. TensorFlow supports various neural network architectures, including sequence-to-sequence models.

**d. `sacrebleu` Library:**
   - Used for evaluating the quality of machine translations.
   - Computes BLEU (Bilingual Evaluation Understudy) scores, which measure how well a translation matches a reference translation.

**e. `accelerate` Library:**
   - Simplifies training and inference on multiple GPUs or TPUs.
   - Helps manage distributed training and optimizes performance.

### **2. Sequence-to-Sequence (Seq2Seq) Models**

**a. Seq2Seq Architecture:**
   - A model architecture designed for tasks where the input and output sequences are of different lengths, such as machine translation.
   - Consists of an **encoder** and a **decoder**:
     - **Encoder**: Processes the input sequence (e.g., English sentence) and converts it into a context vector (or hidden states).
     - **Decoder**: Takes the context vector and generates the output sequence (e.g., Hindi translation).

**b. Pre-trained Models:**
   - Models like `Helsinki-NLP/opus-mt-en-hi` are pre-trained on large parallel corpora for English-Hindi translation.
   - These models have already learned to map English sentences to Hindi sentences from extensive data and can be fine-tuned on specific datasets.

### **3. Tokenization**

**a. Tokenizer:**
   - Converts text into tokens, which are numerical representations that models can understand.
   - The tokenizer used here is designed to work with the specific model checkpoint, handling subword units and special tokens.

**b. Preprocessing:**
   - **Input Tokenization**: Converts English text into tokens that the model can process.
   - **Target Tokenization**: Converts Hindi text into tokens that are used as labels for training.

**c. Padding and Truncation:**
   - **Padding**: Adding special tokens to ensure all sequences in a batch have the same length.
   - **Truncation**: Cutting off tokens that exceed the maximum length allowed by the model.

### **4. Data Collation**

**a. Data Collator:**
   - Combines multiple examples into a batch, handling padding and other necessary transformations.
   - `DataCollatorForSeq2Seq` is specifically designed for sequence-to-sequence tasks, ensuring that input and output sequences are correctly batched.

### **5. Training Configuration**

**a. Optimizer:**
   - **AdamW**: An extension of the Adam optimizer that includes weight decay for regularization.
   - **Learning Rate**: Controls how much to adjust the model weights during training.
   - **Weight Decay**: Helps prevent overfitting by adding a penalty for large weights.

**b. Batch Size:**
   - Number of samples processed together in one forward/backward pass.
   - Affects memory usage and training speed.

**c. Epochs:**
   - Number of times the entire dataset is passed through the model during training.
   - More epochs can improve performance but may also lead to overfitting.

### **6. Inference**

**a. Model Generation:**
   - **`generate` Method**: Used to produce sequences from the model. In translation tasks, it generates the translated text from the input text.
   - **Max Length**: Sets the maximum length for generated sequences to prevent excessively long outputs.

**b. Decoding:**
   - Converts generated token IDs back into human-readable text.
   - Handles special tokens and ensures that the output is understandable.

### **Summary of Workflow**

1. **Setup Environment**: Install necessary libraries and tools.
2. **Load and Preprocess Data**: Obtain translation pairs and tokenize them.
3. **Configure Model**: Load a pre-trained Seq2Seq model and prepare it for fine-tuning.
4. **Train Model**: Use the prepared dataset and defined parameters to train the model.
5. **Perform Inference**: Test the model with new input and decode the results.

Each component plays a crucial role in building a functional and effective translation model, and understanding these concepts helps in making adjustments and improvements to the workflow.

In [None]:
!nvidia-smi

!pip install datasets transformers[sentencepiece] sacrebleu -q

!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate

import os
import sys
import transformers
import tensorflow as tf
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
from transformers import AdamWeightDecay

model_checkpoint = "Helsinki-NLP/opus-mt-en-hi"

raw_datasets = load_dataset("cfilt/iitb-english-hindi")

raw_datasets

raw_datasets['train'][1]

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

max_input_length = 128
max_target_length = 128

source_lang = "en"
target_lang = "hi"

def preprocess_function(examples):
    inputs = [ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

batch_size = 16
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128)

tf.__version__

train_dataset = model.prepare_tf_dataset(
    tokenized_datasets["test"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator,
)


train_dataset

validation_dataset = model.prepare_tf_dataset(
    tokenized_datasets["validation"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=generation_data_collator,
)

generation_dataset = model.prepare_tf_dataset(
    tokenized_datasets["validation"],
    batch_size=8,
    shuffle=False,
    collate_fn=generation_data_collator,
)

optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

model.fit(train_dataset, validation_data=validation_dataset, epochs=1)

## Interencing

model.save_pretrained("tf_model/")

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModelForSeq2SeqLM.from_pretrained("tf_model/")

input_text  = "I am learning Coding. How are you"

tokenized = tokenizer([input_text], return_tensors='np')
out = model.generate(**tokenized, max_length=128)
print(out)

with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0], skip_special_tokens=True))

