### Imports 

In [1]:
# Standard library imports
import pandas as pd
import numpy as np

# Third-party library imports
from datasets import Dataset, DatasetDict, load_dataset
from evaluate import load as load_metric  # Renamed for clarity when loading metrics
from matplotlib import pyplot as plt  # Fixed incorrect alias

# Transformers and related libraries
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    pipeline,
    BitsAndBytesConfig
)
from peft import (
    get_peft_model,
    LoraConfig,
    TaskType,
    prepare_model_for_kbit_training
)

### **Dataset Sampling**

In [2]:
percent_data_select = "train[:100%]" # add percent sign ie. "train[:20%]" to select that percent of data 
# Load only 20% of the dataset
dataset = load_dataset(
    "csv", data_files={"train": "../Datasets/WikiMatrix/Processed/clean_en-hi.csv"},
    split=percent_data_select
)

# Split into train and test sets (e.g., 80% train, 20% test)
train_test_split = dataset.train_test_split(test_size=0.2, seed=42)

# Further split the test set into validation and test (e.g., 50-50 split of the 20%)
validation_test_split = train_test_split["test"].train_test_split(test_size=0.5, seed=42)

# Combine splits into a DatasetDict
raw_dataset = {
    "train": train_test_split["train"],
    "validation": validation_test_split["train"],
    "test": validation_test_split["test"]
}

dataset = DatasetDict(raw_dataset)

# Inspect the resulting dataset
print(dataset)


KeyboardInterrupt: 

   - `percent_data_select = "train[:100%]"`: This step determines the proportion of data to load. By specifying a percentage (e.g., `"train[:20%]"`), you are sampling a subset of the data from the entire dataset.
   - This is useful for:
     - Reducing computational load during development or testing.
     - Experimenting with smaller portions of the data.
     
**Dataset Splitting**
   - `train_test_split`: This step divides the dataset into **training** and **test** sets, often using an 80-20 split. The `seed` parameter ensures reproducibility by fixing the randomization.
   - `validation_test_split`: Further splits the test set into **validation** and **test** sets, typically ensuring the final dataset structure is:
     - Training Set: 80%
     - Validation Set: 10%
     - Test Set: 10%

**Combining into a DatasetDict**
   - The `DatasetDict` organizes these splits (`train`, `validation`, `test`) into a cohesive structure. This is a common practice when working with Hugging Face's `datasets` library, as it standardizes access to each split.

**Inspection**
   - `print(dataset)`: Displays the dataset structure, providing details about the splits, number of samples, and features.

In [3]:
# Empty VRAM cache
import torch
import gc
gc.collect()
torch.cuda.empty_cache()

### **Configuring 4-bit Quantization for Efficient Model Loading**

* This code configures and loads a pre-trained sequence-to-sequence model with 4-bit quantization, which is a technique used to reduce memory usage and improve inference speed without significant loss in model accuracy. Here's a detailed explanation:


In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enable 4-bit quantization
    bnb_4bit_use_double_quant=True,  # Use double quantization for better accuracy
    bnb_4bit_quant_type="nf4",  # Use 4-bit NormalFloat quantization
    bnb_4bit_compute_dtype=torch.float16  # Use FP16 for computation
)

model = AutoModelForSeq2SeqLM.from_pretrained(
    "facebook/m2m100_418M",
    quantization_config=quantization_config
)

tokenizer = AutoTokenizer.from_pretrained('facebook/m2m100_418M', src_lang="en", tgt_lang="hi")



<img src="4&16&32 bits.png" alt="drawing" width="400"/>


**1. Quantization Configuration**
The `BitsAndBytesConfig` object defines the settings for 4-bit quantization:
- **`load_in_4bit=True`**:
  - Enables loading the model weights in 4-bit precision, reducing memory usage compared to standard 16-bit (FP16) or 32-bit (FP32) formats.
  
- **`bnb_4bit_use_double_quant=True`**:
  - Activates double quantization, which improves quantization accuracy by first applying an intermediate quantization step.

- **`bnb_4bit_quant_type="nf4"`**:
  - Specifies the quantization type as **NormalFloat4 (NF4)**, a more efficient quantization method that maintains numerical precision better than standard integer quantization.

- **`bnb_4bit_compute_dtype=torch.float16`**:
  - Sets the computation to use FP16 (16-bit floating-point), balancing performance and accuracy.

---

**2. Model Loading**
The model is loaded using `AutoModelForSeq2SeqLM.from_pretrained`, where:
- **`facebook/m2m100_418M`**:
  - The name of the pre-trained multilingual model designed for translation tasks.
- **`quantization_config=quantization_config`**:
  - Applies the 4-bit quantization settings during model loading, enabling reduced VRAM usage while maintaining effective performance.

---

**3. Tokenizer Setup**
The tokenizer is initialized for the same model:
- **`src_lang="en"`** and **`tgt_lang="hi"`**:
  - Specifies English as the source language and Hindi as the target language for translation.

---

**Why Use 4-bit Quantization?**
- **Memory Efficiency**: Reduces the size of the model, making it possible to run on GPUs with limited VRAM.
- **Faster Inference**: Smaller models require less computation, speeding up translation tasks.
- **Retained Accuracy**: Advanced quantization techniques (e.g., NF4) minimize accuracy loss during model optimization.

This configuration is particularly useful when deploying large models on resource-constrained hardware or optimizing for inference speed in production systems.

### **Preprocessing Data for Translation Tasks**

This code defines a preprocessing function and applies it to a dataset for preparing data suitable for a machine translation task using a sequence-to-sequence (Seq2Seq) model.

---

In [23]:
def preprocess_function(examples, src_lang, tgt_lang):
    inputs = [f"translate {src_lang} to {tgt_lang}: " + ex for ex in examples[src_lang]]
    targets = examples[tgt_lang]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets_english_to_hindi = dataset.map(lambda x: preprocess_function(x, "English", "Hindi"), batched=True)


**Preprocessing Function**
The `preprocess_function` processes raw examples from the dataset into a format required by the tokenizer and model.

### **Applying LoRA (Low-Rank Adaptation) to a Seq2Seq Model**

This code configures and applies **Low-Rank Adaptation (LoRA)** to a sequence-to-sequence (Seq2Seq) model. LoRA is a parameter-efficient fine-tuning technique that reduces the number of trainable parameters while maintaining model performance.

In [25]:
# Define LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank of the low-rank matrices
    lora_alpha=32,  # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Target specific layers
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_2_SEQ_LM"  # Task type for sequence-to-sequence models
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

**LoRA Configuration**
The `LoraConfig` object defines the key parameters for applying LoRA:

- **`r=16`**:
  - Specifies the rank of the low-rank matrices used for adaptation. A lower rank reduces the number of trainable parameters.

- **`lora_alpha=32`**:
  - A scaling factor that adjusts the influence of the LoRA modules on the model's performance.

- **`target_modules=["q_proj", "v_proj"]`**:
  - Specifies which layers of the model should be adapted. Here, the attention layers' query (`q_proj`) and value (`v_proj`) projections are targeted.

- **`lora_dropout=0.1`**:
  - Introduces a dropout rate of 10% to prevent overfitting during training.

- **`bias="none"`**:
  - Indicates that no bias terms will be trained or adapted in the LoRA modules.

- **`task_type="SEQ_2_SEQ_LM"`**:
  - Specifies the task type as sequence-to-sequence language modeling, ensuring compatibility with the model architecture.

---
**Applying LoRA to the Model**
```python
model = get_peft_model(model, lora_config)
```
- **`get_peft_model`**:
  - Modifies the existing model by injecting LoRA layers into the specified `target_modules`.
  - This makes the model efficient for fine-tuning by freezing the original model weights and learning only the low-rank adaptation parameters.

---


### **Why Use LoRA?**
1. **Parameter Efficiency**:
   - Instead of fine-tuning all model parameters, only a small set of low-rank matrices is trained, drastically reducing memory and computational costs.

2. **Scalability**:
   - LoRA enables fine-tuning of large models on hardware with limited resources.

3. **Task-Specific Adaptation**:
   - LoRA allows for task-specific tuning without altering the base model, making it ideal for transfer learning.

---

This setup is well-suited for training Seq2Seq models like translation, summarization, or text generation tasks on resource-constrained hardware while maintaining competitive performance.

In [None]:
model.print_trainable_parameters()

### **Data Collator for Sequence-to-Sequence Tasks**
- The provided code creates a DataCollatorForSeq2Seq object, which is used to efficiently prepare batches of data for sequence-to-sequence (Seq2Seq) models during training or evaluation.

In [27]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

### **Training a Seq2Seq Model with `Seq2SeqTrainer`**

This code sets up and trains a sequence-to-sequence (Seq2Seq) model using Hugging Face's `Seq2SeqTrainer`, a convenient API designed for tasks like translation, summarization, and other text-to-text tasks.


In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    do_train=True,
    do_eval=True,
    evaluation_strategy="epoch",  # Evaluate after each epoch
    save_strategy="epoch",  # Save after each epoch (match evaluation strategy)
    num_train_epochs=10,
    learning_rate=2e-5,
    warmup_steps=500,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.1,
    gradient_accumulation_steps=4,
    fp16=True,
    logging_steps=10,
    lr_scheduler_type="linear",  # Linear decay after warmup
    metric_for_best_model="eval_loss",
    predict_with_generate=True,
    report_to=None,  # Or "wandb" if integrated
)


trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets_english_to_hindi["train"],
    eval_dataset=tokenized_datasets_english_to_hindi["validation"],
)

trainer.train()

**1. Training Arguments**
The `Seq2SeqTrainingArguments` defines hyperparameters and configurations for training. Here’s a breakdown of its key components:

- **General Training Configuration**:
  - **`output_dir="./results"`**: Specifies the directory to save model checkpoints and logs.
  - **`do_train=True`**: Enables training.
  - **`do_eval=True`**: Enables evaluation.

- **Evaluation and Checkpointing**:
  - **`evaluation_strategy="epoch"`**: Evaluates the model at the end of every epoch.
  - **`save_strategy="epoch"`**: Saves model checkpoints at the end of every epoch.
  - **`metric_for_best_model="eval_loss"`**: Monitors evaluation loss to determine the best model.

- **Hyperparameters**:
  - **`num_train_epochs=10`**: Trains the model for 10 epochs.
  - **`learning_rate=2e-5`**: Sets the initial learning rate for the optimizer.
  - **`warmup_steps=500`**: Gradually increases the learning rate for the first 500 steps to stabilize training.
  - **`weight_decay=0.1`**: Applies weight decay to regularize the model.

- **Batching and Gradient Updates**:
  - **`per_device_train_batch_size=8`**: Sets the training batch size per GPU.
  - **`per_device_eval_batch_size=8`**: Sets the evaluation batch size per GPU.
  - **`gradient_accumulation_steps=4`**: Accumulates gradients over 4 steps before updating model weights, effectively increasing the batch size.

- **Optimization**:
  - **`fp16=True`**: Enables mixed-precision training for faster computation and reduced memory usage.
  - **`lr_scheduler_type="linear"`**: Uses a linear learning rate decay after the warmup phase.

- **Logging**:
  - **`logging_steps=10`**: Logs training metrics every 10 steps.

- **Reporting**:
  - **`report_to=None`**: Disables integration with tracking tools like Weights & Biases. Use `"wandb"` if needed.

- **Prediction**:
  - **`predict_with_generate=True`**: Ensures the model generates sequences (e.g., translations) during evaluation.

---

**2. Training the Model**
The `Seq2SeqTrainer` is initialized with the following components:
- **`model=model`**: The Seq2Seq model with LoRA applied.
- **`args=training_args`**: The training configuration defined above.
- **`data_collator=data_collator`**: Handles padding and formatting of inputs for training batches.
- **`train_dataset=tokenized_datasets_english_to_hindi["train"]`**: Specifies the training dataset.
- **`eval_dataset=tokenized_datasets_english_to_hindi["validation"]`**: Specifies the validation dataset.

---

**3. Training Execution**
```python
trainer.train()
```
- This method begins training based on the defined arguments and datasets.
- Outputs:
  - Training and evaluation metrics after each epoch.
  - Checkpoints saved in `output_dir` after each epoch.
  - The best-performing model based on the `metric_for_best_model`.

---

**Why Use `Seq2SeqTrainer`?**
- **Simplifies Workflow**: Abstracts away much of the boilerplate for training Seq2Seq models.
- **Flexible Configuration**: Easily integrates LoRA, mixed precision, and custom evaluation strategies.
- **Scalable**: Supports distributed training and gradient accumulation for large models.

This setup is ideal for tasks like machine translation (English-to-Hindi in this case) while optimizing resource usage and training efficiency.

#### **Save Model and Tokenizer Locally**
```python
trainer.save_model("../Model/lora/M2M100/")
tokenizer.save_pretrained("../Model/lora/M2M100/")
```
- **Purpose**:
  - Saves the fine-tuned model weights, configuration, and tokenizer for reuse.
- **Path**:
  - The model and tokenizer are saved in `../Model/lora/M2M100/`.

---

In [None]:
trainer.save_model("../Model/lora/M2M100/")
tokenizer.save_pretrained("../Model/lora/M2M100/")


#### **Translation with the Saved Model**
```python
text = 'break a leg'
translator = pipeline("translation_en_to_hi", model="../Model/lora/M2M100/")
translator(text)
```
- **Translation Pipeline**:
  - Loads the saved model using Hugging Face's `pipeline` for English-to-Hindi translation.
- **Input**:
  - `text = 'break a leg'`: A common English idiom.
- **Output**:
  - The translated text in Hindi.

---

In [None]:
text = 'break a leg'
translator = pipeline("translation_en_to_hi", model="../Model/lora/M2M100/")
translator(text)



#### **Push Model to Hugging Face Hub**
```python
model.push_to_hub("aktheroy/translate_en_hi")
```
- **Purpose**:
  - Shares the fine-tuned model with the Hugging Face community by uploading it to your repository on the Hugging Face Model Hub.
- **Repository Name**:
  - `aktheroy/translate_en_hi`: The specified repository on the Model Hub.


In [None]:
model.push_to_hub("aktheroy/translate_en_hi")
tokenizer.push_to_hub("aktheroy/translate_en_hi")

#### Steps to Authenticate and Push:
1. Log in to your Hugging Face account:
   ```bash
   huggingface-cli login
   ```
2. When running the script for the first time, you may be prompted to provide a token for authentication.
3. Once authenticated, the model and its configuration will be uploaded to your specified repository.

---

### **Advantages of Saving and Sharing**
1. **Reusability**:
   - Save the model locally for later use without retraining.
2. **Accessibility**:
   - Sharing the model on the Hugging Face Hub makes it accessible to other users.
3. **Production Readiness**:
   - Easily integrate the fine-tuned model into applications using the Hugging Face `pipeline`.


### **1. Validation and Model Performance Monitoring**
   - **Track Metrics During Training**:
     While you've defined `metric_for_best_model="eval_loss"`, it’s good practice to ensure you're tracking multiple evaluation metrics (e.g., BLEU score for translation tasks) during training to get a more comprehensive view of model performance.
     ```python
     from datasets import load_metric
     bleu_metric = load_metric("bleu")
     ```

   - **Evaluation During Training**:
     Ensure that the model is evaluated on the validation set during training to check for overfitting or underfitting, and adjust hyperparameters accordingly.

   - **Test Set Evaluation**:
     After training, you should evaluate your model on a held-out test set (which you already defined) to assess its performance on unseen data. This would help gauge the generalization capability of the fine-tuned model.

### **2. Hyperparameter Tuning**
   - **Hyperparameter Search**:
     Consider running a hyperparameter search (e.g., using `optuna` or `Ray Tune`) to optimize parameters like learning rate, batch size, and other training settings.

### **3. Save and Load Checkpoints During Training**
   - **Intermediate Checkpoints**:
     You can save model checkpoints periodically during training to avoid losing progress in case of any interruptions.
     ```python
     training_args.save_steps = 500  # Saves model every 500 steps
     ```

   - **Best Model Checkpointing**:
     Make sure you are saving the best model based on the evaluation metric.
     ```python
     training_args.load_best_model_at_end = True
     ```

### **4. Fine-Tuning with Data Augmentation**
   - **Data Augmentation**:
     For machine translation, you might consider augmenting the training data (e.g., using back-translation) to improve the model's robustness, especially if your training dataset is relatively small.

### **5. Model Efficiency Improvements**
   - **Pruning**:
     After fine-tuning, you could explore model pruning (removing redundant neurons) to further optimize the model size and improve inference speed without sacrificing too much accuracy.

### **6. Model Interpretability**
   - **Model Explainability**:
     Consider tools like SHAP or LIME for interpreting the model's decisions, especially when deploying the model in production or in high-stakes applications.

### **7. Deployment and Inference Optimization**
   - **ONNX Export**:
     If you're considering deploying the model for inference on a different platform (like TensorFlow, or an edge device), you can convert the model to **ONNX** format for optimized inference.
     ```python
     model.save_pretrained('model_dir')
     torch.onnx.export(model, input_tensor, 'model.onnx')
     ```

   - **Quantization for Inference**:
     For faster inference, you could perform post-training quantization (e.g., using **8-bit quantization** with PyTorch) to reduce the model size and speed up inference on edge devices or low-resource environments.

### **8. Documentation and Experiment Tracking**
   - **Experiment Tracking**:
     If you’re using a tool like **Weights & Biases (wandb)** or **TensorBoard**, make sure to track training and evaluation metrics for better insight into your experiments. This is useful for managing and analyzing multiple training runs.

   - **Model Documentation**:
     Document the training setup, hyperparameters, and performance metrics. This helps both for reproducibility and understanding how the model was trained when coming back to it later or sharing it with others.

---

### **Summary of Optional Steps:**
1. Track additional evaluation metrics (e.g., BLEU score).
2. Perform hyperparameter tuning.
3. Save intermediate checkpoints during training.
4. Consider data augmentation techniques (e.g., back-translation).
5. Explore pruning for model size reduction.
6. Use tools for model explainability.
7. Optimize inference with ONNX or quantization.
8. Track experiments and document the process.