In [None]:
import pandas
import numpy


| **Approach**                     | **Data Required**      | **Overview**                                                                                             | **Strengths**                       | **Challenges**                    | **References**                                                                                   |
|-----------------------------------|------------------------|----------------------------------------------------------------------------------------------------------|-------------------------------------|------------------------------------|--------------------------------------------------------------------------------------------------|
| **Zero-Shot Translation**         | None                  | Leverage multilingual models trained on large corpora to perform translation without explicit parallel data. | Fast, multilingual support          | May lack domain-specific accuracy | [M2M-100 by Facebook](https://arxiv.org/abs/2010.11125), [XLM](https://arxiv.org/abs/1901.07291) |
| **Back-Translation**              | Monolingual           | Create synthetic parallel data by translating from the target language back to the source language.       | Generates synthetic parallel data   | Translation quality impacts model | [Unsupervised Machine Translation](https://arxiv.org/abs/1804.07755), [MarianMT](https://marian-nmt.github.io/) |
| **Denoising Autoencoders**        | Monolingual           | Train models to reconstruct noisy text, aligning representations across languages without parallel data. | Fully unsupervised                  | Computationally expensive         | [Unsupervised NMT](https://arxiv.org/abs/1710.11041), [XLM Model](https://arxiv.org/abs/1901.07291) |
| **Cross-Lingual Embedding Alignment** | Monolingual       | Train embeddings separately for each language and align them in the same semantic space for translation. | Lightweight, effective for keywords | Limited to word/phrase-level      | [MUSE by Facebook](https://github.com/facebookresearch/MUSE), [FastText](https://fasttext.cc/)   |
| **Unsupervised Pre-Trained Models** | Monolingual         | Use pre-trained models like T5 or GPT, fine-tuned for unsupervised or specific language tasks.            | Robust pre-trained capabilities     | May need fine-tuning              | [T5](https://arxiv.org/abs/1910.10683), [mT5](https://arxiv.org/abs/2010.11934)                 |
| **Iterative Back-Translation**    | Monolingual           | Use back-translation iteratively to improve synthetic data and model performance.                        | Incremental improvement             | Time-consuming process            | [Improving Back-Translation](https://arxiv.org/abs/1805.08241)                                  |
| **Data Augmentation**             | Monolingual           | Generate synthetic data using paraphrasing or augmentation to improve the training dataset size.         | Expands training data               | Noise in generated data           | [Paraphrasing for Translation](https://arxiv.org/abs/1909.13838), [Text Augmentation](https://github.com/jasonwei20/eda_nlp) |
| **Adversarial Training**          | Monolingual           | Use adversarial methods to align representations across languages, enabling unsupervised translation.    | Fully unsupervised                  | Complex to implement              | [Adversarial Alignment for NMT](https://arxiv.org/abs/1706.05075), [GANs in NMT](https://arxiv.org/abs/1703.04887) |
| **Multilingual Fine-Tuning**      | Monolingual           | Fine-tune multilingual pre-trained models (e.g., XLM-R, mT5) on domain-specific monolingual data.        | Adaptable to specific domains       | Needs computing resources         | [XLM-R](https://arxiv.org/abs/1911.02116), [mT5](https://arxiv.org/abs/2010.11934)              |
| **Active Learning**               | Minimal labeled data  | Dynamically label the most useful examples to train a model efficiently with minimal supervision.         | Efficient labeling process          | Needs active learning pipeline    | [Active Learning for NMT](https://arxiv.org/abs/1706.08500), [AL in NLP](https://arxiv.org/abs/2006.11477) |


** The above listed approaches are recommended BY Chat GPT and they are pretty much around 2017-2020 time. This approaches are obsolate considering the latest approaches we have with the LLMs liek GPT 4.


# LATEST APPROACHES
| **Approach**                        | **Data Required**      | **Overview**                                                                                          | **Strengths**                          | **Challenges**                    | **References**                                                                                     |
|--------------------------------------|------------------------|-------------------------------------------------------------------------------------------------------|----------------------------------------|------------------------------------|----------------------------------------------------------------------------------------------------|
| **Prompt-Based Learning**            | None or Monolingual    | Uses LLMs (e.g., GPT-4, PaLM-2) with task-specific prompts for translation. Requires minimal or no fine-tuning. | Zero-shot/few-shot capability, domain flexibility | Requires access to large LLMs      | [InstructGPT](https://arxiv.org/abs/2203.02155), [PaLM-2](https://ai.google/static/palm/)          |
| **Adapter-Based Fine-Tuning**        | Monolingual or Limited Parallel | Fine-tune only small parts (adapters) of massive multilingual models like mT5 or XLM-R.              | Lightweight, efficient for updates      | Fine-tuning still required          | [Adapters for mT5](https://arxiv.org/abs/2110.04366), [PELT for NLP](https://arxiv.org/abs/2208.05581) |
| **Multilingual In-Context Learning** | None or Monolingual    | Train LLMs to support context-sensitive translations across languages without parallel data.          | No task-specific fine-tuning required   | Sensitive to input prompt design    | [GPT-4 Technical Report](https://arxiv.org/abs/2303.08774)                                         |
| **Efficient Transformers (e.g., LongFormers)** | Monolingual or Limited Parallel | Focus on long-context translation tasks using efficient transformer architectures.                   | Handles long texts efficiently           | Requires architectural adaptations  | [LongFormer](https://arxiv.org/abs/2004.05150), [BigBird](https://arxiv.org/abs/2007.14062)         |
| **Contrastive Learning for Translation** | Monolingual           | Uses contrastive objectives to improve representation alignment for multilingual embedding spaces.    | Improves cross-lingual performance       | Resource-intensive training         | [InfoXLM](https://arxiv.org/abs/2101.08296), [CLIP for Text](https://arxiv.org/abs/2103.00020)      |
| **Sparse Models (Mixture of Experts)** | Monolingual           | Mixture-of-Experts (MoE) architectures dynamically activate parts of the model for specific tasks.   | Scales efficiently with large models     | Complex implementation              | [Switch Transformers](https://arxiv.org/abs/2101.03961), [GLaM](https://arxiv.org/abs/2112.06905)   |
| **Self-Supervised Cross-Alignment**  | Monolingual           | Align representations across languages using self-supervised pre-training, avoiding parallel data.   | No parallel data needed, effective       | Requires substantial compute power  | [XLM-E](https://arxiv.org/abs/2204.10487), [ParaMAE](https://arxiv.org/abs/2205.00330)              |
| **Knowledge Distillation for MT**    | Monolingual           | Train smaller, efficient translation models using LLMs as teachers to distill knowledge.             | Reduces size and complexity of models    | Requires large teacher models       | [Distilling Translation Models](https://arxiv.org/abs/2212.07677)                                  |
| **Retrieval-Augmented Translation**  | Monolingual           | Combines neural translation with retrieval systems to improve performance on domain-specific tasks.  | Domain adaptability, better accuracy     | Requires a retrieval database setup | [RETRO](https://arxiv.org/abs/2112.04426), [RAG](https://arxiv.org/abs/2005.11401)                  |
| **Hybrid Neural-Symbolic MT**        | Monolingual           | Combine neural MT with rule-based or symbolic reasoning for enhanced translation quality.            | Better explainability and accuracy       | Hybrid design complexity            | [Neuro-Symbolic AI](https://arxiv.org/abs/2301.02177), [Symbolic MT Approaches](https://arxiv.org/abs/2209.11762) |


## Evaluatuon of the Machine Tranlation

There are many metrics which we can use for the machine translation. But we need to choose the metrics which are more into the semantic similarity and also have some domain knowledge.

Metric	Description
- BLEU (Bilingual Evaluation Understudy)	Measures n-gram overlap between the machine translation and reference translation. Higher scores indicate closer alignment with human references. It is widely used but can struggle with synonyms or paraphrasing.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation)	Focuses on recall-based overlap of n-grams, primarily used in summarization but applicable to MT. ROUGE-L specifically considers sequence alignment. It is helpful for evaluating fluency and coverage.
- METEOR (Metric for Evaluation of Translation with Explicit ORdering)	Considers exact matches, stemming, synonyms, and paraphrases between translation and reference. It uses precision, recall, and a harmonic mean to give a more human-like judgment. It is better at capturing semantic similarity than BLEU.
- TER (Translation Edit Rate)	Calculates the number of edits (insertions, deletions, substitutions, and shifts) needed to make a machine translation match a reference. Lower TER scores mean fewer corrections are needed, reflecting better quality.
- chrF (Character F-score)	Evaluates translation quality at the character level instead of word level, making it suitable for morphologically rich languages. It combines precision and recall to account for partial matches, such as prefixes or suffixes.
- COMET (Cross-lingual Optimized Metric for Evaluation of Translation)	Uses a neural network trained on human judgment to predict translation quality. COMET evaluates translations on adequacy, fluency, and consistency, making it more robust than traditional metrics.
- BERTScore	Compares contextual embeddings of the machine and reference translations using pre-trained models like BERT. It captures semantic similarity and paraphrasing better than n-gram-based metrics.
- BLEURT (Bilingual Evaluation Understudy with Representations from Transformers)	A transformer-based metric fine-tuned on human ratings, designed to predict translation quality. BLEURT captures semantic differences and aligns well with human judgments.
- SacreBLEU	A standardized version of BLEU with consistent preprocessing and tokenization, ensuring reproducibility across evaluations. It is simpler to use than traditional BLEU but shares similar limitations.
- Human Evaluation	The gold standard, involving human raters who evaluate adequacy, fluency, and overall translation quality. It is costly and time-consuming but essential for benchmarking against automated metrics.

Well Human Evaluation is the last step so we need to focus on the other aspects. THe BLEU , ROUGE, TER, chrF n Meteor are more into the Frequency based approach.  Amoung the Transformer based approaches we can look into the BLEURT, COMET, SBLEU/DBLEU


Recommendation for our Z+ Case
- Start with Pre-trained BLEURT: Test the general pre-trained model to see how well it handles our translations.
- Fine-tune if Necessary: If the BLEURT scores are not aligned with human judgments (e.g., it fails to recognize Z+ e-commerce-specific terms or paraphrasing), fine-tune the model using a smaller, domain-specific dataset. I am not sure we have any specific dataset inside Z+ for this.
- A few thousand high-quality parallel sentences for fine-tuning are often sufficient.
Ensure High-Quality Reference Data: Regardless of the approach, your dataset quality (e.g., accuracy and fluency of reference translations) is key to improving performance.

For BLEU we need a reference TEXT which will be actual text we are comparing and the Candiate which will be the translated text in our case.




### BLEU SCORES RANGE
| **BLEU** | **Interpretation**                                      |
|----------|---------------------------------------------------------|
| < 0.1    | Almost useless                                          |
| 0.1-0.19 | Hard to get the gist                                    |
| 0.2-0.29 | The gist is clear, but has significant grammatical errors|
| 0.3-0.39 | Understandable to good translations                     |
| 0.4-0.49 | High quality translations                               |
| 0.5-0.59 | Very high quality, adequate, and fluent translations    |
| ≥ 0.6    | Quality often better than humans                        |

Ref: https://codelabsacademy.com/en/blog/understanding-bleu-score-in-nlp-evaluating-translation-quality

### BLURT/COMET scores typically range from 0 to 1, where:

- 0 - 0.2: Poor quality
- 0.2 - 0.4: Fair quality
- 0.4 - 0.6: Good quality
- 0.6 - 0.8: Very good quality
- 0.8 - 1.0: Excellent quality2