# Agents in Language Translation: A Comprehensive Tutorial for Data Scientists

## Introduction

Welcome to this tutorial on leveraging agentic fraeworks for language translation tasks using Large Language Models (LLMs). As seasoned data scientists, you are likely familiar with machine learning models and natural language processing. However, agentic frameworks offer a novel approach to orchestrate complex interactions with LLMs, enabling more sophisticated and controllable outcomes.

### In this tutorial, we will:

- Introduce the concept of agents in the context of LLMs.
- Demonstrate how agents can enhance language translation tasks, especially in technical domains.
- Provide a step-by-step implementation of an agent-based translation system.
- Evaluate translation quality using established metrics.

Let's explore how agents can revolutionize language translation by ensuring technical accuracy and iterative refinement.

# Table of Contents

1. [Understanding Agents](#understanding-agents)
2. [Why Use Agents for Language Translation](#why-use-agents-for-language-translation)
3. [Setting Up the Environment](#setting-up-the-environment)
4. [The Translation Process](#the-translation-process)
5. [Translation Quality Metrics](#translation-quality-metrics)
6. [Implementing the Agent-Based Translation System](#implementing-the-agent-based-translation-system)
7. [Testing the Agent-Based Translation](#testing-the-agent-based-translation)
8. [Comparing with Standard Machine Translation](#comparing-with-standard-machine-translation)
9. [Evaluating Translation Quality](#evaluating-translation-quality)
10. [Conclusion](#conclusion)
11. [Next Steps](#next-steps)


# Understanding Agents
In the context of LLMs, agents are systems that structure a sequence of interactions with language models to accomplish complex tasks. They are more than simple prompt-response pairs; agents can:

- Chain multiple prompts and responses to handle tasks requiring reasoning, iteration, or multiple steps.
- Incorporate logic and decision-making, allowing for conditional flows based on intermediate outputs.
- Collaborate with other agents or tools, enhancing functionality and performance.

Example Workflow of an Agent:
1. Initial Prompt: Provide input to the LLM.
2. LLM Response: Receive output from the LLM.
3. Processing: Analyze or manipulate the output.
4. Next Prompt: Generate a new prompt based on processing.
5. Iteration: Repeat as necessary to achieve the goal.

Agents enable us to harness the full potential of LLMs by orchestrating complex interactions that a single prompt-response cycle cannot handle.

## Why Use Agents for Language Translation
Language translation, particularly in technical domains, presents unique challenges:
- Consistency in Terminology: Technical documents often contain domain-specific terms that need consistent translation.
- Iterative Refinement: Translations may require multiple iterations to achieve the desired accuracy and tone.
- Complex Structures: Technical texts may include complex sentences, specialized vocabulary, and critical details.

An agent-based approach offers:
- Structured Workflows: Breaking down the translation process into manageable steps.
- Term Extraction and Management: Identifying and consistently translating technical terms.
- Quality Assurance: Incorporating back-translation and comparison to ensure fidelity.

- By using agents, we can automate and enhance the translation process, ensuring higher quality and efficiency.

### Set Up OpenAI API + Download NLTK Resources

In [None]:
import os
os.environ['OPENAI_API_KEY'] = 'your-api-key'

In [None]:
import nltk
# nltk.download('wordnet')
# nltk.download('omw-1.4')

## The Translation Process
Understanding the translation process is crucial before implementing our agent-based system.

### Human Translation Workflow
1. Initial Translation
- Review Source Text: Understand context and nuances.
- First Draft: Create an initial translation, focusing on accuracy.
- Cultural Adaptation: Adjust idioms and expressions appropriately.

2. Editing and Refining
- Peer Review: Another translator reviews the draft.
- Terminology Consistency: Use glossaries to ensure consistent terminology.

3. Back Translation
- Back Translation: Translate back to the original language.
- Comparison: Identify discrepancies and refine the translation.

4. Proofreading
- Final Edits: Correct grammar, style, and formatting.

5. Client Review and Feedback

6. Final Delivery

### Machine Translation Workflow
Computers use statistical and neural methods for translation but may lack context understanding and consistency in technical terms. Evaluating machine translations often involves quantitative metrics.



## Translation Quality Metrics
To evaluate our translations, we'll use METEOR and ROUGE, two established metrics.

## METEOR
- Metric for Evaluation of Translation with Explicit ORdering
- Focus: Considers synonyms, stemming, and word order.
- Advantages: Captures semantic meaning better than simple n-gram overlap.

### ROUGE
- Recall-Oriented Understudy for Gisting Evaluation
- Focus: Measures n-gram overlap between translations.
- Variants:
    a. ROUGE-N: Overlap of n-grams.
    b. ROUGE-L: Longest common subsequence.


## Implementing METEOR and ROUGE

In [1]:
## METEOR Score Function

from nltk.translate.meteor_score import meteor_score

def calculate_meteor(reference, hypothesis):
    score = meteor_score([reference], hypothesis)
    print(f"METEOR Score: {score:.4f}")
    return score

In [1]:
## ROUGE Score Function

from rouge_score import rouge_scorer

def calculate_rouge(reference, hypothesis):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, hypothesis)
    for key in scores:
        print(f"{key.upper()}: Precision: {scores[key].precision:.4f}, Recall: {scores[key].recall:.4f}, F1: {scores[key].fmeasure:.4f}")
    return scores

## Implementing the Agent-Based Translation System
We'll now build an agent that emulates the human translation workflow, ensuring technical accuracy and consistency.

### Extracting Technical Terms
We start by extracting technical or scientific terms from the source text.

In [None]:
import openai
import ast
client = openai.OpenAI()

def extract_science_terms(input_text, source_lang, target_lang):
    print("Extracting technical terms...")
    system_prompt = f"""
    You are an expert in {source_lang} and {target_lang} medical terminology.
    Extract all technical or scientific terms from the following text.
    Return them as a Python list of strings. If none, return an empty list.
    """
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": input_text}
        ]
    )
    terms_str = response.choices[0].message.content.strip()
    try:
        terms_list = ast.literal_eval(terms_str)
        if not isinstance(terms_list, list):
            terms_list = []
    except:
        terms_list = []
    print(f"Extracted terms: {terms_list}")
    return terms_list


## Translating Technical Terms
Next, we translate the extracted terms.

In [None]:
def translate_terms(terms_list, source_lang, target_lang):
    print("Translating technical terms...")
    terms_str = str(terms_list)
    system_prompt = f"""
    You are a professional translator fluent in {source_lang} and {target_lang}.
    Translate the following list of technical terms from {source_lang} to {target_lang}.
    Return the translated terms as a Python list of strings, preserving the order.
    """
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": terms_str}
        ]
    )
    translated_terms_str = response.choices[0].message.content.strip()
    try:
        translated_terms = ast.literal_eval(translated_terms_str)
        if not isinstance(translated_terms, list):
            translated_terms = []
    except:
        translated_terms = []
    print(f"Translated terms: {translated_terms}")
    return translated_terms

## Performing the Translation
We translate the main text, ensuring consistent use of technical terms.

In [None]:
def translate_text_with_terms(source_text, source_terms, target_terms, source_lang, target_lang):
    print("Translating main text...")
    term_mapping = dict(zip(source_terms, target_terms))
    system_prompt = f"""
    You are a professional translator fluent in {source_lang} and {target_lang}.
    Translate the following text from {source_lang} to {target_lang}.
    Ensure that technical terms are translated consistently using the provided mapping.
    Only produce the translated text.
    """
    user_prompt = f"""
    Technical Terms Mapping: {term_mapping}
    Text to Translate:
    {source_text}
    """
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    translated_text = response.choices[0].message.content.strip()
    return translated_text

## Back Translation
We back-translate the text to the source language.

In [None]:
def back_translate_text(translated_text, source_terms, target_terms, source_lang, target_lang):
    print("Performing back translation...")
    term_mapping = dict(zip(target_terms, source_terms))
    system_prompt = f"""
    You are a professional translator fluent in {target_lang} and {source_lang}.
    Back-translate the following text from {target_lang} to {source_lang}.
    Ensure that technical terms are translated consistently using the provided mapping.
    Only produce the back-translated text.
    """
    user_prompt = f"""
    Technical Terms Mapping: {term_mapping}
    Text to Back-Translate:
    {translated_text}
    """
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    back_translated_text = response.choices[0].message.content.strip()
    return back_translated_text

## Comparing and Refining
We compare the original and back-translated texts to identify discrepancies.

In [None]:
def compare_texts(original_text, back_translated_text):
    print("Comparing texts...")
    system_prompt = """
    You are a senior translator.
    Compare the original and back-translated texts.
    Identify any discrepancies and provide feedback.
    If the texts are sufficiently similar, state 'Translations are consistent.'
    """
    user_prompt = f"""
    Original Text:
    {original_text}

    Back-Translated Text:
    {back_translated_text}
    """
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    feedback = response.choices[0].message.content.strip()
    print(f"Feedback: {feedback}")
    return feedback

## Testing the Agent-Based Translation
Now, let's test our agent-based system with a sample text.

### Sample Text
We'll use a Chinese medical abstract.

In [None]:
chinese_text = """"中文摘要 
 
活性氧化物質造成的蛋白質氧化常與老化及神經退化疾病連結。蛋白質中，能與金屬離子產生螯合作用的組胺酸是常見的氧化受損目標物。2-氧基組胺酸便是近期研究發現組胺酸在金屬催化下氧化的產物，並有潛力作為蛋白質氧化受損之生物標記物。然而，受限於合成方法不足，它的生物物理及生化特性都尚未被透徹地研究了解。我們發展了一個有效的方法，能利用金屬催化氧化反應，在胺基酸單體及胜肽鏈上產生2-氧基組胺酸。在常見的二價銅離子/抗壞血酸鈉/氧氣的金屬催化氧化系統中，我們調整了試劑的比例及反應的緩衝溶液系統，使得反應的產率較文獻報導增加了十倍以上，並且能得到完全氧化的2-氧基組胺酸胜肽鏈，以鑑定其性質。 
含有2-氧基組胺酸之胜肽鏈在液相層析串聯式質譜儀的分析中，於碳-18管柱的滯留時間較長；其結構在質譜儀電噴灑游離及串聯式質譜碰撞誘導解離的過程中展現高度的穩定性。因此，針對2-氧基組胺酸做蛋白質體的質譜分析是可行的。我們合成的2-氧基組胺酸胜肽鏈可以作為這類質譜實驗的標準品，也可進一步用作胜肽探針或誘導抗體產生，對此非酵素性後轉譯修飾的研究，有相當的價值。 
在這份研究中，我們將含有2-氧基組胺酸之胜肽鏈連接到生物素上，作為研究2-氧基組胺酸交互作用體的探針。生物素能將探針固定在修飾有抗生物素蛋白的固相上，和海拉細胞裂解液作用後，便能得到2-氧基組胺酸的交互作用體並且以質譜鑑定。由於新的證據漸漸顯示氧化受損的蛋白質可能藉由蛋白脢體分解，因此我們將2-氧基組胺酸交互作用體與蛋白脢體交互作用體交叉分析，企圖嘗試了解其中細胞對蛋白質品質調控的機制。 
 
關鍵字： 2-氧基組胺酸、活性氧化物質、金屬催化氧化、蛋白脢體、交互作用體"
"""

## Running the Agent 

In [None]:
source_lang = "Chinese"
target_lang = "English"

# Step 1: Extract technical terms
source_terms = extract_science_terms(chinese_text, source_lang, target_lang)

# Step 2: Translate technical terms
target_terms = translate_terms(source_terms, source_lang, target_lang)

# Step 3: Perform translation
translated_text = translate_text_with_terms(chinese_text, source_terms, target_terms, source_lang, target_lang)
print("\nTranslated Text:")
print(translated_text)

# Step 4: Back translation
back_translated_text = back_translate_text(translated_text, source_terms, target_terms, source_lang, target_lang)

# Step 5: Compare and refine
feedback = compare_texts(chinese_text, back_translated_text)

## Comparing with Standard Machine Translation
For comparison, we'll use a standard translation method.

### Using Google Translate
Using google translate for the above abstract we get the following English translation.

## Evaluating Agent-Based Translation

In [12]:
google_translated_text = """
Chinese abstract

Protein oxidation caused by reactive oxidative species is often associated with aging and neurodegenerative diseases. Among proteins, histamine, which can chelate metal ions, is a common target of oxidative damage. 2-Oxyhistidine is a product of histidine oxidation under metal-catalyzed recent research and has the potential to be used as a biomarker of protein oxidation damage. However, due to insufficient synthetic methods, its biophysical and biochemical properties have not yet been thoroughly studied. We have developed an effective method that utilizes metal-catalyzed oxidation reactions to produce 2-oxyhistidine acid on amino acid monomers and peptide chains. In the common metal catalytic oxidation system of divalent copper ions/sodium ascorbate/oxygen, we adjusted the ratio of reagents and the buffer solution system of the reaction, so that the yield of the reaction increased more than ten times compared with the literature reports, and we were able to obtain a complete Oxidized 2-oxyhistidine peptide chain to identify its properties.
The peptide chain containing 2-oxyhistidine has a longer retention time on the carbon-18 column in the analysis of the liquid chromatography tandem mass spectrometer; its structure is free in the mass spectrometer electrospray and tandem mass spectrometry collision Demonstrates a high degree of stability during induced dissociation. Therefore, mass spectrometry analysis of proteosomes targeting 2-oxyhistidine is feasible. The 2-oxyhistidine peptide chain we synthesized can be used as a standard for such mass spectrometry experiments, and can also be further used as a peptide probe or to induce antibody production. There is considerable research on this non-enzymatic post-translational modification. value.
In this study, we linked a peptide chain containing 2-oxyhistidine to biotin as a probe to study 2-oxyhistidine interactors. Biotin can immobilize the probe on a solid phase modified with avidin, and after reacting with HeLa cell lysate, the 2-oxyhistidine interactor can be obtained and identified by mass spectrometry. Since new evidence gradually shows that oxidatively damaged proteins may be broken down by proteasomes, we cross-analyzed 2-oxyhistidine interactors and proteasome interactors in an attempt to understand the cellular response to protein quality. control mechanism.

Keywords: 2-oxyhistidine, reactive oxidative species, metal-catalyzed oxidation, protease, interactor
"""

Here is the translation provided by the paper authors. Though, like all translations, are subjective to a degree, this translation provides a good idea of what the authors originally intended for meaning.

In [7]:
reference_translation = """ Protein oxidation by reactive oxygen species has been associated with aging and neurodegenerative disorders, and histidine is a major target for oxidation due to its metal chelating property and susceptibility to metal-catalyzed oxidation. 2-oxohistidine, the major product of histidine oxidation, has been recently identified as a stable marker of oxidative damage in biological systems, but its biophysical and biochemical properties are understudied, partly due to difficulties in its chemical synthesis. We developed an efficient method to generate 2-oxohistidine side chain using metal catalyzed oxidation, applicable to both monomers and peptides. By optimizing reagent ratios and pH value in Cu2+/ascorbate/O2 reaction system, we improved the yield by more than 10-fold compared to reported conditions, which allowed us to obtain homogeneously modified 2-oxohisidine peptides for further characterization.  
Analysis of 2-oxohistidine-containing model peptides by liquid chromatography-tandem mass spectrometry revealed increased retention time in reverse-phase chromatography and general stability of 2-oxohistidine under electrospray ionization and collision-induced dissociation. Thus, large-scale analysis of 2-oxohistidine-modified proteome should be feasible using shotgun protein mass spectrometry. Peptide probes and antigens containing 2-oxohistidine will be important tools to advance the biochemical and proteomic studies of this non-enzymatic post-translational modification. 
Furthermore, 2-oxohistidine containing peptides were conjugated with biotin to make probes for interactome analysis. The 2-oxohistidine peptide probe was bound to monomeric avidin agarose and incubated with HeLa cell lysate to capture 2-oxohistidine interactome, which was finally eluted by biotin and identified with LC-MS/MS. Recent evidence has shown that some oxidized proteins are degraded by proteasomes. The interconnection between 2-oxohistidine interactome and proteasome interactomes was studied as an attempt to understand the underlying quality control mechanism. 
 
Key words: 2-oxohistidine, reactive oxygen species, metal-catalyzed oxidation, proteasome, interactome 
  """

In [None]:
print("\nEvaluating Agent-Based Translation:")
calculate_meteor(reference_translation, translated_text)
calculate_rouge(reference_translation, translated_text)

## Evaluating Google Translation

In [None]:
print("\nEvaluating Google Translation:")
calculate_meteor(reference_translation, google_translated_text)
calculate_rouge(reference_translation, google_translated_text)

# Conclusion
In this tutorial, we've:
- Explored the concept of agents in the context of LLMs.
- Demonstrated how agents can enhance technical language translation.
- Implemented an agent-based translation system that extracts and manages technical terms.
- Compared our agent-based approach with standard machine translation.
- Evaluated translation quality using METEOR and ROUGE metrics.

Key Takeaways:
- Agents enable structured and iterative workflows that improve translation quality.
- Managing technical terms ensures consistency and accuracy in translations.
- Back translation and comparison help identify and rectify discrepancies.

By incorporating agentic frameworks, data scientists can build more sophisticated and reliable NLP applications.

# Next Steps
- Extend to Other Domains: Adapt the agent for other technical fields like legal or engineering documents.
- Automate Refinement: Implement automated feedback incorporation for iterative improvement.
- Scale Up: Integrate the agent into larger translation pipelines or applications.
