# LLM Translation at Scale with Inference.net Batch API

LLMs are remarkably good at translation. It doesn't take a particularly strong LLM to perform most translations: a small 8-30B parameter model is more than strong enough for translating between most languages. The OpenRouter leaderboard shows the most popular models used for translation are tiny and fast--which allow you to translate very large amounts of text remarkably cheaply.

![image](openrouter.png)

For most translation tasks, the specific cheap model/provider you use isn't particularly important. But some translation jobs scale to the hundreds of millions to tens of trillions of tokens, and at that point price and rate limits become a factor. 

This is where Inference excels: we serve models extremely cheaply and have no rate limits for time-insensitive batch jobs like this.

Here's how you can get started with LLM translation:


## Setting Up Your Translation Pipeline

The beauty of using Inference.net is that it's compatible with the OpenAI SDK, so you can get started in seconds. Just point the client at our batch endpoint:


In [None]:
%pip install openai -q

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://batch.inference.net/v1",
    api_key=os.getenv("INFERENCE_API_KEY"),
)

## Your First Translation Batch

Let's say you need to translate a batch of product descriptions. With the Batch API, you prepare all your requests in a JSONL file where each line is a complete translation request:


In [2]:
import json

# Some product descriptions to translate
documents = [
    "High-quality wireless headphones with noise cancellation",
    "Ergonomic office chair with lumbar support", 
    "Smart home thermostat with energy-saving features",
    "Professional camera with 4K video recording",
    "Portable power bank with fast charging"
]

# Create the batch file
with open("translation_batch.jsonl", "w") as f:
    for idx, doc in enumerate(documents):
        request = {
            "custom_id": f"translation-{idx}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "meta-llama/llama-3.2-1b-instruct/fp-8",
                "messages": [
                    {
                        "role": "system",
                        "content": "You are a professional translator. Translate the following English text to Spanish, preserving the tone and meaning. Only translate the text, do not add any other text."
                    },
                    {
                        "role": "user",
                        "content": doc
                    }
                ],
                "max_tokens": 100,
                "temperature": 0.3  # Lower temperature for consistent translations
            }
        }
        f.write(json.dumps(request) + "\n")

print(f"Created batch file with {len(documents)} translation requests")


Created batch file with 5 translation requests


The key here is that each request is self-contained. You specify the model, the system prompt (your translation instructions), and the text to translate. Setting temperature to 0.3 gives you consistent, professional translations without too much creativity.


## Launching the Job

Once your batch file is ready, it's a two-step process: upload the file, then create the batch job.


In [3]:
# Upload the file
with open("translation_batch.jsonl", "rb") as f:
    batch_file = client.files.create(
        file=f,
        purpose="batch"
    )


# Create the batch job
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={
        "job_type": "translation",
        "document_count": str(len(documents))
    }
)

print(f"Batch job started: {batch.id}")


Batch job started: dW09IT9URRp-tQiK_4vjr


The batch starts processing immediately. For small jobs like this, it'll complete in seconds. For massive jobs with millions of tokens, it might take a few hours--but that's still faster than hitting rate limits with synchronous APIs.


## Getting Your Translations

Once the batch completes, you download the results and parse them:


In [None]:
"""Batch helper utilities"""

import time, json
from typing import List, Dict, Any

# ── Helpers ─────────────────────────────────────────────────────────────────────

def wait_for_batch(client, batch_id: str, interval: int = 2) -> Any:
    """Poll the batch until it completes; returns final status object."""
    while True:
        status = client.batches.retrieve(batch_id)
        if status.status == "completed":
            return status
        print("Status:", status.status)
        time.sleep(interval)

def ndjson_to_dicts(client, file_id: str) -> List[Dict[str, Any]]:
    """Download a file and parse ND‑JSON into a list of dicts, skipping blanks."""
    text = client.files.content(file_id).text
    return [json.loads(line) for line in text.splitlines() if line.strip()]

def show_translations(docs: List[str], records: List[Dict[str, Any]]) -> None:
    """Pretty‑print originals next to their Spanish translations."""
    for i, original in enumerate(docs):
        translated = records[i].get("response") if i < len(records) else "<missing>"
        print(f"\nOriginal: {original}\nSpanish:  {translated}")

def show_errors(err_records: List[Dict[str, Any]], limit: int = 10) -> None:
    """Display up to *limit* raw error log lines."""
    print("⚠️  Output empty – showing error log lines:\n")
    for rec in err_records[:limit]:
        print(json.dumps(rec))

# ── Main workflow ──────────────────────────────────────────────────────────────

status      = wait_for_batch(client, batch.id)
output      = ndjson_to_dicts(client, status.output_file_id)
error_lines = ndjson_to_dicts(client, getattr(status, "error_file_id", None)) if getattr(status, "error_file_id", None) else []

if output:
    show_translations(documents, output)
else:
    show_errors(error_lines)


⚠️  Output empty – showing error log lines:

{"id": "dW09IT9URRp-tQiK_4vjr", "custom_id": "translation-4", "response": null, "error": {"code": "inference_failed", "message": "Maximum retries reached"}}
{"id": "dW09IT9URRp-tQiK_4vjr", "custom_id": "translation-3", "response": null, "error": {"code": "inference_failed", "message": "Maximum retries reached"}}
{"id": "dW09IT9URRp-tQiK_4vjr", "custom_id": "translation-2", "response": null, "error": {"code": "inference_failed", "message": "Maximum retries reached"}}
{"id": "dW09IT9URRp-tQiK_4vjr", "custom_id": "translation-1", "response": null, "error": {"code": "inference_failed", "message": "Maximum retries reached"}}
{"id": "dW09IT9URRp-tQiK_4vjr", "custom_id": "translation-0", "response": null, "error": {"code": "inference_failed", "message": "Maximum retries reached"}}


The results come back as JSONL too, with each line containing the custom_id you specified and the translation. This makes it trivial to match translations back to your original documents.


## Scaling to Millions of Documents

The real power comes when you scale up. Let's say you're translating an entire e-commerce catalog with 1,000,000 products into 5 languages. That's 5,000,000 translations. Before LLMs, this would be a undoable number of translations. Now its trivial. Let's calcualtion the cost of this task.

In [20]:
# Simulating a large catalog
languages = ["es", "fr", "de", "ja", "zh"]
products_per_language = 1000000
total_requests = len(languages) * products_per_language

# Estimate costs (using Llama 3.2 1B)
avg_tokens_per_request = 3000  # ~1500 input, ~1500 output
total_tokens = total_requests * avg_tokens_per_request
cost_per_million_tokens = 0.10  # Check current pricing
total_cost = (total_tokens / 1_000_000) * cost_per_million_tokens

print(f"Translation job size:")
print(f"  Languages: {len(languages)}")
print(f"  Products: {products_per_language:,}")
print(f"  Total requests: {total_requests:,}")
print(f"  Estimated tokens: {total_tokens:,}")
print(f"  Estimated cost: ${total_cost:.2f}")
print(f"  Cost per translation: ${total_cost/total_requests:.4f}")

Translation job size:
  Languages: 5
  Products: 1,000,000
  Total requests: 5,000,000
  Estimated tokens: 15,000,000,000
  Estimated cost: $1500.00
  Cost per translation: $0.0003


At these scales, traditional translation APIs would either reject your requests or charge enterprise rates. With Inference.net's Batch API, you just upload larger JSONL files and wait. No rate limits, no throttling, just results.


## Production Tips

Here's some general tips for translation:

**1. Use the smallest model that works.** Llama 3.2 1B is good enough for most translation tasks, but struggles with following directions/processing long documents. Try different models and see what works well. If you are processing less than 100M tokens, just use an 8B model and call it a day.

**2. Add glossaries to your system prompt.** If you have specific terms that must be translated consistently, you can inject glossaries into your prompt.

**3. Break long texts up into parts, and then translate the parts and merge.** Don't try to translate very long documents (more than one or 1/2 page) at once. You may want to try breaking into paragraphs too.

**4. Use webhooks for large jobs or with new data.** If you constantly have new data coming in, a single static batch job might make less sense than a processing pipeline where you send new texts as they come in, we process them and send the results to your webhook, which then updates your database. 

For more info on this, check out our docs on this:
[Webhooks Quick Reference](https://docs.inference.net/features/asynchronous-inference/webhooks/quick-reference)


**5. Validate critical translations.** For important content, run a second pass with a different model to check for issues.

In [21]:
def translate_with_model(model_name, text, target_language):
    """Translate text using specified model"""
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {
                "role": "system",
                "content": f"Translate the following text to {target_language}. Preserve formatting and technical terms. Only translate the text, do not add any other text."
            },
            {
                "role": "user", 
                "content": text
            }
        ]
    )
    return response.choices[0].message.content

def verify_translation(original_text, translation, source_lang, target_lang, model_name):
    """Use a second model to verify if translation is correct, returns True/False"""
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {
                "role": "system",
                "content": f"You are a translation quality checker. Evaluate if the {target_lang} translation accurately represents the {source_lang} original text. Consider meaning, context, and technical accuracy. Respond in JSON format."
            },
            {
                "role": "user",
                "content": f"Original ({source_lang}): {original_text}\nTranslation ({target_lang}): {translation}\n\nIs this translation correct?"
            }
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "translation_verification",
                "schema": {
                    "type": "object",
                    "properties": {
                        "is_correct": {"type": "boolean"},
                        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
                        "explanation": {"type": "string"}
                    },
                    "required": ["is_correct", "confidence", "explanation"],
                    "additionalProperties": False
                },
                "strict": True
            }
        }
    )
    
    import json
    result = json.loads(response.choices[0].message.content)
    return result

def flag_for_human_review(text, original):
    """Flag text for human review"""
    print(f"Translation flagged for review:")
    print(f"Original: {original}")
    print(f"Translation: {text}")

# Create a mock product for demonstration
class MockProduct:
    def __init__(self, description, price, is_featured=False):
        self.description = description
        self.price = price
        self.is_featured = is_featured

# Example usage with a high-value product
product = MockProduct("Professional camera with 4K video recording", 1500, is_featured=True)
target_language = "Spanish"

print("Demonstrating translation validation:")
print(f"Original: {product.description}")

# First pass: fast translation with 1B model
primary_translation = translate_with_model("google/gemma-3-27b-instruct/bf-16", product.description, target_language)
print(f"Primary translation: {primary_translation}")

# Second pass: verify translation quality with 3B model
if product.is_featured or product.price > 1000:
    print("Verifying translation quality...")
    verification_result = verify_translation(
        original_text=product.description,
        translation=primary_translation,
        source_lang="English",
        target_lang=target_language,
        model_name="qwen/qwen2.5-7b-instruct/bf-16"
    )
    
    print(f"Is correct: {verification_result['is_correct']}")
    print(f"Confidence: {verification_result['confidence']:.2f}")
    print(f"Explanation: {verification_result['explanation']}")
    
    if not verification_result['is_correct']:
        flag_for_human_review(primary_translation, product.description)
    else:
        print("✅ Translation verified as correct!")


Demonstrating translation validation:
Original: Professional camera with 4K video recording
Primary translation: Cámara profesional con grabación de video 4K

Verifying translation quality...
Is correct: True
Confidence: 1.00
Explanation: The translation is accurate. Both the English term 'Professional camera' and 'Cámara profesional' as well as the technical term '4K video recording' and 'grabación de video 4K' accurately and precisely convey the meaning and function of the original text. The translation maintains the exact details and context of the original.
✅ Translation verified as correct!


## Real-World Example: Chunking and Translating Documentation

Let's fetch a real markdown document and translate it in chunks using the batch API:


In [22]:
# Install chonkie for smart chunking
%pip install chonkie requests -q

import requests
from chonkie import RecursiveChunker

# Fetch the markdown content
url = "https://ai-sdk.dev/llms.txt"
print(f"Fetching content from {url}...")
response = requests.get(url)
response.raise_for_status()
markdown_content = response.text

print(f"Fetched {len(markdown_content)} characters")
print("First 200 characters:")
print(markdown_content[:200] + "...")

# Initialize the markdown chunker
chunker = RecursiveChunker.from_recipe("markdown", lang="en")

# Chunk the content on markdown headers
chunks = chunker.chunk(markdown_content)

You should consider upgrading via the '/Users/michaelryaboy/recent-projects/inference-webhook/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
Fetching content from https://ai-sdk.dev/llms.txt...
Fetched 789804 characters
First 200 characters:
---
title: RAG Chatbot
description: Learn how to build a RAG Chatbot with the AI SDK and Next.js
tags: ['rag', 'chatbot', 'next', 'embeddings', 'database', 'retrieval']
---

# RAG Chatbot Guide

In th...


In [23]:
print(f"\nChunked into {len(chunks)} sections:")
for i, chunk in enumerate(chunks[:5]):  # Show first 5 chunks
    print(f"Chunk {i+1}: {len(chunk.text)} chars, level {chunk.level}")
    print(f"  Preview: {chunk.text[:100].strip()}...")
    print()

print(f"Total chunks to translate: {len(chunks)}")


Chunked into 740 sections:
Chunk 1: 684 chars, level 0
  Preview: ---
title: RAG Chatbot
description: Learn how to build a RAG Chatbot with the AI SDK and Next.js
tag...

Chunk 2: 1554 chars, level 0
  Preview: ### Why is RAG important?

While LLMs are powerful, the information they can reason on is restricted...

Chunk 3: 1988 chars, level 0
  Preview: ### Embedding

[Embeddings](/docs/ai-sdk-core/embeddings) are a way to represent words, phrases, or...

Chunk 4: 1437 chars, level 0
  Preview: ### All Together Now

Combining all of this together, RAG is the process of enabling the model to re...

Chunk 5: 1099 chars, level 0
  Preview: ### Clone Repo

To reduce the scope of this guide, you will be starting with a [repository](https://...

Total chunks to translate: 740


In [24]:
# Create batch translation requests for all chunks
target_languages = ["Spanish", "French"]

print("Creating batch translation requests...")

# Prepare batch requests for all chunks and languages
batch_requests = []
for lang in target_languages:
    for i, chunk in enumerate(chunks):
        request = {
            "custom_id": f"chunk-{i}-{lang.lower()}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "mistralai/mistral-nemo-12b-instruct/fp-8",
                "messages": [
                    {
                        "role": "system",
                        "content": f"""Translate this technical documentation chunk to {lang}.

Rules:
- Preserve ALL markdown formatting (headers, links, code blocks, etc.)
- Keep code examples in English
- Preserve technical terms when appropriate
- Maintain the structure and meaning
- Only translate the content, don't add explanations
- ONLY give me the translated version of the content, no other text
- Code stays exactly as it is
"""
                    },
                    {
                        "role": "user",
                        "content": chunk.text
                    }
                ],
                "max_tokens": len(chunk.text.split()) * 10,  # Generous token limit
                "temperature": 0.3
            }
        }
        batch_requests.append(request)

print(f"Created {len(batch_requests)} translation requests")
print(f"Languages: {target_languages}")
print(f"Chunks per language: {len(chunks)}")

# print number of chunks at each 100 character interval
for i in range(0, 1000, 100):
    print(f"Number of chunks with {i} chars: {len([chunk for chunk in chunks if len(chunk.text.split()) <= i])}")

Creating batch translation requests...
Created 1480 translation requests
Languages: ['Spanish', 'French']
Chunks per language: 740
Number of chunks with 0 chars: 0
Number of chunks with 100 chars: 248
Number of chunks with 200 chars: 602
Number of chunks with 300 chars: 734
Number of chunks with 400 chars: 740
Number of chunks with 500 chars: 740
Number of chunks with 600 chars: 740
Number of chunks with 700 chars: 740
Number of chunks with 800 chars: 740
Number of chunks with 900 chars: 740


In [25]:
# Write batch file
batch_filename = f"docs_chunks_translation_{int(time.time())}.jsonl"
with open(batch_filename, "w") as f:
    for req in batch_requests:
        f.write(json.dumps(req) + "\n")

print(f"Batch file created: {batch_filename}")

# Upload and start batch job
print("Uploading batch file...")
with open(batch_filename, "rb") as f:
    batch_file = client.files.create(
        file=f,
        purpose="batch"
    )

# Create the batch job
batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={
        "job_type": "documentation_translation",
        "source_url": url,
        "languages": ",".join(target_languages),
        "total_chunks": str(len(chunks))
    }
)

print(f"Batch translation job started: {batch_job.id}")
print("This will translate all chunks into multiple languages simultaneously!")

Batch file created: docs_chunks_translation_1752537608.jsonl
Uploading batch file...
Batch translation job started: zwg5rS3WgDZqvG-va65Bl
This will translate all chunks into multiple languages simultaneously!


In [None]:
# Check batch status and get results
print("Checking batch status...")

# Wait for completion 
while True:
    batch_status = client.batches.retrieve(batch_job.id)
    print(f"Status: {batch_status.status}")
    
    if batch_status.status == "completed":
        print("✅ Batch completed!")
        break
    elif batch_status.status == "failed":
        print("❌ Batch failed!")
        break
    elif batch_status.status in ["cancelled", "expired"]:
        print(f"❌ Batch {batch_status.status}!")
        break
    
    time.sleep(3)

if batch_status.status == "completed":
    # Download and parse results
    print("Downloading results...")
    results_file = client.files.content(batch_status.output_file_id)

    error_file = client.files.content(batch_status.error_file_id)
    print(error_file.text)
    
    # Parse all translation results
    translations = {}
    for line in results_file.text.strip().split('\n'):
        if line.strip():
            result = json.loads(line)
            custom_id = result['custom_id']
            translation = result['response']['body']['choices'][0]['message']['content']
            translations[custom_id] = translation
    
    print(f"Received {len(translations)} translations")
    
    # Reconstruct documents by language
    reconstructed_docs = {}
    
    for lang in target_languages:
        print(f"\n📄 Reconstructing {lang} document...")
        lang_key = lang.lower()
        
        # Get all chunks for this language, sorted by chunk number
        lang_chunks = []
        for i in range(len(chunks)):
            chunk_id = f"chunk-{i}-{lang_key}"
            if chunk_id in translations:
                lang_chunks.append((i, translations[chunk_id]))
        
        # Sort by chunk index and concatenate
        lang_chunks.sort(key=lambda x: x[0])
        reconstructed_text = '\n\n'.join([chunk_text for _, chunk_text in lang_chunks])
        reconstructed_docs[lang] = reconstructed_text
        
        print(f"✅ {lang} document reconstructed: {len(reconstructed_text)} characters")
    
    # Show samples from each language
    print("\n" + "="*60)
    print("TRANSLATION RESULTS PREVIEW")
    print("="*60)
    
    for lang, doc in reconstructed_docs.items():
        print(f"\n🌍 {lang.upper()} VERSION:")
        print("-" * 40)
        # Show first 500 characters
        preview = doc[:500].strip()
        print(preview)
        if len(doc) > 500:
            print("...\n[Truncated - full translation available]")
        print()
    
    # Save translated documents to files
    print("💾 Saving translated documents...")
    for lang, doc in reconstructed_docs.items():
        filename = f"ai-sdk-llms_{lang.lower()}.md"
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(doc)
        print(f"Saved: {filename}")
    
    print("\n🎉 Translation complete! All documents have been translated and saved.")
    
else:
    print("❌ Could not retrieve results - batch did not complete successfully.")

Checking batch status...


NameError: name 'batch_job' is not defined

We did it! Our results are saved in two files:
ai-sdk-llms_spanish.md and Saved: ai-sdk-llms_french.md.

We just executed the following workflow:

1. **Smart Chunking**: Using Chonkie's markdown recipe to intelligently split documentation on headers
2. **Batch Processing**: Creating hundreds of translation requests simultaneously 
3. **Multi-language**: Translating to multiple target languages in a single batch job

This approach scales beautifully - whether you're translating a few pages or an entire documentation site with tens of thousands of pages.
