
# Training‑Free Prompting (Three‑Step) — Two Versions

This notebook implements the exact three-step method in your paper figure:

1. **Explain** the idiom in the **target language** (default: English) → *explanation / true meaning*
2. **Literal translation** (word‑by‑word) into English
3. **Natural idiomatic translation**, combining (1) + (2)

We provide **two versions**:

- **Version A (Paper Step 3 only):** Use the CSV's `true_meaning` and `literal_translation` for steps (1) and (2), then run step (3) to produce the final idiomatic translation.
- **Version B (Fully LLM‑driven):** Ask the LLM to produce steps (1) and (2), then run step (3).

Both versions save results to CSV.


In [None]:

# (Optional) If needed:
%pip install --quiet openai pandas



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/jimcheng/Desktop/UIUC/cs546/hw1/.venv/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Configuration

In [None]:

import os, json, re, hashlib, pathlib, asyncio
from dataclasses import dataclass
from typing import List, Dict, Tuple, Any, Optional
from collections import Counter
import pandas as pd
from tqdm.notebook import tqdm

# --- Paths ---
INPUT_CSV = "petci_chinese_english_improved.csv"
OUT_A     = "version_A_results.csv"      # Uses CSV meanings/literals + Step 3
OUT_B     = "version_B_results.csv"      # Full LLM: Steps 1 + 2 + 3

# --- Model / Inference ---
MODEL = os.getenv("GPT5_MODEL", "gpt-5-mini")
TARGET_EXPLANATION_LANGUAGE = "English"  # change to "Chinese" or others if needed

# Ensure your OpenAI key is available:
os.environ["OPENAI_API_KEY"] = "sk-..."

CACHE_DIR = pathlib.Path("./cache_three_step")
CACHE_DIR.mkdir(parents=True, exist_ok=True)

@dataclass
class InferenceConfig:
    model: str = MODEL
    top_p: float = 1.0
    seed: Optional[int] = None
    system_prompt: str = "You are a precise bilingual translator. Output compact text, no extra commentary."


## Load CSV

In [None]:

def load_idioms(csv_path: str) -> pd.DataFrame:
    df = pd.read_csv(csv_path)
    if "src" not in df.columns:
        raise ValueError("CSV must include a 'src' column.")
    if "true_meaning" not in df.columns:
        df["true_meaning"] = None
    if "literal_translation" not in df.columns:
        df["literal_translation"] = None
    return df

df = load_idioms(INPUT_CSV)
df.head()


Unnamed: 0,src,true_meaning,literal_translation
0,一波未平，一波又起,Suffer a string of reverses,"One wave is not flat, another wave is rising"
1,一板三眼,Following a prescribed pattern in speech or ac...,One board and three eyes
2,一鼻孔出气,Be in tune with,Exhale through one nostril
3,一步登天,Have a meteoric rise,One step to the sky
4,一不做，二不休,"Once it is started, go through with it","Don't do it, never stop"


## GPT‑5 Call Helper (with disk cache)

In [None]:

def _cache_key(payload: Dict[str, Any]) -> str:
    return hashlib.sha256(json.dumps(payload, sort_keys=True, ensure_ascii=False).encode()).hexdigest()

def call_gpt5(user_content: str, cfg: InferenceConfig) -> str:
    payload = {
        "model": cfg.model,
        "top_p": cfg.top_p,
        "seed": cfg.seed,
        "system": cfg.system_prompt,
        "user": user_content,
    }
    key = _cache_key(payload)
    f = CACHE_DIR / f"{key}.json"
    if f.exists():
        return json.loads(f.read_text())["text"]

    from openai import OpenAI
    client = OpenAI()
    resp = client.chat.completions.create(
        model=cfg.model,
        messages=[
            {"role":"system","content":cfg.system_prompt},
            {"role":"user","content":user_content},
        ],
        top_p=cfg.top_p,
        seed=cfg.seed
    )
    text = resp.choices[0].message.content.strip()
    f.write_text(json.dumps({"text": text}, ensure_ascii=False))
    return text

CFG = InferenceConfig()


## Three Prompts (matching the paper steps)

In [None]:

PROMPT_STEP1_EXPLAIN = """
Explain the meaning of the following Chinese idiom in {lang}.
- Audience: educated readers; be concise (<= 2 sentences).
- Do not translate word-by-word; provide the **idiomatic sense**.

Idiom: {idiom}
""".strip()

PROMPT_STEP2_LITERAL = """
Provide a **literal, word-by-word** English translation for the following Chinese idiom.
- Keep it terse and faithful to each component.
- No commentary, just the literal gloss.

Idiom: {idiom}
""".strip()

PROMPT_STEP3_NATURAL = """
Produce a **natural English idiomatic translation** given:
(1) An idiom explanation (idiomatic meaning) and
(2) A literal word-by-word gloss.

Rules:
- Output a single short English phrase/sentence that a native speaker would actually say.
- Prefer clarity and naturalness over literalness.
- No extra commentary.

Idiom: {idiom}
Explanation: {explanation}
Literal: {literal}
Result:
""".strip()


## Version A — Use CSV (steps 1 & 2 from file) → Run Step 3 only

In [None]:

def version_A_run(df: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for _, r in df.iterrows():
        idiom = str(r["src"])
        explanation = str(r["true_meaning"]) if pd.notna(r["true_meaning"]) else ""
        literal = str(r["literal_translation"]) if pd.notna(r["literal_translation"]) else ""

        # Step 3 prompt
        p3 = PROMPT_STEP3_NATURAL.format(idiom=idiom, explanation=explanation, literal=literal)
        final = call_gpt5(p3, CFG)

        rows.append({
            "src": idiom,
            "explanation_used": explanation,
            "literal_used": literal,
            "final_translation": final
        })
    return pd.DataFrame(rows)

res_A = version_A_run(df)
res_A.head()


Unnamed: 0,src,explanation_used,literal_used,final_translation
0,一举成名,Achieve instant fame,Rise to fame,Become famous overnight.
1,不胜枚举,Cannot be enumerated one by one,The list goes on,There are too many to list.
2,偷鸡摸狗,Crooked dealings,Stalking the dog,shady dealings
3,口蜜腹剑,Hypocritical and malignant,Honey belly sword,Two-faced and treacherous
4,大势已去,The game is as good as lost,The tide is gone,The game is as good as lost.


## Version A Fast— Use CSV (steps 1 & 2 from file) → Run Step 3 only

In [None]:
# helper: run blocking call_gpt5 in a thread
async def _run_gpt(p3, cfg):
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(None, call_gpt5, p3, cfg)

async def _worker(row):
    idiom = str(row["src"])
    explanation = str(row["true_meaning"]) if pd.notna(row["true_meaning"]) else ""
    literal = str(row["literal_translation"]) if pd.notna(row["literal_translation"]) else ""

    p3 = PROMPT_STEP3_NATURAL.format(
        idiom=idiom,
        explanation=explanation,
        literal=literal,
    )

    final = await _run_gpt(p3, CFG)

    return {
        "src": idiom,
        "explanation_used": explanation,
        "literal_used": literal,
        "final_translation": final,
    }

async def version_A_run_parallel(df: pd.DataFrame, concurrency: int = 8) -> pd.DataFrame:
    # limit how many GPT calls happen at once
    sem = asyncio.Semaphore(concurrency)
    tasks = []

    for row in df.to_dict(orient="records"):
        async def go(row=row):
            async with sem:
                return await _worker(row)
        tasks.append(go())

    results = []
    for coro in tqdm(asyncio.as_completed(tasks),
                     total=len(tasks),
                     desc="Translating idioms"):
        results.append(await coro)

    return pd.DataFrame(results)


res_A = await version_A_run_parallel(df)
res_A.head()


Translating idioms:   0%|          | 0/1623 [00:00<?, ?it/s]

Unnamed: 0,src,explanation_used,literal_used,final_translation
0,乘兴而来，败兴而归,Set out cheerfully and return disappointed,"Come in good times, come back in defeat",Went in high hopes and came back disappointed.
1,一波未平，一波又起,Suffer a string of reverses,"One wave is not flat, another wave is rising","When it rains, it pours."
2,立足之地,Standing-room,Gain a foothold,a foothold
3,革故鼎新,Discard the old ways of life in favor of the new,Revolutionizing the old,Cast off the old and embrace the new.
4,不分彼此,Share everything,Regardless of each other,We share everything.


## Version B — Full LLM (steps 1 & 2 via model) → Run Step 3

In [None]:

def step1_explain(idiom: str, lang: str = TARGET_EXPLANATION_LANGUAGE) -> str:
    p = PROMPT_STEP1_EXPLAIN.format(idiom=idiom, lang=lang)
    return call_gpt5(p, CFG)

def step2_literal(idiom: str) -> str:
    p = PROMPT_STEP2_LITERAL.format(idiom=idiom)
    return call_gpt5(p, CFG)

def step3_natural(idiom: str, explanation: str, literal: str) -> str:
    p = PROMPT_STEP3_NATURAL.format(idiom=idiom, explanation=explanation, literal=literal)
    return call_gpt5(p, CFG)

def version_B_run(df: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for _, r in df.iterrows():
        idiom = str(r["src"])

        # Step 1 & 2 generated by LLM
        explanation_gen = step1_explain(idiom)
        literal_gen     = step2_literal(idiom)

        # Step 3
        final = step3_natural(idiom, explanation_gen, literal_gen)

        rows.append({
            "src": idiom,
            "explanation_gen": explanation_gen,
            "literal_gen": literal_gen,
            "final_translation": final
        })
    return pd.DataFrame(rows)

res_B = version_B_run(df)
res_B.head()


Unnamed: 0,src,explanation_gen,literal_gen,final_translation
0,一举成名,To achieve sudden or instant fame as the resul...,one move become name,become an overnight sensation
1,不胜枚举,"It means ""too numerous to list individually"" —...",not / able to / (measure word for small items)...,Too numerous to count.
2,偷鸡摸狗,"It describes petty, underhanded behavior—small...",steal chicken touch dog,"petty, underhanded schemes"
3,口蜜腹剑,To flatter with honeyed words while secretly p...,"mouth honey, belly sword",Sweet-talking but backstabbing.
4,大势已去,The overall momentum has decisively turned aga...,big momentum already gone,It's a lost cause.


## Save Results

In [None]:

res_A.to_csv(OUT_A, index=False)
#res_B.to_csv(OUT_B, index=False)
print("Saved:")
print(" - Version A →", OUT_A)
#print(" - Version B →", OUT_B)


Saved:
 - Version A → version_A_results.csv
