## 🚀 Latar Belakang Proyek

Seiring dengan pesatnya perkembangan *Large Language Models* (LLM), analisis mengenai bagaimana model-model ini digunakan dan dipersepsikan oleh pengguna menjadi semakin penting.  
Proyek ini menganalisis data percakapan/komparasi antarmodel untuk:

- Memahami **tren popularitas** model,
- Mengidentifikasi **topik utama** (fitur n‑gram),
- Mengevaluasi **tingkat preferensi/kualitas** via **Win‑Rate** (+ Wilson 95% CI),
- Mengukur **efisiensi penyelesaian** via **Turns‑to‑Solve (TTS)**,
- Menilai **kecocokan model per kategori tugas** (*Fit‑for‑Purpose*: Coding, Penulisan, Analisis Data, Terjemahan).

Wawasan ini berguna untuk: **routing otomatis** model per topik, **bundling produk**, dan kebijakan **pricing/SLAs** yang lebih presisi.

## 🎯 Pertanyaan Bisnis

1. **Popularitas Model** — model mana yang paling sering digunakan?
2. **Topik Utama** — apa kata kunci/tema yang paling sering diminta pengguna?
3. **Win‑Rate** — model mana yang lebih disukai (beserta Wilson 95% CI)?
4. **Turns‑to‑Solve (TTS)** — berapa rata-rata giliran hingga “beres”? *(proxy “thanks/berhasil/works/dll”)*  
5. **Fit‑for‑Purpose** — model mana unggul per kategori tugas (Coding, Penulisan, Analisis Data, Terjemahan)?

In [None]:
# Imports & Setup
import os, re, warnings
from math import sqrt

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset

warnings.filterwarnings(
    "ignore",
    message="The default of observed=False is deprecated and will be changed to True in a future version of pandas",
    category=FutureWarning
)

sns.set(style="darkgrid")
plt.rcParams["figure.figsize"] = (8, 4.8)

# Regex proxy 'beres'
OK_PAT = re.compile(
    r"(thanks|thank you|terima kasih|berhasil|works|solved|mantap|fixed?|oke+|ok|done|clear|yes|sip|resolved|great|perfect)",
    re.IGNORECASE,
)

# Aturan kategorisasi topik
TOPIC_RULES = {
    "Coding": re.compile(r"\b(code|coding|bug|function|class|method|api|regex|python|javascript|java|ts|typescript|cpp|golang|php|html|css|framework|compile|error)\b", re.I),
    "Analisis Data": re.compile(r"\b(data|dataset|pandas|numpy|stat(istik|s)?|regression|cluster|model(ing)?|visualisasi|plot|chart|csv|etl|eda)\b", re.I),
    "Terjemahan": re.compile(r"\b(translate|translat(e|ion)|terjemah|alih ?bahasa|english to indonesian|indonesian to english|b\.?inggris|b\.?indonesia)\b", re.I),
    "Penulisan": re.compile(r"\b(tulis|menulis|writing|essay|artikel|copy|caption|paragraf|ringkas|rangkuman|summary|email|surat|konten)\b", re.I),
}

# Stopwords ringan untuk n-gram
STOP = {
    "the","and","for","with","that","this","from","your","have","you","will","just","does","did","can","could",
    "would","there","here","into","them","then","than","what","when","where","which","some","about","like",
    "been","were","they","their","ours","ourselves",
    "kami","kita","kamu","anda","yang","dengan","untuk","atau","dari","pada","dalam","akan","saya","dia",
    "itu","ini","bisa","tidak","iya","dan","atau","jadi","agar","karena","kalau","sehingga"
}


In [None]:
def normalize_model_name(m: str) -> str:
    if m is None: return ""
    m = str(m).strip()
    m = m.replace(" - ", "-")
    m = re.sub(r"\s+", " ", m)
    return m

def _user_text_from_conv(conv):
    if not isinstance(conv, (list, tuple)): return ""
    parts = []
    for msg in conv:
        if isinstance(msg, dict) and msg.get("role") == "user":
            parts.append((msg.get("content") or "").strip())
    return " ".join(parts)

def is_solved_from_conv(conv) -> bool:
    if not isinstance(conv, (list, tuple)): return False
    for msg in reversed(conv):
        if isinstance(msg, dict) and msg.get("role") == "user":
            return bool(OK_PAT.search((msg.get("content") or "").lower()))
    return False

def topic_category_from_text(text: str) -> str:
    if not isinstance(text, str): return "Lainnya"
    for label in ["Coding", "Analisis Data", "Terjemahan", "Penulisan"]:
        if TOPIC_RULES[label].search(text):
            return label
    return "Lainnya"

def add_derived_columns(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["model_norm"] = df["model"].apply(normalize_model_name)
    df["user_text"] = df["conversation"].apply(_user_text_from_conv)
    df["is_solved"] = df["conversation"].apply(is_solved_from_conv)
    df["topic_category"] = df["user_text"].apply(topic_category_from_text)
    df["turn"] = df["conversation"].apply(lambda conv: len(conv) if isinstance(conv, (list, tuple)) else np.nan)
    return df

def wilson_ci(k: float, n: float, z: float = 1.96):
    if n <= 0: return (0.0, 0.0)
    p = k / n
    denom = 1 + z*z/n
    centre = p + z*z/(2*n)
    adj = z * sqrt((p*(1-p) + z*z/(4*n))/n)
    lo = (centre - adj)/denom
    hi = (centre + adj)/denom
    return lo, hi


In [None]:
# Load Data (auto-detect skema + cache lokal)
SAMPLE_ROWS = 20000
LOCAL_CACHE = "data/arena55k_sample.parquet"
os.makedirs(os.path.dirname(LOCAL_CACHE), exist_ok=True)

# 1) Baca cache lokal jika ada
if os.path.exists(LOCAL_CACHE):
    try:
        df_raw = pd.read_parquet(LOCAL_CACHE)
        print(f"Menggunakan cache lokal: {LOCAL_CACHE} (rows={len(df_raw):,})")
    except Exception as e:
        print("Gagal baca cache lokal:", e)
        df_raw = None
else:
    df_raw = None

# 2) Unduh kalau belum ada
if df_raw is None:
    ds = load_dataset("lmsys/lmsys-arena-human-preference-55k", split="train")
    df_raw = pd.DataFrame(ds)
    df_save = df_raw.sample(SAMPLE_ROWS, random_state=42).reset_index(drop=True) if (SAMPLE_ROWS and SAMPLE_ROWS < len(df_raw)) else df_raw
    df_save.to_parquet(LOCAL_CACHE, index=False)
    print(f"Unduh & simpan cache ke {LOCAL_CACHE} (rows={len(df_save):,})")

# 3) Subsample untuk analisis notebook
if SAMPLE_ROWS and SAMPLE_ROWS < len(df_raw):
    df_raw = df_raw.sample(SAMPLE_ROWS, random_state=42).reset_index(drop=True)

# 4) Deteksi skema
has_conv = {"model_a","model_b","conversation_a","conversation_b"}.issubset(df_raw.columns)
has_pair = {"model_a","model_b","prompt","response_a","response_b","winner_model_a","winner_model_b","winner_tie"}.issubset(df_raw.columns)

if has_conv:
    SCHEMA = "conversation"
elif has_pair:
    SCHEMA = "pairwise"
else:
    raise ValueError(f"Skema dataset tidak dikenali. Kolom tersedia: {sorted(df_raw.columns.tolist())[:40]} ...")

print("Schema terdeteksi:", SCHEMA)


In [None]:
# Normalisasi ke long format + win-rate (Wilson CI)
if SCHEMA == "conversation":
    df_a = df_raw[["model_a", "conversation_a"]].rename(columns={"model_a":"model","conversation_a":"conversation"})
    df_b = df_raw[["model_b", "conversation_b"]].rename(columns={"model_b":"model","conversation_b":"conversation"})
    df_long = pd.concat([df_a, df_b], ignore_index=True).dropna(subset=["model","conversation"])

    if "winner_model" in df_raw.columns:
        wins = df_raw["winner_model"].value_counts()
    elif "winner" in df_raw.columns:
        wins_a = df_raw.loc[df_raw["winner"]=="model_a","model_a"].value_counts()
        wins_b = df_raw.loc[df_raw["winner"]=="model_b","model_b"].value_counts()
        wins = wins_a.add(wins_b, fill_value=0)
    else:
        wins = pd.Series(dtype=float)
    apps = df_raw["model_a"].value_counts().add(df_raw["model_b"].value_counts(), fill_value=0)

elif SCHEMA == "pairwise":
    df_a = df_raw[["model_a","prompt","response_a","winner_model_a","winner_tie"]].copy()
    df_b = df_raw[["model_b","prompt","response_b","winner_model_b","winner_tie"]].copy()
    df_a.rename(columns={"model_a":"model","response_a":"response","winner_model_a":"won"}, inplace=True)
    df_b.rename(columns={"model_b":"model","response_b":"response","winner_model_b":"won"}, inplace=True)

    df_a["conversation"] = df_a.apply(lambda r: [{"role":"user","content":r["prompt"]},{"role":"assistant","content":r["response"]}], axis=1)
    df_b["conversation"] = df_b.apply(lambda r: [{"role":"user","content":r["prompt"]},{"role":"assistant","content":r["response"]}], axis=1)

    df_a.loc[df_a["winner_tie"]==1, "won"] = 0
    df_b.loc[df_b["winner_tie"]==1, "won"] = 0

    df_long = pd.concat([df_a[["model","conversation","won"]], df_b[["model","conversation","won"]]], ignore_index=True).dropna(subset=["model","conversation"])
    wins = df_long.groupby("model")["won"].sum(min_count=1)
    apps = df_long["model"].value_counts()

# Normalisasi index
wins.index = wins.index.astype(str).map(normalize_model_name)
apps.index = apps.index.astype(str).map(normalize_model_name)

# Derived kolom
df_long = add_derived_columns(df_long)

if SCHEMA == "pairwise":
    # Override is_solved berdasar 'won'; turn=2
    if "won" in df_long.columns:
        df_long["is_solved"] = df_long["won"].fillna(0).astype(int) == 1
    df_long["turn"] = 2

# Win-rate + Wilson CI
win_rate = (wins / apps).dropna().sort_values(ascending=False)
wr_df = pd.DataFrame({"wins": wins, "apps": apps}).fillna(0)
wr_df["win_rate"] = wr_df.apply(lambda r: (r["wins"]/r["apps"]) if r["apps"]>0 else np.nan, axis=1)
wr_df[["wr_lo","wr_hi"]] = wr_df.apply(lambda r: pd.Series(wilson_ci(r["wins"], r["apps"])), axis=1)
wr_df.index.name = "model_norm"

df_long.head()


In [None]:
# 1) Popularitas Model
top_n_pop = 12
order = df_long["model_norm"].value_counts().head(top_n_pop).index
df_pop = df_long[df_long["model_norm"].isin(order)]

ax = sns.countplot(data=df_pop, x="model_norm", order=order, palette="cividis")
ax.set_title("Popularitas Model (berdasar jumlah percakapan)")
ax.set_xlabel("Model"); ax.set_ylabel("Jumlah Percakapan")
plt.xticks(rotation=25, ha="right")
plt.show()

print(f"N efektif: {len(df_pop):,} percakapan | Model unik: {len(order)}")


In [None]:
# 2) Topik Utama (n-gram: unigram + bigram)
texts = df_long["user_text"].astype(str).str.lower().tolist()
all_text = " ".join(texts)
tokens = re.findall(r"[a-zA-Z]{3,}", all_text)
tokens = [w for w in tokens if w not in STOP]
bigrams = [" ".join(tokens[i:i+2]) for i in range(len(tokens)-1)]
merged = tokens + bigrams
freq = pd.Series(merged).value_counts().head(20)

ax = sns.barplot(x=freq.values, y=freq.index, palette="cividis")
ax.set_title("Top 20 N‑gram dari Pesan Pengguna")
ax.set_xlabel("Frekuensi"); ax.set_ylabel("N‑gram")
plt.show()

print(f"N efektif: {len(df_long):,} percakapan | Token unik terpilih: {len(freq)}")


In [None]:
# 3) Win-Rate (Wilson 95% CI)
wr_view = wr_df.sort_values("win_rate", ascending=False).copy()
fig, ax = plt.subplots(figsize=(10,5))
x = np.arange(len(wr_view))
ax.errorbar(
    x, wr_view["win_rate"].values,
    yerr=[wr_view["win_rate"].values - wr_view["wr_lo"].values,
          wr_view["wr_hi"].values - wr_view["win_rate"].values],
    fmt="o", capsize=3
)
ax.set_xticks(x)
ax.set_xticklabels(wr_view.index, rotation=25, ha="right")
ax.set_ylabel("Win‑Rate")
ax.set_title("Win‑Rate per Model (error bars = Wilson 95% CI)")
plt.show()

print(f"Total Apps (pasangan kompetisi): {int(wr_view['apps'].sum()):,}")

# (Opsional) scatter Win‑Rate vs Avg Turns (relevan untuk schema 'conversation')
if 'turn' in df_long.columns and SCHEMA == 'conversation':
    avg_turns = df_long.groupby("model_norm", observed=True)["turn"].mean()
    comp_idx = wr_view.index.intersection(avg_turns.index)
    comp = pd.DataFrame({"Win-Rate": wr_view.loc[comp_idx, "win_rate"], "Avg Turns": avg_turns.loc[comp_idx]}).dropna()
    if not comp.empty:
        ax = sns.scatterplot(data=comp, x="Avg Turns", y="Win-Rate", s=80)
        for model_name, row in comp.iterrows():
            ax.text(row["Avg Turns"], row["Win-Rate"], model_name, fontsize=8)
        ax.set_title("Win‑Rate vs Avg Turns")
        plt.show()


In [None]:
# 4) TTS (statistik + visual)
def compute_tts(df_in: pd.DataFrame, min_turn: int = 3, schema: str = "conversation") -> pd.DataFrame:
    if df_in.empty:
        return pd.DataFrame(columns=["n_solved","mean","median","p75"])
    eff_min = 2 if schema == "pairwise" else min_turn
    df_use = df_in[(df_in["is_solved"]) & (df_in["turn"].fillna(0) >= eff_min)]
    if df_use.empty:
        return pd.DataFrame(columns=["n_solved","mean","median","p75"])
    tts = (
        df_use.groupby("model_norm", observed=True)["turn"]
        .agg(n_solved="count", mean="mean", median="median", p75=lambda s: s.quantile(0.75))
        .sort_values("median")
    )
    return tts

def get_tts_samples(df_in: pd.DataFrame, min_turn: int, schema: str) -> pd.DataFrame:
    if df_in.empty:
        return pd.DataFrame(columns=["model_norm","turn"])
    eff_min = 2 if schema == "pairwise" else min_turn
    return df_in[(df_in["is_solved"]) & (df_in["turn"].fillna(0) >= eff_min)][["model_norm","turn"]].copy()

MIN_TURN = 3
tts_stats = compute_tts(df_long, min_turn=MIN_TURN, schema=SCHEMA)
display(tts_stats.round(2))

if not tts_stats.empty:
    order = tts_stats.index.tolist()
    ax = sns.barplot(x=tts_stats.index, y=tts_stats["median"], order=order, palette="cividis")
    ax.set_xlabel(""); ax.set_ylabel("Median TTS (turn)")
    ax.set_title("Median TTS per Model (lebih kecil lebih baik)")
    plt.xticks(rotation=20, ha="right")
    for i, model_name in enumerate(order):
        n = int(tts_stats.loc[model_name, "n_solved"])
        ax.text(i, float(tts_stats.loc[model_name, "median"]) + 0.1, f"n={n}", ha="center", va="bottom", fontsize=9)
    plt.show()

eff_min = 2 if SCHEMA == "pairwise" else MIN_TURN
tts_samples = get_tts_samples(df_long, MIN_TURN, SCHEMA)

if SCHEMA == "pairwise":
    print("Catatan: skema pairwise hanya 1 balasan per model → TTS ≈ 2 turn (kurang informatif).")
    solved_counts = tts_stats["n_solved"].sort_values(ascending=False) if not tts_stats.empty else pd.Series(dtype=int)
    if not solved_counts.empty:
        ax = sns.barplot(x=solved_counts.index, y=solved_counts.values, palette="cividis")
        ax.set_xlabel(""); ax.set_ylabel("Jumlah Percakapan Solved")
        ax.set_title("Volume Solved per Model")
        plt.xticks(rotation=20, ha="right")
        plt.show()
else:
    if not tts_samples.empty:
        max_turns = int(np.nanmax(tts_samples["turn"])) if not tts_samples["turn"].isna().all() else eff_min
        bins = range(1, max(5, max_turns) + 2)
        plt.hist(tts_samples["turn"].dropna(), bins=bins)
        plt.xlabel("Jumlah Turn"); plt.ylabel("Frekuensi")
        plt.title("Histogram TTS (Percakapan Solved)")
        plt.show()

        TOPK = 10
        top_models_tts = tts_samples["model_norm"].value_counts().head(TOPK).index
        df_box = tts_samples[tts_samples["model_norm"].isin(top_models_tts)]
        if not df_box.empty:
            ax = sns.boxplot(data=df_box, x="model_norm", y="turn", order=top_models_tts, palette="cividis")
            ax.set_xlabel(""); ax.set_ylabel("TTS (turn)")
            ax.set_title("Sebaran TTS per Model (Top‑K by n_solved)")
            plt.xticks(rotation=20, ha="right")
            plt.show()
    else:
        print("Tidak ada sampel TTS yang memenuhi kriteria untuk grafik distribusi.")


In [None]:
# 5) Fit-for-Purpose (Heatmap Model × Topik + Leaders)
TOP_N_HEAT = 8
top_models = df_long["model_norm"].value_counts().head(TOP_N_HEAT).index

perf = (
    df_long[df_long["model_norm"].isin(top_models)]
    .groupby(["topic_category","model_norm"], observed=True)
    .agg(n=("model_norm","size"), solved_rate=("is_solved","mean"))
    .reset_index()
)
heat = perf.pivot(index="topic_category", columns="model_norm", values="solved_rate").fillna(0)

if heat.empty:
    print("Data tidak cukup untuk heatmap.")
else:
    ax = sns.heatmap(heat, cmap="cividis", vmin=0, vmax=1, annot=True, fmt=".0%")
    ax.set_xlabel("Model"); ax.set_ylabel("Kategori Topik")
    ax.set_title("Solved Rate (Proxy) — Model × Topik")
    plt.show()

MIN_N = 30
leaders = (
    perf[perf["n"] >= MIN_N]
    .sort_values(["topic_category","solved_rate"], ascending=[True, False])
    .groupby("topic_category", observed=True)
    .head(1).reset_index(drop=True)
)

leaders_show = leaders.assign(solved_rate=(leaders["solved_rate"]*100).round(1)) \
    .rename(columns={"topic_category":"Topik","model_norm":"Model","n":"N","solved_rate":"Solved Rate (%)"}) \
    [["Topik","Model","N","Solved Rate (%)"]]
leaders_show


## ✅ Kesimpulan

1. **Popularitas Model** – Beberapa model mendominasi jumlah percakapan; ini mengindikasikan preferensi awal/brand awareness yang kuat.  
2. **Topik Utama** – Kata kunci n‑gram menegaskan fokus pada *Coding*, *Penulisan*, dan *Analisis Data* sebagai use‑case utama.  
3. **Win‑Rate** – Perbedaan win‑rate antar model terlihat jelas; bandingkan dengan **Wilson 95% CI** untuk menghindari bias sampel kecil.  
4. **TTS (Efisiensi)** – Model tertentu mampu menyelesaikan percakapan ‘beres’ dengan lebih sedikit giliran; gunakan **median TTS** untuk membandingkan efisiensi.  
5. **Fit‑for‑Purpose** – “Juara per topik” berbeda-beda; ini mendukung strategi **routing otomatis** dan **bundling produk** (mis. coding vs penulisan).

### Ringkasan Naratif
Secara keseluruhan, analisis menunjukkan fokus penggunaan LLM pada tiga klaster utama (coding, penulisan, analisis data), dengan dominasi sejumlah model populer. Win‑Rate memetakan preferensi pengguna namun perlu dibaca bersama rentang kepercayaannya. TTS mengungkap efisiensi relatif; beberapa model menyelesaikan tugas dengan lebih sedikit giliran. Heatmap Fit‑for‑Purpose menegaskan bahwa keunggulan setiap model bersifat kontekstual per topik—ini membuka peluang routing otomatis, bundling produk, dan pengelolaan biaya/SLAs yang lebih presisi.

In [None]:
# (Opsional) Simpan artifacts ringkasan
os.makedirs("artifacts", exist_ok=True)
df_long.to_parquet("artifacts/df_long.parquet", index=False)
wr_df.to_csv("artifacts/win_rate_wilson.csv")
try:
    tts_stats.to_csv("artifacts/tts_stats.csv")
except NameError:
    pass
try:
    perf.to_csv("artifacts/fit_for_purpose_perf.csv", index=False)
    leaders_show.to_csv("artifacts/fit_for_purpose_leaders.csv", index=False)
except NameError:
    pass
print("Artifacts saved to /artifacts")
