# 🤖📊 Capstone: Public Data Analytics with Replicate (SMS Spam)
**Tujuan:** Menunjukkan alur *end‑to‑end* analisis data publik dengan bantuan LLM di **Replicate API**.  
Dataset: **SMS Spam Collection (5000 rows)** — kolom: `label` (`ham`/`spam`), `message` (teks).

### Output Akhir
- `predictions.csv` — data + label hasil zero‑shot (jika dijalankan)
- `report.md` — *Analytical Result*, *Insights & Findings*, *Recommendations* (Markdown)

> Jalankan sel **berurutan dari atas ke bawah**.

## Daftar Isi
1. [Prasyarat & Token](#1-Prasyarat--Token)
2. [Konfigurasi Proyek](#2-Konfigurasi-Proyek)
3. [Load Dataset](#3-Load-Dataset)
4. [EDA Singkat](#4-EDA-Singkat)
5. [Preprocessing Teks](#5-Preprocessing-Teks)
6. [Zero‑Shot Classification (LLM)](#6-ZeroShot-Classification-LLM)
7. [Baseline Supervised (TF‑IDF + LR)](#7-Baseline-Supervised-TFIDF--LR)
8. [Generate Insight & Rekomendasi](#8-Generate-Insight--Rekomendasi)
9. [Simpan Output](#9-Simpan-Output)
10. [Catatan & Tips](#10-Catatan--Tips)

## 1) Prasyarat & Token

In [None]:
# Install dependensi
!pip -q install replicate pandas numpy scikit-learn matplotlib tqdm

# Set & cek token Replicate
# >>> GANTI 'xxxxxxxx' dengan token aslimu atau gunakan: %env REPLICATE_API_TOKEN=... <<<
import os
os.environ.setdefault("REPLICATE_API_TOKEN", "xxxxxxxxxxxxxxxx")

print("Token terdeteksi:", bool(os.getenv("REPLICATE_API_TOKEN")))

## 2) Konfigurasi Proyek

In [None]:
# Import umum & konfigurasi
import os, json, re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

plt.rcParams.update({"figure.figsize": (10,5), "axes.grid": True})

# Model LLM di Replicate yang akan dipakai
MODEL_ID = "meta/meta-llama-3-8b-instruct"   # ubah sesuai model yang tersedia di akunmu

# Skema label untuk zero-shot
LABELS = ["spam", "ham"]

# Sampling & batch (menjaga biaya/token)
SAMPLE_N = 200     # jumlah sample text untuk zero-shot
BATCH    = 25      # ukuran batch per panggilan LLM

print("Konfigurasi siap. MODEL_ID:", MODEL_ID, "| LABELS:", LABELS)

## 3) Load Dataset

Upload file `sms_spam_full.csv` (5000 baris) ke panel **Files** di Colab, lalu jalankan sel di bawah.  
Jika belum punya filenya, unduh dari chat, atau ganti nama file sesuai lokasi Anda.

In [None]:
DATA_FILE = "sms_spam_full.csv"  # ganti jika nama file berbeda

# Load aman dengan validasi sederhana
try:
    df = pd.read_csv(DATA_FILE)
except FileNotFoundError as e:
    raise FileNotFoundError(f"Tidak menemukan file '{DATA_FILE}'. Pastikan sudah upload ke Colab Files.") from e

print("Shape:", df.shape)
display(df.head(5))

## 4) EDA Singkat

In [None]:
print("Info dataset:")
df.info()

print("\nMissing values per kolom:")
display(df.isna().sum().to_frame("n_missing").sort_values("n_missing", ascending=False))

print("\nDistribusi label:")
if "label" in df.columns:
    display(df["label"].value_counts())
else:
    print("Kolom 'label' tidak ada (tidak masalah jika hanya ingin zero-shot).")

print("\nContoh 5 baris:")
display(df.sample(5, random_state=42))

## 5) Preprocessing Teks

In [None]:
def clean_text(s: str) -> str:
    if not isinstance(s, str):
        s = str(s)
    s = s.lower()
    s = re.sub(r"http\S+|www\S+", " ", s)        # hapus URL
    s = re.sub(r"[^0-9a-zA-Z\s]+", " ", s)        # hapus simbol
    s = re.sub(r"\s+", " ", s).strip()            # normalisasi spasi
    return s

if "message" not in df.columns:
    raise KeyError("Kolom 'message' tidak ditemukan. Pastikan dataset punya kolom 'message'.")

df["_text_clean"] = df["message"].apply(clean_text)
display(df[["message","_text_clean"]].head(5))

# Word frequency plot (opsional)
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(max_features=2000, stop_words="english")
X_counts = vec.fit_transform(df["_text_clean"].fillna(""))
word_sums = np.asarray(X_counts.sum(axis=0)).ravel()
vocab = np.array(vec.get_feature_names_out())

top_n = 20
idx = word_sums.argsort()[::-1][:top_n]
top_words = vocab[idx]
top_freqs = word_sums[idx]

plt.bar(range(len(top_words)), top_freqs)
plt.xticks(range(len(top_words)), top_words, rotation=80)
plt.title("Top Word Frequencies (Cleaned)")
plt.tight_layout()
plt.show()

## 6) ZeroShot Classification (LLM)

**Penjelasan singkat:**  
- Kita minta LLM mengklasifikasikan teks tanpa contoh (*zero‑shot*) menjadi label di `LABELS` (`spam`/`ham`).  
- Prompt diatur *JSON-only* supaya hasil mudah di-parse.

In [None]:
import replicate

REPLICATE_API_TOKEN = os.getenv("REPLICATE_API_TOKEN", "")
if not REPLICATE_API_TOKEN:
    raise EnvironmentError("REPLICATE_API_TOKEN belum di-set. Lihat bagian 'Prasyarat & Token'.")

def replicate_generate(prompt: str, model_id: str = MODEL_ID, **extra):
    """Panggil model text-generation di Replicate.
    Mencoba beberapa kunci input umum: 'prompt', 'input', 'text'.
    Mengembalikan string gabungan.
    """
    try_inputs = [
        {"prompt": prompt, **extra},
        {"input": prompt,  **extra},
        {"text":  prompt,  **extra},
    ]
    last_err = None
    for payload in try_inputs:
        try:
            out = replicate.run(model_id, input=payload)
            if isinstance(out, str):
                return out
            if hasattr(out, "__iter__"):
                return "".join([str(x) for x in out])
            return json.dumps(out, ensure_ascii=False)
        except Exception as e:
            last_err = e
    raise RuntimeError(f"Replicate error: {last_err}")

# Sampling supaya hemat token
texts = df["_text_clean"].dropna().astype(str).tolist()[:SAMPLE_N]

def zero_shot_batch(batch_texts, labels):
    sys = (
        "You are a strict JSON-only classifier. "
        f"Classify each line into exactly one of: {labels}. "
        "Return ONLY a valid JSON array of objects with keys: 'text' and 'label'. No explanations."
    )
    numbered = "\n".join([f"{i+1}. {t}" for i, t in enumerate(batch_texts)])
    prompt = f"""{sys}

Lines:
{numbered}
"""
    raw = replicate_generate(prompt)
    # Parse JSON robust
    try:
        data = json.loads(raw)
    except Exception:
        m = re.search(r"(\[.*\])", raw, flags=re.S)
        if m:
            data = json.loads(m.group(1))
        else:
            # fallback: kembalikan struktur minimal
            data = [{"text": t, "label": None, "raw": raw} for t in batch_texts]
    return data

# Jalankan per-batch
zs_rows = []
for i in tqdm(range(0, len(texts), BATCH)):
    part = texts[i:i+BATCH]
    zs_rows.extend(zero_shot_batch(part, LABELS))

zs_df = pd.DataFrame(zs_rows)
display(zs_df.head())

# Gabungkan kembali ke df
if not zs_df.empty and "label" in zs_df.columns:
    df = df.merge(
        zs_df[["text","label"]].rename(columns={"text":"_text_clean","label":"_label_zs"}),
        on="_text_clean", how="left"
    )
    print("Distribusi label (zero‑shot):")
    display(df["_label_zs"].value_counts(dropna=False))
else:
    print("Tidak ada hasil zero‑shot yang bisa digabungkan.")

## 7) Baseline Supervised (TFIDF + LR)

**Tujuan:** Membuat baseline cepat & interpretabel.  
- Fitur: **TF‑IDF (1–2 gram)**  
- Model: **Logistic Regression**  
- Target: hasil zero‑shot (`_label_zs`) *atau* label asli (`label`) bila tersedia.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

target_series = None
if "_label_zs" in df.columns and df["_label_zs"].notna().sum() > 0:
    target_series = df["_label_zs"]
elif "label" in df.columns:
    target_series = df["label"]

if target_series is None:
    print("Lewati: belum ada target label (zero‑shot atau label asli).")
else:
    sup = df.dropna(subset=["_text_clean"]).copy()
    sup = sup[target_series.notna()]
    X_train, X_test, y_train, y_test = train_test_split(
        sup["_text_clean"].values, target_series.loc[sup.index].values,
        test_size=0.2, random_state=42, stratify=target_series.loc[sup.index].values
    )

    vec = TfidfVectorizer(max_features=3000, ngram_range=(1,2))
    Xtr = vec.fit_transform(X_train)
    Xte = vec.transform(X_test)

    clf = LogisticRegression(max_iter=200)
    clf.fit(Xtr, y_train)
    pred = clf.predict(Xte)

    print(classification_report(y_test, pred))

    cm = confusion_matrix(y_test, pred, labels=sorted(list(set(y_test))))
    plt.imshow(cm, aspect="auto")
    plt.title("Confusion Matrix")
    plt.ylabel("True")
    plt.xlabel("Pred")
    plt.colorbar()
    plt.xticks(range(len(set(y_test))), sorted(list(set(y_test))), rotation=45)
    plt.yticks(range(len(set(y_test))), sorted(list(set(y_test))))
    plt.tight_layout()
    plt.show()

## 8) Generate Insight & Rekomendasi

Menggunakan LLM untuk menulis ringkasan analitik dalam **Bahasa Indonesia**.  
Struktur keluaran: *Analytical Result*, *Insight & Findings*, *Recommendations*.

In [None]:
stats = {
    "n_rows": int(df.shape[0]),
    "n_cols": int(df.shape[1]),
}
if "_label_zs" in df.columns:
    stats["label_counts_zs"] = df["_label_zs"].value_counts(dropna=False).to_dict()
if "label" in df.columns:
    stats["label_counts_original"] = df["label"].value_counts(dropna=False).to_dict()

prompt = f"""Kamu adalah data analyst senior. Berdasarkan statistik/temuan berikut (format JSON):
{json.dumps(stats, ensure_ascii=False, indent=2)}

Tulis output dalam Bahasa Indonesia dengan format Markdown dan struktur:
### Analytical Result
- (3–5 bullet)

### Insight & Findings
- (3–5 bullet)

### Recommendations
- (3–5 bullet, actionable, prioritas jangka pendek & panjang)
"""

insight_md = replicate_generate(prompt)
print(insight_md)

## 9) Simpan Output

In [None]:
out_files = []

if "_label_zs" in df.columns:
    df.to_csv("predictions.csv", index=False)
    out_files.append("predictions.csv")

try:
    _ = insight_md  # noqa
    with open("report.md", "w", encoding="utf-8") as f:
        f.write(insight_md)
    out_files.append("report.md")
except NameError:
    pass

print("Saved:", out_files)

## 10) Catatan & Tips
- **Kontrol biaya/token**: atur `SAMPLE_N` dan `BATCH`.
- **Label set** harus **mutually exclusive** (tidak tumpang tindih).
- Jika dataset **non‑teks**, fokus pada EDA & gunakan LLM untuk **narasi insight** saja.
- Baseline bisa ditingkatkan: ubah n‑gram, tambah `max_features`, atau coba model lain (LinearSVC, etc.).
- Dokumentasikan alasan teknis & trade‑off (akurasi vs biaya, interpretabilitas, dsb.).