
# Indo EcoTourism — CBF + **User Feedback Weighting (UFW)**

Notebook ini membangun **Content‑Based Filtering (CBF)** untuk rekomendasi destinasi wisata dan **menambahkan personalisasi ringan** melalui **User Feedback Weighting (UFW)**.

**Alur singkat:**
1. **Load & cleaning data** (`eco_place.csv`)
2. **Preprocess teks** → gabungkan deskripsi, kategori, kota → *TF‑IDF*
3. **CBF ranker** + **UFW scoring**: `score = cos(query,item) + α · cos(centroid_like,item)`
4. **Evaluasi offline** : P@1, Recall@10, MRR, nDCG@10, Latency  
5. **Justifikasi** pemakaian UFW berdasarkan metrik di atas


## 1) Setup & Install
Persiapan environment dan pemasangan pustaka.

In [4]:
# !pip -q install pandas numpy scikit-learn scipy joblib


## 2) Imports & Paths
Impor pustaka inti dan siapkan direktori artefak.

In [5]:
import os, re, json, math, random, warnings, time, datetime as dt
from pathlib import Path
from typing import List

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import save_npz, load_npz, csr_matrix
import joblib

# Reproducibility
SEED = 42
random.seed(SEED); np.random.seed(SEED)

# Paths
BASE_DIR = Path(".")
ART_DIR  = BASE_DIR / "artifacts"
ART_DIR.mkdir(exist_ok=True, parents=True)

print("Artifacts dir:", ART_DIR.resolve())


Artifacts dir: /content/artifacts


## 3) Load Data
Muat dataset **`eco_place.csv`**.

In [6]:
from IPython.display import display

DATA_CSV_PATH = Path("./eco_place.csv")

if not DATA_CSV_PATH.exists():
    raise FileNotFoundError(
        "eco_place.csv tidak ditemukan. Silakan unggah/letakkan file di direktori kerja notebook."
    )

df = pd.read_csv(DATA_CSV_PATH)
print("Loaded:", DATA_CSV_PATH.resolve())
print("Shape:", df.shape)
display(df.head(3))

Loaded: /content/eco_place.csv
Shape: (182, 13)


Unnamed: 0,place_id,place_name,place_description,category,city,price,rating,description_location,place_img,gallery_photo_img1,gallery_photo_img2,gallery_photo_img3,place_map
0,1,Taman Nasional Gunung Leuser,Taman Nasional Gunung Leuser adalah salah satu...,"Budaya,Taman Nasional",Aceh,"Rp25,000",4.5,"Barisan mountain range, Aceh 24653",https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://www.google.com/maps/search/Taman+Nasio...
1,2,Desa Wisata Munduk,Desa Wisata Munduk adalah sebuah desa di pegun...,Desa Wisata,Bali,"Rp10,000",4.5,"Munduk, Banjar, Kabupaten Buleleng, Bali",https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://goo.gl/maps/LyeJ2mAeFGysTE9v9
2,3,Desa Wisata Penglipuran,Desa Wisata Penglipuran adalah sebuah desa wis...,"Budaya,Desa Wisata",Bali,"Rp25,000",4.8,"Jl. Penglipuran, Kubu, Kec. Bangli, Kabupaten ...",https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://www.google.com/maps/search/Desa+Wisata...



## 4) Cleaning & Preprocessing

- Normalisasi kolom agar selalu tersedia  
- Bersihkan *price* → angka (IDR)  
- Normalisasi `rating`  
- Buat kolom **`gabungan`** (deskripsi + kategori + kota), lalu lakukan praproses teks ringan


In [7]:
df = df.copy()

# Pastikan kolom ada
for col in ["place_name","place_description","category","city","price","rating","place_img","place_map"]:
    if col not in df.columns:
        df[col] = np.nan

# Hapus duplikat
before = len(df)
df = df.drop_duplicates().reset_index(drop=True)
print(f"Duplicates removed: {before - len(df)}")

# Parse harga IDR
def parse_price_idr(x):
    if pd.isna(x): return np.nan
    s = str(x)
    s = s.replace("Rp","").replace("rp","")
    s = re.sub(r"[.,]", "", s)          # hapus pemisah ribuan
    s = s.replace("Gratis","0").replace("gratis","0").strip()
    m = re.findall(r"\d+", s)
    if not m: return np.nan
    try:
        vals = list(map(int, m))
        return float(int(sum(vals)/len(vals)))  # rata-rata jika rentang
    except Exception:
        return np.nan

df["price"]  = df["price"].apply(parse_price_idr)
df["rating"] = pd.to_numeric(df["rating"], errors="coerce")

# Preprocess teks ringan
STOPWORDS_ID = set([
    "ada","adalah","agar","akan","antara","atau","banyak","beberapa","belum","berbagai",
    "bila","bisa","bukan","dalam","dan","dapat","dari","dengan","di","hanya","harus","hingga",
    "ini","itu","jika","juga","kah","kami","kamu","karena","ke","kemudian","kepada","lah","lain",
    "lainnya","lalu","lebih","masih","mereka","mungkin","namun","nya","oleh","pada","para","pernah",
    "pun","saat","saja","sampai","sangat","sebagai","sebuah","seluruh","semua","serta","setiap",
    "suatu","sudah","supaya","tanpa","tapi","tentang","tentu","terhadap","tiap","untuk","yaitu","yakni","yang"
])
IMPORTANT_WORDS = set(["di","ke","dari","untuk","dengan","yang"])

def preprocess_text(text: str) -> str:
    text = str(text).lower()
    tokens = re.findall(r"\w+", text, flags=re.UNICODE)  # alnum + underscore
    filtered = [t for t in tokens if (t not in STOPWORDS_ID) or (t in IMPORTANT_WORDS)]
    return " ".join(filtered)

# Buat 'gabungan' dan terapkan preprocessing
df["place_description"] = df["place_description"].fillna("").astype(str)
df["category"]          = df["category"].fillna("").astype(str)
df["city"]              = df["city"].fillna("").astype(str)
df["place_name"]        = df["place_name"].fillna("").astype(str)

df["gabungan"] = (df["place_description"] + " " + df["category"] + " " + df["city"]).apply(preprocess_text)

print("Nulls after cleaning:\n", df.isnull().sum())
df[["place_name","category","city","gabungan"]].head(5)


Duplicates removed: 0
Nulls after cleaning:
 place_id                 0
place_name               0
place_description        0
category                 0
city                     0
price                    0
rating                   0
description_location     0
place_img                0
gallery_photo_img1       0
gallery_photo_img2       2
gallery_photo_img3      77
place_map                0
gabungan                 0
dtype: int64


Unnamed: 0,place_name,category,city,gabungan
0,Taman Nasional Gunung Leuser,"Budaya,Taman Nasional",Aceh,taman nasional gunung leuser salah satu dari e...
1,Desa Wisata Munduk,Desa Wisata,Bali,desa wisata munduk desa di pegunungan bali yan...
2,Desa Wisata Penglipuran,"Budaya,Desa Wisata",Bali,desa wisata penglipuran desa wisata yang terle...
3,Taman Nasional Bali Barat,"Taman Nasional,Cagar Alam",Bali,taman nasional bali barat kawasan konservasi a...
4,Bukit Jamur,Cagar Alam,Bandung,bukit jamur ciwidey satu dari sekian pesona wi...


## 5) Feature Extraction — TF‑IDF
Ekstrak fitur TF‑IDF dari kolom `gabungan`. Parameter dapat disesuaikan sesuai ukuran data.

In [8]:
vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1,1),
    min_df=2,
    max_df=0.9,
    sublinear_tf=True,
    norm="l2"
)
tfidf_matrix = vectorizer.fit_transform(df["gabungan"].fillna(""))
tfidf_matrix


<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 6210 stored elements and shape (182, 795)>


## 6) Save Artifacts (untuk aplikasi)

Simpan artefak agar bisa dipakai ulang oleh aplikasi:
- `artifacts/vectorizer.joblib`
- `artifacts/tfidf_matrix.npz`
- `artifacts/items.csv` (metadata untuk UI)
- `artifacts/metadata.json`
- `artifacts/nbrs_cosine.joblib`


In [9]:
items_cols = ["place_name","place_img","place_map","category","city","rating","price","gabungan"]
items = df[items_cols].copy()
items.to_csv(ART_DIR / "items.csv", index=False)

# Simpan matriks TF-IDF (sparse) & vectorizer
save_npz(ART_DIR / "tfidf_matrix.npz", csr_matrix(tfidf_matrix))
joblib.dump(vectorizer, ART_DIR / "vectorizer.joblib")

# Fit & simpan NearestNeighbors (cosine, brute) untuk percepatan inference
n_neighbors = int(min(50, tfidf_matrix.shape[0]))  # default aman
nbrs = NearestNeighbors(n_neighbors=n_neighbors, metric="cosine", algorithm="brute")
nbrs.fit(tfidf_matrix)  # CSR diterima langsung
joblib.dump(nbrs, ART_DIR / "nbrs_cosine.joblib")

# Metadata
meta = {
    "created_at": dt.datetime.utcnow().isoformat() + "Z",
    "n_items": int(items.shape[0]),
    "n_features": int(tfidf_matrix.shape[1]),
    "vectorizer": "sklearn TfidfVectorizer",
    "nearest_neighbors": {
        "file": "nbrs_cosine.joblib",
        "metric": "cosine",
        "algorithm": "brute",
        "n_neighbors": n_neighbors
    }
}
with open(ART_DIR / "metadata.json","w") as f:
    json.dump(meta, f, indent=2)

print("[OK] Artifacts saved:")
for p in sorted(ART_DIR.iterdir()):
    print(" -", p.name)


[OK] Artifacts saved:
 - items.csv
 - metadata.json
 - nbrs_cosine.joblib
 - tfidf_matrix.npz
 - vectorizer.joblib


  "created_at": dt.datetime.utcnow().isoformat() + "Z",



## 7) Ranker: **CBF + User Feedback Weighting (UFW)**

Skor item dihitung sebagai:

\begin{equation}
\operatorname{score}(i)=
\underbrace{\cos\!\big(\vec q,\,\vec i\big)}_{\text{CBF (query $\to$ item)}} \;+\;
\alpha\cdot
\underbrace{\cos\!\big(\vec c_{\text{liked}},\,\vec i\big)}_{\text{kedekatan ke centroid Like}}
\end{equation}

- $\vec q$: vektor TF-IDF **query**  
- $\vec i$: vektor TF-IDF **item**  
- $\vec c_{\text{liked}}$: **centroid** TF-IDF dari item-item yang disukai  
- $\alpha$: bobot pengaruh feedback (dituning saat evaluasi)


In [10]:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def _cosine_sparse_dense_row(sparse_row, dense_mat) -> np.ndarray:
    return cosine_similarity(sparse_row, dense_mat, dense_output=True)[0]

# Precompute agar cepat
TFIDF_DENSE = tfidf_matrix.astype("float32").toarray()

def _user_centroid(liked_indices: List[int]) -> np.ndarray | None:
    if not liked_indices:
        return None
    C = TFIDF_DENSE[liked_indices].mean(axis=0, keepdims=True)
    n = np.linalg.norm(C, axis=1, keepdims=True) + 1e-9
    return (C / n).astype("float32")

def rank_cbf_ufw(query: str, liked_indices: List[int], alpha: float = 0.6, topk: int = 10) -> list[int]:
    qv = vectorizer.transform([preprocess_text(query)])
    sim_q = _cosine_sparse_dense_row(qv, TFIDF_DENSE)  # (N,)
    C = _user_centroid(liked_indices)
    if C is None:
        return np.argsort(-sim_q)[:topk].tolist()
    denom = (np.linalg.norm(TFIDF_DENSE, axis=1)+1e-9) * (np.linalg.norm(C)+1e-9)
    sim_cent = (TFIDF_DENSE @ C.T).ravel() / denom
    scores = sim_q + float(alpha) * sim_cent
    return np.argsort(-scores)[:topk].tolist()



## 8) Evaluasi Offline

Kita menyiapkan **kueri sintetis** dari metadata (kategori/kota/nama) dan mengukur **hanya UFW** dengan beberapa nilai \\(\\alpha\\) untuk memilih yang terbaik.

**Metrik yang dipakai:**
- **P@1** — akurasi Top‑1 (GT tepat di posisi #1) → *semakin tinggi semakin baik*.
- **Recall@10** — GT muncul di Top‑10 → *semakin tinggi semakin baik*.
- **MRR** — posisi GT makin awal → *semakin tinggi semakin baik*.
- **nDCG@10** — kualitas urutan Top‑10 berbobot posisi → *semakin tinggi semakin baik*.
- **Latency(ms/q)** — rata‑rata waktu per kueri → *semakin rendah semakin baik*.


In [11]:
from IPython.display import display, Markdown
import time, math, random, re
from typing import List
import pandas as pd

# ---------------- Query builders (synthetic) ----------------
def _q_struct_1(r: pd.Series) -> str:
    cat  = str(r.get("category","")).split(",")[0].strip()
    city = str(r.get("city","")).strip()
    name = str(r.get("place_name","")).strip()
    toks = [t for t in [cat, city, name] if t]
    return " ".join(toks)

def _q_struct_2(r: pd.Series) -> str:
    cat  = str(r.get("category","")).split(",")[0].strip().lower()
    city = str(r.get("city","")).strip().lower()
    base = "wisata"
    if cat:  base += f" {cat}"
    if city: base += f" di {city}"
    return base.strip()

def build_eval_queries(items_df: pd.DataFrame, max_per_item: int = 2, max_queries: int = 300):
    """Bangun (query, ground_truth_index) untuk evaluasi offline."""
    qs = []
    for i, r in items_df.iterrows():
        cand = [_q_struct_1(r), _q_struct_2(r)]
        cand = [q for q in cand if q and len(q) >= 3]
        for q in cand[:max_per_item]:
            qs.append((q, i))   # (query, GT_index)
    random.shuffle(qs)
    return qs[:max_queries] if max_queries else qs

# ---------------- Metrics (top-k dengan 1 GT) ----------------
def precision_at_k(rank: List[int], gt: int, k: int) -> float:
    "Akurasi Top-k (untuk 1 GT sama dengan hit@k)."
    return 1.0 if gt in rank[:k] else 0.0

def recall_at_k(rank: List[int], gt: int, k: int) -> float:
    "Hit@k (identik dengan precision@k untuk 1 GT)."
    return 1.0 if gt in rank[:k] else 0.0

def mrr(rank: List[int], gt: int) -> float:
    "Mean Reciprocal Rank; GT makin awal -> nilai makin besar."
    try:
        pos = rank.index(gt)
        return 1.0 / float(pos + 1)
    except ValueError:
        return 0.0

def ndcg_at_k(rank: List[int], gt: int, k: int) -> float:
    "Kualitas urutan Top-k berbobot posisi."
    try:
        pos = rank[:k].index(gt)
        return 1.0 / math.log2(pos + 2)
    except ValueError:
        return 0.0

# ---------------- Evaluator untuk UFW ----------------
def eval_ufw(queries, liked_indices: List[int], alpha: float, topk: int = 10):
    """
    Evaluasi CBF + User Feedback Weighting (UFW).
    Menggunakan ranker: rank_cbf_ufw(query, liked_indices, alpha, topk)
    """
    t0 = time.perf_counter()
    p1 = r10 = m = mnd = 0.0
    for q, gt in queries:
        rank = rank_cbf_ufw(q, liked_indices, alpha=alpha, topk=topk)
        p1  += precision_at_k(rank, gt, 1)
        r10 += recall_at_k(rank, gt, 10)
        m   += mrr(rank, gt)
        mnd += ndcg_at_k(rank, gt, 10)
    elapsed_ms = (time.perf_counter() - t0) * 1000.0 / max(1, len(queries))
    n = float(len(queries))
    return {
        "P@1": p1 / n,
        "Recall@10": r10 / n,
        "MRR": m / n,
        "nDCG@10": mnd / n,
        "Latency(ms/q)": elapsed_ms,
    }

# ---------------- Tabel penjelasan metrik ----------------
METRIC_INFO = {
    "P@1": ("Akurasi Top-1 (GT tepat di posisi #1).", "Semakin tinggi semakin baik."),
    "Recall@10": ("Hit@10 (GT muncul di 10 besar).", "Semakin tinggi semakin baik."),
    "MRR": ("Rata-rata kebalikan posisi GT.", "Semakin tinggi GT muncul lebih awal."),
    "nDCG@10": ("Kualitas urutan Top-10 berbobot posisi.", "Semakin tinggi semakin baik."),
    "Latency(ms/q)": ("Rata-rata waktu inference per kueri.", "Semakin rendah semakin baik."),
}



## 9) Menjalankan Evaluasi & **Justifikasi UFW**

Bagian ini menyiapkan kueri sintetis, mensimulasikan *liked indices* sederhana, melakukan tuning kecil \\(\\alpha\\), dan menampilkan hasil metrik.  
Di akhir, ditampilkan **penjelasan metrik** dan **justifikasi** pemilihan UFW.


In [12]:
# 1) Bangun kueri evaluasi (synthetic)
EVAL_QUERIES = build_eval_queries(df, max_per_item=2, max_queries=300)
print(f"[Eval] jumlah queries: {len(EVAL_QUERIES)}")

# 2) Pseudo-liked indices (simulasi preferensi).
top_cat = df["category"].fillna("").apply(lambda s: str(s).split(",")[0].strip())
dom_cat = top_cat.value_counts().index.tolist()[0] if len(top_cat.value_counts()) else ""
LIKED_IDX = df.index[top_cat == dom_cat].tolist()[:3] if dom_cat else []
print(f"[Eval] pseudo-liked indices (kategori dominan='{dom_cat}'): {LIKED_IDX}")

# 3) Tuning kecil α (pilih terbaik berdasar nDCG@10)
alpha_grid = [0.3, 0.5, 0.6, 0.8]
best_alpha, best_metrics = None, None
for a in alpha_grid:
    m = eval_ufw(EVAL_QUERIES, LIKED_IDX, alpha=a, topk=10)
    if (best_metrics is None) or (m["nDCG@10"] > best_metrics["nDCG@10"]):
        best_alpha, best_metrics = a, m

# 4) Tampilkan hasil akhir
row = {"Method": f"CBF+UserFeedbackWeighting (α={best_alpha})", **best_metrics}
res_df = pd.DataFrame([row]).set_index("Method")
try:
    display(
        res_df.style.format({
            "P@1":"{:.3f}",
            "Recall@10":"{:.3f}",
            "MRR":"{:.3f}",
            "nDCG@10":"{:.3f}",
            "Latency(ms/q)":"{:.2f}",
        }).background_gradient(cmap="Blues")
    )
except Exception:
    display(res_df)

# 5) Penjelasan metrik
def _escape_pipes(s: str) -> str:
    return str(s).replace("|", r"\|")

lines = [
    "### ℹ️ Penjelasan Metrik",
    "",
    "| Metrik | Apa yang diukur | Cara membaca |",
    "|---|---|---|",
]
for k, (what, how) in METRIC_INFO.items():
    lines.append(f"| **{_escape_pipes(k)}** | {_escape_pipes(what)} | {_escape_pipes(how)} |")

from IPython.display import Markdown, display as _display
_display(Markdown("\n".join(lines)))

# 6) Justifikasi
md = f"""
### ✅ Justifikasi Pemakaian *User Feedback Weighting*

Konfigurasi terbaik menghasilkan:
- **P@1** = {best_metrics["P@1"]:.3f} → peluang hasil paling atas tepat sasaran.
- **Recall@10** = {best_metrics["Recall@10"]:.3f} → item target sering masuk 10 besar.
- **MRR** = {best_metrics["MRR"]:.3f} & **nDCG@10** = {best_metrics["nDCG@10"]:.3f} → urutan relevansi kuat (item relevan cenderung tampil di posisi awal).
- **Latency** ≈ {best_metrics["Latency(ms/q)"]:.1f} ms/kueri → tetap ringan untuk produksi.

**Kenapa UFW:**
1. **Personalisasi ringan tanpa login** — skor menyesuaikan preferensi sesi via *liked centroid* (rata-rata vektor item yang disukai).
2. **Sederhana & stabil** — hanya menambahkan komponen skor α·sim(centroid,item) di atas CBF.
3. **Praktis di-maintain** — tidak menambah artefak/model baru; tetap memakai vectorizer & TF-IDF yang sama.
"""
_display(Markdown(md))


[Eval] jumlah queries: 300
[Eval] pseudo-liked indices (kategori dominan='Cagar Alam'): [4, 5, 6]


Unnamed: 0_level_0,P@1,Recall@10,MRR,nDCG@10,Latency(ms/q)
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CBF+UserFeedbackWeighting (α=0.3),0.36,0.72,0.466,0.526,3.89


### ℹ️ Penjelasan Metrik

| Metrik | Apa yang diukur | Cara membaca |
|---|---|---|
| **P@1** | Akurasi Top-1 (GT tepat di posisi #1). | Semakin tinggi semakin baik. |
| **Recall@10** | Hit@10 (GT muncul di 10 besar). | Semakin tinggi semakin baik. |
| **MRR** | Rata-rata kebalikan posisi GT. | Semakin tinggi GT muncul lebih awal. |
| **nDCG@10** | Kualitas urutan Top-10 berbobot posisi. | Semakin tinggi semakin baik. |
| **Latency(ms/q)** | Rata-rata waktu inference per kueri. | Semakin rendah semakin baik. |


### ✅ Justifikasi Pemakaian *User Feedback Weighting*

Konfigurasi terbaik menghasilkan:
- **P@1** = 0.360 → peluang hasil paling atas tepat sasaran.
- **Recall@10** = 0.720 → item target sering masuk 10 besar.
- **MRR** = 0.466 & **nDCG@10** = 0.526 → urutan relevansi kuat (item relevan cenderung tampil di posisi awal).
- **Latency** ≈ 3.9 ms/kueri → tetap ringan untuk produksi.

**Kenapa UFW:**
1. **Personalisasi ringan tanpa login** — skor menyesuaikan preferensi sesi via *liked centroid* (rata-rata vektor item yang disukai).
2. **Sederhana & stabil** — hanya menambahkan komponen skor α·sim(centroid,item) di atas CBF.
3. **Praktis di-maintain** — tidak menambah artefak/model baru; tetap memakai vectorizer & TF-IDF yang sama.
