
# Indo EcoTourism — CBF + **User Feedback Weighting (UFW)**

Notebook ini membangun **Content‑Based Filtering (CBF)** untuk rekomendasi destinasi wisata dan **menambahkan personalisasi ringan** melalui **User Feedback Weighting (UFW)**.

**Alur singkat:**
1. **Load & cleaning data** (`eco_place.csv`)
2. **Preprocess teks** → gabungkan deskripsi, kategori, kota → *TF‑IDF*
3. **CBF ranker** + **UFW scoring**: `score = cos(query,item) + α · cos(centroid_like,item)`
4. **Evaluasi offline** : P@1, Recall@10, MRR, nDCG@10, Latency  
5. **Justifikasi** pemakaian UFW berdasarkan metrik di atas


## 1) Setup & Install
Persiapan environment dan pemasangan pustaka.

In [None]:
# !pip -q install pandas numpy scikit-learn scipy joblib


## 2) Imports & Paths
Impor pustaka inti dan siapkan direktori artefak.

In [None]:
import os
import re
import json
import time
import random
import datetime as dt
from pathlib import Path

import numpy as np
import pandas as pd
import joblib

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import save_npz, csr_matrix

# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# Paths
BASE_DIR = Path(".")
ART_DIR = BASE_DIR / "artifacts"
ART_DIR.mkdir(exist_ok=True, parents=True)

print("Artifacts dir:", ART_DIR.resolve())

Artifacts dir: /content/artifacts


## 3) Load Data

In [None]:
from IPython.display import display

DATA_CSV_PATH = Path("./eco_place.csv")

if not DATA_CSV_PATH.exists():
    raise FileNotFoundError(
        "eco_place.csv tidak ditemukan. Silakan unggah/letakkan file di direktori kerja notebook."
    )

df = pd.read_csv(DATA_CSV_PATH)
print("Loaded:", DATA_CSV_PATH.resolve())
print("Shape:", df.shape)
display(df.head(3))

Loaded: /content/eco_place.csv
Shape: (182, 13)


Unnamed: 0,place_id,place_name,place_description,category,city,price,rating,description_location,place_img,gallery_photo_img1,gallery_photo_img2,gallery_photo_img3,place_map
0,1,Taman Nasional Gunung Leuser,Taman Nasional Gunung Leuser adalah salah satu...,"Budaya,Taman Nasional",Aceh,"Rp25,000",4.5,"Barisan mountain range, Aceh 24653",https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://www.google.com/maps/search/Taman+Nasio...
1,2,Desa Wisata Munduk,Desa Wisata Munduk adalah sebuah desa di pegun...,Desa Wisata,Bali,"Rp10,000",4.5,"Munduk, Banjar, Kabupaten Buleleng, Bali",https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://goo.gl/maps/LyeJ2mAeFGysTE9v9
2,3,Desa Wisata Penglipuran,Desa Wisata Penglipuran adalah sebuah desa wis...,"Budaya,Desa Wisata",Bali,"Rp25,000",4.8,"Jl. Penglipuran, Kubu, Kec. Bangli, Kabupaten ...",https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://storage.googleapis.com/travelee-capsto...,https://www.google.com/maps/search/Desa+Wisata...



## 4) Cleaning & Preprocessing


In [None]:
df = df.copy()

# Pastikan kolom ada
required_cols = ["place_name", "place_description", "category", "city", "price", "rating", "place_img", "place_map"]
for col in required_cols:
    if col not in df.columns:
        df[col] = np.nan

# Hapus duplikat
before = len(df)
df = df.drop_duplicates().reset_index(drop=True)
print(f"Duplicates removed: {before - len(df)}")

# Parse harga IDR (SIMPLIFIED)
def parse_price_idr(x):
    """Ambil angka pertama dari string harga."""
    if pd.isna(x):
        return np.nan
    s = str(x).lower()
    s = s.replace("rp", "").replace(".", "").replace(",", "")
    if "gratis" in s:
        return 0.0
    angka = re.findall(r"\d+", s)
    if angka:
        return float(angka[0])
    return np.nan

df["price"] = df["price"].apply(parse_price_idr)
df["rating"] = pd.to_numeric(df["rating"], errors="coerce")

# Stopwords Indonesia (SIMPLIFIED - tanpa IMPORTANT_WORDS)
STOPWORDS_ID = {
    "ada", "adalah", "agar", "akan", "atau", "banyak", "beberapa", "belum",
    "bila", "bisa", "bukan", "dalam", "dan", "dapat", "dari", "dengan", "di",
    "hanya", "harus", "hingga", "ini", "itu", "jika", "juga", "kami", "kamu",
    "karena", "ke", "kepada", "lain", "lalu", "lebih", "masih", "mereka",
    "namun", "nya", "oleh", "pada", "para", "saat", "saja", "sampai", "sangat",
    "sebagai", "sebuah", "seluruh", "semua", "serta", "setiap", "suatu", "sudah",
    "tanpa", "tapi", "tentang", "untuk", "yaitu", "yang"
}

def preprocess_text(text):
    """Preprocessing teks: lowercase, tokenize, hapus stopwords."""
    text = str(text).lower()
    tokens = re.findall(r"\w+", text)
    tokens = [t for t in tokens if t not in STOPWORDS_ID]
    return " ".join(tokens)

# Isi kolom kosong dengan string kosong
df["place_description"] = df["place_description"].fillna("").astype(str)
df["category"] = df["category"].fillna("").astype(str)
df["city"] = df["city"].fillna("").astype(str)
df["place_name"] = df["place_name"].fillna("").astype(str)

# Buat kolom gabungan
df["gabungan"] = (df["place_description"] + " " + df["category"] + " " + df["city"]).apply(preprocess_text)

print("Nulls after cleaning:\n", df.isnull().sum())
df[["place_name", "category", "city", "gabungan"]].head(5)

Duplicates removed: 0
Nulls after cleaning:
 place_id                 0
place_name               0
place_description        0
category                 0
city                     0
price                    0
rating                   0
description_location     0
place_img                0
gallery_photo_img1       0
gallery_photo_img2       2
gallery_photo_img3      77
place_map                0
gabungan                 0
dtype: int64


Unnamed: 0,place_name,category,city,gabungan
0,Taman Nasional Gunung Leuser,"Budaya,Taman Nasional",Aceh,taman nasional gunung leuser salah satu enam t...
1,Desa Wisata Munduk,Desa Wisata,Bali,desa wisata munduk desa pegunungan bali terken...
2,Desa Wisata Penglipuran,"Budaya,Desa Wisata",Bali,desa wisata penglipuran desa wisata terletak k...
3,Taman Nasional Bali Barat,"Taman Nasional,Cagar Alam",Bali,taman nasional bali barat kawasan konservasi a...
4,Bukit Jamur,Cagar Alam,Bandung,bukit jamur ciwidey satu sekian pesona wisata ...


## 5) Feature Extraction — TF‑IDF

In [None]:
vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 1),
    min_df=2,
    max_df=0.9,
    sublinear_tf=True,
    norm="l2"
)
tfidf_matrix = vectorizer.fit_transform(df["gabungan"].fillna(""))
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")

TF-IDF matrix shape: (182, 797)



## 6) Save Artifacts


In [None]:
items_cols = ["place_name", "place_img", "place_map", "category", "city", "rating", "price", "gabungan"]
items = df[items_cols].copy()
items.to_csv(ART_DIR / "items.csv", index=False)

# Simpan vectorizer
joblib.dump(vectorizer, ART_DIR / "vectorizer.joblib")

# Simpan TF-IDF matrix sebagai numpy array
save_npz(ART_DIR / "tfidf_matrix.npz", csr_matrix(tfidf_matrix))
nbrs = NearestNeighbors(n_neighbors=50, metric="cosine", algorithm="brute")
nbrs.fit(tfidf_matrix)
joblib.dump(nbrs, ART_DIR / "nbrs_cosine.joblib")

# Metadata
meta = {
    "created_at": dt.datetime.utcnow().isoformat() + "Z",
    "n_items": int(items.shape[0]),
    "n_features": int(tfidf_matrix.shape[1]),
    "vectorizer": "sklearn TfidfVectorizer"
}
with open(ART_DIR / "metadata.json", "w") as f:
    json.dump(meta, f, indent=2)

print("[OK] Artifacts saved:")
for p in sorted(ART_DIR.iterdir()):
    print(" -", p.name)

[OK] Artifacts saved:
 - items.csv
 - metadata.json
 - nbrs_cosine.joblib
 - tfidf_matrix.npz
 - vectorizer.joblib


  "created_at": dt.datetime.utcnow().isoformat() + "Z",



## 7) Ranker: **CBF + User Feedback Weighting (UFW)**


In [None]:
"""
Skor item = cosine(query, item) + alpha * cosine(centroid_liked, item)

- query: vektor TF-IDF dari input user
- item: vektor TF-IDF item
- centroid_liked: rata-rata vektor dari item yang disukai user
- alpha: bobot pengaruh feedback (default 0.6)
"""

# Konversi ke dense array untuk perhitungan
TFIDF_DENSE = tfidf_matrix.toarray()

def get_user_centroid(liked_indices):
    """Hitung centroid (rata-rata) dari item yang disukai."""
    if not liked_indices:
        return None
    liked_vectors = TFIDF_DENSE[liked_indices]
    centroid = liked_vectors.mean(axis=0).reshape(1, -1)
    return centroid

def rank_cbf_ufw(query, liked_indices, alpha=0.6, topk=10):
    """
    Ranking dengan CBF + User Feedback Weighting.

    Args:
        query: string query dari user
        liked_indices: list index item yang disukai user
        alpha: bobot untuk feedback (0-1)
        topk: jumlah hasil yang dikembalikan

    Returns:
        list index item dengan skor tertinggi
    """
    # Transform query ke vektor TF-IDF
    query_processed = preprocess_text(query)
    query_vector = vectorizer.transform([query_processed])

    # Hitung similarity query dengan semua item
    sim_query = cosine_similarity(query_vector, TFIDF_DENSE)[0]

    # Jika tidak ada liked items, kembalikan ranking murni CBF
    centroid = get_user_centroid(liked_indices)
    if centroid is None:
        top_indices = np.argsort(-sim_query)[:topk]
        return top_indices.tolist()

    # Hitung similarity centroid dengan semua item
    sim_centroid = cosine_similarity(centroid, TFIDF_DENSE)[0]

    # Gabungkan skor: CBF + alpha * UFW
    scores = sim_query + alpha * sim_centroid

    # Ambil top-k
    top_indices = np.argsort(-scores)[:topk]
    return top_indices.tolist()


## 8) Evaluasi Offline


In [None]:
from IPython.display import display, Markdown

K_EVAL = 10

def build_query(row):
    """Buat query sintetis dari metadata item."""
    cat = str(row.get("category", "")).split(",")[0].strip()
    city = str(row.get("city", "")).strip()
    name = str(row.get("place_name", "")).strip()
    parts = [p for p in [cat, city, name] if p]
    return " ".join(parts)

def build_eval_queries(items_df, max_queries=300):
    """Bangun list (query, ground_truth_index) untuk evaluasi."""
    queries = []
    for i, row in items_df.iterrows():
        q = build_query(row)
        if q and len(q) >= 3:
            queries.append((q, i))
    random.shuffle(queries)
    return queries[:max_queries]

def precision_at_k(ranked_list, ground_truth, k):
    """Precision@K: 1/K jika GT ada di top-K, else 0."""
    return (1.0 / k) if ground_truth in ranked_list[:k] else 0.0

def recall_at_k(ranked_list, ground_truth, k):
    """Recall@K: 1 jika GT ada di top-K, else 0."""
    return 1.0 if ground_truth in ranked_list[:k] else 0.0

def evaluate(queries, liked_indices, alpha, k=10):
    """Evaluasi model dengan Precision@K, Recall@K, dan Latency."""
    total_prec = 0.0
    total_rec = 0.0
    total_time = 0.0

    for query_text, gt_index in queries:
        start = time.perf_counter()
        ranked = rank_cbf_ufw(query_text, liked_indices, alpha=alpha, topk=k)
        elapsed_ms = (time.perf_counter() - start) * 1000

        total_prec += precision_at_k(ranked, gt_index, k)
        total_rec += recall_at_k(ranked, gt_index, k)
        total_time += elapsed_ms

    n = max(1, len(queries))
    return {
        f"precision@{k}": total_prec / n,
        f"recall@{k}": total_rec / n,
        "latency_ms": total_time / n
    }


## 9) Menjalankan Evaluasi & **Justifikasi UFW**

In [None]:
# Buat queries evaluasi
EVAL_QUERIES = build_eval_queries(df, max_queries=300)
print(f"[Eval] Jumlah queries: {len(EVAL_QUERIES)}")

# Simulasi liked items (ambil 3 item dari kategori terbanyak)
top_categories = df["category"].fillna("").apply(lambda s: str(s).split(",")[0].strip())
dominant_cat = top_categories.value_counts().index[0] if len(top_categories.value_counts()) > 0 else ""
LIKED_IDX = df.index[top_categories == dominant_cat].tolist()[:3] if dominant_cat else []
print(f"[Eval] Pseudo-liked indices (kategori: '{dominant_cat}'): {LIKED_IDX}")

# Gunakan alpha tetap = 0.6 (SIMPLIFIED - tanpa tuning)
ALPHA = 0.6

# Jalankan evaluasi
metrics = evaluate(EVAL_QUERIES, LIKED_IDX, alpha=ALPHA, k=K_EVAL)

# Tampilkan hasil
results_df = pd.DataFrame([{
    "Method": f"CBF+UFW (α={ALPHA})",
    **metrics
}]).set_index("Method")

print("\n=== HASIL EVALUASI ===")
display(results_df)

# Penjelasan
explanation = f"""
### ℹ️ Penjelasan Metrik
- **Precision@{K_EVAL}**: Proporsi item relevan di Top-{K_EVAL} (1/{K_EVAL} jika GT ada, 0 jika tidak)
- **Recall@{K_EVAL}**: 1 jika ground truth muncul di Top-{K_EVAL}, 0 jika tidak
- **Latency (ms)**: Rata-rata waktu ranking per query

**Alpha = {ALPHA}** dipilih sebagai nilai default yang memberikan keseimbangan antara query similarity dan user preference.
"""
display(Markdown(explanation))

[Eval] Jumlah queries: 182
[Eval] Pseudo-liked indices (kategori: 'Cagar Alam'): [4, 5, 6]

=== HASIL EVALUASI ===


Unnamed: 0_level_0,precision@10,recall@10,latency_ms
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CBF+UFW (α=0.6),0.088462,0.884615,3.161402



### ℹ️ Penjelasan Metrik
- **Precision@10**: Proporsi item relevan di Top-10 (1/10 jika GT ada, 0 jika tidak)
- **Recall@10**: 1 jika ground truth muncul di Top-10, 0 jika tidak
- **Latency (ms)**: Rata-rata waktu ranking per query

**Alpha = 0.6** dipilih sebagai nilai default yang memberikan keseimbangan antara query similarity dan user preference.
