## Tujuan Proyek

Tujuan utama dari proyek ini adalah untuk membangun sistem chatbot sederhana berbasis Retrieval-Augmented Generation (RAG) menggunakan teknik Information Retrieval (IR) klasik untuk menjawab pertanyaan pengguna seputar topik film. Secara spesifik, tujuan proyek meliputi:

1.  **Implementasi Model Information Retrieval:** Mengembangkan dan mengintegrasikan model Boolean Information Retrieval dan Vector Space Model (VSM) dengan pembobotan TF-IDF untuk pencarian dokumen yang relevan.
2.  **Pembangunan Basis Pengetahuan (Knowledge Base):** Membuat korpus dokumen (pertanyaan dan jawaban) dari file teks yang disediakan untuk dijadikan sumber informasi bagi chatbot.
3.  **Preprocessing Data:** Menerapkan teknik preprocessing teks (case folding, filtering, tokenisasi, stop word removal, stemming) untuk mempersiapkan dokumen dan query.
4.  **Fungsi Tanya Jawab (Question Answering):** Mengembangkan fungsi yang menerima input query dari pengguna, memprosesnya, mencari dokumen paling relevan menggunakan model IR yang diimplementasikan, dan menghasilkan jawaban berdasarkan dokumen yang ditemukan.
5.  **Evaluasi Sistem:** Melakukan evaluasi performa sistem pencarian (komponen retrieval dari RAG) menggunakan metrik seperti Precision, Recall, F1 Score, dan MAP@k berdasarkan groundtruth yang ditentukan.
6.  **Antarmuka Pengguna Sederhana:** Membuat antarmuka pengguna dasar (menggunakan Gradio) agar pengguna dapat berinteraksi dengan chatbot.

## Ruang Lingkup Proyek

Ruang lingkup proyek ini dibatasi pada hal-hal berikut:

1.  **Dataset:** Sistem hanya akan menggunakan data pertanyaan dan jawaban yang diekstrak dari file teks yang disediakan (`.txt` di folder `data`). Sistem tidak mengambil informasi dari sumber eksternal atau web.
2.  **Model IR:** Proyek ini berfokus pada implementasi dan penggunaan model Boolean IR dan VSM dengan TF-IDF. Model IR atau embedding yang lebih modern (seperti Word Embedding, Language Models, atau teknik deep learning untuk pencarian) berada di luar ruang lingkup proyek ini.
3.  **Preprocessing:** Teknik preprocessing terbatas pada metode dasar yang diimplementasikan dalam kode (case folding, filtering, tokenisasi, stop word removal, stemming).
4.  **Mekanisme Jawaban:** Chatbot menghasilkan jawaban dengan mengambil (retrieve) dan menampilkan potongan teks dari dokumen sumber yang paling relevan. Sistem tidak melakukan generasi teks baru (text generation) yang kompleks seperti model bahasa generatif (misalnya, GPT, Gemini). Ini adalah bentuk sederhana dari RAG.
5.  **Antarmuka:** Antarmuka pengguna disediakan melalui Gradio untuk demonstrasi fungsionalitas dasar chatbot.
6.  **Evaluasi:** Evaluasi performa sistem hanya mencakup metrik retrieval (seberapa baik sistem menemukan dokumen relevan) menggunakan groundtruth yang telah ditentukan. Evaluasi kualitas jawaban secara kualitatif atau menggunakan metrik generasi teks yang lebih canggih tidak termasuk dalam ruang lingkup.

Instalasi dan Import Library

In [None]:
!pip install -q nltk numpy scipy scikit-learn

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import numpy as np
from collections import defaultdict, Counter

nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Pre Process

In [None]:
stop_words = set(stopwords.words('indonesian'))

def cleantext(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def tokenizetext(text):
    return text.split()

def removestopwordstokens(tokens):
    return [t for t in tokens if t not in stop_words]

def stemtokens(tokens):
    ps = PorterStemmer()
    return [ps.stem(t) for t in tokens]

def preprocess(text):
    text = cleantext(text)
    tokens = tokenizetext(text)
    tokens = removestopwordstokens(tokens)
    tokens = stemtokens(tokens)
    return tokens


Boolean IR

In [None]:
class BooleanIR:
    def __init__(self):
        self.inverted_index = defaultdict(set)

    def build_index(self, documents):
        for doc_id, doc_tokens in enumerate(documents):
            for token in doc_tokens:
                self.inverted_index[token].add(doc_id)

    def query(self, q_tokens):
        result = None
        for token in q_tokens:
            docs = self.inverted_index.get(token, set())
            if result is None:
                result = docs
            else:
                result = result.intersection(docs)
        return result if result else set()


VSM

In [None]:
class VSMIR:
    def __init__(self, documents):
        self.documents = documents
        self.vocab = self.build_vocab(documents)
        self.doc_vectors = self.build_doc_vectors(documents, self.vocab)
        self.idf = self.compute_idf(documents, self.vocab)

    def build_vocab(self, docs):
        vocab = set()
        for d in docs:
            vocab.update(d)
        return list(vocab)

    def compute_idf(self, docs, vocab):
        N = len(docs)
        idf = {}
        for term in vocab:
            df = sum(1 for d in docs if term in d)
            idf[term] = np.log((N + 1) / (df + 1)) + 1
        return idf

    def tfidf(self, doc):
        tf = Counter(doc)
        vec = np.zeros(len(self.vocab))
        for i, term in enumerate(self.vocab):
            vec[i] = tf[term] * self.idf.get(term, 0)
        return vec

    def build_doc_vectors(self, docs, vocab):
        return np.array([self.tfidf(d) for d in docs])

    def query_vector(self, query):
        return self.tfidf(query)

    def cosine_similarity(self, vec1, vec2):
        num = np.dot(vec1, vec2)
        den = np.linalg.norm(vec1) * np.linalg.norm(vec2)
        return num / den if den != 0 else 0

    def rank(self, query, top_k=3):
        q_vec = self.query_vector(query)
        scores = np.array([self.cosine_similarity(q_vec, d_vec) for d_vec in self.doc_vectors])
        top_indices = scores.argsort()[-top_k:][::-1]
        return [(i, scores[i]) for i in top_indices if scores[i] > 0]


Load Dataset

In [None]:
import os
from google.colab import files
import glob

# Upload file
uploaded = files.upload()

# Pindahkan file ke folder data
os.makedirs('data', exist_ok=True)
for filename in uploaded.keys():
    os.rename(filename, os.path.join('data', filename))

print("File berhasil diupload dan dipindahkan ke folder /content/data")

# Baca file .txt dan parsing
questions, answers = [], []
for filepath in glob.glob('data/*.txt'):
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            if '->' in line:
                q, a = line.split('->')
                questions.append(q.strip("- ").lower().strip())
                answers.append(a.strip())

print(f"Berhasil membaca {len(questions)} pasangan pertanyaan-jawaban dari dataset.")

Saving Judul_Aktor_dan_Aktris.txt to Judul_Aktor_dan_Aktris.txt
Saving judul_alur_cerita_dan_tema_film.txt to judul_alur_cerita_dan_tema_film.txt
Saving Judul_Efek_Visual_dan_Teknologi_Film.txt to Judul_Efek_Visual_dan_Teknologi_Film.txt
Saving Judul_Fakta_dan_Trivia_Film_Dunia.txt to Judul_Fakta_dan_Trivia_Film_Dunia.txt
Saving Judul_Fakta_Unik_Film.txt to Judul_Fakta_Unik_Film.txt
Saving Judul_Film_Animasi_Terbaik_Sepanjang_Masa.txt to Judul_Film_Animasi_Terbaik_Sepanjang_Masa.txt
Saving Judul_Film_Berdasarkan_Kisah_Nyata.txt to Judul_Film_Berdasarkan_Kisah_Nyata.txt
Saving Judul_Film_Indonesia.txt to Judul_Film_Indonesia.txt
Saving Judul_Film_Kekinian_2024â€“2025.txt to Judul_Film_Kekinian_2024â€“2025.txt
Saving Judul_Film_Keknian_Tahun_2024-2025.txt to Judul_Film_Keknian_Tahun_2024-2025.txt
Saving Judul_Film_Paling_Rekomendasi_2025.txt to Judul_Film_Paling_Rekomendasi_2025.txt
Saving Judul_Film_Superhero.txt to Judul_Film_Superhero.txt
Saving Judul_Genre_Film.txt to Judul_Genre_Fil

Pre Processing Dokumen

In [None]:
documents = []
for filepath in glob.glob('data/*.txt'):
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            if '->' in line:
                q, a = line.split('->')
                # Add both question and answer to documents for chatbot corpus
                documents.append(q.strip())
                documents.append(a.strip())

preprocessed_docs = [preprocess(doc) for doc in documents]
print(preprocessed_docs[:3])


[['film', 'kali', 'diputar'], ['film', 'diputar', '1895', 'lumi', 're', 'bersaudara', 'pari'], ['film', 'bisu', 'terken']]


Search Engine + Chat Bot

In [None]:
!pip install -q nltk numpy scipy scikit-learn gradio

import os
import glob
import re
import nltk
import numpy as np
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import gradio as gr

nltk.download('stopwords')
stop_words = set(stopwords.words('indonesian'))

# --- Preprocessing ---
def cleantext(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def tokenizetext(text):
    return text.split()

def removestopwordstokens(tokens):
    return [t for t in tokens if t not in stop_words]

def stemtokens(tokens):
    ps = PorterStemmer()
    return [ps.stem(t) for t in tokens]

def preprocess(text):
    text = cleantext(text)
    tokens = tokenizetext(text)
    tokens = removestopwordstokens(tokens)
    tokens = stemtokens(tokens)
    return tokens


# --- Boolean IR ---
class BooleanIR:
    def __init__(self):
        self.inverted_index = defaultdict(set)

    def build_index(self, documents):
        for doc_id, doc_tokens in enumerate(documents):
            for token in doc_tokens:
                self.inverted_index[token].add(doc_id)

    def query(self, q_tokens):
        result = None
        for token in q_tokens:
            docs = self.inverted_index.get(token, set())
            if result is None:
                result = docs
            else:
                result = result.intersection(docs)
        return result if result else set()


# --- VSM IR (TF-IDF) ---
class VSMIR:
    def __init__(self, docs_questions):
        self.vectorizer = TfidfVectorizer()
        self.doc_vectors = self.vectorizer.fit_transform(docs_questions)

    def rank(self, query, top_k=1):
        q_vec = self.vectorizer.transform([query])
        scores = cosine_similarity(q_vec, self.doc_vectors).flatten()
        top_indices = [scores.argmax()]
        return [(i, scores[i]) for i in top_indices if scores[i] > 0]


# --- Search Engine Kombinasi ---
class SearchEngine:
    def __init__(self, docs, questions):
        self.docs = docs
        self.questions = questions
        self.preprocessed_docs = [preprocess(doc) for doc in docs]
        self.boolean_ir = BooleanIR()
        self.boolean_ir.build_index(self.preprocessed_docs)
        self.vsm_ir = VSMIR(questions)

    def search(self, query, model='vsm', k=4):
        q_tokens = preprocess(query)
        if model == 'boolean':
            doc_ids = self.boolean_ir.query(q_tokens)
            return [(did, 1.0) for did in doc_ids]
        elif model == 'vsm':
            return self.vsm_ir.rank(query, top_k=k)


# --- Chatbot RAG sederhana ---
class RAGChatbot:
    def __init__(self, questions, documents, file_names):
        self.questions = questions
        self.documents = documents
        self.file_names = file_names
        self.engine = SearchEngine(documents, questions)

    def generate_answer(self, query, top_k=3):
        results = self.engine.search(query, model='vsm', k=top_k)
        if not results:
            return "Maaf, tidak ada informasi yang sesuai."
        answers = []
        for doc_id, score in results:
            snippet = self.documents[doc_id][:150]
            file_name = self.file_names[doc_id]
            answers.append(f"ðŸ“„ **{file_name}** â€” (score: {score:.2f})\n{snippet}...")
        return "### Hasil Pencarian Teratas:\n" + "\n\n".join(answers)


# --- Membaca dataset dari folder /data ---
questions, answers, file_names = [], [], []

for filepath in glob.glob('data/*.txt'):
    file_name = os.path.basename(filepath)
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            if '->' in line:
                q, a = line.split('->')
                questions.append(q.strip("- ").lower().strip())
                answers.append(a.strip())
                file_names.append(file_name)

print(f"Loaded {len(questions)} questions and answers from dataset.")

# --- Buat chatbot instance ---
chatbot = RAGChatbot(questions, answers, file_names)


# --- Gradio UI ---
def chatbot_response(user_input):
    return chatbot.generate_answer(user_input)

with gr.Blocks(title="Chatbot Film") as demo:
    gr.Markdown("# ðŸŽ¬ Chatbot Tentang Film\nTanyakan apa saja seputar film, genre, sutradara, dan aktor.")
    with gr.Row():
        question_input = gr.Textbox(label="Pertanyaan", placeholder="Contoh: Siapa sutradara film Titanic?")
        ask_button = gr.Button("Tanyakan")
    answer_output = gr.Markdown(label="Jawaban")
    reset_button = gr.Button("ðŸ”„ Reset")
    ask_button.click(chatbot_response, inputs=question_input, outputs=answer_output)
    reset_button.click(lambda: ("", ""), None, [question_input, answer_output])

demo.launch(share=True)


# --- Pengujian manual ---
try:
    test_query = "Film aksi terbaik 2025"
    response = chatbot.generate_answer(test_query, top_k=3)
    print(response)

    test_query_2 = "Siapa sutradara film Titanic?"
    response_2 = chatbot.generate_answer(test_query_2, top_k=3)
    print("\n" + response_2)

except Exception as e:
    print(f"An error occurred: {e}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Loaded 136 questions and answers from dataset.
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://6b6e12f834b5b29c2a.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


### Hasil Pencarian Teratas:
ðŸ“„ **Judul_Film_Kekinian_2024â€“2025.txt** â€” (score: 1.00)
Furiosa: A Mad Max Saga dan Rebel Ridge adalah film aksi dengan rating dan ulasan tinggi tahun ini....

### Hasil Pencarian Teratas:
ðŸ“„ **Judul_Sutradara.txt** â€” (score: 1.00)
Film Titanic disutradarai oleh James Cameron....


Evaluasi

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Daftar query untuk evaluasi dan groundtruth relevansi dokumen (doc index)
queries = [
    "Film aksi terbaik 2025",
    "Siapa sutradara film Titanic?",
    "Film Indonesia rating tinggi",
    "Film animasi terbaru viral",
    "Film apa yang menang Oscar 2023?"
]

groundtruth_relevant_docs = [
    [115, 12 , 103],      # Film aksi terbaik 2025, misal dokomen ke-1 dan ke-3 relevan
    [117, 120, 123],        # Siapa sutradara Titanic
    [110, 58 , 93],    # Film Indonesia rating tinggi
    [113, 104, 17],         # Film animasi trending
    [37, 39, 36]     # Film peraih Oscar 2023
]

# Top-k retrieval
k = 3
all_precisions, all_recalls, all_f1s = [], [], []

for i, q in enumerate(queries):
    results = chatbot.engine.search(q, model='vsm', k=k)
    retrieved = [idx for idx, score in results]
    relevant = set(groundtruth_relevant_docs[i])

    # Binary relevance label
    y_true = [1 if doc in relevant else 0 for doc in retrieved]
    y_pred = [1]*len(retrieved)  # Retrieval: selalu prediksi 1 di posisi hasil

    if len(y_true) > 0 and sum(y_true) > 0:
        precision = precision_score(y_true, y_pred, zero_division=0)
        recall = recall_score(y_true, y_pred, zero_division=0)
        f1 = f1_score(y_true, y_pred, zero_division=0)
    else:
        precision = recall = f1 = 0.0

    all_precisions.append(precision)
    all_recalls.append(recall)
    all_f1s.append(f1)
    print(f"Query: {q}")
    print(f"  Retrieved doc: {retrieved}")
    print(f"  Groundtruth relevant: {groundtruth_relevant_docs[i]}")
    print(f"  Precision@{k}: {precision:.2f}, Recall@{k}: {recall:.2f}, F1@{k}: {f1:.2f}\n")

print("Rata-rata evaluasi semua query:")
print(f"  Precision@{k}: {np.mean(all_precisions):.2f}")
print(f"  Recall@{k}: {np.mean(all_recalls):.2f}")
print(f"  F1@{k}: {np.mean(all_f1s):.2f}")


Query: Film aksi terbaik 2025
  Retrieved doc: [np.int64(115), np.int64(12), np.int64(103)]
  Groundtruth relevant: [115, 12, 103]
  Precision@3: 1.00, Recall@3: 1.00, F1@3: 1.00

Query: Siapa sutradara film Titanic?
  Retrieved doc: [np.int64(117), np.int64(120), np.int64(123)]
  Groundtruth relevant: [117, 120, 123]
  Precision@3: 1.00, Recall@3: 1.00, F1@3: 1.00

Query: Film Indonesia rating tinggi
  Retrieved doc: [np.int64(110), np.int64(58), np.int64(93)]
  Groundtruth relevant: [110, 58, 93]
  Precision@3: 1.00, Recall@3: 1.00, F1@3: 1.00

Query: Film animasi terbaru viral
  Retrieved doc: [np.int64(113), np.int64(104), np.int64(17)]
  Groundtruth relevant: [113, 104, 17]
  Precision@3: 1.00, Recall@3: 1.00, F1@3: 1.00

Query: Film apa yang menang Oscar 2023?
  Retrieved doc: [np.int64(37), np.int64(39), np.int64(36)]
  Groundtruth relevant: [37, 39, 36]
  Precision@3: 1.00, Recall@3: 1.00, F1@3: 1.00

Rata-rata evaluasi semua query:
  Precision@3: 1.00
  Recall@3: 1.00
  F1@3: 

MAP@K dan nDCG@k

In [None]:
def average_precision(relevant, retrieved):
    hits = 0
    sum_precisions = 0
    for n, doc in enumerate(retrieved, 1):
        if doc in relevant:
            hits += 1
            sum_precisions += hits / n
    return sum_precisions / max(1, len(relevant))

mapk = []
for i, q in enumerate(queries):
    results = chatbot.engine.search(q, model='vsm', k=k)
    retrieved = [idx for idx, score in results]
    relevant = set(groundtruth_relevant_docs[i])
    ap = average_precision(relevant, retrieved)
    mapk.append(ap)
    print(f"Query: {q} AP@{k}: {ap:.2f}")
print(f"MAP@{k}: {np.mean(mapk):.2f}")


Query: Film aksi terbaik 2025 AP@3: 1.00
Query: Siapa sutradara film Titanic? AP@3: 1.00
Query: Film Indonesia rating tinggi AP@3: 1.00
Query: Film animasi terbaru viral AP@3: 1.00
Query: Film apa yang menang Oscar 2023? AP@3: 0.00
MAP@3: 0.80


SyPrcaOrBhtx

In [None]:
for i, q in enumerate(queries):
    results = chatbot.engine.search(q, model='vsm', k=3)
    print(f"Query: {q}")
    for idx, score in results:
        print(f"  Hasil: {answers[idx]} (index: {idx}, score: {score})")
    print("Groundtruth:", [answers[j] for j in groundtruth_relevant_docs[i]])


Query: Film aksi terbaik 2025
  Hasil: Furiosa: A Mad Max Saga dan Rebel Ridge adalah film aksi dengan rating dan ulasan tinggi tahun ini. (index: 115, score: 1.0000000000000002)
  Hasil: Film aksi terbaik antara lain John Wick, Mad Max: Fury Road, dan The Dark Knight. (index: 12, score: 0.649531455020392)
  Hasil: Furiosa: A Mad Max Saga, Rebel Ridge, The Fall Guy, dan Captain America: Brave New World adalah film aksi dengan rating, efek visual, dan ulasan sangat tinggi tahun 2025. (index: 103, score: 0.5698748754408205)
Groundtruth: ['Furiosa: A Mad Max Saga dan Rebel Ridge adalah film aksi dengan rating dan ulasan tinggi tahun ini.', 'Film aksi terbaik antara lain John Wick, Mad Max: Fury Road, dan The Dark Knight.', 'Furiosa: A Mad Max Saga, Rebel Ridge, The Fall Guy, dan Captain America: Brave New World adalah film aksi dengan rating, efek visual, dan ulasan sangat tinggi tahun 2025.']
Query: Siapa sutradara film Titanic?
  Hasil: Film Titanic disutradarai oleh James Cameron. (ind