# Introduction
1. This system that relies on semantic similarity. It finds the text in the document that is most similar to the user's question.
2. If the user's question doesn't closely resemble the way the information is expressed in the document, the system may not find the correct answer.
3. Basic Functionality covers:
    * Extract text from PDF documents.
    * Perform semantic search to find relevant chunks of text.
    * Clean the output to remove unwanted content.
    * Provide an answer to the user's question (even if the answer is not always perfect).



## Further Development
1. Clarifying Expectation, example :
    * Chatbot: "Dana BOS digunakan untuk membiayai kegiatan operasional sekolah. Apakah Anda ingin mengetahui contoh kegiatan operasional yang dapat dibiayai oleh Dana BOS?"
2. Provide a list of example questions that the user can ask. This shows them the types of questions the chatbot is good at answering. Example:
    * Apa saja syarat pengajuan Dana BOS?
    * Bagaimana cara melaporkan penggunaan Dana BOS?
    * Sebutkan contoh kegiatan yang dapat dibiayai oleh Dana BOS.
3. Keyword Suggestions: As the user types their question, suggest relevant keywords that they can include to make their question more specific.
4. Intent Recognition (Advanced): Implement a simple intent recognition system. This would analyze the user's question and try to identify the intent behind it (e.g., "find allowed uses," "find reporting requirements"). Based on the intent, the chatbot could automatically rephrase the question to be more targeted. This requires more advanced natural language processing techniques.
5. Expand the Training Data (If Possible): If you have the ability to add more data to the system, try to find documents that explicitly list the allowed uses of Dana BOS in a clear and structured way. This will make it easier for the semantic search to find the right information.
6. Hybrid Approach (Advanced): Consider combining this semantic search approach with a more traditional keyword-based search. If the semantic search fails to find a good answer, the chatbot could fall back to a keyword search to find any relevant documents and present them to the user.

# Import Library

In [9]:
!pip install pymupdf nltk transformers sentence-transformers faiss-cpu

import os
import re
import json
import fitz
import nltk
import faiss
import requests
import numpy as np

from nltk import sent_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer, CrossEncoder

# Download resource NLTK
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Data Gathering

In [10]:
# ===============================
# 1. DATA GATHERING
# ===============================

GITHUB_RAW_URL = "https://raw.githubusercontent.com/esnanta/ai-chatbot-dana-bos-api/main/knowledge_base/"
FILES = [
    "Permendikbudriset_No_63_Tahun_2023.pdf",
]

# Direktori penyimpanan di Colab
pdf_dir = "/content/pdf_files"
os.makedirs(pdf_dir, exist_ok=True)

# Download file PDF dari GitHub
for file in FILES:
    file_url = GITHUB_RAW_URL + file
    try:
        response = requests.get(file_url)
        response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
        with open(os.path.join(pdf_dir, file), "wb") as f:
            f.write(response.content)
        print(f"✅ Berhasil mengunduh: {file}")
    except requests.exceptions.RequestException as e:
        print(f"❌ Gagal mengunduh {file}: {e}")
    except Exception as e:
        print(f"❌ Kesalahan tak terduga saat mengunduh {file}: {e}")


# Cek file yang telah diunduh
print(f"Daftar file di {pdf_dir}: {os.listdir(pdf_dir)}")

✅ Berhasil mengunduh: Permendikbudriset_No_63_Tahun_2023.pdf
Daftar file di /content/pdf_files: ['chunks.json', 'cleaned_texts.json', 'Permendikbudriset_No_63_Tahun_2023.pdf']


In [11]:
# ===============================
# 2. EKSTRAKSI TEKS DARI FILE PDF
# ===============================

# --- PDF Text Extraction ---
def extract_text_from_pdf(pdf_path):
    """Extracts text from a PDF file."""
    try:
        with fitz.open(pdf_path) as doc:  # Use context manager for safety
            text = ""
            for page in doc:
                text += page.get_text("text") + "\n"
        return text.strip()
    except Exception as e:
        raise RuntimeError(f"Gagal mengekstrak teks dari {pdf_path}: {e}")


pdf_files = [f for f in os.listdir(pdf_dir) if f.endswith(".pdf")]
pdf_texts = {}

for pdf_file in pdf_files:
    pdf_path = os.path.join(pdf_dir, pdf_file)
    try:
        text = extract_text_from_pdf(pdf_path)
        pdf_texts[pdf_file] = text
        print(f"✅ Berhasil mengekstrak teks dari: {pdf_file}")
    except RuntimeError as re:
        print(f"❌ Kesalahan saat ekstraksi: {re}")
    except Exception as e:
        print(f"❌ Kesalahan tidak terduga pada {pdf_file}: {e}")


# Cek hasil ekstraksi
print(f"\nTotal file yang berhasil diekstrak: {len(pdf_texts)}")

✅ Berhasil mengekstrak teks dari: Permendikbudriset_No_63_Tahun_2023.pdf

Total file yang berhasil diekstrak: 1


# Preprocessing Data

In [12]:
# ===============================
# 3. PREPROCESSING TEKS
# ===============================

def clean_text(text):

    # This collapses multiple consecutive blank lines into a single blank line,
    # reducing unnecessary whitespace.
    text = re.sub(r'\n+', '\n', text)

    # replaces sequences of spaces, tabs, or newlines with a single space,
    # ensuring consistent spacing
    text = re.sub(r'\s+', ' ', text).strip()

    # This line finds instances like "Pasal 17." and replaces them with
    # "Pasal 17 ". It removes the dot after the number and ensures
    # there is space. This prevents the sentence tokenizer from incorrectly
    # splitting "Pasal 17." into two sentences. It's important to keep
    # "Pasal 17" together as a single unit.
    text = re.sub(r'Pasal (\d+)\.\s', r'Pasal \1 ', text)

    # Remove dot, KEEP contents of parentheses
    text = re.sub(r'Ayat \((\d+[a-z]?)\)\.\s', r'Ayat (\1) ', text)

    text = re.sub(r'http\S+|www\S+', '', text, flags=re.IGNORECASE)  # Remove URLs
    text = re.sub(r'jdih\.kemdikbud\.go\.id', '', text, flags=re.IGNORECASE)  # Remove specific website

    # Replace page number pattern '- 4 -' with '(page 4)'
    text = re.sub(r'\s-\s(\d+)\s-\s', r' (page \1) ', text)

    return text

cleaned_texts = {pdf: clean_text(text) for pdf, text in pdf_texts.items()}

In [13]:
# ===============================
# 4. CHUNKING TEKS
# Splits text into smaller chunks.
# ===============================

def chunk_text(text, chunk_size=500):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1 <= chunk_size:
            current_chunk += sentence + " "
        else:
            if len(current_chunk) > 100:  # Pastikan chunk tidak terlalu kecil
                chunks.append(current_chunk.strip())
            current_chunk = sentence + " "

    if len(current_chunk) > 100:  # Pastikan chunk tidak terlalu kecil
        chunks.append(current_chunk.strip())

    return chunks

def clean_chunk(chunk):
    # Hapus angka atau simbol yang berada di awal baris
    chunk = re.sub(r'^\s*[\d\-\•]+', '', chunk)
    # Hapus angka yang berdiri sendiri tanpa konteks
    chunk = re.sub(r'\s*\d+\s*$', '', chunk)
    return chunk.strip()

def filter_irrelevant_text(chunk):
    irrelevant_patterns = [
        r'\(\d+\)\s*Dihapus',  # "(3) Dihapus", "(4) Dihapus"
        r'-\d+-',  # "-9-", "-11-" (kemungkinan nomor halaman)
        r'Pasal\s*\d+',  # "Pasal 52a" (jika tidak ada konteks)
        r'^\s*\.\s*$',  # Tanda titik yang berdiri sendiri
    ]

    for pattern in irrelevant_patterns:
        chunk = re.sub(pattern, '', chunk)

    # Hapus angka yang berdiri sendiri, kecuali yang ada dalam kurung ( ) atau { }
    chunk = re.sub(r'\b\d+\b(?![\)}])', ' ', chunk)  # Hanya hapus angka yang tidak diikuti kurung

    # Hilangkan spasi berlebihan
    chunk = re.sub(r'\s+', ' ', chunk).strip()

    return chunk

all_chunks = []
for pdf, text in cleaned_texts.items():
    chunks = chunk_text(text)  # 1️⃣ Chunking dulu
    cleaned_chunks = [clean_chunk(chunk) for chunk in chunks]  # 2️⃣ Clean angka awal
    final_chunks = [filter_irrelevant_text(chunk) for chunk in cleaned_chunks]  # 3️⃣ Hapus bagian tidak relevan
    all_chunks.extend(final_chunks)

print(f"Total chunks: {len(all_chunks)}")

Total chunks: 66


# SAVING DATA #1

* Chunk File
* Cleaned Texts File

In [14]:
# ===============================
# 5. SAVING DATA
# ===============================

# Define file paths for saving data
chunks_file = os.path.join(pdf_dir, "chunks.json")  # Path to save chunks
cleaned_texts_file = os.path.join(pdf_dir, "cleaned_texts.json") # Path to save cleaned texts

# --------------------------------------
# 1. Saving the Chunks of Text
# --------------------------------------
try:
    with open(chunks_file, "w", encoding="utf-8") as f:
        json.dump(all_chunks, f, ensure_ascii=False, indent=4)
    print(f"Chunks saved to: {chunks_file}")
except Exception as e:
    print(f"Error saving chunks: {e}")

# --------------------------------------
# 2. Saving the Cleaned PDF Texts
# --------------------------------------
try:
    with open(cleaned_texts_file, "w", encoding="utf-8") as f:
        json.dump(cleaned_texts, f, ensure_ascii=False, indent=4)
    print(f"Cleaned texts saved to: {cleaned_texts_file}")
except Exception as e:
    print(f"Error saving cleaned texts: {e}")

Chunks saved to: /content/pdf_files/chunks.json
Cleaned texts saved to: /content/pdf_files/cleaned_texts.json


# LOAD MODEL

In [16]:
# ===============================
# 6. MODEL & FAISS INDEXING
# ===============================

# Load model
cross_encoder_model = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-6")
embedder = SentenceTransformer("paraphrase-MiniLM-L3-v2")

# Encode semua chunk teks
chunk_embeddings = embedder.encode(all_chunks, convert_to_numpy=True)

d = 384  # Dimensi embedding dari model MiniLM

# Cluster tidak boleh lebih dari jumlah data
# Jika data lebih banyak, tetap gunakan 100 cluster.
NLIST = min(100, len(chunk_embeddings) // 4)
NPROBE = 10  # Jumlah cluster yang dicari saat query



# Buat quantizer dan FAISS IVF index
quantizer = faiss.IndexFlatL2(d)  # Quantizer untuk clustering
index = faiss.IndexIVFFlat(quantizer, d, NLIST, faiss.METRIC_L2)

# Training FAISS IVF dengan semua embeddings
if not index.is_trained:
    print("Training FAISS IVF index...")
    index.train(chunk_embeddings)

# Tambahkan embeddings ke FAISS Index
index.add(chunk_embeddings)
print(f"FAISS IVF Index siap dengan {index.ntotal} data.")

Training FAISS IVF index...
FAISS IVF Index siap dengan 66 data.


# SAVING DATA #2

* Embedding File
* Faiss Index_file

In [17]:
# ===============================
# 7. SAVE EMBEDDINGS & INDEX
# ===============================

# Define file paths for saving data
embedding_file = os.path.join(pdf_dir, "chunk_embeddings.npy")
faiss_index_file = os.path.join(pdf_dir, "faiss_index.bin")

# --------------------------------------
# a. Saving the embedding as .npy
# --------------------------------------
try:
    np.save(embedding_file, chunk_embeddings)
    print(f"Embedding saved to: {embedding_file}")
except Exception as e:
    print(f"Error saving embeddings: {e}")

# --------------------------------------
# b. Saving FAISS file
# --------------------------------------
try:
    faiss.write_index(index, faiss_index_file)
    print(f"FAISS IVF index saved to: {faiss_index_file}")
except Exception as e:
    print(f"Error saving FAISS index: {e}")

Embedding saved to: /content/pdf_files/chunk_embeddings.npy
FAISS IVF index saved to: /content/pdf_files/faiss_index.bin


# LOAD EMBEDDINGS & INDEX

In [18]:
# ===============================
# 8. LOAD EMBEDDINGS & INDEX
# ===============================

# Load embeddings dari file
if os.path.exists(embedding_file):
    chunk_embeddings = np.load(embedding_file)
    print(f"Loaded embeddings from: {embedding_file}")
else:
    raise FileNotFoundError(f"Embedding file {embedding_file} not found.")

# Load FAISS index dari file
if os.path.exists(faiss_index_file):
    index = faiss.read_index(faiss_index_file)
    print(f"Loaded FAISS index from: {faiss_index_file}")
else:
    raise FileNotFoundError(f"FAISS index file {faiss_index_file} not found.")

Loaded embeddings from: /content/pdf_files/chunk_embeddings.npy
Loaded FAISS index from: /content/pdf_files/faiss_index.bin


# Answer

In [19]:
# ===============================
# 9. FUNGSI UTILITAS
# ===============================

# Fungsi untuk ekstraksi kata kunci tanpa stopwords
def extract_keywords(question, top_n=5):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([question])

    if tfidf_matrix.shape[1] == 0:
        return set()

    feature_array = np.array(vectorizer.get_feature_names_out())
    tfidf_sorting = np.argsort(tfidf_matrix.toarray()).flatten()[::-1]

    return set(feature_array[tfidf_sorting][:top_n])

# Fungsi untuk memfilter chunk berdasarkan kata kunci dari pertanyaan
def filter_chunks_by_keywords(question, chunks):
    keywords = extract_keywords(question)

    if not keywords:  # Jika tidak ada kata kunci, gunakan semua chunk
        return chunks

    filtered_chunks = [chunk for chunk in chunks if any(keyword.lower() in chunk.lower() for keyword in keywords)]
    return filtered_chunks if filtered_chunks else chunks  # Jika kosong, tetap gunakan semua chunk

# Fungsi utama untuk menjawab pertanyaan dengan FAISS IVF
def answer_question(question, chunks, index_faiss, embedder, cross_encoder_model, top_n=3):

    filtered_chunks = filter_chunks_by_keywords(question, chunks)
    if not filtered_chunks:
        return "Maaf, saya tidak dapat menemukan informasi yang sesuai."

    # Embedding hanya untuk pertanyaan
    question_embedding = embedder.encode([question], convert_to_numpy=True)

    # Atur nprobe sebelum mencari
    index_faiss.nprobe = NPROBE

    # Cari similarity dengan FAISS IVF
    D, I = index_faiss.search(question_embedding, min(top_n * 2, len(chunks)))
    candidates = [chunks[i] for i in I[0] if i < len(chunks)]  # Pastikan indeks valid

    # Gunakan Cross-Encoder untuk memilih chunk terbaik
    pairs = [(question, chunk) for chunk in candidates]
    scores = cross_encoder_model.predict(pairs)
    top_indices = np.argsort(scores)[::-1][:top_n]

    return "\n".join([candidates[i] for i in top_indices])

def post_process_answer(answer):
    sentences = sent_tokenize(answer)

    # Hapus duplikasi dan urutkan kalimat agar lebih jelas
    unique_sentences = list(dict.fromkeys(sentences))

    # Format sebagai bullet list dengan memastikan keterbacaan
    bulleted_list = "\n".join([f"* {sentence.strip()}" for sentence in unique_sentences if len(sentence.strip()) > 10])

    return bulleted_list

# TESTING
Chatbot menampilkan mengambil dan menampilkan 3 chunk teks yang paling mirip dengan pertanyaan pengguna sebagai jawaban.

In [21]:
# ===============================
# 7. TESTING CHATBOT
# ===============================

# Contoh pertanyaan
test_questions = [
    "Apakah Dana BOSP dapat digunakan untuk pengembangan sumber daya manusia?",
    "Untuk apa saja Dana BOS Kinerja dapat digunakan?",
    "Kapan laporan realisasi penggunaan Dana BOSP harus disampaikan?"
]

for question in test_questions:
    raw_answer = answer_question(question, all_chunks, index, embedder, cross_encoder_model, top_n=3)
    processed_answer = post_process_answer(raw_answer)

    print(f"\n🔹 **Pertanyaan:** {question}")
    print(f"🔸 **Jawaban:**\n{processed_answer}\n")


🔹 **Pertanyaan:** Apakah Dana BOSP dapat digunakan untuk pengembangan sumber daya manusia?
🔸 **Jawaban:**
* Rincian Komponen Penggunaan Dana BOS Kinerja Sekolah yang Melaksanakan Program Sekolah Penggerak a. Pengembangan sumber daya manusia merupakan komponen yang digunakan untuk pembiayaan dalam kegiatan penguatan sumber daya manusia dalam rangka pelaksanaan Program Sekolah Penggerak, seperti: 1) identifikasi, pemetaan potensi dan kebutuhan pelatihan; 2) penguatan pelatihan griyaan (in house training) di Satuan Pendidikan; 3) penguatan komunitas belajar bagi kepala Satuan Pendidikan dan pendidik; 4) pelatihan bersama komunitas belajar; 5) pelaksanaan diskusi terpumpun bersama dengan guru SD kelas awal; 6) peningkatan kapasitas literasi digital; dan/ atau 7) kegiatan lainnya yang relevan dalam rangka pelaksanaan pengembangan sumber daya manusia.
* Rincian Komponen Penggunaan Dana BOP PAUD Kinerja Sekolah yang Melaksanakan Program Sekolah Penggerak a. Pengembangan sumber daya manusia m