# Introduction
1. This system that relies on semantic similarity. It finds the text in the document that is most similar to the user's question.
2. If the user's question doesn't closely resemble the way the information is expressed in the document, the system may not find the correct answer.
3. Basic Functionality covers:
    * Extract text from PDF documents.
    * Perform semantic search to find relevant chunks of text.
    * Clean the output to remove unwanted content.
    * Provide an answer to the user's question (even if the answer is not always perfect).



## Further Development
1. Clarifying Expectation, example :
    * Chatbot: "Dana BOS digunakan untuk membiayai kegiatan operasional sekolah. Apakah Anda ingin mengetahui contoh kegiatan operasional yang dapat dibiayai oleh Dana BOS?"
2. Provide a list of example questions that the user can ask. This shows them the types of questions the chatbot is good at answering. Example:
    * Apa saja syarat pengajuan Dana BOS?
    * Bagaimana cara melaporkan penggunaan Dana BOS?
    * Sebutkan contoh kegiatan yang dapat dibiayai oleh Dana BOS.
3. Keyword Suggestions: As the user types their question, suggest relevant keywords that they can include to make their question more specific.
4. Intent Recognition (Advanced): Implement a simple intent recognition system. This would analyze the user's question and try to identify the intent behind it (e.g., "find allowed uses," "find reporting requirements"). Based on the intent, the chatbot could automatically rephrase the question to be more targeted. This requires more advanced natural language processing techniques.
5. Expand the Training Data (If Possible): If you have the ability to add more data to the system, try to find documents that explicitly list the allowed uses of Dana BOS in a clear and structured way. This will make it easier for the semantic search to find the right information.
6. Hybrid Approach (Advanced): Consider combining this semantic search approach with a more traditional keyword-based search. If the semantic search fails to find a good answer, the chatbot could fall back to a keyword search to find any relevant documents and present them to the user.

# Import Library

In [None]:
!pip install pymupdf nltk sastrawi transformers sentence-transformers

import os
import re
import json
import fitz
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer
from google.colab import drive
from sklearn.metrics.pairwise import cosine_similarity

# Download resource NLTK
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

Collecting pymupdf
  Downloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl.metadata (909 bytes)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from to

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# Data Gathering

In [None]:
# Data Gathering
# ===============================
# 1. MOUNT GOOGLE DRIVE & CEK FILES
# ===============================

# Mount Google Drive
drive.mount('/content/drive')

# Path ke direktori penyimpanan file PDF
pdf_dir = "/content/drive/My Drive/Colab Notebooks/AI Chatbot Berbasis Regulasi"

# Cek apakah direktori ada
if not os.path.exists(pdf_dir):
    raise FileNotFoundError(f"Direktori {pdf_dir} tidak ditemukan! Periksa kembali path-nya.")
else:
    print(f"Direktori ditemukan! Daftar file PDF: {os.listdir(pdf_dir)}")


Mounted at /content/drive
Direktori ditemukan! Daftar file PDF: ['Permendikbudriset No. 63 Tahun 2023.pdf', 'Untitled folder', 'chunks.json', 'embeddings.npy', 'cleaned_texts.json']


In [None]:
# ===============================
# 2. EKSTRAKSI TEKS DARI FILE PDF
# ===============================

# --- PDF Text Extraction ---
def extract_text_from_pdf(pdf_path):
    """Extracts text from a PDF file."""
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text("text") + "\n"
    return text.strip()

pdf_files = [f for f in os.listdir(pdf_dir) if f.endswith(".pdf")]
pdf_texts = {}

for pdf_file in pdf_files:
    pdf_path = os.path.join(pdf_dir, pdf_file)
    try:
        text = extract_text_from_pdf(pdf_path)
        pdf_texts[pdf_file] = text
        print(f"Extracted text from: {pdf_file}")
    except Exception as e:
        print(f"Error extracting text from {pdf_file}: {e}")

Extracted text from: Permendikbudriset No. 63 Tahun 2023.pdf


# Preprocessing Data

In [None]:
# ===============================
# 3. PREPROCESSING TEKS
# ===============================

def clean_text(text):

    # This collapses multiple consecutive blank lines into a single blank line,
    # reducing unnecessary whitespace.
    text = re.sub(r'\n+', '\n', text)

    # replaces sequences of spaces, tabs, or newlines with a single space,
    # ensuring consistent spacing
    text = re.sub(r'\s+', ' ', text).strip()

    # This line finds instances like "Pasal 17." and replaces them with
    # "Pasal 17 ". It removes the dot after the number and ensures
    # there is space. This prevents the sentence tokenizer from incorrectly
    # splitting "Pasal 17." into two sentences. It's important to keep
    # "Pasal 17" together as a single unit.
    text = re.sub(r'Pasal (\d+)\.\s', r'Pasal \1 ', text)

    # Remove dot, KEEP contents of parentheses
    text = re.sub(r'Ayat \((\d+[a-z]?)\)\.\s', r'Ayat (\1) ', text)

    text = re.sub(r'http\S+|www\S+', '', text, flags=re.IGNORECASE)  # Remove URLs
    text = re.sub(r'jdih\.kemdikbud\.go\.id', '', text, flags=re.IGNORECASE)  # Remove specific website

    # Replace page number pattern '- 4 -' with '(page 4)'
    text = re.sub(r'\s-\s(\d+)\s-\s', r' (page \1) ', text)

    return text

cleaned_texts = {pdf: clean_text(text) for pdf, text in pdf_texts.items()}

In [None]:
# ===============================
# 4. CHUNKING TEKS
# Splits text into smaller chunks.
# ===============================

def chunk_text(text, chunk_size=100):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1 <= chunk_size:
            current_chunk += sentence + " "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + " "
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

all_chunks = []
for pdf, text in cleaned_texts.items():
    chunks = chunk_text(text)
    all_chunks.extend(chunks)

print(f"Total chunks: {len(all_chunks)}")

Total chunks: 144


# TOKENISASI TEKS

In [None]:
# ===============================
# 5. TOKENISASI TEKS & EMBEDDING
# ===============================

# Load Sentence Transformer model (multilingual)
model_name = 'paraphrase-multilingual-mpnet-base-v2'  # Replace with the actual model
model = SentenceTransformer(model_name)

# Generate embeddings for the chunks
embeddings = model.encode(all_chunks, show_progress_bar=True)

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

# SAVING DATA

In [None]:
# ===============================
# 6. SAVING DATA
# ===============================

# Define file paths for saving data
embeddings_file = os.path.join(pdf_dir, "embeddings.npy")  # Path to save embeddings
chunks_file = os.path.join(pdf_dir, "chunks.json")  # Path to save chunks
cleaned_texts_file = os.path.join(pdf_dir, "cleaned_texts.json") # Path to save cleaned texts

# ------------------------------------------------------------------
# 1. Saving the SentenceTransformer Model (NOT NECESSARY, SEE COMMENTS)
# ------------------------------------------------------------------
# As discussed, saving the SentenceTransformer model itself is not necessary
# because you can simply load it from the Hugging Face Model Hub using the model_name.
# Saving the model weights would take up a lot of space and is not required in this case.

# --------------------------------------
# 2. Saving the Embeddings
# --------------------------------------
try:
    np.save(embeddings_file, embeddings)
    print(f"Embeddings saved to: {embeddings_file}")
except Exception as e:
    print(f"Error saving embeddings: {e}")

# --------------------------------------
# 3. Saving the Chunks of Text
# --------------------------------------
try:
    with open(chunks_file, "w", encoding="utf-8") as f:
        json.dump(all_chunks, f, ensure_ascii=False, indent=4)
    print(f"Chunks saved to: {chunks_file}")
except Exception as e:
    print(f"Error saving chunks: {e}")

# --------------------------------------
# 4. Saving the Cleaned PDF Texts
# --------------------------------------
try:
    with open(cleaned_texts_file, "w", encoding="utf-8") as f:
        json.dump(cleaned_texts, f, ensure_ascii=False, indent=4)
    print(f"Cleaned texts saved to: {cleaned_texts_file}")
except Exception as e:
    print(f"Error saving cleaned texts: {e}")

Embeddings saved to: /content/drive/My Drive/Colab Notebooks/AI Chatbot Berbasis Regulasi/embeddings.npy
Chunks saved to: /content/drive/My Drive/Colab Notebooks/AI Chatbot Berbasis Regulasi/chunks.json
Cleaned texts saved to: /content/drive/My Drive/Colab Notebooks/AI Chatbot Berbasis Regulasi/cleaned_texts.json


# TESTING

In [None]:
# ===============================
# 7. TESTING CHATBOT
# ===============================

# --- Question Answering ---
def answer_question(question, embeddings, chunks, top_n=3):
    question_embedding = model.encode([question])[0]
    similarities = cosine_similarity([question_embedding], embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_n]

    # Debugging: Print the top chunks
    print("Top Chunks before post-processing:")
    for i in top_indices:
        print(f"Chunk {i}: {chunks[i]}\n---")

    context = "\n".join([chunks[i] for i in top_indices])

    return context

def post_process_answer(answer):
    # Split the answer into sentences
    sentences = sent_tokenize(answer)

    # Create a bulleted list from the sentences
    bulleted_list = "\n".join([f"* {sentence.strip()}" for sentence in sentences])

    return bulleted_list

# --- Example Usage ---
question = "Apakah Dana BOSP dapat digunakan untuk pengembangan sumber daya manusia?"  # More focused question
raw_answer = answer_question(question, embeddings, all_chunks, top_n=3)
processed_answer = post_process_answer(raw_answer)

print(f"Pertanyaan: {question}")
print(f"Jawaban:\n{processed_answer}")

# --- Example Usage ---
question = "Untuk apa saja Dana BOS Kinerja dapat digunakan?"  # More focused question
raw_answer = answer_question(question, embeddings, all_chunks, top_n=3)
processed_answer = post_process_answer(raw_answer)

print(f"Pertanyaan: {question}")
print(f"Jawaban:\n{processed_answer}")

# --- Example Usage ---
question = "Kapan laporan realisasi penggunaan Dana BOSP harus disampaikan?"  # More focused question
raw_answer = answer_question(question, embeddings, all_chunks, top_n=3)
processed_answer = post_process_answer(raw_answer)

print(f"Pertanyaan: {question}")
print(f"Jawaban:\n{processed_answer}")

Top Chunks before post-processing:
Chunk 123: Rincian Komponen Penggunaan Dana BOS Kinerja Sekolah yang Melaksanakan Program Sekolah Penggerak a. Pengembangan sumber daya manusia merupakan komponen yang digunakan untuk pembiayaan dalam kegiatan penguatan sumber daya manusia dalam rangka pelaksanaan Program Sekolah Penggerak, seperti: 1) identifikasi, pemetaan potensi dan kebutuhan pelatihan; 2) penguatan pelatihan griyaan (in house training) di Satuan Pendidikan; 3) penguatan komunitas belajar bagi kepala Satuan Pendidikan dan pendidik; 4) pelatihan bersama komunitas belajar; 5) pelaksanaan diskusi terpumpun bersama dengan guru SD kelas awal; 6) peningkatan kapasitas literasi digital; dan/ atau 7) kegiatan lainnya yang relevan dalam rangka pelaksanaan pengembangan sumber daya manusia.
---
Chunk 19: Dana Bantuan Operasional Penyelenggaraan Pendidikan Kesetaraan yang selanjutnya disebut Dana BOP Kesetaraan adalah Dana BOSP untuk operasional Satuan Pendidikan dalam menyelenggarakan pendid