# Candi Borobudur Chabot With RAG 

## Load Data
For reducing hallucatination from the output of the LLM, the LLM will be reinforced with external knowledge from :
- **Borobudur Wikipedia**
- **Scrapped borobudur information** (Paper,sites,..)

#### Borobudur Wikipedia

In [1]:
from langchain_community.document_loaders import WikipediaLoader

wiki_data = WikipediaLoader(query="Borobudur", load_max_docs=1,doc_content_chars_max = 5000,lang='id').load()

**Example**

In [2]:
wiki_data[0].metadata

{'title': 'Borobudur',
 'summary': 'Candi Borobudur (bahasa Jawa: ꦕꦟ꧀ꦝꦶꦧꦫꦧꦸꦝꦸꦂ, translit. Candhi Båråbudhur) adalah sebuah candi Buddha yang terletak di Borobudur, Magelang, Jawa Tengah, Indonesia. Candi ini terletak kurang lebih 100 km di sebelah barat daya Semarang, 86 km di sebelah barat Surakarta, dan 40 km di sebelah barat laut Yogyakarta. Candi dengan banyak stupa ini didirikan oleh para penganut agama Buddha Mahayana sekitar tahun 800-an Masehi pada masa pemerintahan wangsa Syailendra. Borobudur adalah candi atau kuil Buddha terbesar di dunia, sekaligus salah satu monumen Buddha terbesar di dunia.\nMonumen ini terdiri atas enam teras berbentuk bujur sangkar yang di atasnya terdapat tiga pelataran melingkar, pada dindingnya dihiasi dengan 2.672 panel relief dan aslinya terdapat 504 arca Buddha. Borobudur memiliki koleksi relief Buddha terlengkap dan terbanyak di dunia. Stupa utama terbesar terletak di tengah sekaligus memahkotai bangunan ini, dikelilingi oleh tiga barisan melingk

#### Scrapped Information

In [3]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("passages.txt",encoding="utf8")
text_data = loader.load()

**Example**

In [4]:
text_data[0].page_content

'"Stupa di Candi Borobudur terkait erat dengan konsep perjalanan spiritual. Kuil itu sendiri dirancang sebagai situs ziarah, dengan jalan setapak yang menuntun pengunjung dalam perjalanan pendakian spiritual. Stupa mewakili tujuan akhir dari perjalanan spiritual ini, melambangkan pencerahan dan pencapaian Nirvana. Saat pengunjung mengelilingi Stupa, mereka terlibat dalam bentuk perjalanan spiritual, bergerak dari alam luar keberadaan menuju pusat, yang mewakili keadaan kesadaran tertinggi. Ukiran dan relief di Stupa menggambarkan berbagai ajaran dan cerita Buddha, berfungsi sebagai panduan visual untuk kontemplasi spiritual selama perjalanan.Tindakan berjalan di sekitar Stupa dan mengamati ukiran dapat dilihat sebagai bentuk meditasi dan refleksi diri, memfasilitasi pertumbuhan dan pemahaman spiritual."\n"Kerusakan awal Candi Borobudur disebabkan oleh berbagai faktor, termasuk letusan gunung berapi, gempa bumi, dan pengabaian setelah keruntuhan Kerajaan Mataram Kuno. Selain itu, air hu

## Process Data

The additonal data would be processed so that searches across the data can run more easily

#### Split Text

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_split = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap = 200)

wiki_split = text_split.split_documents(wiki_data)
passage_split = text_split.split_documents(text_data)

print(f'\nWiki Data Split Amount : {len(wiki_split)}')
print(f'Sample                 : {wiki_split[0].page_content}')
print('=================================================')
print(f'Passage Data Split Amount : {len(passage_split)}')
print(f'Sample                    : {passage_split[0].page_content}')


Wiki Data Split Amount : 11
Sample                 : Candi Borobudur (bahasa Jawa: ꦕꦟ꧀ꦝꦶꦧꦫꦧꦸꦝꦸꦂ, translit. Candhi Båråbudhur) adalah sebuah candi Buddha yang terletak di Borobudur, Magelang, Jawa Tengah, Indonesia. Candi ini terletak kurang lebih 100 km di sebelah barat daya Semarang, 86 km di sebelah barat Surakarta, dan 40 km di sebelah barat laut Yogyakarta. Candi dengan banyak stupa ini didirikan oleh para penganut agama Buddha Mahayana sekitar tahun 800-an Masehi pada masa pemerintahan wangsa Syailendra. Borobudur adalah candi atau kuil Buddha terbesar di dunia, sekaligus salah satu monumen Buddha terbesar di dunia.
Passage Data Split Amount : 450
Sample                    : "Stupa di Candi Borobudur terkait erat dengan konsep perjalanan spiritual. Kuil itu sendiri dirancang sebagai situs ziarah, dengan jalan setapak yang menuntun pengunjung dalam perjalanan pendakian spiritual. Stupa mewakili tujuan akhir dari perjalanan spiritual ini, melambangkan pencerahan dan pencapaian Nirv

#### Embedding
Instantiate Embedding model that responsible for transforming data into numerical representation in a continuous vector space

In [6]:
import os
os.environ["PYDANTIC_SKIP_VALIDATING_CORE_SCHEMAS"] = "True"

In [7]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings

# Load an Indonesian sentence embedding model from Hugging Face
indosentencebert_embeddings = HuggingFaceEmbeddings(model_name="firqaaa/indo-sentence-bert-base")

# Load a multilingual embedding model from Hugging Face, allowing custom code execution from the model repo
gte_embeddings = HuggingFaceEmbeddings(model_name="Alibaba-NLP/gte-multilingual-base", 
                                       model_kwargs={'trust_remote_code': True})

# Load Filter embbeding for merge retriever
filter_embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001",google_api_key = "")

Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Create RAG Components

#### Merger Retriever

The MergerRetriever class can be used to improve the accuracy of document retrieval in a number of ways. First, it can combine the results of multiple retrievers, which can help to reduce the risk of bias in the results.

In [None]:
from langchain.vectorstores import FAISS

# Instantiate 2 diff FAISS indexes, each one with a diff embedding.
db_wiki = FAISS.from_documents(wiki_split, embedding=indosentencebert_embeddings)
db_passages = FAISS.from_documents(passage_split, embedding=gte_embeddings)

In [None]:
# Define 2 diff retrievers with 2 diff embeddings and diff search type.
retriever_wiki = db_wiki.as_retriever(
    search_type="similarity", search_kwargs={"k": 5, "include_metadata": True}
)
retriever_passages = db_passages.as_retriever(
    search_type="mmr", search_kwargs={"k": 5, "include_metadata": True}
)

In [None]:
from langchain.retrievers import MergerRetriever

# The Lord of the Retrievers will hold the output of both retrievers and can be used as any other
# retriever on different types of chains.
lotr = MergerRetriever(retrievers=[retriever_wiki, retriever_passages])

In [None]:
from langchain_community.document_transformers import EmbeddingsClusteringFilter,EmbeddingsRedundantFilter
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import DocumentCompressorPipeline

# Remove redundant results from both retrievers using yet another embedding.
# Using multiples embeddings in diff steps could help reduce biases.
filter = EmbeddingsRedundantFilter(embeddings=filter_embeddings)
pipeline = DocumentCompressorPipeline(transformers=[filter])
compression_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline, base_retriever=lotr
)

#### LLM
Instantiate LLM for question-answering system

In [1]:
import os
from langchain_groq import ChatGroq

llm = ChatGroq(temperature=0.5, model_name="llama-3.1-8b-instant",max_tokens=256,api_key="")

#### Reranking
Re-ranking existing search & retrieval pipelines

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank

compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=compression_retriever
)

## Build RAG App

#### Create History

In [None]:
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.prompts import ChatPromptTemplate

# Define a prompt to turn a question with chat history context
# into a clear, standalone question without using the history.
contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, "
    "just reformulate it if needed and otherwise return it as is."
)

# Create a prompt template that combines system instructions, chat history, and user input.
contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),  # Placeholder for chat history
        ("human", "{input}"),  # Placeholder for the user's latest question
    ]
)

# Create a retriever that understands the chat history using the LLM, compression retriever, and the prompt template.
history_aware_retriever = create_history_aware_retriever(
    llm, compression_retriever, contextualize_q_prompt
)

#### Create Retrieval Chain

In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain


# Define a system prompt for the assistant, named "Bori", who specializes in answering questions
system_prompt = (
    "You are an assistant named Bori that stands for Borobudur Story for question-answering tasks about Candi Borobudur."
    "Use the following pieces of retrieved context to help your answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise and don't answer questions not related to Candi Borobudur."
    "\n\n"
    "{context}"  # Placeholder for the retrieved context relevant to the question.
)

# Create a chat prompt template that combines the system instructions, chat history, and user input.
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),              # System instructions (how Bori should behave)
        MessagesPlaceholder("chat_history"),    # Placeholder for the chat history
        ("human", "{input}"),                   # Placeholder for the user's latest question
    ]
)

# Create a chain for answering questions based on the context retrieved.
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

# Combine the history-aware retriever and the question-answer chain to form a retrieval-augmented generation (RAG) chain.
rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

## Testing

In [None]:
from langchain_core.messages import AIMessage, HumanMessage

chat_history = []

In [None]:
question = "Tentu saja"
ai_msg = rag_chain.invoke({"input": question, "chat_history": chat_history})
chat_history.extend(
    [
        HumanMessage(content=question),
        AIMessage(content=ai_msg_1["answer"]),
    ]
)
ai_msg['answer']

'Baiklah, saya siap membantu! Apakah kamu memiliki pertanyaan tentang Candi Borobudur?'

#### Save Requirements

In [None]:
!pip freeze > requirements.txt