# **HiBlu (RAG) - Vector Database** 

---

## ***Introduction***

---

By [Maverick Team](https://github.com/FTDS-assignment-bay/p2-final-project-mavericks)

<center><img src="https://imgtr.ee/images/2024/07/03/9eea693c3d3aee90f0ea2041012aa6a1.png" alt="9eea693c3d3aee90f0ea2041012aa6a1.png" border="0" /></center> 

Data : [FAQ Blue](https://blubybcadigital.id/info/faq)

---

## ***Objective***

---


HiBlu adalah chatbot LLM yang di-tuning secara khusus dan terintegrasi dengan Generative AI untuk Blu (sebuah layanan perbankan digital oleh BCA). Chatbot ini dirancang untuk memberikan tanggapan yang cepat dan akurat terhadap pertanyaan klien terkait layanan Blu BCA, meningkatkan pengalaman pelanggan dan efisiensi dukungan.

---

## ***Import Library***

---

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from pymongo import MongoClient
from langchain.vectorstores import MongoDBAtlasVectorSearch
import os

---

## ***Proses***

---

In [None]:
# Loading Data
loader = PyPDFLoader("answers.pdf")
data = loader.load()
data

In [None]:
# Menampilkan Tipe Data
type(data)

In [None]:
# Menampilkan 100 Halaman
print(data[31].page_content)

In [None]:
# Text Splitter
text_spliter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0,
                                               separators=["\n\n", "\n", "(?<=\. )", " "],
                                             length_function=len)

In [None]:
# Text Chunk
text_chunk = text_spliter.split_documents(data)

In [None]:
# Menampilkan Chunk
print(text_chunk[101].page_content) 

In [None]:
# Menampilkan 5 Chunk
for idx,chunk in enumerate(text_chunk[100:105]):
    print(f'no {idx} chunk : \n{chunk.page_content}')
    print(f'\ncharacter length in chunk {len(chunk.page_content)} ')
    print('-'*50)

In [None]:
# Loading Environment Variables
load_dotenv()
KEY=os.getenv("OPEN_AI_MONGO") 

In [None]:
# Embedding
embedding = OpenAIEmbeddings(openai_api_key=KEY)

In [None]:
# Menampilkan Len Embedding
len(embedding.embed_query('my name is danu'))

In [None]:
# Test Embedding
test_embed = embedding.embed_query('saya adalah danu')
test_embed[:10]

In [None]:
# Inisisasi MongoDB Python Client
client = MongoClient("mongodb+srv://Maverick:anakbimbinganmasdanu@cluster0.muggb2k.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0")

In [None]:
# Akses Koleksi Dataset
collection = client['Maverick']['Maverick_DB']

In [None]:
# Mencari Vektor (Vector Search) di MongoDB Atlas
docsearch = MongoDBAtlasVectorSearch.from_documents(
    text_chunk, embedding, collection=collection, index_name="vector_index"
)

***Insight:***

Proses penyimpanan data dari file PDF akan dilakukan dengan cara membagi teks menjadi beberapa bagian (chunk) dengan ukuran 1000 kata. Data ini akan diolah menggunakan environment dari OpenAI untuk proses embedding. Hasil embedding, dalam bentuk vektor, akan disimpan dalam database MongoDB.

---