# **Store Data to Vector Store (OJK)**

Ini cara untuk storing ke Redis, tapi untuk [Load](#load) Document beda-beda untuk tiap data BI, OJK, dan SIKEPO. Jadi buat sendiri function `extract_all_documents_in_directory` nya

## **Setup**

In [1]:
# import nest_asyncio
# nest_asyncio.apply()

import warnings
warnings.filterwarnings("ignore")

## **Config**

In [2]:
from utils.config import get_config

config = get_config()

## **Define Model**

In [3]:
from utils.models import ModelName, LLMModelName, EmbeddingModelName, get_model

model_name = ModelName.OPENAI
llm_model, embed_model = get_model(model_name=model_name, config=config, llm_model_name=LLMModelName.GPT_4O_MINI, embedding_model_name=EmbeddingModelName.EMBEDDING_3_SMALL)

## **Indexing (All)**

In [4]:
import pickle

def clean_document_content(content):
    return content.replace('\x00', '')  # Remove NUL characters

def load_from_pickle(filename):
    file = open(f'data/pickles/{filename}.pkl','rb')
    return pickle.load(file)

## **Indexing**

In [5]:
pickle_path = './data/pickles/'

### **Load**

Untuk SIKEPO dan BI beda cara extract documentsnya, file document_extractor buat sendiri :D.

In [6]:
LOAD_PICKLE = True

documents = load_from_pickle('all_documents')

### **Split**

In [7]:
from utils.bi_documents_split import bi_document_splitter
import pickle

LOAD_PICKLE = True

if not LOAD_PICKLE:
    all_splits = bi_document_splitter(docs=documents)
    all_splits_sorted = sorted(all_splits, key=lambda x: (x.metadata['file_id'], x.metadata.get('page_number', '0')))
    for split in all_splits_sorted:
        split.page_content = clean_document_content(split.page_content)
    # Open a file and use dump()
    with open(pickle_path + 'documents1.pkl', 'wb') as file:
        # A new file will be created
        pickle.dump(all_splits_sorted, file)

# Open the file in binary mode 
with open(pickle_path + 'documents1.pkl', 'rb') as file:
    # Call load method to deserialze 
    all_splits = pickle.load(file)

In [8]:
len(all_splits)

212890

In [9]:
all_splits[12].metadata

{'file_id': '0025aefc',
 'title': 'PERATURAN ANGGOTA DEWAN GUBERNUR NOMOR 21/1/PADG/2019 TANGGAL 17 JANUARI 2019 TENTANG PERUBAHAN ATAS PERATURAN ANGGOTA DEWAN GUBERNUR NOMOR 19/6/PADG/2017 TENTANG PINJAMAN LIKUIDITAS JANGKA PENDEK BAGI BANK UMUM KONVENSIONAL',
 'file_name': 'PADG Nomor 21/1/PADG/2019',
 'file_link': 'https://www.bi.go.id/id/publikasi/peraturan/Documents/PADG_210119.pdf',
 'date': '17 Januari 2019',
 'type_of_regulation': 'Peraturan Anggota Dewan Gubernur',
 'sector': 'Makroprudensial',
 'standardized_extracted_file_name': '',
 'standardized_file_name': 'padg-21_1_padg_2019-17012019-peraturan_anggota_dewan_gubernur_nomor_21_1_padg_2019_tanggal_17_januari_2019_tentang_perubahan_atas_peraturan_anggota_dewan_gubernur_nomor_19_6_padg_2017_tentang_pinjaman_likuiditas_jangka_pendek_bagi_bank_umum_konvensional',
 'page_number': 4}

### **Storing**

In [24]:
from database.vector_store.vector_store import ElasticIndexManager

# vector_store_manager = RedisIndexManager(index_name='ojk', embed_model=embed_model, config=config, db_id=0)
# vector_store_manager = PostgresIndexManager(index_name='ojk', embed_model=embed_model, config=config)
vector_store_manager = ElasticIndexManager(index_name='bi', embed_model=embed_model, config=config)

# vector_store_manager.delete_index() # WARNING: This will delete the index
vector_store_manager.store_vector_index(docs=all_splits, batch_size=100)
vector_store = vector_store_manager.load_vector_index()

Start loading from idx: 130700
Loaded 130701-130800 documents
Loaded 130801-130900 documents
Loaded 130901-131000 documents
Loaded 131001-131100 documents
Loaded 131101-131200 documents
Loaded 131201-131300 documents
Loaded 131301-131400 documents
Loaded 131401-131500 documents
Loaded 131501-131600 documents
Loaded 131601-131700 documents
Loaded 131701-131800 documents
Loaded 131801-131900 documents
Loaded 131901-132000 documents
Loaded 132001-132100 documents
Loaded 132101-132200 documents
Loaded 132201-132300 documents
Loaded 132301-132400 documents
Loaded 132401-132500 documents
Loaded 132501-132600 documents
Loaded 132601-132700 documents
Loaded 132701-132800 documents
Loaded 132801-132900 documents
Loaded 132901-133000 documents
Loaded 133001-133100 documents
Loaded 133101-133200 documents
Loaded 133201-133300 documents
Loaded 133301-133400 documents
Loaded 133401-133500 documents
Loaded 133501-133600 documents
Loaded 133601-133700 documents
Loaded 133701-133800 documents
Loaded 1

APIConnectionError: Connection error.

## **NYOBA2**

In [11]:
vector_store.as_retriever().invoke("Apa itu QRIS?")

[Document(metadata={'doc_id': 4, 'title': 'Tata Cara dan Mekanisme Penyampaian Data Transaksi Pendanaan dan Pelaporan Penyelenggara Layanan Pendanaan Bersama Berbasis Teknologi Informasi (LPBBTI)', 'sector': 'IKNB', 'subsector': 'Peraturan Lainnya', 'regulation_type': 'Surat Edaran OJK', 'regulation_number': '1/SEOJK.06/2024', 'effective_date': '2024/07/01', 'file_url': 'https://www.ojk.go.id/id/regulasi/Documents/Pages/Tata-Cara-dan-Mekanisme-Penyampaian-Data-Transaksi-Pendanaan-dan-Pelaporan-Penyelenggara-Layanan-Pendanaan-Bersama-Berbasis/SEOJK%201-SEOJK.06-2024%20Tata%20Cara%20dan%20Mekanisme%20Penyampaian%20Data%20Transaksi%20Pendanaan%20dan%20Pelaporan%20Penyelenggara%20LPBBTI.pdf', 'page_number': 15}, page_content='No \nKode Jenis Kelamin \nPengisian \n1 \nBpk. Adiwena  \nLaki-Laki \n(sesuai dengan referensi) \n2 \nIbu Suci \nWanita \n(sesuai dengan referensi) \n3 \nPT ABC \nkolom dikosongkan \n \n15. Alamat \na. \nKolom ini diisi dengan alamat rumah untuk perorangan dan alamat 

In [12]:
from langchain_community.vectorstores.pgvector import PGVector
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.chains.query_constructor.base import (
    StructuredQueryOutputParser,
    get_query_constructor_prompt,
)
from langchain_core.language_models.base import BaseLanguageModel
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_core.vectorstores import VectorStore
from langchain.retrievers.self_query.elasticsearch import ElasticsearchTranslator
from langchain_core.prompts import PromptTemplate
from langchain_core.structured_query import Operator, Comparator

# Define metadata field information
metadata_field_info = [
    AttributeInfo(
        name="title",
        description="The title of the document of regulation",
        type="string",
    ),
    AttributeInfo(
        name="sector",
        description="""The sector of the regulation""",
        type="string",
    ),
    AttributeInfo(
        name="subsector",
        description="The subsector of the regulation",
        type="string",
    ),
    AttributeInfo(
        name="regulation_type",
        description="""The type of the regulation""",
        type="string",
    ),
    AttributeInfo(
        name="regulation_number",
        description="The number of the regulation",
        type="string",
    ),
    AttributeInfo(
        name="effective_date",
        description="The effective date of the regulation in string format 'YYYY/MM/DD'",
        type="string",
    ),
]

# Define document content description
document_content_description = "The content of the document"

# Define prompt
SCHEMA = """\
<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{{{{
    "query": string \\ text string to compare to document contents
    "filter": string \\ logical condition statement for filtering documents
}}}}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` ({allowed_comparators}): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` ({allowed_operators}): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to

Make sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.
Make sure that date attributes are compared using ASCII comparison operators with the date in the string format "YYYY/MM/DD".
"""

SCHEMA_PROMPT = PromptTemplate.from_template(SCHEMA)

# prompt = get_query_constructor_prompt(
#     document_contents=document_content_description,
#     attribute_info=metadata_field_info,
#     schema_prompt=SCHEMA_PROMPT,
    
# )
# output_parser = StructuredQueryOutputParser.from_components()
# query_constructor = prompt | llm_model | output_parser

# query_constructor.invoke("Berikan dokumen yang berlaku pada tanggal 1 Januari 2023 hingga 1 Januari 2024")



# # Create query constructor
# def self_query_ojk(llm_model: BaseLanguageModel, vector_store: VectorStore, search_type: str = "similarity") -> SelfQueryRetriever:
#     retriever = SelfQueryRetriever.from_llm(
#         document_contents=document_content_description,
#         # enable_limit=False,
#         use_original_query=True,
#         llm=llm_model,
#         vectorstore=vector_store,
#         metadata_field_info=metadata_field_info,
#         structured_query_translator=PGVectorTranslator(),
#     )

#     return retriever



def self_query_ojk(llm_model: BaseLanguageModel, vector_store: VectorStore, search_type: str = "similarity") -> SelfQueryRetriever:
    prompt = get_query_constructor_prompt(
        document_contents=document_content_description,
        attribute_info=metadata_field_info,
        schema_prompt=SCHEMA_PROMPT,
    )
    output_parser = StructuredQueryOutputParser.from_components()
    query_constructor = prompt | llm_model | output_parser

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vector_store,
        search_type=search_type,
        structured_query_translator=ElasticsearchTranslator(),
        verbose=True,
    )

    return retriever

retriever = self_query_ojk(llm_model=llm_model, vector_store=vector_store, search_type="similarity")

In [15]:
# 2024-07-01
from langchain.globals import set_debug

set_debug(True)

context = retriever.invoke('Berikan dokumen yang berlaku dari tahun 20023 hingga 2025')
context

[32;1m[1;3m[chain/start][0m [1m[retriever:Retriever > chain:RunnableSequence] Entering Chain run with input:
[0m{
  "query": "Berikan dokumen yang berlaku dari tahun 20023 hingga 2025"
}
[32;1m[1;3m[chain/start][0m [1m[retriever:Retriever > chain:RunnableSequence > prompt:FewShotPromptTemplate] Entering Prompt run with input:
[0m{
  "query": "Berikan dokumen yang berlaku dari tahun 20023 hingga 2025"
}
[36;1m[1;3m[chain/end][0m [1m[retriever:Retriever > chain:RunnableSequence > prompt:FewShotPromptTemplate] [1ms] Exiting Prompt run with output:
[0m[outputs]
[32;1m[1;3m[llm/start][0m [1m[retriever:Retriever > chain:RunnableSequence > llm:AzureChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Human: Your goal is to structure the user's query to match the request schema provided below.\n\n<< Structured Request Schema >>\nWhen responding use a markdown code snippet with a JSON object formatted in the following schema:\n\n```json\n{\n    \"query\": string

[Document(metadata={'doc_id': 4, 'title': 'Tata Cara dan Mekanisme Penyampaian Data Transaksi Pendanaan dan Pelaporan Penyelenggara Layanan Pendanaan Bersama Berbasis Teknologi Informasi (LPBBTI)', 'sector': 'IKNB', 'subsector': 'Peraturan Lainnya', 'regulation_type': 'Surat Edaran OJK', 'regulation_number': '1/SEOJK.06/2024', 'effective_date': '2024/07/01', 'file_url': 'https://www.ojk.go.id/id/regulasi/Documents/Pages/Tata-Cara-dan-Mekanisme-Penyampaian-Data-Transaksi-Pendanaan-dan-Pelaporan-Penyelenggara-Layanan-Pendanaan-Bersama-Berbasis/SEOJK%201-SEOJK.06-2024%20Tata%20Cara%20dan%20Mekanisme%20Penyampaian%20Data%20Transaksi%20Pendanaan%20dan%20Pelaporan%20Penyelenggara%20LPBBTI.pdf', 'page_number': 15}, page_content='No \nKode Jenis Kelamin \nPengisian \n1 \nBpk. Adiwena  \nLaki-Laki \n(sesuai dengan referensi) \n2 \nIbu Suci \nWanita \n(sesuai dengan referensi) \n3 \nPT ABC \nkolom dikosongkan \n \n15. Alamat \na. \nKolom ini diisi dengan alamat rumah untuk perorangan dan alamat 