# **Store Data to Vector Store (OJK)**

Ini cara untuk storing ke Redis, tapi untuk [Load](#load) Document beda-beda untuk tiap data BI, OJK, dan SIKEPO. Jadi buat sendiri function `extract_all_documents_in_directory` nya

## **Setup**

In [1]:
# import nest_asyncio
# nest_asyncio.apply()

## **Config**

In [2]:
from utils.config import get_config
from utils.models import ModelName, get_model

config = get_config()

## **Define Model**

In [3]:
from utils.models import ModelName, LLMModelName, EmbeddingModelName, get_model

model_name = ModelName.OPENAI
llm_model, embed_model = get_model(model_name=model_name, config=config, llm_model_name=LLMModelName.GPT_35_TURBO, embedding_model_name=EmbeddingModelName.EMBEDDING_3_SMALL)

## **Indexing**

In [4]:
documents_dir = './data/documents3/'
pickle_path = './data/pickles/'
metadata_path = './data/metadata/files_metadata.csv'

LOAD_PICKLE = True

### **Load**

Untuk SIKEPO dan BI beda cara extract documentsnya, file document_extractor buat sendiri :D.

In [5]:
from utils.documents_extractor.documents_extract_ojk import extract_all_documents_in_directory

if not LOAD_PICKLE:
    documents = extract_all_documents_in_directory(documents_dir, metadata_path, treshold=0.98)

### **Split**

In [6]:
from utils.documents_split import document_splitter
import pickle

if not LOAD_PICKLE:
    all_splits = document_splitter(docs=documents)
    all_splits1 = sorted(all_splits, key=lambda x: (x.metadata['doc_id'], x.metadata.get('page_number', '0')))
    # Open a file and use dump() 
    with open(pickle_path + 'documents3.pkl', 'wb') as file:
        # A new file will be created
        pickle.dump(all_splits1, file) 

# Open the file in binary mode 
with open(pickle_path + 'documents1.pkl', 'rb') as file:
    # Call load method to deserialze 
    all_splits = pickle.load(file)

In [7]:
len(all_splits)

132966

In [8]:
all_splits[48853:48953]

[Document(metadata={'doc_id': 418, 'title': 'Penerapan Manajemen Risiko bagi Perusahaan Pialang Asuransi, Perusahaan Pialang Reasuransi, dan Perusahaan Penilai Kerugian Asuransi', 'sector': 'IKNB', 'subsector': 'Asuransi', 'regulation_type': 'Surat Edaran OJK', 'regulation_number': '13/SEOJK.05/2021', 'effective_date': '2021-04-12', 'file_url': 'https://www.ojk.go.id/id/regulasi/Documents/Pages/Penerapan-Manajemen-Risiko-bagi-Perusahaan-Pialang-Asuransi,-Perusahaan-Pialang-Reasuransi,-dan-Perusahaan-Penilai-Kerugian/SEOJK%2013%20-%2005%20-%202021.pdf', 'page_number': 129}, page_content='METADATA:\ntitle: Penerapan Manajemen Risiko bagi Perusahaan Pialang Asuransi, Perusahaan Pialang Reasuransi, dan Perusahaan Penilai Kerugian Asuransi\nsector: IKNB\nsubsector: Asuransi\nregulation_type: Surat Edaran OJK\nregulation_number: 13/SEOJK.05/2021\neffective_date: 12 April 2021\n----------\n\nSource: [13/SEOJK.05/2021](https://www.ojk.go.id/id/regulasi/Documents/Pages/Penerapan-Manajemen-Risik

### **Storing**

In [9]:
from database.vector_store.vector_store import PostgresIndexManager

# vector_store_manager = RedisIndexManager(index_name='ojk', embed_model=embed_model, config=config, db_id=0)
vector_store_manager = PostgresIndexManager(index_name='ojk', embed_model=embed_model, config=config)

# vector_store_manager.delete_index() # WARNING: This will delete the index
# vector_store_manager.store_vector_index(docs=all_splits[0:100], batch_size=100)
vector_store = vector_store_manager.load_vector_index()

Database 'vector_store' already exists.
Vector extension created successfully (if it didn't exist).


## **NYOBA2**

In [10]:
vector_store.as_retriever().invoke("Halo")

In [26]:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.chains.query_constructor.base import (
    StructuredQueryOutputParser,
    get_query_constructor_prompt,
)
from langchain_core.language_models.base import BaseLanguageModel
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_core.vectorstores import VectorStore
from langchain.retrievers.self_query.pgvector import PGVectorTranslator
from langchain_core.prompts import PromptTemplate


# Define metadata field information
metadata_field_info = [
    AttributeInfo(
        name="title",
        description="The title of the document of regulation",
        type="string",
    ),
    AttributeInfo(
        name="sector",
        description="""The sector of the regulation""",
        type="string",
    ),
    AttributeInfo(
        name="subsector",
        description="The subsector of the regulation",
        type="string",
    ),
    AttributeInfo(
        name="regulation_type",
        description="""The type of the regulation""",
        type="string",
    ),
    AttributeInfo(
        name="regulation_number",
        description="The number of the regulation",
        type="string",
    ),
    AttributeInfo(
        name="effective_date",
        description="The effective date of the regulation in format string YYYY-MM-DD",
        type="string",
    ),
]

# Define document content description
document_content_description = "The content of the document"

# Define prompt
SCHEMA = """\
<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{{{{
    "query": string \\ text string to compare to document contents
    "filter": string \\ logical condition statement for filtering documents
}}}}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` ({allowed_comparators}): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` ({allowed_operators}): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to

Make sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters only use iso format for date comparisons, e.g. '2024-07-01', and make comparisons using ASCII values for string comparisons.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.
"""

SCHEMA_PROMPT = PromptTemplate.from_template(SCHEMA)

# Create query constructor
def self_query_ojk(llm_model: BaseLanguageModel, vector_store: VectorStore, search_type: str = "similarity") -> SelfQueryRetriever:
    prompt = get_query_constructor_prompt(
        document_contents=document_content_description,
        attribute_info=metadata_field_info,
        schema_prompt=SCHEMA_PROMPT,
    )
    output_parser = StructuredQueryOutputParser.from_components()
    query_constructor = prompt | llm_model | output_parser

    retriever = SelfQueryRetriever(
        query_constructor=query_constructor,
        vectorstore=vector_store,
        search_type=search_type,
        structured_query_translator=PGVectorTranslator(),
        verbose=True,
        
    )

    return retriever

retriever = self_query_ojk(llm_model=llm_model, vector_store=vector_store, search_type="similarity")

In [29]:
# 2024-07-01
context = retriever.invoke("Berikan dokumen dengan sektor IKNB")
context