# **Store Data to Vector Store (OJK)**

Ini cara untuk storing ke Redis, tapi untuk [Load](#load) Document beda-beda untuk tiap data BI, OJK, dan SIKEPO. Jadi buat sendiri function `extract_all_documents_in_directory` nya

## **Setup**

In [1]:
import nest_asyncio
nest_asyncio.apply()

from dotenv import load_dotenv
load_dotenv()

True

## **Config**

In [2]:
from utils.config import get_config
from utils.models import ModelName, get_model

config = get_config()

## **Define Model**

In [3]:
model_name = ModelName.AZURE_OPENAI
llm_model, embed_model = get_model(model_name=model_name, config=config)

## **Indexing**

In [4]:
documents_dir = './data/documents/'
pickle_path = './data/pickles/'
metadata_path = './data/metadata/files_metadata.csv'

LOAD_PICKLE = True

### **Load**

Untuk SIKEPO dan BI beda cara extract documentsnya, file document_extractor buat sendiri :D.

In [5]:
from utils.documents_extractor.documents_extract_ojk import extract_all_documents_in_directory

if not LOAD_PICKLE:
    documents = extract_all_documents_in_directory(documents_dir, metadata_path, treshold=0.98)

### **Split**

In [6]:
from utils.documents_split import document_splitter
import pickle


if not LOAD_PICKLE:
    all_splits = document_splitter(docs=documents)
    all_splits1 = sorted(all_splits, key=lambda x: (x.metadata['doc_id'], x.metadata.get('page_number', '0')))
    # Open a file and use dump() 
    with open(pickle_path + 'documents3.pkl', 'wb') as file:

        # A new file will be created
        pickle.dump(all_splits1, file) 

# Open the file in binary mode 
with open(pickle_path + 'documents3.pkl', 'rb') as file:
    
    # Call load method to deserialze 
    all_splits = pickle.load(file)

In [7]:
len(all_splits)

39109

In [16]:
all_splits[10195]

Document(metadata={'doc_id': 2445, 'title': 'Peraturan Bank Indonesia Nomor 11/13/PBI/2009', 'sector': 'NO DATA', 'subsector': 'NO DATA', 'regulation_type': 'NO DATA', 'regulation_number': '11/13/PBI/2009', 'effective_date': '17 April 2009', 'file_url': 'https://www.ojk.go.id/id/kanal/perbankan/regulasi/peraturan-bank-indonesia/Documents/145.pdf', 'page_number': 33}, page_content="metadata={'doc_id': 2445, 'title': 'Peraturan Bank Indonesia Nomor 11/13/PBI/2009', 'sector': 'NO DATA', 'subsector': 'NO DATA', 'regulation_type': 'NO DATA', 'regulation_number': '11/13/PBI/2009', 'effective_date': '17 April 2009', 'file_url': 'https://www.ojk.go.id/id/kanal/perbankan/regulasi/peraturan-bank-indonesia/Documents/145.pdf', 'page_number': 33}\n- 4 - \nkepengurusan oleh Direksi dan tidak menghilangkan tanggung jawab \nDireksi sebagai pemutus. \n \nPasal 7 \n \nHuruf a \n \nCukup jelas. \nHuruf b \n \nCukup jelas. \nHuruf c \n \nCukup jelas. \nHuruf d \n \nYang dimaksud dengan hubungan keluarga s

### **Storing**

In [19]:
from databases.vector_store import RedisIndexManager

redis = RedisIndexManager(index_name='ojk', embed_model=embed_model, config=config, db_id=0)
# redis_bi = RedisIndexManager(index_name='bi', embed_model=embed_model, config=config, db_id=0)

# redis.delete_index()
redis.store_vector_index(docs=all_splits, batch_size=30) # Kalau error 'Redis failed to connect: Index does not exist.' ubah isi start_store_idx_indexname.txt menjadi 0
vector_store = redis.load_vector_index()

Start loading from idx: 38226
Loaded 38227-38256 documents
Loaded 38257-38286 documents
Loaded 38287-38316 documents
Loaded 38317-38346 documents
Loaded 38347-38376 documents
Loaded 38377-38406 documents
Loaded 38407-38436 documents
Loaded 38437-38466 documents
Loaded 38467-38496 documents
Loaded 38497-38526 documents
Loaded 38527-38556 documents
Loaded 38557-38586 documents
Loaded 38587-38616 documents
Loaded 38617-38646 documents
Loaded 38647-38676 documents
Loaded 38677-38706 documents
Loaded 38707-38736 documents
Loaded 38737-38766 documents
Loaded 38767-38796 documents
Loaded 38797-38826 documents
Loaded 38827-38856 documents
Loaded 38857-38886 documents
Loaded 38887-38916 documents
Loaded 38917-38946 documents
Loaded 38947-38976 documents
Loaded 38977-39006 documents
Loaded 39007-39036 documents
Loaded 39037-39066 documents
Loaded 39067-39096 documents
Loaded 39097-39109 documents


In [7]:
from retriever.retriever_bi_ojk.retriever_bi_ojk import get_combined_retriever_bi_ojk
from retriever.retriever_bi.retriever_bi import get_retriever_bi

# retriever = get_combined_retriever_bi_ojk(vector_store_bi=vector_store, vector_store_ojk=vector_store, top_n=7, top_k=20, llm_model=llm_model, embed_model=embed_model, config=config)