# **Store Data to Vector Store (OJK)**

Ini cara untuk storing ke Redis, tapi untuk [Load](#load) Document beda-beda untuk tiap data BI, OJK, dan SIKEPO. Jadi buat sendiri function `extract_all_documents_in_directory` nya

## **Setup**

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

## **Config**

In [2]:
from utils.config import get_config
from utils.models import ModelName, get_model

config = get_config()

## **Define Model**

In [9]:
model_name = ModelName.AZURE_OPENAI
llm_model, embed_model = get_model(model_name=model_name, config=config)

## **Indexing**

In [3]:
documents_dir = './data/documents/'
pickle_path = './data/pickles/'
metadata_path = './data/metadata/files_metadata.csv'

LOAD_PICKLE = True

### **Load**

Untuk SIKEPO dan BI beda cara extract documentsnya, file document_extractor buat sendiri :D.

In [4]:
from utils.documents_extractor.documents_extract_ojk import extract_all_documents_in_directory

if not LOAD_PICKLE:
    documents = extract_all_documents_in_directory(documents_dir, metadata_path, treshold=0.98)

### **Split**

In [5]:
from utils.documents_split import document_splitter
import pickle


if not LOAD_PICKLE:
    all_splits = document_splitter(docs=documents)
    all_splits1 = sorted(all_splits, key=lambda x: (x.metadata['doc_id'], x.metadata.get('page_number', '0')))
    # Open a file and use dump() 
    with open(pickle_path + 'documents3.pkl', 'wb') as file:

        # A new file will be created
        pickle.dump(all_splits1, file) 

# Open the file in binary mode 
with open(pickle_path + 'documents3.pkl', 'rb') as file:
    
    # Call load method to deserialze 
    all_splits = pickle.load(file)

In [6]:
len(all_splits)

39109

In [7]:
all_splits[0]

Document(metadata={'doc_id': 2212, 'title': 'Surat Edaran Bank Indonesia Nomor 9/30/DPNP', 'sector': 'NO DATA', 'subsector': 'NO DATA', 'regulation_type': 'NO DATA', 'regulation_number': '9/30/DPNP/2007', 'effective_date': '12 Desember 2007', 'file_url': 'https://www.ojk.go.id/id/kanal/perbankan/regulasi/surat-edaran-bank-indonesia/Documents/187.pdf', 'page_number': 1}, page_content="metadata={'doc_id': 2212, 'title': 'Surat Edaran Bank Indonesia Nomor 9/30/DPNP', 'sector': 'NO DATA', 'subsector': 'NO DATA', 'regulation_type': 'NO DATA', 'regulation_number': '9/30/DPNP/2007', 'effective_date': '12 Desember 2007', 'file_url': 'https://www.ojk.go.id/id/kanal/perbankan/regulasi/surat-edaran-bank-indonesia/Documents/187.pdf', 'page_number': 1}\n \n \nNo. 9/30/DPNP \n \n \n \n     \n    Jakarta, 12 Desember 2007 \n \n \nS U R A T    E D A R A N \n \nKepada \nSEMUA BANK UMUM \nDI  INDONESIA \n \nPerihal :  Penerapan Manajemen Risiko dalam Penggunaan Teknologi \nInformasi oleh Bank Umum \n \n

### **Storing**

In [10]:
from databases.vector_store import RedisIndexManager

redis = RedisIndexManager(index_name='ojk', embed_model=embed_model, config=config, db_id=0)
# redis_bi = RedisIndexManager(index_name='bi', embed_model=embed_model, config=config, db_id=0)

# redis.delete_index()
redis.store_vector_index(docs=all_splits, batch_size=200) # Kalau error 'Redis failed to connect: Index does not exist.' ubah isi start_store_idx_indexname.txt menjadi 0
vector_store = redis.load_vector_index()

Start loading from idx: 0
Loaded 1-200 documents
Loaded 201-400 documents
Loaded 401-600 documents
Loaded 601-800 documents
Loaded 801-1000 documents
Loaded 1001-1200 documents
Loaded 1201-1400 documents
Loaded 1401-1600 documents
Loaded 1601-1800 documents
Loaded 1801-2000 documents
Loaded 2001-2200 documents
Loaded 2201-2400 documents
Loaded 2401-2600 documents
Loaded 2601-2800 documents
Loaded 2801-3000 documents
Loaded 3001-3200 documents
Loaded 3201-3400 documents
Loaded 3401-3600 documents
Loaded 3601-3800 documents
Loaded 3801-4000 documents
Loaded 4001-4200 documents
Loaded 4201-4400 documents
Loaded 4401-4600 documents
Loaded 4601-4800 documents
Loaded 4801-5000 documents
Loaded 5001-5200 documents
Loaded 5201-5400 documents
Loaded 5401-5600 documents
Loaded 5601-5800 documents
Loaded 5801-6000 documents
Loaded 6001-6200 documents
Loaded 6201-6400 documents
Loaded 6401-6600 documents
Loaded 6601-6800 documents
Loaded 6801-7000 documents
Loaded 7001-7200 documents
Loaded 7201-7

KeyboardInterrupt: 

In [7]:
from retriever.retriever_bi_ojk.retriever_bi_ojk import get_combined_retriever_bi_ojk
from retriever.retriever_bi.retriever_bi import get_retriever_bi

# retriever = get_combined_retriever_bi_ojk(vector_store_bi=vector_store, vector_store_ojk=vector_store, top_n=7, top_k=20, llm_model=llm_model, embed_model=embed_model, config=config)