# **Store Data to Vector Store (OJK)**

Ini cara untuk storing ke Redis, tapi untuk [Load](#load) Document beda-beda untuk tiap data BI, OJK, dan SIKEPO. Jadi buat sendiri function `extract_all_documents_in_directory` nya

## **Setup**

In [1]:
import nest_asyncio
nest_asyncio.apply()

from dotenv import load_dotenv
load_dotenv()

True

## **Config**

In [2]:
from utils.config import get_config
from utils.models import ModelName, get_model

config = get_config()

## **Define Model**

In [3]:
model_name = ModelName.AZURE_OPENAI
llm_model, embed_model = get_model(model_name=model_name, config=config)

## **Indexing**

In [4]:
documents_dir = './data/documents/'
pickle_path = './data/pickles/'
metadata_path = './data/metadata/files_metadata.csv'

LOAD_PICKLE = True

### **Load**

Untuk SIKEPO dan BI beda cara extract documentsnya, file document_extractor buat sendiri :D.

In [5]:
from utils.documents_extractor.documents_extract_ojk import extract_all_documents_in_directory

if not LOAD_PICKLE:
    documents = extract_all_documents_in_directory(documents_dir, metadata_path, treshold=0.98)

### **Split**

In [6]:
from utils.documents_split import document_splitter
import pickle


if not LOAD_PICKLE:
    all_splits = document_splitter(docs=documents)
    all_splits1 = sorted(all_splits, key=lambda x: (x.metadata['doc_id'], x.metadata.get('page_number', '0')))
    # Open a file and use dump() 
    with open(pickle_path + 'documents1.pkl', 'wb') as file:

        # A new file will be created
        pickle.dump(all_splits1, file) 

# Open the file in binary mode 
with open(pickle_path + 'documents1.pkl', 'rb') as file:
    
    # Call load method to deserialze 
    all_splits = pickle.load(file)

In [7]:
len(all_splits)

113052

In [8]:
all_splits[0]

Document(metadata={'doc_id': 1, 'title': 'Dasar Penilaian Investasi Dana Pensiun', 'sector': 'IKNB', 'subsector': 'Dana Pensiun', 'regulation_type': 'Surat Edaran OJK', 'regulation_number': '4/SEOJK.05/2024', 'effective_date': '1 Juli 2024', 'file_url': 'https://www.ojk.go.id/id/regulasi/Documents/Pages/Dasar-Penilaian-Investasi-Dana-Pensiun/SEOJK%204-SEOJK.05-2024%20Dasar%20Penilaian%20Investasi%20Dana%20Pensiun.pdf', 'page_number': 1}, page_content="metadata={'doc_id': 1, 'title': 'Dasar Penilaian Investasi Dana Pensiun', 'sector': 'IKNB', 'subsector': 'Dana Pensiun', 'regulation_type': 'Surat Edaran OJK', 'regulation_number': '4/SEOJK.05/2024', 'effective_date': '1 Juli 2024', 'file_url': 'https://www.ojk.go.id/id/regulasi/Documents/Pages/Dasar-Penilaian-Investasi-Dana-Pensiun/SEOJK%204-SEOJK.05-2024%20Dasar%20Penilaian%20Investasi%20Dana%20Pensiun.pdf', 'page_number': 1}\n \n \n \n \n \nYth.  \nPengurus Dana Pensiun \ndi tempat. \n \nSALINAN \nSURAT EDARAN OTORITAS JASA KEUANGAN \n

### **Storing**

In [10]:
from databases.vector_store import RedisIndexManager

redis = RedisIndexManager(index_name='ojk', embed_model=embed_model, config=config, db_id=0)

# redis.delete_index()
redis.store_vector_index(docs=all_splits, batch_size=100) # Kalau error 'Redis failed to connect: Index does not exist.' ubah isi start_store_idx_indexname.txt menjadi 0
vector_store = redis.load_vector_index()

Start loading from idx: 0
Loaded 1-100 documents
Loaded 101-200 documents
Loaded 201-300 documents
Loaded 301-400 documents
Loaded 401-500 documents
Loaded 501-600 documents
Loaded 601-700 documents
Loaded 701-800 documents
Loaded 801-900 documents
Loaded 901-1000 documents
