# Vector Embeddings for NCVS Statutory

The purpose of this notebook is to evaluate the embeddings of different models. 

Models used in this experiment are: 

- `hklunp/instructor-xl`
- 

Questions that arose within the process of experimenting: 

- Q: What does `chunk_size` and `chunk_overlap` do? 
- Q: Does smaller `chunk_size` mean bigger database? Since smaller chunk size translates to bigger amount

# Document loading and processing

In [None]:
!pip -q install langchain InstructorEmbedding sentence_transformers==2.2.2 sentencepiece Xformers 

In [1]:
from tqdm.auto import tqdm

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import (PyPDFLoader, DirectoryLoader)

from InstructorEmbedding import INSTRUCTOR

In [2]:
# load single file
pypdfloader = PyPDFLoader('../assets/ncvs_documents/CHAPTER-4_LIFE_SAVING_APPLIANCES_v.4.4_1708919237619_0.pdf')
documents = pypdfloader.load()

print(documents[0].page_content)

Chapter IV Live - Saving Appliances Bab IV  Perlengkapan Keselamatan NCVS Indonesia
IV - 1REPUBLIK INDONESIA
KEMENTERIAN PERHUBUNGAN
STANDAR KAPAL NON-KONVENSI 
BERBENDERA INDONESIA
BAB IV
BAB IV
PERLENGKAPAN KESELAMATANREPUBLIK INDONESIA
MINISTRY OF TRANSPORTATION
NON-CONVENTION VESSEL STANDARD
INDONESIAN FLAGGED
CHAPTER IV
CHAPTER IV
LIFE-SA VING APPLIANCES
Copyright © 2010 Ministry of Transportation, Republic of Indonesia Hak cipta © 2010 Kementerian Perhubungan, Republik Indonesia
PERLENGKAPAN KESELAMATAN
LIFE - SAVING APPLIANCES STANDAR KAPAL NON-KONVENSI 
BERBENDERA INDONESIA
NON-CONVENTION  VESSEL STANDARD
INDONESIAN FLAGGEDREPUBLIK INDONESIA
KEMENTERIAN PERHUBUNGAN
MINISTRY OF TRANSPORTATION
BAB      
CHAPTER IV
Hak cipta ©2009 Kementerian Perhubungan, Republik Indonesia
Edisi Pertama 2009



In [3]:
# let's stick to 512 chunk size for now, we can change this later on
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
texts = text_splitter.split_documents(documents)
len(texts)

548

# Embedding Model Magic

In [4]:
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from langchain_chroma import Chroma

cpu_model = {"device":"cpu"}
cuda_model = {"device":"cuda"}

In [None]:
instructor-l = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large",
                                           model_kwargs=cuda_model)

instructor-db = Chroma.from_documents(documents=texts,
                                      embedding=instructor-l,
                                      persist_directory='instructorDB')

In [None]:
jina-v2 = HuggingFaceInstructEmbeddings(model_name="jinaai/jina-embeddings-v2-base-en", 
                                        model_kwargs=cuda_model)
jina-db = Chroma.from_documents(documents=texts,
                                embedding=jina-v2,
                                persist_directory='jinaDB')

In [None]:
minilm = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2",
                                       model_kwargs=cuda_model)
minilm-db = Chroma.from_documents(documents=texts,
                                  embedding=minilm,
                                  persist_directory='minilmDB')

# Embedding Evaluation