<a href="https://colab.research.google.com/github/achrip/fictional-octo-broccoli/blob/main/embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -Uq \
langchain langchain-community langchain-huggingface langchain-chroma \
pypdf transformers accelerate Xformers InstructorEmbedding \
sentencepiece bitsandbytes tiktoken chromadb typer semantic_split

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m41.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.2/47.2 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m308.5/308.5 kB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.8/122.8 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

# Document Pre-Processing

- Importing PDFs
- Parsing PDFs
- Splitting into chunks

In [None]:
import os

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from semantic_split import SimilarSentenceSplitter, SentenceTransformersSimilarity, SpacySentenceSplitter

In [None]:
path = "/content/drive/MyDrive/ncvs_documents/"
loader = DirectoryLoader(path=path,
                         glob="*.pdf",
                         loader_cls=PyPDFLoader)
documents = loader.load()

The `documents` is specifically an instance of `List[Document]` class. It consists of the tuple: `page_content` and `metadata` as shown below.

In [None]:
documents[36]

Document(page_content='Chapter VIII Manning Bab VIII  Pengawakan NCVS Indonesia\nVIII - 379.3.10. 2nd Class Engineer \n   c. tugas yang berkaitan dengan penyerahan  \n\t \t \t \t dinas\tjaga\tprosedur\tkeselamatan\tdan\t\t \t\n    darurat; mengubah semua sistem remote  \n    control;\n   kewaspadaan keselamatan yang harus dilaku\n\t \t \t kan\tselama\tdinas\tjaga\tdan\ttindakan\tsegera\t\t\n\t \t \t yang\tharus\tdilaksanakan\tjika\tterjadi\tkeba-\n   karan atau kecelakaan, khususnya yang   \n   berkaitan dengan sistem bahan bakar.\n4) Mengoperasikan Sistem pompa: \n sistem pompa dan a. operasi pemompaan rutin \n sistem pengendali b. operasi sistem pompa bilga, ballast dan \n terkait.  muatan\nPengendalian operasional kapal dan kepedulian terhadap kesela-\nmatan orang di atas kapal pada tingkat operasional\n5) memastikan peme- Pencegahan pencemaran lingkungan laut\n nuhan persyaratan dan prosedur anti pencemaran \n pencegahan pence-\n maran  Pengetahuan mengenai kewaspadaan yang  \n   h

## Splitting into Chunks

There are two such splitters that are recommended here, one is the `RecursiveCharacterTextSplitter`from Langchain and the other is the [`semantic-split`](https://github.com/agamm/semantic-split) by Agamm.

> Known issues:  
> `semantic-split` uses string as arguments, meaning we _could_ very well lose page metadata of the content itself. Workaround is in progress.

In [None]:
recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=512,
                                                   chunk_overlap=64)
semantic_splitter = SimilarSentenceSplitter(similarity_model=SentenceTransformersSimilarity(),
                                           sentence_splitter=SpacySentenceSplitter())



In [None]:
recursive_text = recursive_splitter.split_documents(documents)

semantic_text= list()
for i in range(len(documents)):
  semantic_text.append(semantic_splitter.split(documents[i].page_content))

# Export to ChromaDB

Embeddings created will be stored in ChromaDB, and Chroma can also act like a retriever by utilizing an embedding model.

In [None]:
from langchain_chroma import Chroma

