<a href="https://colab.research.google.com/github/achrip/fictional-octo-broccoli/blob/main/src/embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -Uq \
langchain langchain-community langchain-huggingface langchain-chroma \
pypdf transformers accelerate Xformers InstructorEmbedding \
sentencepiece bitsandbytes tiktoken chromadb typer semantic_split

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.7/222.7 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m77.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m72.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

# Document Pre-Processing

- Importing PDFs
- Parsing PDFs
- Splitting into chunks

In [2]:
import os

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from semantic_split import SimilarSentenceSplitter, SentenceTransformersSimilarity, SpacySentenceSplitter

In [15]:
path = "/content/drive/MyDrive/ncvs_documents/"
# loader = DirectoryLoader(path=path,
#                          glob="*.pdf",
#                          loader_cls=PyPDFLoader)
loader = PyPDFLoader(file_path="/content/drive/MyDrive/ncvs_documents/CHAPTER-4_LIFE_SAVING_APPLIANCES_v.4.4_1708919237619_0.pdf")

documents = loader.load()
len(documents)

39

The `documents` is specifically an instance of `List[Document]` class. It consists of the tuple: `page_content` and `metadata` as shown below.

In [16]:
documents[13]

Document(page_content='IV - 14Chapter IV Live Saving Appliances Bab IV Perlengkapan Keselamatan NCVS Indonesia\nto\tthe\tspecified\t load\tline\tin\torder\tto\tmeasure\tthe\t\nfreeboard.\n6) The freshwater freeboard for the life boat with a \nstandard sheer shall be at least 3% of the length of the boat, but shall not be less than 7; 9.5;13 and 16.5 cm respectively for the life boats with depths of 31, 46, 61 and 76 cms.   \n Section 3  DA VITS AND LAUNCHING ARRANGEMENTS  \n    FOR  LIFE BOATS\n3.1. Type of davits3.1.1. Davits for life boats shall be:1)\t Of\t a\tluffing\tgravity\tof\tmechanically\t controlled\t\nsingle arm type, where the mass of the fully equipped life boat, manned and ready for launch -\ning, is less than 2.3 tonnes; or\n2) of gravity type, where such mass exceeds 2.3 \ntonnes.\n3) The mass of a life boat when fully loaded with persons and equipment shall not exceed 20.3 tonnes, a person being deemed for the purposes of this provision to be 75 kilograms.\n3.1.2. Str

Clean escape characters (`\n`, `\t`) and multiple whitespaces

In [17]:
import re

for i in range(len(documents)):
  cleaned_docs = re.sub("\s+", " ", documents[i].page_content)
  documents[i].page_content = cleaned_docs

In [24]:
documents[13]

Document(page_content='IV - 14Chapter IV Live Saving Appliances Bab IV Perlengkapan Keselamatan NCVS Indonesia to the specified load line in order to measure the freeboard. 6) The freshwater freeboard for the life boat with a standard sheer shall be at least 3% of the length of the boat, but shall not be less than 7; 9.5;13 and 16.5 cm respectively for the life boats with depths of 31, 46, 61 and 76 cms. Section 3 DA VITS AND LAUNCHING ARRANGEMENTS FOR LIFE BOATS 3.1. Type of davits3.1.1. Davits for life boats shall be:1) Of a luffing gravity of mechanically controlled single arm type, where the mass of the fully equipped life boat, manned and ready for launch - ing, is less than 2.3 tonnes; or 2) of gravity type, where such mass exceeds 2.3 tonnes. 3) The mass of a life boat when fully loaded with persons and equipment shall not exceed 20.3 tonnes, a person being deemed for the purposes of this provision to be 75 kilograms. 3.1.2. Strength1) Every davit serving a life boat which is re

## Splitting into Chunks

There are two such splitters that are recommended here, one is the `RecursiveCharacterTextSplitter`from Langchain and the other is the [`semantic-split`](https://github.com/agamm/semantic-split) by Agamm.

> Known issues:  
> `semantic-split` uses string as arguments, meaning we _could_ very well lose page metadata of the content itself. Workaround is in progress.

In [19]:
recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=512,
                                                    chunk_overlap=64,
                                                    separators=["\n\n",
                                                                "\n",
                                                                " ",
                                                                ".",
                                                                ",",
                                                                "\u200b",  # Zero-width space
                                                                "\uff0c",  # Fullwidth comma
                                                                "\u3001",  # Ideographic comma
                                                                "\uff0e",  # Fullwidth full stop
                                                                "\u3002",  # Ideographic full stop
                                                                ""])
semantic_splitter = SimilarSentenceSplitter(similarity_model=SentenceTransformersSimilarity(),
                                           sentence_splitter=SpacySentenceSplitter())



In [20]:
recursive_text = recursive_splitter.split_documents(documents)

semantic_text= list()
for i in range(len(documents)):
  semantic_text.append(semantic_splitter.split(documents[i].page_content))

print(f"Recursive: {len(recursive_text)}")
print(f"Semantic: {len(semantic_text)}")

Recursive: 501
Semantic: 39


In [21]:
recursive_text[250]

Document(page_content='adhesive and instructions. 4.4. Marking of rescue boats Retro-reflective tapes shall comply and be fitted in accordance with the requirements of rescue boats. Section 5 DINGHY 5.1. Requirement of a dinghy5.1.1. A dinghy shall be 2.4 metres length in minimum and shall be designed and constructed to meet the open waters and reserve buoyancy with level of floatation and swamped stability, and shall be fitted with buoyant materials ac- cordingly. 5.1.2. A dinghy shall be of a highly visible color. 5.1.3. The', metadata={'source': '/content/drive/MyDrive/ncvs_documents/CHAPTER-4_LIFE_SAVING_APPLIANCES_v.4.4_1708919237619_0.pdf', 'page': 18})

In [22]:
semantic_text[18]

[['Chapter IV Live - Saving Appliances Bab IV Perlengkapan Keselamatan NCVS Indonesia IV - 19maximum power capacity shall be determined in accor - dance with the Authority.'],
 ['4.3.2.'],
 ['Life jackets, additional to those required to be carried on the parent vessel, shall be provided for every person as-signed to crew the rescue boat.',
  'Buoyant immersion suits, where provided, shall be deemed to satisfy this require-ment.'],
 ['4.3.3.'],
 ['A waterproof, baterai-operated torch.'],
 ['4.3.4.'],
 ['A bailer of robust construction fitted with a lanyard, or a bilge pump.'],
 ['4.3.5.', 'oars or paddles.4.3.6.', 'A heaving line with rescue boat.'],
 ['4.3.7.'],
 ['For inflatable rescue boats the following equipment shall also be supplied:1)',
  'A bellows or pump complete with hose and adapter to suit inflation/valves and shall be capa- ble of inflating the rescue boat to normal working pressure.',
  '2) Emergency repair kit, complete with suitable patches, adhesive and instructions.

# Export to ChromaDB

Embeddings created will be stored in ChromaDB, and Chroma can also act like a retriever by utilizing an embedding model.

In [None]:
from langchain_chroma import Chroma

