### Part 1: Creating the Vector Database with ChromaDB and Hugging Face Embeddings
**Introduction:**  
In this part, we will create a vector database using Chroma DB to store embeddings generated by Hugging Face's embedding models. This vector database will serve as the foundation for the retrieval component of our RAG system.

In [None]:
# !pip install -r requirements.txt
#Or
# Below are the necessary libraries, uncomment the ones you need:
!pip install langchain
!pip install chromadb
!pip install arxiv
!pip install PyPDF2

Collecting langchain
  Downloading langchain-0.1.9-py3-none-any.whl (816 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.0/817.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.21 (from langchain)
  Downloading langchain_community-0.0.24-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2,>=0.1.26 (from langchain)
  Downloading langchain_core-0.1.27-py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.8/250.8 kB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langsmith<0.2.0,>=0.1.0 (from langchain)
  Downloading langsmith

In [None]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
import arxiv
from PyPDF2 import PdfReader

### 1. Download an example PDF from arXiv
For this RAG example we are using the Language Models are Few-Shot Learners paper

In [None]:
client = arxiv.Client()
search = arxiv.Search(id_list=['2005.14165'])

paper = next(arxiv.Client().results(search))
print(paper.title)

Language Models are Few-Shot Learners


##### Download the PDF locally

In [None]:
path = paper.download_pdf()

In [None]:
print(path)

./2005.14165v4.Language_Models_are_Few_Shot_Learners.pdf


### 2. Convert the PDF to LangChain Documents
For this example we will be using the Document format.
This allows us to include the page_content and pass our metadata which is uses for citing sources

In [None]:
from langchain.docstore.document import Document

In [None]:
reader = PdfReader("/content/BAI GIANG TRIET 2021.pdf")
doc = []
for idx, page in enumerate(reader.pages):
    doc.append(Document(page_content=page.extract_text(),
                        metadata={'source': f'{paper.title}', 'page': f'{idx+1}', 'link':f'{paper.entry_id}'},
                        excluded_llm_metadata_keys=['link'],
                        excluded_embed_metadata_keys=['source', 'page', 'link'])
                         )

print(f'Number of pages {len(doc)}')




KeyboardInterrupt: 

### 3. Prepare the documents by splitting the data
Now we will split the 75 pages into chucks to be vectorized

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=350, chunk_overlap=15)
texts = text_splitter.split_documents(doc)

print(f'Split into {len(texts)} chunks')

Split into 809 chunks


#### Setting Device:  
If you are using a Mac or an Nvidia GPU and installed PyTorch correctly the below will use the correct device  
Otherwise it will default to using the CPU

For details on how to install PyTorch for CUDA see the [Get Started page](https://pytorch.org/get-started/locally/)  
If you are not using CUDA with an Nvidia GPU you can uncomment the line below:

In [None]:
pip install pypdf

Collecting pypdf
  Downloading pypdf-4.0.2-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-4.0.2


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
loader = DirectoryLoader("/content/helo", glob="*.pdf", loader_cls = PyPDFLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=350, chunk_overlap=15)
chunks = text_splitter.split_documents(documents)

Exception ignored in: <function _xla_gc_callback at 0x7eff01f391b0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/jax/_src/lib/__init__.py", line 97, in _xla_gc_callback
    def _xla_gc_callback(*args):
KeyboardInterrupt: 


In [None]:
chunks

[Document(page_content='HỌC VI ỆN CÔNG NGH Ệ BƯU CHÍNH VI ỄN THÔNG  \nKHOA CƠ B ẢN I \nBỘ MÔN LÝ LU ẬN CHÍNH TR Ị \n⎯⎯⎯⎯⎯⎯⎯⎯⎯ \n \n \n \n \nBÀI GI ẢNG \nTRIẾT HỌC MÁC - LÊNIN  \n \n \n                               \nTs. Ph ạm Minh Ái  \nThs. Ph ạm Th ị Khánh  \n(Đồng ch ủ biên)  \n                                                              \n \n \n \n \nHÀ N ỘI - 2021', metadata={'source': '/content/helo/BAI GIANG TRIET 2021 (1).pdf', 'page': 0}),
 Document(page_content='BỘ TRUY ỀN THÔNG THÔNG TIN  \nHỌC VI ỆN CÔNG NGH Ệ BƯU CHÍNH VI ỄN THÔNG  \nKHOA CƠ B ẢN 1 \n \n \n \nTẬP BÀI GI ẢNG \nTRIẾT HỌC MÁC - LÊNIN  \n \n \n \n \nTrình đ ộ: Đại học \nĐối tượng: Sinh viên và gi ảng viên H ọc viện Công ngh ệ Bưu chính Vi ễn thôn g \n \n \n \n \n \n \n \n \n \n \n \n \nHÀ N ỘI - 2021', metadata={'source': '/content/helo/BAI GIANG TRIET 2021 (1).pdf', 'page': 1}),
 Document(page_content='BÀI GI ẢNG MÔN TRI ẾT HỌC MÁC - LÊNIN   \n \n  \n \nBỘ MÔN LÝ LU ẬN CHÍNH TR Ị - PTIT Page 1 MỤC LỤC  \nMỤ

In [None]:
# Install PyTorch for Mac or Windows PC without Nvidia GPU
# !pip install torch torchvision torchaudio

!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-2.4.0-py3-none-any.whl (149 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.5/149.5 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence_transformers
Successfully installed sentence_transformers-2.4.0


In [None]:
import torch
# Detect hardware acceleration device
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f'Using device: {device}')

Using device: cuda


#### Load embedding model:
A good place to start when choosing and embedding model is the [MTEB English Leaderboard](https://huggingface.co/BAAI/bge-small-en)

At time of writing, the [BAAI/bge-small-en-v1.5'model](https://huggingface.co/spaces/mteb/leaderboard) is the best small model according to the leaderboard

In [None]:
from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = 'bkai-foundation-models/vietnamese-bi-encoder'  # Using open source embedding model

embedding_function = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': device},
    encode_kwargs={'normalize_embeddings': True} #normalizes the vectors
)
print(f'Loaded {model_name} from HuggingFace')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.08k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/701 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/540M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/303 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/895k [00:00<?, ?B/s]

bpe.codes:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loaded bkai-foundation-models/vietnamese-bi-encoder from HuggingFace


### 4. Create and store the Vector DB
* This will use the bge-small-en embeddings model to embed our chunked text into vectors
* Then save those vectors into a ChromaDB named "LC_VectorDB"

**Note**: If a DB with that name already exists, it will append, otherwise it creates it

In [None]:
persist_directory = 'LC_VectorDB' # Name of the DB

vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_function,
    persist_directory=persist_directory # This line saves the db to disk
    )
print("DB write complete!")

DB write complete!


In [None]:
import zipfile
import os

def zip_directory(directory_path, output_path):
    with zipfile.ZipFile(output_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(directory_path):
            for file in files:
                file_path = os.path.join(root, file)
                zipf.write(file_path, os.path.relpath(file_path, directory_path))

# Đường dẫn của thư mục bạn muốn nén
directory_path = '/content/LC_VectorDB'

# Đường dẫn và tên file zip đầu ra
output_path = '/content/alo.zip'

# Gọi hàm để nén thư mục
zip_directory(directory_path, output_path)