### Part 1: Creating the Vector Database with ChromaDB and Hugging Face Embeddings
**Introduction:**  
In this part, we will create a vector database using Chroma DB to store embeddings generated by Hugging Face's embedding models. This vector database will serve as the foundation for the retrieval component of our RAG system.

In [None]:
# All packages are in requirements.txt

!pip install -r /content/requi.txt

Collecting chroma-hnswlib==0.7.3 (from -r /content/requi.txt (line 1))
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb==0.4.16 (from -r /content/requi.txt (line 2))
  Downloading chromadb-0.4.16-py3-none-any.whl (496 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m496.1/496.1 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index==0.8.64.post1 (from -r /content/requi.txt (line 3))
  Downloading llama_index-0.8.64.post1-py3-none-any.whl (846 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m846.1/846.1 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama_cpp_python==0.2.16 (from -r /content/requi.txt (line 4))
  Downloading llama_cpp_python-0.2.16.tar.gz (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:

from PyPDF2 import PdfReader

In [None]:
CHUNK_SIZE = 350
CHUNK_OVERLAP = 15

#### 1. Download an example PDF from arXiv
For this RAG example we are using the Language Models are Few-Shot Learners paper

#### 2. Convert the PDF to LlamaIndex Documents
For this example we will be using the Document format.
This allows us to include the page_content and pass our metadata which is uses for citing sources

In [None]:
from llama_index import Document

In [None]:
reader = PdfReader("/content/triet.pdf")
print(reader.pages)

<PyPDF2._page._VirtualList object at 0x7a127df3b2e0>


In [None]:
reader = PdfReader("/content/triet.pdf")
print(reader.pages)
doc = []
for idx, page in enumerate(reader.pages):
    doc.append(Document(text=page.extract_text(),
                        metadata={'source': 'Triết học PTIT', 'page': f'{idx+1}'},
                        excluded_llm_metadata_keys=['link'],
                        excluded_embed_metadata_keys=['source', 'page', 'link']))

print(f'Number of pages {len(doc)}')



<PyPDF2._page._VirtualList object at 0x7a127df3b5e0>




Number of pages 167


#### 3. Convert Documents into LlamaIndex Nodes
We split our documents into 'chunks' to be embedded.  
Each chunk is what LlamaIndex calls a **Node**.

In [None]:
from llama_index.node_parser import SimpleNodeParser

parser = SimpleNodeParser.from_defaults(include_metadata = True, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

nodes = parser.get_nodes_from_documents(doc)

print(f'Parsed the {len(doc)} pages into {len(nodes)} nodes')

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Parsed the 167 pages into 1567 nodes


In [None]:
from llama_index.schema import MetadataMode

In [None]:
# This prints what the LLM sees
print (nodes[50].get_content (metadata_mode=MetadataMode.LLM))

source: Triết học PTIT
page: 9

BÀI GI ẢNG MÔN TRI ẾT HỌC MÁC - LÊNIN   
 
  
 
BỘ MÔN LÝ LU ẬN CHÍNH TR Ị - PTIT Page 7 cứu riêng c ho mình như mô t ả những hi ện tượng tinh th ần, phân tích ng ữ nghĩa, chú 
giải văn b ản. 
Mặc dù v ậy, cái chung trong các h ọc thuy ết triết học là nghiên c ứu những vấn đề 
chung nh ất của giới tự nhiên, c ủa xã h ội và con ngư ời, mối quan h ệ của con ngư ời,


In [None]:
# This prints what the embedding sees, you can see excluding source and page worked
print (nodes[50].get_content (metadata_mode=MetadataMode.EMBED))

BÀI GI ẢNG MÔN TRI ẾT HỌC MÁC - LÊNIN   
 
  
 
BỘ MÔN LÝ LU ẬN CHÍNH TR Ị - PTIT Page 7 cứu riêng c ho mình như mô t ả những hi ện tượng tinh th ần, phân tích ng ữ nghĩa, chú 
giải văn b ản. 
Mặc dù v ậy, cái chung trong các h ọc thuy ết triết học là nghiên c ứu những vấn đề 
chung nh ất của giới tự nhiên, c ủa xã h ội và con ngư ời, mối quan h ệ của con ngư ời,


In [None]:
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext

#### Setting Device:  
If you are using a Mac or an Nvidia GPU and installed PyTorch correctly the below will use the correct device  
Otherwise it will default to using the CPU

For details on how to install PyTorch for CUDA see the [Get Started page](https://pytorch.org/get-started/locally/)  
If you are not using CUDA with an Nvidia GPU you can uncomment the line below:

In [None]:
# Install PyTorch for Mac or Windows PC without Nvidia GPU
# !pip install torch torchvision torchaudio
# !pip install transformers

In [None]:
import torch
# Detect hardware acceleration device
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f'Using device: {device}')

Using device: cuda


**Load Embedding Model:**  
A good place to start when choosing and embedding model is the [MTEB English Leaderboard](https://huggingface.co/BAAI/bge-small-en)

At time of writing, the [BAAI/bge-small-en-v1.5'model](https://huggingface.co/spaces/mteb/leaderboard) is the best small model according to the leaderboard

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model_name = 'bkai-foundation-models/vietnamese-cross-encoder'
# Import embedding model from HuggingFace
embed_model = HuggingFaceEmbedding(
    model_name=embed_model_name,
    device = device,
    normalize='True',
    )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/814 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/540M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at bkai-foundation-models/vietnamese-cross-encoder and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/895k [00:00<?, ?B/s]

bpe.codes:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

### 4. Create and store the Vector DB
* This will use the bge-small-en embeddings model to embed our chunked text into vectors
* Then save those vectors into a ChromaDB named "RAG_VectorDB"

**Note**: If a DB with that name already exists, it will append, otherwise it creates it

In [None]:
import chromadb

db = chromadb.PersistentClient(path='./RAG_VectorDB')

collection_metadata = {
    'embedding_used':embed_model_name,
    'Included Papers':"Triết học PTIT"}
chroma_collection = db.get_or_create_collection('trietptit', metadata=collection_metadata)

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

service_context = ServiceContext.from_defaults(embed_model=embed_model,
                                                llm = None, # We will set the LLM when we open the DB
                                                chunk_size=CHUNK_SIZE,
                                                chunk_overlap=CHUNK_OVERLAP
                                                )

vector_store_index = VectorStoreIndex(nodes=nodes,
                                    storage_context=storage_context,
                                    service_context=service_context,
                                    show_progress=True)

print('Completed')

LLM is explicitly disabled. Using MockLLM.


Generating embeddings:   0%|          | 0/1567 [00:00<?, ?it/s]

Completed


In [None]:
# Import thư viện shutil để sử dụng hàm make_archive
import shutil

# Đường dẫn đến thư mục bạn muốn nén
thu_muc = "/content/RAG_VectorDB"

# Đường dẫn đến file zip sau khi nén
file_nen = "/content/index"

# Sử dụng hàm make_archive từ thư viện shutil để nén thư mục
shutil.make_archive(file_nen, 'zip', thu_muc)


'/content/index.zip'