This is a playground to use ColBERT It does not use RAGatouille
The problem with RAGatouille are
- It does not expose all the ColBERT configurations.
- It is difficult directly integrate with LangChain's Embeddings class
The latest version of colbert-ai==0.2.19 or its dependencies require pyarraow==14.0.0
Install faiss-gpu on CUDA
Code is at this folder that includes
- A ColBERT Embedding class
- Astra loader
- Astra vector based retriever, a LangChain compatible retriever
- It runs on CPU and GPU/Cuda (automatically runs all available GPUs) A chat bot example of RAG using ColBERT embedding, Astra DB vector store, retriever (including a default ranker).
How to run the example and prerequisites:
- Specify the directory of pdf files
- Create a AstraDB keyspace and specify the keyspace name in the example code
- Download Secure Connect Bundle and specify the path in the example
- Create an AstraCS token to export as
ASTRA_TOKEN
cd webserver
poetry install
poetry shell
cd webserver
python example.py
- A web server for embedding service
- Dockerfile of the web embedding service
- Indexing and encoding examples[example] to test on GPU.
Load, split and prepare the documents
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyPDFLoader
import os
# pip install pypdf
loader =DirectoryLoader(
path="./files",
glob="**/*.pdf",
loader_cls=PyPDFLoader,
recursive=True,
)
docs = loader.load()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=500, # colbert doc_maxlen is 220
chunk_overlap=100,
length_function=len,
)
splits = text_splitter.split_documents(docs)
title = docs[0].metadata['source']
collections = []
for part in splits:
collections.append(part.page_content)from embedding import ColbertTokenEmbeddings
colbert = ColbertTokenEmbeddings(
doc_maxlen=220,
nbits=1,
kmeans_niters=4,
nranks=1,
)
passageEmbeddings = colbert.embed_documents(texts=collections, title=title)Create tables and load embeddings
from embedding import AstraDB
import os
# astra db
astra = AstraDB(
secure_connect_bundle="./secure-connect-mingv1.zip",
astra_token=os.getenv("ASTRA_TOKEN"),
keyspace="colbert128"
)from embedding import ColbertAstraRetriever
retriever = ColbertAstraRetriever(astraDB=astra, colbertEmbeddings=colbert)
answers = retriever.retrieve("what's the toll free number to call for help?")A web embedding service is implemented to provide ColBERT text embedding over HTTP.
Commands to set up dev environment.
cd webserver
poetry install
poetry shell
cd webserver
uvicorn main:app --reload