## Modularizing components of the ingestion pipeline

In Object Orienteed Programming, we attempt to modularize components into reusable components using classes and methods. For our lab, we created three components in the ingestion pipline. Namely, the vectordb, loader and chunking component. Each components will then be composed using the AdamIngest class in ingest.py to create the pipeline. As clean coders, we allow developers to pass by values or by keyword arguments as well as component configuration in the form of dictionaries to facilitate faster experimentation.

### Ingest with Values

In [1]:
import sys
sys.path.append("..")

In [2]:
from rag.utils import read_config
from rag.ingest import ADAM_VECTORDB, ADAM_CHUNKER, ADAM_LOADER
from rag.ingest.ingest import AdamIngest
import os
# Configs
vectordb_cfg = read_config("../configs/ingest.yaml", "vectordb")
loader_cfg = read_config("../configs/ingest.yaml", "loader")
chunker_cfg = read_config("../configs/ingest.yaml", "chunker")

# Vector db initialization
vectordb_choice = next(iter(vectordb_cfg))
vectordb = ADAM_VECTORDB[vectordb_choice](**vectordb_cfg[vectordb_choice], debug=True)

# Loader initialization
loader_choice = next(iter(loader_cfg))
loader = ADAM_LOADER[loader_choice](**loader_cfg[loader_choice], debug=True)

# Chunker initialization
chunker_choice = next(iter(chunker_cfg))
chunker = ADAM_CHUNKER[chunker_choice](**chunker_cfg[chunker_choice], debug=True)

ingestor = AdamIngest(vectordb, loader, chunker, debug = True)
path = r"C:\Users\USER\adamxht\AdamLab\docs"
doc_list = [os.path.join(path, file) for file in os.listdir(path)]
ingestor.ingest(doc_list)

USER_AGENT environment variable not set, consider setting it to identify your requests.


No Chromadb located at: ./db/chroma, initializing new.

            Vector database configurations
            ------------------------------------------
            Vector Database: Chroma
            Embedding model: text-embedding-3-small
        

            Loader configurations
            ------------------------------------------
            Loader:  AdamPyPDFLoader
            Extract images: {'extract_images': False}
            

            Chunker configurations
            ------------------------------------------
            Chunker:  RecursiveCharacterTextSplitter
            Chunk Size: 1000
            Chunk Overlap: 0
            
Vectordb: <rag.ingest.vectordb.ChromaDB object at 0x0000024255BAC510> Initialized.
Loader: <rag.ingest.load.PDFLoader object at 0x00000242686CE190> Initialized.
Chunker: <rag.ingest.chunk.RecursiveChunker object at 0x0000024268752290> Initialized.
...Chunking...
Number of chunks:  6
6 Chunks saved into Chroma               using embedding

- Test

In [3]:
from model.llm import OpenaiLLM
from model.prompt import zero_shot_prompt
from rag.retrieve import VectorStoreRetriever
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
prompt_template = zero_shot_prompt()
llm = OpenaiLLM()

vectordb_retriever = VectorStoreRetriever(
    vectordb=vectordb,
    search_type="similarity", 
    search_kwargs={"k": 6}
    )

# Preprocessing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = (
    {"context": vectordb_retriever.retriever | format_docs, "input": RunnablePassthrough()}
    | prompt_template
    | llm
    | StrOutputParser()
)

# Query
rag_chain.invoke("What is xue hao's cgpa?")

Vectordb: <rag.ingest.vectordb.ChromaDB object at 0x0000024255BAC510> Initialized.
Vector Store Retriever: tags=['Chroma', 'OpenAIEmbeddings'] vectorstore=<langchain_chroma.vectorstores.Chroma object at 0x0000024268E18F10> search_kwargs={'k': 6} created


"Xue Hao's CGPA is 4.00."

### Ingest with configuration file

In [3]:
import sys
sys.path.append("..")

In [2]:
from rag.utils import read_all_config
from rag.ingest.ingest import AdamIngest
# Configs
ingestion_cfg = read_all_config("../configs/ingest.yaml")

ingestor = AdamIngest(**ingestion_cfg, debug = True)
ingestor.ingest(r"C:\Users\USER\adamxht\AdamLab\docs\TayXueHao-Resume.pdf")

USER_AGENT environment variable not set, consider setting it to identify your requests.


No Chromadb located at: ./db/chroma, initializing new.

            Vector database configurations
            ------------------------------------------
            Vector Database: Chroma
            Embedding model: text-embedding-3-small
        
Vectordb: <rag.ingest.vectordb.ChromaDB object at 0x0000024B3FF4DF90> Initialized.

            Loader configurations
            ------------------------------------------
            Loader:  AdamPyPDFLoader
            Extract images: {'extract_images': False}
            
Loader: <rag.ingest.load.PDFLoader object at 0x0000024B52C87990> Initialized.

            Chunker configurations
            ------------------------------------------
            Chunker:  RecursiveCharacterTextSplitter
            Chunk Size: 1000
            Chunk Overlap: 0
            
Chunker: <rag.ingest.chunk.RecursiveChunker object at 0x0000024B52A08F90> Initialized.
...Chunking...
Number of chunks:  6
6 Chunks saved into Chroma               using embedding

- Test

In [3]:
from model.llm import OpenaiLLM
from model.prompt import zero_shot_prompt
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from rag.ingest.vectordb import ChromaDB

vectordb = ChromaDB(persist_directory="db/chroma") # Assuming it is created
prompt_template = zero_shot_prompt()
llm = OpenaiLLM()

retriever = vectordb.vectorstore.as_retriever(
    search_type="similarity", 
    search_kwargs={"k": 6}
    )

# Preprocessing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = (
    {"context": retriever | format_docs, "input": RunnablePassthrough()}
    | prompt_template
    | llm
    | StrOutputParser()
)

# Query
rag_chain.invoke("What is xue hao's cgpa?")

Loading from existing Chromadb located at: db/chroma


"Xue Hao's CGPA is 4.00."

- Now that we know how to use the classes we created, we can use them in a python script to easily ingest our documents next time.

## Modularizing Ragchain

**After Ingesting**, we can create our ragchain using the existing vectordb, We also modularized the creation of retriever, and rag chain.

- With Values

In [4]:
from rag.ragchain import RagChain
ragchain = RagChain(retriever=vectordb_retriever, llm=llm, system_prompt=prompt_template, debug=True)
ragchain("Cgpa")

Retriever: <rag.retrieve.VectorStoreRetriever object at 0x00000295B5922350> Initialized.
System prompt initialized
LLM: gpt-3.5-turbo-0125 Initialized.

            Ragchain configurations
            ------------------------------------------
            Retriever: <rag.retrieve.VectorStoreRetriever object at 0x00000295B5922350>
            System prompt: input_variables=['context', 'input'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'input'], template='\n        Answer the following question based only on the provided context:\n        <context>\n        {context}\n        </context>\n\n        Question: {input}\n        '))]
            LLM: gpt-3.5-turbo-0125
        
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Cgpa"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,input>] Entering Chain run with input:
[0m{
  "input": "Cgpa"
}


"The CGPA mentioned in the context is 4.00 for the Bachelor's Degree in Computer Science, majoring in Data Science."

- With configuration file

In [1]:
import sys
sys.path.append("..")

In [2]:
from rag.ragchain import RagChain
from rag.utils import read_all_config
retrieval_cfg = read_all_config("../configs/rag.yaml")
ragchain = RagChain(**retrieval_cfg, debug=False)
ragchain("What is xue hao's cgpa?")

USER_AGENT environment variable not set, consider setting it to identify your requests.


Loading from existing Chromadb located at: ./db/chroma
Vectordb: <rag.ingest.vectordb.ChromaDB object at 0x00000140D42F6B10> Initialized.
Vector Store Retriever: tags=['Chroma', 'OpenAIEmbeddings'] vectorstore=<langchain_chroma.vectorstores.Chroma object at 0x00000140D451D5D0> search_kwargs={'k': 6} created
Retriever: <rag.retrieve.VectorStoreRetriever object at 0x00000140D451C790> Initialized.
System prompt initialized
LLM: gpt-3.5-turbo-0125 Initialized.


"Xue Hao's CGPA is 4.00."