Following Datacamp article: https://www.datacamp.com/tutorial/llama-3-1-rag

Set up the environment

In [28]:
%pip install langchain langchain_community scikit-learn langchain-ollama sentence-transformers tiktoken

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [29]:
from langchain_community.vectorstores import SKLearnVectorStore
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate

Load and prepare documents
Documents can be anyhing, I can load a PDF or use webpages as the source also

In [None]:
# List of PDF file paths to load documents from
pdf_paths = [
    "/home/Documents/ai-ml-practice/rag-using-llm/sample-book.pdf"
]

Split documents into chunks

In [31]:
# Load and split documents
docs = [PyPDFLoader(pdf_path).load() for pdf_path in pdf_paths]
docs_list = [item for sublist in docs for item in sublist]

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=250, chunk_overlap=2
)
doc_splits = text_splitter.split_documents(docs_list)

Initialize embeddings

In [32]:
# Initialize embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

Create a vector store

In [33]:
# Create vector store
texts = [doc.page_content for doc in doc_splits]
vectorstore = SKLearnVectorStore.from_texts(texts, embedding=embeddings)
retriever = vectorstore.as_retriever(k=4)

Initialize LLM

In [34]:
llm = ChatOllama(model="llama3.1:8b")

Define prompt template

In [35]:
prompt_template = """You are an assistant for question-answering tasks.
Use the following documents to answer the question.
If you don't know the answer, just say that you don't know.
Please answer in simple and easy to understand language.
Use two sentences maximum and keep the answer concise:
Question: {question}
Documents: {context}
Answer:"""
prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

Create retrieval chain

In [36]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

Define RAG application class  

In [37]:
class RAGApplication:
    def __init__(self, qa_chain):
        self.qa_chain = qa_chain

    def run(self, question):
        result = self.qa_chain.invoke({"query": question})
        return result["result"]

Initialize and run

In [38]:
rag_application = RAGApplication(qa_chain)

question = "What should I know about protein?"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)

Question: What should I know about protein?
Answer: Here's the answer in two sentences:

Protein is important for growth, tissue repair, immune function, energy production, and preserving lean muscle mass. You can get protein from animal products like meat, eggs, and fish, or plant-based sources like beans, nuts, and grains, as long as you combine them to get all essential amino acids.
