<a href="https://colab.research.google.com/github/genaiconference/RAG_Workshop_DHS2024/blob/main/10_RAG_with_open_source_llms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Retrieval-Augmented Generation (RAG): LangChain Implementation
This Notebook implements an advanced RAG system with the following components
* How to handle **multiple data sources**
* Showcases the power of various **Advanced Retrievers**

In [1]:
!git clone https://github.com/genaiconference/RAG_Workshop_DHS2024.git

Cloning into 'RAG_Workshop_DHS2024'...
remote: Enumerating objects: 81, done.[K
remote: Counting objects: 100% (81/81), done.[K
remote: Compressing objects: 100% (79/79), done.[K
remote: Total 81 (delta 28), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (81/81), 29.43 MiB | 7.40 MiB/s, done.
Resolving deltas: 100% (28/28), done.
Updating files: 100% (31/31), done.


In [12]:
!pip install -q langchain==0.0.309 chromadb==0.4.15 pypdf==3.16.2 openai==0.28.1 tiktoken==0.5.1 rank_bm25==0.2.2 sentence_transformers==2.2.2 sentencepiece transformers accelerate einops

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for sentence_transformers (setup.py) ... [?25l[?25hdone


Define open source LLM

In [4]:
import torch
torch.cuda.is_available()

import torch
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = 'microsoft/Phi-3-mini-4k-instruct'

model_path = snapshot_download(repo_id=repo_id,repo_type="model", local_dir="./phi-3", local_dir_use_symlinks=False)

model_path = "./phi-3"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Needs 12GB of vRAM to run in float32 (default)
# Run this line to load in float16. You need Gb of vRAM
torch.set_default_dtype(torch.float16)

llm = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)

Fetching 19 files:   0%|          | 0/19 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Define open source Embedding function

In [5]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings


embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

## Imports

In [6]:
from RAG_Workshop_DHS2024.utils import pretty_print_result, pretty_print_docs

In [7]:
# @title
# wrap colab output text

from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## Step1: Document Loading
Load Leave Policy Document

In [8]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(file_path='/content/RAG_Workshop_DHS2024/data/misc/Leave-Policy-India.pdf')
documents = loader.load()
len(documents)

8

## Step2: Document Splitting

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunkByWord(text):
    return len(text.split(" "))

text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, length_function = chunkByWord, chunk_overlap=50, add_start_index = False)

docs = text_splitter.split_documents(documents)

print(len(docs))

8


## Step3: Vector Store Creation

In [10]:
## save to disk
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=docs, embedding=embeddings, persist_directory='vectorstore/open_source_leave_policy/')

## load from disk
# vectorstore = Chroma(persist_directory='vectorstore/open_source_leave_policy/',embedding_function=embeddings)

## Step4: Retrieval

### Type1 - LOTR (Merger Retriever) - Handling multiple retrievers
Lord of the Retrievers, also known as MergerRetriever, takes a list of retrievers as input and merges the results of their get_relevant_documents() methods into a single list. The merged results will be a list of documents that are relevant to the query and that have been ranked by the different retrievers.

The MergerRetriever class can be used to improve the accuracy of document retrieval in a number of ways. First, it can combine the results of multiple retrievers, which can help to reduce the risk of bias in the results. Second, it can rank the results of the different retrievers, which can help to ensure that the most relevant documents are returned first.

In [11]:
from langchain.retrievers import MergerRetriever, BM25Retriever

# Define 2 diff retrievers with 2 diff search type.
retriever_1 = vectorstore.as_retriever(search_type="similarity",
                                                 search_kwargs={"k": 5, "include_metadata": True}
                                                )

# retriever_2 = vectorstore.as_retriever(search_type="mmr",
#                                                  search_kwargs={"k": 5, "include_metadata": True}
#                                                 )

# BM25 Retriver for keyword based retrieval using BM25 algorithm
bm25_retriever = BM25Retriever.from_documents(docs)

bm25_retriever.k = 3 # set the num of docs to be retrieved

# Hybrid Retriver using Semantic Search and Keyword search together
lotr = MergerRetriever(retrievers=[retriever_1, bm25_retriever])

In [12]:
from RAG_Workshop_DHS2024.utils import *

In [33]:
from langchain.chains import ConversationalRetrievalChain
from IPython.display import Markdown

chain = ConversationalRetrievalChain.from_llm(llm=llm,
                                              combine_docs_chain_kwargs={"prompt": RAG_prompt},
                                              retriever=lotr,
                                              return_source_documents=True)

ValidationError: 1 validation error for LLMChain
llm
  value is not a valid dict (type=type_error.dict)

In [32]:
query = "Compare the Revenue trends of Microsoft between the years 2022 and 2023?"
result = chain({"question": query})
result

NameError: name 'chain' is not defined