<a href="https://colab.research.google.com/github/edquestofficial/Gen-AI-Cohort/blob/main/2024/april/Level_2/RAG_with_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG with LangChain

## Install Required Libraries

In [13]:
!pip install pypdf langchain langchain_openai langchain_community chromadb

Collecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl (525 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.110.0-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.29.0-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5

## Import required libraries

In [8]:
from pathlib import Path
import openai

## Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [3]:
# Set the API key
api_key_path = "/content/drive/MyDrive/Gen AI Course/RAG_For_HDFC_Policy/api_key.txt"

with open(api_key_path, "r") as f:
  api_key = ' '.join(f.readlines())
  openai.api_key = api_key

## Load data and split into chunks

In [11]:
from langchain_community.document_loaders import PyPDFLoader

pdf_directory_str = "/content/drive/MyDrive/Gen AI Course/RAG_For_HDFC_Policy/data"
pdf_directory_path = Path(pdf_directory_str)

# Define the directory containing the PDF files
pdf_directory = Path(pdf_directory_path)

# Initialize an empty list to store the extracted texts and document names
data = []

# Loop through all files in the directory
for pdf_path in pdf_directory.glob("*.pdf"):
  pdf_path_str = str(pdf_path)
  loader = PyPDFLoader(pdf_path_str)
  pages = loader.load_and_split()
  data.extend(pages)

In [14]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.vectorstores import Chroma

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
# raw_documents = TextLoader('../../../state_of_the_union.txt').load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(data)
db = Chroma.from_documents(documents, OpenAIEmbeddings(openai_api_key=api_key))

## Similarity Search

In [15]:
retriever = db.as_retriever(search_kwargs={"k": 3})

In [17]:
query = "what are Surgical Benefit"

In [18]:
docs = retriever.get_relevant_documents(query)

## Generative Search

In [19]:
# Define the function to generate the response. Provide a comprehensive prompt that passes the user query and the top 3 results to the model

def generate_response(query, results_df):
    """
    Generate a response using GPT-3.5's ChatCompletion based on the user query and retrieved information.
    """
    messages = [
                {"role": "system", "content":  "You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents."},
                {"role": "user", "content": f"""You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents.
                                                You have a question asked by the user in '{query}' and you have some search results from a corpus of insurance documents in the dataframe '{results_df}'. These search results are essentially one page of an insurance document that may be relevant to the user query.

                                                The column 'documents' inside this dataframe contains the actual text from the policy document and the column 'metadata' contains the policy name and source page. The text inside the document may also contain tables in the format of a list of lists where each of the nested lists indicates a row.

                                                Use the documents in '{results_df}' to answer the query '{query}'. Frame an informative answer and also, use the dataframe to return the relevant policy names and page numbers as citations.

                                                Follow the guidelines below when performing the task.
                                                1. Try to provide relevant/accurate numbers if available.
                                                2. You don’t have to necessarily use all the information in the dataframe. Only choose information that is relevant.
                                                3. If the document text has tables with relevant information, please reformat the table and return the final information in a tabular in format.
                                                3. Use the Metadatas columns in the dataframe to retrieve and cite the policy name(s) and page numbers(s) as citation.
                                                4. If you can't provide the complete answer, please also provide any information that will help the user to search specific sections in the relevant cited documents.
                                                5. You are a customer facing assistant, so do not provide any information on internal workings, just answer the query directly.

                                                The generated response should answer the query directly addressing the user and avoiding additional information. If you think that the query is not relevant to the document, reply that the query is irrelevant. Provide the final response as a well-formatted and easily readable text along with the citation. Provide your complete response first with all information, and then provide the citations.
                                                """},
              ]

    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages
    )

    return response.choices[0].message.content.split('\n')

In [20]:
response = generate_response(query, docs)

In [21]:
response

['The Surgical Benefit is a part of the insurance policy that provides coverage for specific surgeries under certain conditions. ',
 '',
 'From the provided insurance documents, the Surgical Benefit includes:',
 '- The Life Assured undergoing any of the 138 specified surgeries listed in Annexure I.',
 '- The surgery must be performed by a qualified surgeon at a hospital due to injury or sickness for the covered surgical procedures.',
 '- The benefit payable is a fixed percentage of the Sum Insured based on the category of the surgery (Category 1: 100%, Category 2: 60%, Category 3: 40%, Category 4: 20% of the Sum Insured).',
 '- The Policyholder can make multiple claims up to a maximum of 100% of the Sum Insured during the Policy Term.',
 '',
 'The Surgical Benefit will continue to be in force even if the Daily Hospital Cash Benefit and Critical Illness Benefit cease, as long as the conditions for the Surgical Benefit are met. ',
 '',
 'Please refer to the policy document named "HDFC-Li

In [22]:
print("\n".join(response))

The Surgical Benefit is a part of the insurance policy that provides coverage for specific surgeries under certain conditions. 

From the provided insurance documents, the Surgical Benefit includes:
- The Life Assured undergoing any of the 138 specified surgeries listed in Annexure I.
- The surgery must be performed by a qualified surgeon at a hospital due to injury or sickness for the covered surgical procedures.
- The benefit payable is a fixed percentage of the Sum Insured based on the category of the surgery (Category 1: 100%, Category 2: 60%, Category 3: 40%, Category 4: 20% of the Sum Insured).
- The Policyholder can make multiple claims up to a maximum of 100% of the Sum Insured during the Policy Term.

The Surgical Benefit will continue to be in force even if the Daily Hospital Cash Benefit and Critical Illness Benefit cease, as long as the conditions for the Surgical Benefit are met. 

Please refer to the policy document named "HDFC-Life-Easy-Health-101N110V03-Policy-Bond-Sing