<a href="https://colab.research.google.com/github/anshupandey/ms-generativeai-apr2025/blob/main/code15_RAG_chain_and_agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Implementation with OpenAI and LangChain

In [1]:
!pip install langchain-community langchainhub langchain-openai langchain-chroma langchain langchain-experimental langgraph --quiet

In [2]:
!pip install pypdf faiss-cpu --quiet

In [3]:
import os
embedding_model_name = "text-embedding-ada-002"
model_name = "gpt-4o"

In [4]:
doc_paths = ["https://www.morningstar.com/content/dam/marketing/shared/research/methodology/771945_Morningstar_Rating_for_Funds_Methodology.pdf",
             "https://www.morningstar.in/docs/methodology/CategoryDefinitionsIndiaV3.pdf",
             "https://s21.q4cdn.com/198919461/files/doc_downloads/press_kits/2016/Morningstar-Sustainable-Investing-Handbook.pdf"]

In [5]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loaders = [PyPDFLoader(pdf, extract_images=False) for pdf in doc_paths]

docs = []

for loader in loaders:
    doc = loader.load()
    docs.extend(doc)

In [6]:
len(docs)

46

In [8]:
docs[0].page_content

'The Morningstar Rating\nTM\n for Funds\nMorningstar Methodology\nAugust 2021\nContents\nIntroduction \nMorningstar Categories \nTheory \nCalculations \nThe Morningstar Rating: \nThree-, Five-, and 10-Year \nMorningstar Return and \nMorningstar Risk Rating\nThe Overall Morningstar Rating\nRating Suspensions\nConclusion\nAppendix 1: Risk-Free Rates Applied\nAppendix 2: Methodology Changes\nAppendix 3: Star Ratings for Separately\nManaged Accounts and Models\nImportant Disclosure\nMerged with the existing star rating\nThe conduct of Morningstar’s analysts is governed \nby Code of Ethics/Code of Conduct Policy, Personal \nSecurity Trading Policy (or an equivalent of), \nand Investment Research Policy. For information \nregarding conflicts of interest, please visit: http://\nglobal.morningstar.com/equitydisclosures\n1\n2\n4\n8\n12\n13\n15\n16\n17\n18\n19\n20\nIntroduction\nThis document describes the rationale for, and the formulas and procedures used in, calculating the \nMorningstar Rati

In [9]:
# drop pages which have less than 100 characters (e.g. header pages, empty separater pages)
docs = [doc for doc in docs if len(doc.page_content.strip())>100]
len(docs)

45

In [10]:
# FOr all remaining document, checking the average characters count
sum(len(doc.page_content) for doc in docs)/len(docs)

2709.4222222222224

In [11]:
# split the documents(each pdf page as one document) , into multiple so that at max there are 3500 characters in one document, with overlap of 500 characters while splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3500, chunk_overlap=500)
splits = text_splitter.split_documents(docs)
len(splits)

53

In [12]:
print(splits[1].page_content)

3
3
3
©2021 Morningstar, Inc. All rights reserved. The information in this document is the property of Morningstar, Inc. Reproduction or transcription by any means, in whole or in part, without the prior written 
consent of Morningstar, Inc., is prohibited.
 
The Morningstar RatingTM for Funds    August 2021Page 2 of 21
captured by standard deviation, as would be the case if excess return were normally or lognormally 
distributed, which is not always the case. Also, standard deviation measures variation both above 
and below the mean equally. But investors are generally risk-averse and dislike downside variation 
more than upside variation. Morningstar gives more weight to downside variation when calculating 
Morningstar Risk-Adjusted Return and does not make any assumptions about the distribution of 
excess returns.
The other commonly accepted meaning of “risk-adjusted” is based on assumed investor preferences. 
Under this approach, higher return is “good” and higher risk is “bad” und

In [13]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [14]:
# initializng the vector store using CHromaDB
#from langchain_community.vectorstores import FAISS
#from langchain.embeddings import HuggingFaceBgeEmbeddings
#embedding_model_name = "BAAI/bge-large-en-v1.5"
#embeddings = HuggingFaceBgeEmbeddings(model_name=embedding_model_name,)

In [None]:
#from langchain_community.vectorstores import FAISS
# Using embedding model, to embed documents to vector and store to a vector db (Inmemory vectorDB - FAISS)
#vectorstore = FAISS.from_documents(documents=splits, embedding=embeddings)

# using vector db object to initialize a retriever object - to perform vector search/retrieval
#retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2})

In [16]:
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings,collection_name='morningStar-Data')

In [18]:
# using vector db object to initialize a retriever object - to perform vector search/retrieval
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2})

In [19]:
retrieved_docs = retriever.invoke("What is Large Cap equity fund according to MorningStar")
len(retrieved_docs)

2

In [20]:
print(retrieved_docs[0].page_content)

? 
 
 
 
 
 
 
 
 
 
 
 
Categor y Definitions  
India 
Equity  
 
Large-Cap 
Large-Cap funds primarily consist of stocks which are the Top 100 stocks by full market capitalization  of 
the equity market. These funds invest at least 80% of total assets in Indian equities and the balance can 
be invested in other asset classes such as fixed income and overseas equities, among others. Funds in 
this category would invest at least 80% of their total assets in large-cap stocks. 
Morningstar Category Index: S&P BSE 100 TR 
 
Mid-Cap 
Mid-Cap funds primarily consist of stocks ranked 101st to 250th by full market capitalization of the 
equity market. These funds invest at least 65% of total assets in Indian equities, and the balance can be 
invested in other asset classes such as fixed income and overseas equities, among others. Funds in this 
category would invest at least 65% of their total assets in mid-cap stocks. 
Morningstar Category Index: S&P BSE Mid Cap TR 
 
Small-Cap 
Small-Cap fun

In [21]:
print(retrieved_docs[1].page_content)

©2019 Morningstar, Inc. All rights reserved. The information in this document is the property of Morningstar, Inc. Reproduction or transcription by any means, in whole or part, without  
the prior written consent of Morningstar, Inc., is prohibited. 
Category Definitions , India  | 26 February 2021  Page 2 of 12  
Multi- Cap 
Multi-Cap funds invest at least 75% of their total assets in Indian equities, and the balance can be 
invested in other asset classes such as fixed income and overseas equities, among others. These funds 
will invest a minimum of 25% each in Large Cap, Mid Cap and Small Cap stocks. 
Morningstar Category Index: S&P BSE 500 TR 
 
 
Large & Mid- Cap 
Large & Mid-Cap funds primarily consist of stocks which are the Top 250 stocks by full market 
capitalization of the equity market. These funds invest at least 70% of total assets in Indian equities and 
the balance can be invested in other asset classes such as fixed income and overseas equities, among 
others. Funds in

In [25]:
retrieved_docs[0]

Document(id='15d570e3-ce5b-453f-a30f-fcc4dded9cf6', metadata={'author': 'KBelapu', 'creationdate': '2021-02-26T12:11:18+05:30', 'creator': 'PScript5.dll Version 5.2.2', 'moddate': '2021-02-26T12:11:18+05:30', 'page': 0, 'page_label': '1', 'producer': 'GPL Ghostscript 9.06', 'source': 'https://www.morningstar.in/docs/methodology/CategoryDefinitionsIndiaV3.pdf', 'title': 'Microsoft Word - India Category_Definitions April 2021', 'total_pages': 12}, page_content='? \n \n \n \n \n \n \n \n \n \n \n \nCategor y Definitions  \nIndia \nEquity  \n \nLarge-Cap \nLarge-Cap funds primarily consist of stocks which are the Top 100 stocks by full market capitalization  of \nthe equity market. These funds invest at least 80% of total assets in Indian equities and the balance can \nbe invested in other asset classes such as fixed income and overseas equities, among others. Funds in \nthis category would invest at least 80% of their total assets in large-cap stocks. \nMorningstar Category Index: S&P BSE

### Implementing RAG Chain

In [22]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])
prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='\nAnswer this question using the provided context only.\n\n{question}\n\nContext:\n{context}\n'), additional_kwargs={})])

In [23]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm


In [24]:
response = rag_chain.invoke("tell me about mid cap market")

print(response.content)

Mid-Cap funds primarily consist of stocks that are ranked 101st to 250th by full market capitalization of the equity market. These funds invest at least 65% of their total assets in Indian equities, while the remaining assets can be allocated to other asset classes such as fixed income and overseas equities. The Morningstar Category Index for Mid-Cap funds is the S&P BSE Mid Cap TR.


### Implementing RAG Agent

### Creating retriever Tool

In [26]:
from langchain.tools.retriever import create_retriever_tool

tool = create_retriever_tool(
    retriever,
    "searchCapitalMarket",
    "Searches and returns excerpts about trading stocks markets shares capital markets, DO not use it for any other info than that of capital market/ finance questions",
)
tools = [tool,]

In [27]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")


In [28]:
from langchain_core.messages import HumanMessage
from langgraph.prebuilt import create_react_agent

agent_executor = create_react_agent(llm, tools,) #messages_modifier=system_message

response = agent_executor.invoke({"messages": [("human", "Hi, I am Anshu")]})

for k in response['messages']:print(k)


content='Hi, I am Anshu' additional_kwargs={} response_metadata={} id='cde5c911-3f2c-45c1-917c-012212381630'
content='Hello Anshu! How can I assist you today?' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 82, 'total_tokens': 95, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_6dd05565ef', 'id': 'chatcmpl-BIZ1LURMUPWhPcUDcLSs7U1YcjXEc', 'finish_reason': 'stop', 'logprobs': None} id='run-ce94eba8-3f1b-4df0-bb27-8dcf00ff4e5b-0' usage_metadata={'input_tokens': 82, 'output_tokens': 13, 'total_tokens': 95, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}


In [29]:
response = agent_executor.invoke({"messages": [("human", "What is large cap market?")]})

for k in response['messages']:print(k)

content='What is large cap market?' additional_kwargs={} response_metadata={} id='e02b04f1-aa74-4d34-9242-c2808ac49eea'
content='' additional_kwargs={'tool_calls': [{'id': 'call_xQo3qFLGrkQn7J3IPxNJgsDV', 'function': {'arguments': '{"query":"large cap market definition"}', 'name': 'searchCapitalMarket'}, 'type': 'function'}], 'refusal': None} response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 82, 'total_tokens': 101, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_6dd05565ef', 'id': 'chatcmpl-BIZ1T0uuGElUcaRPzNKqh6wP2UlsH', 'finish_reason': 'tool_calls', 'logprobs': None} id='run-53c31727-437a-42d2-b9a4-c133fa8cb094-0' tool_calls=[{'name': 'searchCapitalMarket', 'args': {'query': 'large cap market definition'}, 'id': 'call_xQo3qFLGrkQn7J3IPxNJgs

In [30]:
for s in agent_executor.stream(
    {"messages": [HumanMessage(content="What is large cap market?")]}
):
    print(s)
    print("----")

{'agent': {'messages': [AIMessage(content='The term "large cap" (short for "large market capitalization") refers to companies with a high market capitalization value. Market capitalization is the total market value of a company\'s outstanding shares of stock. It is calculated by multiplying the company\'s share price by its total number of outstanding shares. \n\nTypically, large-cap companies are well-established firms with a strong presence in their industries and are often leaders in their sectors. These businesses tend to be more stable and less volatile compared to smaller companies, providing a more conservative investment choice. They might pay dividends and have steady revenue streams, making them attractive to investors seeking stability and income.\n\nThe specific threshold for what constitutes a large-cap company can vary, but typically it includes companies with a market capitalization of $10 billion or more. Investing in large-cap companies is a common strategy for investo