RAG Application using Gemini


In [2]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("Python.pdf")
data = loader.load()

In [4]:
len(data)

294

In [28]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(data)

print("Total number of documents : ", len(docs))

Total number of documents :  287


In [29]:
docs[5]

Document(metadata={'source': 'Python.pdf', 'page': 5}, page_content='About the Reviewers\nSébastien Celles  is a professor of applied physics at Universite de Poitiers (working \nin the thermal science department). He has used Python for numerical simulations, \ndata plotting, data predictions, and various other tasks since the early 2000s. He is a member of PyData and was granted commit rights to the pandas DataReader project. He is also involved in several open source projects in the scientific Python ecosystem.\nSebastien is also the author of some Python packages available on PyPi, which are  \nas follows:\n• openweathermap_requests: This is a package used to fetch data from OpenWeatherMap.org using Requests and Requests-cache and to get pandas \nDataFrame with weather history\n• pandas_degreedays: This is a package used to calculate degree days  \n(a measure of heating or cooling) from the pandas time series of temperature\n• pandas_confusion: This is a package used to manage conf

In [30]:
from langchain_chroma import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings

from dotenv import load_dotenv
load_dotenv()

embeddings = GoogleGenerativeAIEmbeddings(model = "models/embedding-001")
vector = embeddings.embed_query("hello world")
vector[:5]

[0.04909781739115715,
 -0.044328298419713974,
 -0.025365285575389862,
 -0.030721040442585945,
 0.019068578258156776]

In [31]:
vectorstore = Chroma.from_documents(documents=docs, embedding=GoogleGenerativeAIEmbeddings(model="models/embedding-001"))

In [32]:
retreiver = vectorstore.as_retriever(search_type = "similarity", seach_kwargs={"k": 10})
retreived_docs = retreiver.invoke("What is new in Python?")


In [33]:
len(retreived_docs)

4

In [35]:
print(retreived_docs[3].page_content)

Getting Started with Raw Data[ 2 ]In this chapter we will cover the following topics:
• Exploring arrays with NumPy
• Handling data with pandas
• Reading and writing data from various formats
• Handling missing data
• Manipulating data
The world of arrays with NumPy
Python, by default, comes with a data structure, such as List, which can be utilized 
for array operations, but a Python list on its own is not suitable to perform heavy 
mathematical operations, as it is not optimized for it.
NumPy is a wonderful Python package produced by Travis Oliphant, which 
has been created fundamentally for scientific computing. It helps handle large 
multidimensional arrays and matrices, along with a large library of high-level 
mathematical functions to operate on these arrays.
A NumPy array would require much less memory to store the same amount of data 
compared to a Python list, which helps in reading and writing from the array in a faster manner.
Creating an array
A list of numbers can be pass

In [36]:
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model = "gemini-1.5-pro", temperature = 0.3, max_tokens= 500)

In [37]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retreived context to answer the question"
    "If you dont know the answer, say Thank you, I don't know "
    "Use three sentences maximum and make the answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system",system_prompt),
        ("human","{input}")
    ]
)


In [39]:
question_answer_chain = create_stuff_documents_chain(llm,prompt)
rag_chain = create_retrieval_chain(retreiver, question_answer_chain)

In [42]:
response = rag_chain.invoke({"input": "What is Pandas?"})
print(response["answer"])

Pandas is an open source Python library specifically designed for data analysis. It was developed by Wes McKinny at AQR Capital Management to provide a flexible tool for quantitative analysis on financial data. Pandas is built on top of NumPy, providing efficient data structures and making data handling easier. 

