<a href="https://colab.research.google.com/github/fredygerman/rag-techniques-demonstration/blob/main/simple_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a target="_blank" href="https://colab.research.google.com/github/sergiopaniego/RAG_local_tutorial/blob/main/example_rag.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Simple RAG example with Langchain and Open Ai

In this example, we first connect to an LLM locally and make request to the LLM that Ollama is serving using LangChain. After that, we generate our RAG application from a PDF file and extract details from that document.

<p align="center">
  <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2023/07/langchain3.png" alt="Langchain Logo" width="30%">
  <img src="https://seekvectors.com/files/download/OpenAI%20Logo-01.png" alt="Ollama Logo" width="30%">
</p>

Sources:

* https://github.com/svpino/llm
*


# Requirements

* Open Ai API key

# Install the requirements

If an error is raised related to docarray, refer to this solution: https://stackoverflow.com/questions/76880224/error-using-using-docarrayinmemorysearch-in-langchain-could-not-import-docarray

In [126]:
!pip3 install langchain
!pip3 install langchain_community
!pip3 install langchain_pinecone
!pip3 install langchain[docarray]
!pip3 install docarray
!pip3 install pypdf
!pip3 install openai
!pip3 install tiktoken



# We instanciate the LLM model and the Embedding model

"The only limit to our realization of tomorrow will be our doubts of today." - Franklin D. Roosevelt


In [154]:
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
import os

# Load the OpenAI API key
os.environ["OPENAI_API_KEY"] = 'xxx'

# Initialize the chat-based OpenAI model
chat_model = ChatOpenAI(model="gpt-4", temperature=0.7)

# Initialize OpenAI embeddings (if needed)
embeddings = OpenAIEmbeddings()

# Define the prompt for chat
prompt = "Give me an inspirational quote"

# Invoke the chat model to generate a response
response = chat_model.predict(prompt)

# Print the response
print(response)

"The only way to achieve the impossible is to believe it is possible." - Charles Kingsleigh


In [155]:
chat_model.predict("Waht is 2+2?")

'2+2 is 4.'

## Using a parser provided by LangChain, we can transform the LLM output to something more suitable to be read

In [156]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()
response_from_model = chat_model.predict("Give me an inspirational quote")
parsed_response = parser.parse(response_from_model)
print(parsed_response)

"The only limit to our realization of tomorrow will be our doubts of today." - Franklin D. Roosevelt


# We generate the template for the conversation with the instruct-based LLM

We can create a template to structure the conversation effectively.

This template allows us to provide some general context to the Language Learning Model (LLM), which will be utilized for every prompt. This ensures that the model has a consistent background understanding for all interactions.

Additionally, we can include specific context relevant to the particular prompt. This helps the model understand the immediate scenario or topic before addressing the actual question. Following this specific context, we then present the actual question we want the model to answer.

By using this approach, we enhance the model's ability to generate accurate and relevant responses based on both the general and specific contexts provided.

In [157]:
from langchain.prompts import PromptTemplate

template = """
Answer the question based on the context below. If you can't
answer the question, answer with "I don't know".

If you think we need to do a web search response with 'WEB_SEARCH'

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
prompt.format(context="Here is some context", question="Here is a question")

'\nAnswer the question based on the context below. If you can\'t \nanswer the question, answer with "I don\'t know".\n\nIf you think we need to do a web search response with \'WEB_SEARCH\'\n\nContext: Here is some context\n\nQuestion: Here is a question\n'

The model can answer prompts based on the context:

In [158]:
formatted_prompt = prompt.format(context="My parents named me John", question="What's my name?")
response_from_model = chat_model.predict(formatted_prompt)
parsed_response = parser.parse(response_from_model)
print(parsed_response)

Your name is John.


But it can't answer what is not provided as context:

In [159]:
formatted_prompt = prompt.format(context="My parents named me Sergio", question="What's my age?")
response_from_model = chat_model.predict(formatted_prompt)
parsed_response = parser.parse(response_from_model)
print(parsed_response)

I don't know


Even previously known info!

In [160]:
formatted_prompt = prompt.format(context="", question="What did elon must do today?")
response_from_model = chat_model.predict(formatted_prompt)
parsed_response = parser.parse(response_from_model)
print(parsed_response)

# if resonse is WEB_SEARCH run website function to get results

WEB_SEARCH


In [161]:
formatted_prompt = prompt.format(context="My parents named me Sergio", question="What is 2+2?")
response_from_model = chat_model.predict(formatted_prompt)
parsed_response = parser.parse(response_from_model)
print(parsed_response)

4


# Load an example PDF to do Retrieval Augmented Generation (RAG)

For the example, you can select your own PDF.

In [162]:
from langchain_community.document_loaders import PyPDFLoader


loader = PyPDFLoader("./sample_data/Understanding_Climate_Change.pdf")
pages = loader.load_and_split()
#pages = loader.load()
pages

[Document(metadata={'source': './sample_data/Understanding_Climate_Change.pdf', 'page': 0}, page_content='Understanding Climate Change  \nChapter 1: Introduction to Climate Change  \nClimate change refers to significant, long -term changes in the global climate. The term \n"global climate" encompasses the planet\'s overall weather patterns, including temperature, \nprecipitation, and wind patterns, over an extended period. Over the past cent ury, human \nactivities, particularly the burning of fossil fuels and deforestation, have significantly \ncontributed to climate change.  \nHistorical Context  \nThe Earth\'s climate has changed throughout history. Over the past 650,000 years, there have \nbeen seven cycles of glacial advance and retreat, with the abrupt end of the last ice age about \n11,700 years ago marking the beginning of the modern climate era and  human civilization. \nMost of these climate changes are attributed to very small variations in Earth\'s orbit that \nchange the a

In [163]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
text_documents = text_splitter.split_documents(pages)[:5]

pages

[Document(metadata={'source': './sample_data/Understanding_Climate_Change.pdf', 'page': 0}, page_content='Understanding Climate Change  \nChapter 1: Introduction to Climate Change  \nClimate change refers to significant, long -term changes in the global climate. The term \n"global climate" encompasses the planet\'s overall weather patterns, including temperature, \nprecipitation, and wind patterns, over an extended period. Over the past cent ury, human \nactivities, particularly the burning of fossil fuels and deforestation, have significantly \ncontributed to climate change.  \nHistorical Context  \nThe Earth\'s climate has changed throughout history. Over the past 650,000 years, there have \nbeen seven cycles of glacial advance and retreat, with the abrupt end of the last ice age about \n11,700 years ago marking the beginning of the modern climate era and  human civilization. \nMost of these climate changes are attributed to very small variations in Earth\'s orbit that \nchange the a

# Store the PDF in a vector space.

From Langchain docs:

`DocArrayInMemorySearch is a document index provided by Docarray that stores documents in memory. It is a great starting point for small datasets, where you may not want to launch a database server.`

The execution time of the following block depends on the complexity and longitude of the PDF provided. Try to keep it small and simple for the example.

In [164]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore = DocArrayInMemorySearch.from_documents(text_documents, embedding=embeddings)

# Create retriever of vectors that are similar to be used as context

In [165]:
retriever = vectorstore.as_retriever()
retriever.invoke("climate change")

[Document(metadata={'source': './sample_data/Understanding_Climate_Change.pdf', 'page': 0}, page_content='Climate change refers to significant, long -term changes in the global climate. The term'),
 Document(metadata={'source': './sample_data/Understanding_Climate_Change.pdf', 'page': 0}, page_content='Understanding Climate Change  \nChapter 1: Introduction to Climate Change'),
 Document(metadata={'source': './sample_data/Understanding_Climate_Change.pdf', 'page': 0}, page_content='"global climate" encompasses the planet\'s overall weather patterns, including temperature,'),
 Document(metadata={'source': './sample_data/Understanding_Climate_Change.pdf', 'page': 0}, page_content='activities, particularly the burning of fossil fuels and deforestation, have significantly')]

# Generate conversate with the document to extract the details

In [166]:
# Assuming retriever is an instance of a retriever class and has a method to retrieve context
retrieved_context = retriever.invoke("climate change")

In [167]:
questions = [
    "What is climate change?",
    "What are Causes of Climate Change?",
    "The Role of Technology in Climate Change Mitigation"
]

for question in questions:
    # Convert retrieved_context to a string before concatenation
    retrieved_context_str = " ".join([doc.page_content for doc in retrieved_context])

    unformatted_propmt = question + " based on the context: " + retrieved_context_str
    formatted_prompt = prompt.format(context=retrieved_context_str, question=question)
    response_from_model = chat_model.predict(formatted_prompt)
    unformatted_propmt_response = chat_model.predict(unformatted_propmt)
    parsed_response = parser.parse(response_from_model)
    unformatted_response = parser.parse(unformatted_propmt_response)

    print(f"Question: {question}")
    print(f"Answer: {parsed_response}")
    print(f"Unformatted Prompt answer: {unformatted_response}")
    print()

Question: What is climate change?
Answer: Climate change refers to significant, long-term changes in the global climate. This includes changes to the planet's overall weather patterns, temperature, and activities. Factors such as the burning of fossil fuels and deforestation have significantly contributed to climate change.
Unformatted Prompt answer: increased the amount of greenhouse gases in the Earth's atmosphere, leading to increased average temperatures around the world, a phenomenon often referred to as global warming. These changes can have a range of impacts on ecosystems, including rising sea levels, severe weather events, and shifts in wildlife populations and habitats. Climate change is a complex issue that poses significant challenges for the future of our planet.

Question: What are Causes of Climate Change?
Answer: The main causes of climate change are human activities, particularly the burning of fossil fuels and deforestation. These activities release large amounts of g

# Loop to ask-answer questions continously

In [168]:
while True:
    print("Say 'exit' or 'quit' to exit the loop")
    question = input('User question: ')
    print(f"Question: {question}")
    if question.lower() in ["exit", "quit"]:
        print("Exiting the conversation. Goodbye!")
        break
    # Convert retrieved_context to a string before concatenation
    retrieved_context_str = " ".join([doc.page_content for doc in retrieved_context])
    formatted_prompt = prompt.format(context=retrieved_context_str, question=question)
    # Use the string representation of retrieved_context
    unformatted_propmt = question + " based on the context: " + retrieved_context_str
    response_from_model = chat_model.predict(formatted_prompt)
    unformatted_propmt_response = chat_model.predict(unformatted_propmt)
    parsed_response = parser.parse(response_from_model)
    unformatted_response = parser.parse(unformatted_propmt_response)
    print(f"Answer: {parsed_response}")
    print(f"Unformatted Prompt answer: {unformatted_response}")
    print()

Say 'exit' or 'quit' to exit the loop


KeyboardInterrupt: Interrupted by user