# Retrieval Augmented Generation with LangChain

In this notebook, we'll build some simple naive RAG with LangChain. We will leverage OpenAI for embeddings and LLM responses, and will use the [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) vector database.

In [1]:
from operator import itemgetter
import faiss
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_community.chat_models import ChatOpenAI
from langchain_community.embeddings import OpenAIEmbeddings

from dotenv import load_dotenv
load_dotenv()
import os

## Naive RAG

The below cells show a very simple version of RAG, without a document. We simply pass in a sentence, and have the LLM generate a response based on that sentence.

In [2]:
vectorstore = FAISS.from_texts(
    ["Addy ran to CCRB"], embedding=OpenAIEmbeddings(api_key = "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")
)


retriever = vectorstore.as_retriever()

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)

model = ChatOpenAI(api_key= "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")

  ["Addy ran to CCRB"], embedding=OpenAIEmbeddings(api_key = "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")


AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************************************************************************************************************************7aIA. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

In [7]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)



NameError: name 'retriever' is not defined

In [6]:
chain.invoke("who is addy?")

'Addy is a person.'

In [7]:
template = """Answer the question based only on the following context:
{context}

Question: {question}

Answer in the following language: {language}
"""
prompt = PromptTemplate.from_template(template)

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
        "language": itemgetter("language"),
    }
    | prompt
    | model
    | StrOutputParser()
)

In [8]:
chain.invoke({"question": "where did addy run to", "language": "italian"})

'Addy corse al CCRB.'

### Naive RAG with Documents

Now, we will perform RAG over an Environmental Science text. You can find the PDF in the [Drive](https://drive.google.com/drive/folders/1EBnXiHcnpZNQ3IWwXOFQLbRJCVQG4sXb?usp=drive_link).

In [8]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser

In [9]:
loader = PyPDFLoader("~/LLM-Augmentation-F24/LLM-Augmentation-F24/environmental_sci.pdf")

# The text splitter is used to split the document into chunks
# Mess with the parameters to see how it affects the output
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

chunks = loader.load_and_split(text_splitter=text_splitter)

print(chunks[25].page_content)

Biochemical cycles
The surface of the Earth can be considered as four distinct regions and because
the planet is spherical each of them is also a sphere. The rocks forming thesolid surface comprise the lithosphere, the oceans, lakes, rivers, and icecapsform the hydrosphere, the air constitutes the atmosphere, and the biospherecontains the entire community of living organisms.
Materials move cyclically among these spheres. They originate in the rocks
(lithosphere) and are released by weathering or by volcanism. They enterwater (hydrosphere) from where those serving as nutrients are taken upby plants and from there enter animals and other organisms (biosphere).From living organisms they may enter the air (atmosphere) or water(hydrosphere). Eventually they enter the oceans (hydrosphere), wherethey are taken up by marine organisms (biosphere). These return them tothe air (atmosphere), from where they are washed to the ground by rain,thus returning to the land.
The idea that biogeochemical 

In [10]:
len(chunks)
print(chunks[4])

page_content='4 Biosphere 137
32. Biosphere, biomes, biogeography 137
33. Major biomes 14134. Nutrient cycles 147
35. Respiration and photosynthesis 151
36. Trophic relationships 15137. Energy, numbers, biomass 160
38. Ecosystems 163
39. Succession and climax 16840. Arrested successions 172
41. Colonization 176
42. Stability, instability, and reproductive strategies 17943. Simplicity and diversity 183
44. Homoeostasis, feedback, regulation 188
45. Limits of tolerance 192
Further reading 197
References 197
5 Biological Resources 200
46. Evolution 200
47. Evolutionary strategies and game theory 206
48. Adaptation 21049. Dispersal mechanisms 214
50. Wildlife species and habitats 218
51. Biodiversity 22252. Fisheries 227
53. Forests 233
54. Farming for food and fibre 23955. Human populations and demographic change 249
56. Genetic engineering 250
Further reading 257Notes 257
References 258
6 Environmental Management 261
57. Wildlife conservation 261
58. Zoos, nature reserves, wilderness 265

In [11]:
# We will now use the from_documents method to create a vectorstore from the chunks
vectorstore = FAISS.from_documents(
    chunks, embedding=OpenAIEmbeddings(api_key = "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")
)

retriever = vectorstore.as_retriever(k=5)

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************************************************************************************************************************7aIA. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

In [14]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [15]:
# An overly complicated one-liner to test what the top 5 most similar chunks are to the question
# Use this to make sense of the output of the next cell
print("\n\n".join([x.page_content for x in vectorstore.similarity_search("What is the main cause of global warming?", k=5)]))

the Industrial Revolution was about 280 µmol mol-1 and that the increase since then has been dueentirely to emissions from the burning of fossil fuels. This may not be the case. The solubility ofgases, including carbon dioxide, is inversely proportional to the temperature. A rise in temperature,therefore, will cause dissolved carbon dioxide to bubble out of the oceans. This is called the ‘warmchampagne’ effect. Rising temperature will also stimulate aerobic bacteria. Their respiration willrelease carbon dioxide. This is called the ‘warm beer’ effect (CALDER, 1999).
Carbon dioxide is the best-known greenhouse gas, because it is the most abundant of those over
which we can exert some control, but it is not the only one. Methane, produced naturally, for exampleby termites, but also by farmed livestock and from wet-rice farming (present concentration about 1.7ppm), nitrous oxide (0.31 ppm) and tropospheric ozone (0.06 ppm), products from the burning offuels in furnaces and car engines, and

In [17]:
chain.invoke("What is the main cause of global warming?")

"The main cause of global warming is the increase in atmospheric greenhouse gases, particularly carbon dioxide, methane, nitrous oxide, tropospheric ozone, and CFC compounds, which trap heat in the Earth's atmosphere."

Try RAG yourself! Take a file of your choice and apply the same concepts. 

In [27]:
loader = PyPDFLoader("~/LLM-Augmentation-F24/LLM-Augmentation-F24/Burnett_Arabic_into_Latin.pdf")

# The text splitter is used to split the document into chunks
# Mess with the parameters to see how it affects the output
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1300,
    chunk_overlap=60,
    length_function=len,
    is_separator_regex=False,
)

chunks = loader.load_and_split(text_splitter=text_splitter)

# We will now use the from_documents method to create a vectorstore from the chunks
vectorstore = FAISS.from_documents(
    chunks, embedding=OpenAIEmbeddings(api_key = "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")
)

retriever = vectorstore.as_retriever(k=5)

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

chain.invoke("What is important about Toledo?")


AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************************************************************************************************************************7aIA. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

Mess with the splitting method ([LangChain splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/)), the parameters to the splitter, and the number of retrieved chunks that are injected into the LLM's prompt as context. These will significantly impact how the LLM performs and answers questions.

## Advanced RAG

We leave this as a (optional) challenge for you. How can we implement advanced RAG methods in LangChain?

1. Find some data that you would like to perform RAG over. 
2. Implement some form of advanced search with LangChain. 

Note: The LangChain [EnsembleRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble) may be of use.

In [None]:
pass