# Retrieval Augmented Generation with LangChain

In this notebook, we'll build some simple naive RAG with LangChain. We will leverage OpenAI for embeddings and LLM responses, and will use the [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) vector database.

In [2]:
from operator import itemgetter
import openai
import faiss
import os
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_community.chat_models import ChatOpenAI
from langchain_community.embeddings import OpenAIEmbeddings
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")



## Naive RAG

The below cells show a very simple version of RAG, without a document. We simply pass in a sentence, and have the LLM generate a response based on that sentence.

In [4]:
vectorstore = FAISS.from_texts(
    ["Addy ran to CCRB"], embedding=OpenAIEmbeddings(api_key = api_key)
)


retriever = vectorstore.as_retriever()

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)

model = ChatOpenAI(api_key= api_key)

In [5]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)



In [6]:
chain.invoke("who is addy?")

'Addy is a person.'

In [7]:
template = """Answer the question based only on the following context:
{context}

Question: {question}

Answer in the following language: {language}
"""
prompt = PromptTemplate.from_template(template)

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
        "language": itemgetter("language"),
    }
    | prompt
    | model
    | StrOutputParser()
)

In [8]:
chain.invoke({"question": "where did addy run to", "language": "italian"})

'Addy corse al CCRB.'

### Naive RAG with Documents

Now, we will perform RAG over an Environmental Science text. You can find the PDF in the [Drive](https://drive.google.com/drive/folders/1EBnXiHcnpZNQ3IWwXOFQLbRJCVQG4sXb?usp=drive_link).

In [9]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser

In [10]:
loader = PyPDFLoader("environmental_sci.pdf")

# The text splitter is used to split the document into chunks
# Mess with the parameters to see how it affects the output
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

chunks = loader.load_and_split(text_splitter=text_splitter)

print(chunks[25].page_content)

form the hydrosphere, the air constitutes the atmosphere, and the biosphere
contains the entire community of living organisms.
Materials move cyclically among these spheres. They originate in the rocks
(lithosphere) and are released by weathering or by volcanism. They enter
water (hydrosphere) from where those serving as nutrients are taken up
by plants and from there enter animals and other organisms (biosphere).
From living organisms they may enter the air (atmosphere) or water
(hydrosphere). Eventually they enter the oceans (hydrosphere), where
they are taken up by marine organisms (biosphere). These return them to
the air (atmosphere), from where they are washed to the ground by rain,
thus returning to the land.
The idea that biogeochemical cycles are components of an overall system raises an obvious question:
what drives this system? It used to be thought that the global system is purely mechanical, driven by
physical forces, and, indeed, this is the way it can seem. Volcanoes, fr

In [19]:
len(chunks)
print(chunks[4])

page_content='4 Biosphere 137
32. Biosphere, biomes, biogeography 137
33. Major biomes 141
34. Nutrient cycles 147
35. Respiration and photosynthesis 151
36. Trophic relationships 151
37. Energy, numbers, biomass 160
38. Ecosystems 163
39. Succession and climax 168
40. Arrested successions 172
41. Colonization 176
42. Stability, instability, and reproductive strategies 179
43. Simplicity and diversity 183
44. Homoeostasis, feedback, regulation 188
45. Limits of tolerance 192
Further reading 197
References 197
5 Biological Resources 200
46. Evolution 200
47. Evolutionary strategies and game theory 206
48. Adaptation 210
49. Dispersal mechanisms 214
50. Wildlife species and habitats 218
51. Biodiversity 222
52. Fisheries 227
53. Forests 233
54. Farming for food and fibre 239
55. Human populations and demographic change 249
56. Genetic engineering 250
Further reading 257
Notes 257
References 258
6 Environmental Management 261
57. Wildlife conservation 261
58. Zoos, nature reserves, wilder

In [11]:
# We will now use the from_documents method to create a vectorstore from the chunks
vectorstore = FAISS.from_documents(
    chunks, embedding=OpenAIEmbeddings(api_key =api_key)
)

retriever = vectorstore.as_retriever(k=5)

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)

In [12]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [13]:
# An overly complicated one-liner to test what the top 5 most similar chunks are to the question
# Use this to make sense of the output of the next cell
print("\n\n".join([x.page_content for x in vectorstore.similarity_search("What is the main cause of global warming?", k=5)]))

46 / Basics of Environmental Science
atmospheric greenhouse effect is real and important, and the gases which cause it are justly known
as ‘greenhouse gases’.
Both the global climate and atmospheric concentrations of greenhouse gases vary from time to time.
Studies of air trapped in bubbles inside ice cores from Greenland and from the Russian Vostok
station in Antarctica have revealed a clear and direct relationship between these variations and air
temperature, in the case of the Vostok cores back to about 160 000 years ago. The correlation is
convincing, although it is possible that the fluctuating greenhouse-gas concentration is an effect of
temperature change rather than the cause of it. As temperatures rose at the end of the last ice age, the
increase in the atmospheric concentration of carbon dioxide lagged behind the temperature (CALDER,
1999) and so carbon dioxide cannot have been the cause of the warming. There is also evidence that
the carbon dioxide concentration was far from

In [14]:
chain.invoke("What is the main cause of global warming?")

'The main cause of global warming is debated, with some scientists suggesting it is due to changes in solar output and volcanic eruptions, while others believe it is more likely due to human intervention.'

Try RAG yourself! Take a file of your choice and apply the same concepts. 

In [15]:
loader = PyPDFLoader("Hope C. Dawson & Antonio Hernandez & Cory Shain - Language Files _ Materials for an Introduction to Language and Linguistics-The Ohio State University Press.pdf")

# The text splitter is used to split the document into chunks
# Mess with the parameters to see how it affects the output
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

chunks = loader.load_and_split(text_splitter=text_splitter)

print(chunks[25].page_content)

SAGE Publications, Ltd.; permission conveyed through Copyright
Clearance Center, Inc.
File 9.7
Figure (1) adapted from “Functional MRI in the investigation of blast-
related traumatic brain injury,” by John Graner, Terrence R. Oakes,
Louis M. French, and Gerard Riedy. Frontiers in Neurology 4:16. ©
2013 by the authors. Creative Commons Attribution 3.0 Unported
License.
Figure (2) adapted from image provided by Aaron G. Filler, MD, PhD, via
Wikicommons (https://commons.wikimedia.org/wiki/File:DTI_Brain_
Tractographic_Image_Set.jpg). Creative Commons Attribution—Share
Alike 3.0 Unported License.
File 9.8
Data in Exercise 8b from “Linguistics and agrammatism,” by Sergey
Avrutin. GLOT International 5.87–97. © 2001, Blackwell Publishers
Ltd. Reprinted with permission of John Wiley & Sons, Inc.
Data in Exercises 8c and 8d are excerpts from The shattered mind, by
Howard Gardner, © 1974 by Howard Gardner. Used by permission of
Alfred A. Knopf, an imprint of the Knopf Doubleday Publishing
Group

In [17]:
len(chunks)
print(chunks[4])

vectorstore = FAISS.from_documents(
    chunks, embedding=OpenAIEmbeddings(api_key =api_key)
)

retriever = vectorstore.as_retriever(k=5)

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)

page_content='The Ohio State University Press
Columbus' metadata={'source': 'Hope C. Dawson & Antonio Hernandez & Cory Shain - Language Files _ Materials for an Introduction to Language and Linguistics-The Ohio State University Press.pdf', 'page': 5, 'page_label': '6'}


In [18]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [19]:
print("\n\n".join([x.page_content for x in vectorstore.similarity_search("What are the linguistic facts of life?", k=5)]))

1.
2.
3.
4.
5.
6.
FILE 1.6
Practice
File 1.1—Introducing the Study of Language
Discussion Questions
Look at the list of “surprising but true” facts about language given in
Section 1.1.2. Which items on the list were things you had heard before,
and which were new to you? Which really were surprising? What about
those items surprised you? Do you know any language-related trivia that
can be added to the list?
Look at the list of “common misconceptions” about language given in
Section 1.1.3. How many of these beliefs are ones you have held at some
point or have heard other people express? For each, how do you think that
it came to be widely believed? What sort of evidence do you think
linguists might have that causes them to say that each is false?
File 1.2—What You Know When You Know a Language
Exercises
Explain in a few sentences why linguists tend to ignore speech
performance errors in their study of linguistic competence.
Look back at the illustration on the title page of this chapter

In [20]:
chain.invoke("What are the linguistic facts of life?")

'- There are languages that don’t have words for right and left but use words for cardinal directions instead.\n- Some aspects of language appear to be innate.\n- There are more than 7,000 languages spoken in the world, but 90% of the population speaks only 10% of them.\n- Some languages, such as Turkish, have special verb forms used for gossip and hearsay.\n- Some languages structure sentences by putting the object first and the subject last.\n- In some communities, all or most members can use a signed language.\n- There is nothing inherent about most words that gives them their meaning.'

Mess with the splitting method ([LangChain splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/)), the parameters to the splitter, and the number of retrieved chunks that are injected into the LLM's prompt as context. These will significantly impact how the LLM performs and answers questions.

## Advanced RAG

We leave this as a (optional) challenge for you. How can we implement advanced RAG methods in LangChain?

1. Find some data that you would like to perform RAG over. 
2. Implement some form of advanced search with LangChain. 

Note: The LangChain [EnsembleRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble) may be of use.

In [None]:
pass