# Building a RAG application from scratch

Here is a high-level overview of the system we want to build:

<img src='images/system1.png' width="1200">

Let's start by loading the environment variables we need to use.

In [42]:
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


# This is the YouTube video we're going to use.
YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=cdiD-9MMpb0"

## Setting up the model
Let's define the LLM model that we'll use as part of the workflow.

In [9]:
from langchain_openai.chat_models import ChatOpenAI

model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")

We can test the model by asking a simple question.

In [10]:
model.invoke("What is the purpose of a climate model?")

AIMessage(content="The purpose of a climate model is to simulate and predict the behavior of the Earth's climate system, taking into account various factors such as atmospheric composition, ocean circulation, land surface processes, and human activities. Climate models help scientists understand how the climate system works, how it is changing over time, and how it may evolve in the future under different scenarios. These models can be used to assess the potential impacts of climate change, inform policy decisions, and develop strategies for mitigating and adapting to the effects of climate change.", response_metadata={'token_usage': {'completion_tokens': 104, 'prompt_tokens': 16, 'total_tokens': 120}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_3b956da36b', 'finish_reason': 'stop', 'logprobs': None}, id='run-bb77d67d-56e4-41a1-b363-8f87e81be487-0')

The result from the model is an `AIMessage` instance containing the answer. We can extract this answer by chaining the model with an [output parser](https://python.langchain.com/docs/modules/model_io/output_parsers/).

Here is what chaining the model with an output parser looks like:

<img src='images/chain1.png' width="1200">

For this example, we'll use a simple `StrOutputParser` to extract the answer as a string.

In [11]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser
chain.invoke("What is the purpose of a climate model?")

"The purpose of a climate model is to simulate and predict how the Earth's climate system works, including interactions between the atmosphere, oceans, land surface, and ice. Climate models help scientists better understand past climate changes, current climate trends, and future climate scenarios. They are used to study the impacts of human activities, natural disturbances, and external factors on the climate system, and to develop strategies for climate adaptation and mitigation. Climate models are also used to make projections of future climate change, which can inform policy decisions and help society prepare for potential impacts."

## Introducing prompt templates

We want to provide the model with some context and the question. [Prompt templates](https://python.langchain.com/docs/modules/model_io/prompts/quick_start) are a simple way to define and reuse prompts.

In [12]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know, please contact julien.brajard@nersc.no".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt.format(context="Mary's sister is Susana", question="Who is Mary's sister?")

'Human: \nAnswer the question based on the context below. If you can\'t \nanswer the question, reply "I don\'t know, please contact julien.brajard@nersc.no".\n\nContext: Mary\'s sister is Susana\n\nQuestion: Who is Mary\'s sister?\n'

We can now chain the prompt with the model and the output parser.

<img src='images/chain2.png' width="1200">

In [13]:
chain = prompt | model | parser
chain.invoke({
    "context": "Mary's sister is Susana",
    "question": "Who is Susana's sister?"
})

"Susana's sister is Mary."

In [14]:
chain = prompt | model | parser
chain.invoke({
    "context": "Mary's sister is Susana",
    "question": "Who is Marys's brother?"
})

"I don't know, please contact julien.brajard@nersc.no"

## Combining chains

We can combine different chains to create more complex workflows. For example, let's create a second chain that translates the answer from the first chain into a different language.

Let's start by creating a new prompt template for the translation chain:

In [15]:
translation_prompt = ChatPromptTemplate.from_template(
    "Translate {answer} to {language}"
)

We can now create a new translation chain that combines the result from the first chain with the translation prompt.

Here is what the new workflow looks like:

<img src='images/chain3.png' width="1200">

In [16]:
from operator import itemgetter

translation_chain = (
    {"answer": chain, "language": itemgetter("language")} | translation_prompt | model | parser
)

translation_chain.invoke(
    {
        "context": "Mary's sister is Susana. She doesn't have any more siblings.",
        "question": "How many sisters does Mary have?",
        "language": "Norwegian",
    }
)

'Mary har en søster, Susana.'

## Transcribing the pdf file

The context we want to send the model comes from a pdf file

In [2]:
from langchain_community.document_loaders import UnstructuredPDFLoader



In [3]:
loader = UnstructuredPDFLoader("gmd-6-687-2013.pdf")

data = loader.load()




[nltk_data] Downloading package punkt to /home/julaja/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/julaja/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [39]:
data

[Document(page_content='Natural Hazards and Earth System SciencesOpen AccessDiscussions\n\nClimate of the PastOpen Access\n\nSolid EarthOpen Access\n\nOpen AccessBiogeosciencesDiscussions\n\nAnnales GeophysicaeOpen Access\n\nHydrology and Earth SystemSciencesOpen Access\n\nEGU Journal Logos (RGB)\n\nGeoscientificInstrumentation Methods andData SystemsOpen Access\n\nAtmospheric MeasurementTechniquesOpen AccessDiscussions\n\nGeoscientificInstrumentation Methods andData SystemsOpen AccessDiscussions\n\nThe CryosphereOpen Access\n\nAtmospheric MeasurementTechniquesOpen Access\n\nEarth System DynamicsOpen Access\n\nNonlinear Processes in GeophysicsOpen Access\n\nNatural Hazards and Earth System SciencesOpen Access\n\nOpen AccessClimate of the PastDiscussions\n\nOpen AccessEarth System DynamicsDiscussions\n\nAtmospheric Chemistryand PhysicsOpen Access\n\nBiogeosciencesOpen Access\n\nOcean ScienceOpen Access\n\nHydrology and Earth SystemSciencesOpen AccessDiscussions\n\nOpen AccessOcean Scien

## Using the entire transcription as context

If we try to invoke the chain using the transcription as context, the model will return an error because the context is too long.

Large Language Models support limitted context sizes. The video we are using is too long for the model to handle, so we need to find a different solution.

In [17]:
try:
    chain.invoke({
        "context": data,
        "question": "Is reading papers a good idea?"
    })
except Exception as e:
    print(e)

Error code: 400 - {'error': {'message': "This model's maximum context length is 16385 tokens. However, your messages resulted in 51905 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}


## Splitting the transcription

Since we can't use the entire transcription as the context for the model, a potential solution is to split the transcription into smaller chunks. We can then invoke the model using only the relevant chunks to answer a particular question:

<img src='images/system2.png' width="1200">

Let's start by loading the transcription in memory:

There are many different ways to split a document. For this example, we'll use a simple splitter that splits the document into chunks of a fixed size. Check [Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/) for more information about different approaches to splitting documents.

For illustration purposes, let's split the transcription into chunks of 100 characters with an overlap of 20 characters and display the first few chunks:

In [18]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
text_splitter.split_documents(data)[:5]

[Document(page_content='Natural Hazards and Earth System SciencesOpen AccessDiscussions\n\nClimate of the PastOpen Access', metadata={'source': 'gmd-6-687-2013.pdf'}),
 Document(page_content='Solid EarthOpen Access\n\nOpen AccessBiogeosciencesDiscussions\n\nAnnales GeophysicaeOpen Access', metadata={'source': 'gmd-6-687-2013.pdf'}),
 Document(page_content='Hydrology and Earth SystemSciencesOpen Access\n\nEGU Journal Logos (RGB)', metadata={'source': 'gmd-6-687-2013.pdf'}),
 Document(page_content='GeoscientificInstrumentation Methods andData SystemsOpen Access', metadata={'source': 'gmd-6-687-2013.pdf'}),
 Document(page_content='Atmospheric MeasurementTechniquesOpen AccessDiscussions', metadata={'source': 'gmd-6-687-2013.pdf'})]

For our specific application, let's use 1000 characters instead:

In [19]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
documents = text_splitter.split_documents(data)

## Finding the relevant chunks

Given a particular question, we need to find the relevant chunks from the transcription to send to the model. Here is where the idea of **embeddings** comes into play.

An embedding is a mathematical representation of the semantic meaning of a word, sentence, or document. It's a projection of a concept in a high-dimensional space. Embeddings have a simple characteristic: The projection of related concepts will be close to each other, while concepts with different meanings will lie far away. You can use the [Cohere's Embed Playground](https://dashboard.cohere.com/playground/embed) to visualize embeddings in two dimensions.

To provide with the most relevant chunks, we can use the embeddings of the question and the chunks of the transcription to compute the similarity between them. We can then select the chunks with the highest similarity to the question and use them as the context for the model:

<img src='images/system3.png' width="1200">

Let's generate embeddings for an arbitrary query:

In [20]:
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
embedded_query = embeddings.embed_query("Who is Mary's sister?")

print(f"Embedding length: {len(embedded_query)}")
print(embedded_query[:10])

Embedding length: 1536
[-0.001371190081765891, -0.03434698236453119, -0.011476094990116788, 0.0012773800454156574, -0.026166747008526288, 0.009230907949392044, -0.015660022937300136, 0.0017948988196774898, -0.011851335135517721, -0.03324627818637449]


To illustrate how embeddings work, let's first generate the embeddings for two different sentences:

In [21]:
sentence1 = embeddings.embed_query("Mary's sister is Susana")
sentence2 = embeddings.embed_query("Pedro's mother is a teacher")

We can now compute the similarity between the query and each of the two sentences. The closer the embeddings are, the more similar the sentences will be.

We can use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to calculate the similarity between the query and each of the sentences:

In [22]:
from sklearn.metrics.pairwise import cosine_similarity

query_sentence1_similarity = cosine_similarity([embedded_query], [sentence1])[0][0]
query_sentence2_similarity = cosine_similarity([embedded_query], [sentence2])[0][0]

query_sentence1_similarity, query_sentence2_similarity

(0.91745489543827, 0.7680495517171415)

## Setting up a Vector Store

We need an efficient way to store document chunks, their embeddings, and perform similarity searches at scale. To do this, we'll use a **vector store**.

A vector store is a database of embeddings that specializes in fast similarity searches. 

<img src='images/system4.png' width="1200">

To understand how a vector store works, let's create one in memory and add a few embeddings to it:

In [23]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore1 = DocArrayInMemorySearch.from_texts(
    [
        "Mary's sister is Susana",
        "John and Tommy are brothers",
        "Patricia likes white cars",
        "Pedro's mother is a teacher",
        "Lucia drives an Audi",
        "Mary has two siblings",
    ],
    embedding=embeddings,
)

We can now query the vector store to find the most similar embeddings to a given query:

In [24]:
vectorstore1.similarity_search_with_score(query="Who is Mary's sister?", k=3)

[(Document(page_content="Mary's sister is Susana"), 0.9174549036927803),
 (Document(page_content='Mary has two siblings'), 0.9045440036524318),
 (Document(page_content='John and Tommy are brothers'), 0.8015357441152158)]

## Connecting the vector store to the chain

We can use the vector store to find the most relevant chunks from the transcription to send to the model. Here is how we can connect the vector store to the chain:

<img src='images/chain4.png' width="1200">

We need to configure a [Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/). The retriever will run a similarity search in the vector store and return the most similar documents back to the next step in the chain.

We can get a retriever directly from the vector store we created before: 

In [25]:
retriever1 = vectorstore1.as_retriever()
retriever1.invoke("Who is Mary's sister?")

[Document(page_content="Mary's sister is Susana"),
 Document(page_content='Mary has two siblings'),
 Document(page_content='John and Tommy are brothers'),
 Document(page_content="Pedro's mother is a teacher")]

Our prompt expects two parameters, "context" and "question." We can use the retriever to find the chunks we'll use as the context to answer the question.

We can create a map with the two inputs by using the [`RunnableParallel`](https://python.langchain.com/docs/expression_language/how_to/map) and [`RunnablePassthrough`](https://python.langchain.com/docs/expression_language/how_to/passthrough) classes. This will allow us to pass the context and question to the prompt as a map with the keys "context" and "question."

In [26]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

setup = RunnableParallel(context=retriever1, question=RunnablePassthrough())
setup.invoke("What color is Patricia's car?")

{'context': [Document(page_content='Patricia likes white cars'),
  Document(page_content='Lucia drives an Audi'),
  Document(page_content="Pedro's mother is a teacher"),
  Document(page_content="Mary's sister is Susana")],
 'question': "What color is Patricia's car?"}

Let's now add the setup map to the chain and run it:



In [27]:
chain = setup | prompt | model | parser
chain.invoke("What color is Patricia's car?")

'Based on the context provided, Patricia likes white cars.'

Let's invoke the chain using another example:

In [28]:
chain.invoke("What car does Lucia drive?")

'Lucia drives an Audi.'

## Loading transcription into the vector store

We initialized the vector store with a few random strings. Let's create a new vector store using the chunks from the pdf.

In [29]:
vectorstore2 = DocArrayInMemorySearch.from_documents(documents, embeddings)

Let's set up a new chain using the correct vector store. This time we are using a different equivalent syntax to specify the [`RunnableParallel`](https://python.langchain.com/docs/expression_language/how_to/map) portion of the chain:

In [31]:
chain = (
    {"context": vectorstore2.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
chain.invoke("What is NorESM?")

'NorESM is the Norwegian Earth System Model, which is a nationally coordinated effort building on previous research projects and models developed in Norway. It exists in different versions with different resolutions and components.'

In [33]:
chain.invoke("How is AMOC represented in NorESM?")

'It is not clear what causes the vigorous AMOC intensity in the NorESM experiment. The counterclockwise abyssal circulation of Antarctic Bottom Water (AABW) in the Atlantic basin is not very prominent in the simulation. The simulated AMOC in NorESM is on the strong side compared to observation-based estimates and other model simulations. The bias in the AMOC leads to a too warm and saline Atlantic Ocean at depth. However, the mechanisms responsible for the strong AMOC in NorESM are not clear.'

In [35]:
chain.invoke("What is the sea ice model in NorESM?")

'The sea ice model in NorESM is the original CICE4 version used in CCSM4.'

In [36]:
chain.invoke("What is the resolution of the mode?")

'The resolution of the model presented in the document is approximately 2° for atmosphere and land components, and 1° for ocean and ice components.'

In [37]:
chain.invoke("What is the main problem of NorESM?")

'The main problem of NorESM is the improvement, implementation, and verification of climate processes that are of particular importance at high latitudes, as well as analyzing climate feedbacks, responses, and sensitivities of low-latitude climate.'

In [38]:
chain.invoke("Will it rain in Norway this summer?")

"I don't know, please contact julien.brajard@nersc.no."

## Setting up Pinecone

So far we've used an in-memory vector store. In practice, we need a vector store that can handle large amounts of data and perform similarity searches at scale. For this example, we'll use [Pinecone](https://www.pinecone.io/).

The first step is to create a Pinecone account, set up an index, get an API key, and set it as an environment variable `PINECONE_API_KEY`.

Then, we can load the transcription documents into Pinecone:

In [44]:
from langchain_pinecone import PineconeVectorStore

index_name = "noresmpaper-rag-index"

pinecone = PineconeVectorStore.from_documents(
    documents, embeddings, index_name=index_name
)

Let's now run a similarity search on pinecone to make sure everything works:

In [45]:
pinecone.similarity_search("What is the main problem of NorESM?")[:3]

[Document(page_content='Central to the NorESM activity is therefore improvement, implementation and veriﬁcation of climate processes that are of particular importance at high (northern) latitudes, and con- sequently for polar climate. As the tropics are of key impor- tance for global heat and moisture budgets, as well as for generating and inﬂuencing major climate variability modes, analysis of climate feedbacks, responses and sensitivities of low-latitude climate are an inherent part of the activity.\n\n2 Model description\n\nNorESM is, as mentioned above, largely based on CCSM4. The main differences are the isopycnic coordinate ocean module in NorESM and that CAM4-Oslo substitutes CAM4 as the atmosphere module. The sea ice and land models in NorESM are basically the same as in CCSM4 and the Com- munity Earth System Model version 1 (CESM1), except that deposited soot and mineral dust aerosols on snow and sea ice are based on the aerosol calculations in CAM4-Oslo.\n\n2.1 Atmospheric co

Let's setup the new chain using Pinecone as the vector store:

In [46]:
chain = (
    {"context": pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

chain.invoke("What is the main problem of NorESM?")

'Central to the NorESM activity is therefore improvement, implementation and verification of climate processes that are of particular importance at high (northern) latitudes, and consequently for polar climate. As the tropics are of key importance for global heat and moisture budgets, as well as for generating and influencing major climate variability modes, analysis of climate feedbacks, responses and sensitivities of low-latitude climate are an inherent part of the activity.\n\nTherefore, the main problem of NorESM is the need for continuous improvement and verification of climate processes, especially those relevant to high latitudes and polar climate, as well as the analysis of climate feedbacks and sensitivities in low-latitude regions.'