# üìö Introduction to Retrieval Augmented Generation with LangChain ü¶úüîó

In this notebook you'll learn how to use LangChain for Retrieval Augmented Generation.

We will use an LLM to answer questions about our own documents!

## ‚öôÔ∏è Setup

üëâ Run the cell below to import a couple of basic libraries.

In [None]:
%load_ext autoreload
%autoreload 2
import os
from pprint import pprint
from IPython.display import Markdown

üëâ Run the cell below to load our API key again:

In [None]:
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env file

## üìö Why RAG?

An LLM on its own can respond questions about everything it has learned.

That has a couple of drawbacks:
- The training data comes from the past and is not updated with the most recent data.
- It only knows the data it was trained on.

We want to use an LLM to work with our own data. That is where RAG, or Retrieval-Augmented Generation steps in.

1. **Retrieval-Augmented Generation (RAG)** combines a language model with a document retriever to enhance factual accuracy.
2. **It retrieves relevant external documents** (e.g., from a knowledge base) before generating responses.
3. **The language model uses both the prompt and retrieved context** to produce more informed and grounded outputs.

## üá™üá∫ Context

In this challenge, we'll work with documents from the European Parliament.

Imagine you're a reporter, and you want to know what has been said about a certain topic during the European Parliament's plenary sessions. Those sessions take place 12 times a year in Strassbourg, and last 4 days. Transcriptions of the sessions are available on the EP's website.

You definitely don't want to go ploughing through all those transcripts. So, let's leverage RAG to make our life easier!

This is good data to work with, because at all times we can take brand new data to test it out.

## üìò Let's get the data

1. Head to the [EP's website](https://www.europarl.europa.eu/plenary/en/debates-video.html). 
1. That will lead you to the most recent plenary session.
1. Under the first date, click on `HTML` in "‚ñ∂Ô∏è Verbatim reports HTML".
1. Scroll to the bottom of the page, and download the PDF file at the bottom.
1. Save the file in the `data` folder.

We'll start with one document, but you can already download the same for a couple of other days for later.

Have a look at the document. How many pages does it have? Quickly scroll through the document to get a feel for it.


## üî¢ Embedding documents

Embedding documents means that we will translate whole documents, or chunks of documents, into vectors.

LangChainü¶úüîó will be very helpful again.

Let's instantiate an embedder and try it out. Because we're using Gemini as our LLM, let's stick to Google's text embedders.

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

üëâ Try the embedder's `.embed_query()` to embed a simple piece of text.

In [None]:
# Embed a text like "What is the capital of France?" and save it to a variable `sample_embedding`

# YOUR CODE HERE


üëâ Take the time to explore this `sample_embedding`. What does it look like? What's its type? What is the embedding size?

In [None]:
# YOUR CODE HERE

## üíæ Load our real data from PDF

Now we know what an embedding looks like, it's time to get working with our real data.

üëâ Head to the [LangChain documentation](https://python.langchain.com/docs/how_to/document_loader_pdf/), and find out how you can load a PDF.

In the documentation you'll see an `async for` loop. Let's not get into *[asynchronous programming](https://en.wikipedia.org/wiki/Asynchrony_(computer_programming))* now, it makes things too complex. Use this instead:

```python
for page in loader.lazy_load():
```

üëâ Then go ahead and load one of the PDFs you downloaded before.

In [None]:
# YOUR CODE HERE

üëâ Explore the `pages`:
- What is its data type?
- How many pages do you have?
- What is the type of one page?
- How can you access the content of one page?
- How many characters does the full document have?
- What is in the `metadata` of a page?

In [None]:
# YOUR CODE HERE

## ‚úÇÔ∏è Split our data

Our complete document is too long to be embedded. Our text embedder can take inputs up to 2.048 tokens. For Gemini models that is about 8.196 characters (4 characters per token).

So we want to split our document in smaller chunks.

We already have a bunch of pages we could work with. But page ends are a bit arbitrary: they usually appear in the middle of a sentence.

Also, there is no overlap between the pages. So the first line of a page misses all context before. It's better to split the full text with a bit of overlap.

First, we'll load the PDF again, this time without splitting it.

In [None]:
loader = PyPDFLoader(file_path, mode='single')
pdf_text = loader.load()
len(pdf_text[0].page_content)

Now that we have our whole PDF as one document, we can split it in chunks in a smarter way.

üëâ Again, head over to the [LangChain documentation](https://python.langchain.com/docs/how_to/recursive_text_splitter/) and find out how to split our `pages` into chunks (called `documents` in LangChain).

Split it in chunks of 2_000 characters (that's about half a page in our case) with an overlap of 400. You can experiment with other values if you want.

In [None]:
# YOUR CODE HERE

üëâ Inspect `all_splits`:
- What is its data type?
- How many splits do you have?
- What is the type of one split?
- How can you access the content of one split?
- How many characters do we have in total now?
- What is in the `metadata` of a split?

In [None]:
# YOUR CODE HERE

## üóÑÔ∏è Bring it all together: embed and store our documents in a vector store

We have:
- An embedder
- A loader to load the data
- A text splitter to split our document into documents

What's missing?

We can embed our documents, but we want to store them somewhere. That's where a vector store comes in: it allows us to save:
- the document (the chunk),
- its embedding,
- its metadata.

In a next step we'll then be able to retrieve documents efficiently.

üëâ Check the [LangChain documentation](https://python.langchain.com/docs/concepts/vectorstores/) to see how you can create an `InMemoryVectorStore`.

In [None]:
# Import the necessary libraries

# YOUR CODE HERE

# Create an in-memory vector store using the embedder `embeddings` we created earlier

# YOUR CODE HERE

# Add the `all_splits` to the vector store and store the result in a variable called `document_ids`

# YOUR CODE HERE


In [None]:
# Have a look at the first 3 document IDs

# YOUR CODE HERE


In [None]:
# Use the vector store's `get_by_ids` method. You have to give it a list of document IDs.

# YOUR CODE HERE


üëâ How can you access a vector store's document's content and metadata?

In [None]:
# YOUR CODE HERE

## üîé Use the vector store to retrieve similar documents

Now that we embedded the documents, we can use the vector store to retrieve similar documents.

üëâ Check in the (LangChain documentation](https://python.langchain.com/docs/concepts/vectorstores/) how that works.

Use a query, e.g. "Summarize the discussion on agricultural policy.", and find the most similar documents. You can also specify the number of documents to retrieve.

In [None]:
# Save your question into a variable called `query`

# YOUR CODE HERE

# Use the vector store to find similar documents to the query. Store the result in a variable called `retrieved_docs`

# YOUR CODE HERE


This concludes the so-called "Retrieval" part of RAG: we can now find the documents that are the most similar to our query.

Most of the work is done now!

## üí¨ Generate an answer to our question

So far we only used an **embedding model** to enable us to retrieve the most similar documents.

Now, we will use a generative LLM to get an answer to our question: we'll feed it with our retrieved documents, and our question.

The most rudimentary way to do this would be to concatenate all our inputs together, add our question, and see the result.

Let's give it a try.

üëâ First instantiate an LLM like in the previous challenges.

In [None]:
# YOUR CODE HERE

Then create a rudimentary prompt:

In [None]:
prompt = '\n\n'.join([doc.page_content for doc in retrieved_docs])
prompt += "\n\n" + query

üëâ Now use the prompt:

In [None]:
# YOUR CODE HERE

That's not bad, but we could do better by writing a more extensive prompt, giving the model more guidance.

It turns out we're not the first ones doing this, and LangChain has a library of pre-made prompts for us.

üëâ Run the cell below, and try to understand how it works. (You'll get a warning about LangSmithMissingAPIKeyWarning, you can disregard that.)

In [None]:
from langchain import hub

prompt_template = hub.pull("rlm/rag-prompt")

example_messages = prompt_template.invoke(
    {"context": "(context goes here)", "question": "(question goes here)"}
).to_messages()

print("\n")
print(example_messages[0].content)

See how LangChain generated a more precise prompt for us? Let's use this for our RAG!

üëâ First, join all retrieved docs into one long string, separated by two newlines.

In [None]:
# YOUR CODE HERE

üëâ Next, create a `prompt` starting from your query and the retrieved documents. Remember to look at the example above.

In [None]:
# YOUR CODE HERE

üëâ Finally use the LLM model with `the_prompt` we just created:

In [None]:
# YOUR CODE HERE

üéâ We have finished our first RAG: the LLM generated text ***grounded*** in the documents we provided it with.

## üíæ Persisting our embeddings

So far we worked with an in-memory vector store. So when you will close your notebook, you will also loose all the embeddings.

‚ö†Ô∏è Remember that these embeddings are generated by models running on your provider's platform, in this case on Google's machines. And they don't work for free. üí∞

For one, relatively small document like this one, the cost is low, but it quickly adds up. So far, we just workend on one day's transcripts. There are 3 more per session, 12 sessions per year, multiple years...

To solve this we will just replace our vector store by a persistent one. That's the advantage of LangChain: it's very easy to replace components.

Our in-memory vector store was great for experimenting, now we'll switch it for another one. We will use [Chroma](https://www.trychroma.com/), a very popular vector store. We can run it locally, and use it through LangChain.

We'll recreate our whole flow. It's a good exercise to try to bring it all together again in a couple of code cells. At the same time we'll refactor everything into some reusable code.

We want to have two functions in the end:

1. `embed_and_store()`: Add another session's transcript to our vector db, so that we have more data to retrieve from.
2. `answer()`: Query our vector store with different questions.

#### 1. Instantiate a Chroma vector store

üëâ Look at [LangChain's documentation](https://python.langchain.com/docs/integrations/vectorstores/chroma/) to see how to.

In [None]:
# YOUR CODE HERE

#### 2. Create `embed_and_store()`

üëâ Complete the code for this function:

In [None]:
def embed_and_store(file_path, vector_store):
    """Load a PDF file, split it into chunks, and store the chunks in a vector store."""
    # Load the PDF file
    pass  # YOUR CODE HERE

    # Split the pages into chunks
    pass  # YOUR CODE HERE

    # # Add the session_date to the metadata
    # for split in all_splits:
    #     split.metadata['session_date'] = session_date

    # Add the chunks to the vector store
    pass  # YOUR CODE HERE

    return document_ids

üëâ Try out your function with a file or even two:

In [None]:
# YOUR CODE HERE

#### 3. Create `answer()`

üëâ Complete the code for this function:

In [None]:
def answer(query, vector_store, llm, prompt_template=None):
    """Answer a query using the vector store and the language model."""
    # Retrieve similar documents from the vector store
    retrieved_docs = vector_store.similarity_search(query, k=6)

    # Create the prompt
    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

    # If no prompt template is provided, use the default one
    if not prompt_template:
        prompt_template = hub.pull("rlm/rag-prompt")

    prompt = prompt_template.invoke(
        {"context": docs_content, "question": query}
    )

    # Get the answer from the language model
    answer = llm.invoke(prompt)

    return answer.content

üëâ Try out your function with a query of your liking:

In [None]:
# YOUR CODE HERE

üèÅ Congratulations! You now master RAG using LangChain, and you learned how to make reusable functions to add more documents to your vector store, and to query it.

## [Optional] Adding metadata

The RAG we set up queries all the documents from the vector store. Imagine we have multiple year's information in there. It would be handy if we could filter on years, or dates, no?

How to do that? Remember that the documents in the vector store contain metadata. If we could add the date to it, we could use it later to filter.

Tip: Add your metadata as early as possible in your pipeline. Don't try to add it after your data was already stored to the vector store.

üëâ Adapt your `embed_and_store()` function.

In [None]:
def embed_and_store_fancy(file_path, vector_store, session_date):
    """Load a PDF file, split it into chunks, and store the chunks in a vector store.
    Session_date is added to the metadata of each chunk."""
    pass  # YOUR CODE HERE

    return document_ids

üëâ Try out your function and check that your vector store contains the extra metadata.

In [None]:
# YOUR CODE HERE

Now we have to limit the retriever to the date asked by the user. 

üëâ Adapt your `answer()` function so it can take a date and filter documents based on the new metadata.

In [None]:
# YOUR CODE HERE

In [None]:
# YOUR CODE HERE

Nice! You have combined similarity search with metadata search to create a powerful RAG system!