# Retrieval Augmented Generation with LangChain ü¶úüîó
# (adapted from our Challenge 'Generative AI and RAG' -> 'RAG with Langchain')

In this notebook is taken from the challenge and adapted to the **political manifestos** I downloaded.
This was just a test run: our goal of the project will be to generate summaries from the **speeches**.

I created a new project environment and duplicated the requirements.txt from the challenge.

https://kitt.lewagon.com/camps/2170/challenges?path=06-Deep-Learning%2F07-GenAI-and-RAG%2F03-RAG-with-LangChain


## ‚öôÔ∏è Setup

üëâ Run the cell below to import a couple of basic libraries.

In [1]:
%load_ext autoreload
%autoreload 2
import os
from pprint import pprint
from IPython.display import Markdown
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
import pprint
from langchain_core.prompts import ChatPromptTemplate

In [10]:
import re
from langchain_chroma import Chroma
from langchain.chat_models import init_chat_model
from langchain_classic import hub

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

üëâ Run the cell below to load our API key again:

In [2]:
from dotenv import load_dotenv
load_dotenv()  # Load environment variables from .env file

True

## üìö Why RAG?

An LLM on its own can respond questions about everything it has learned.

That has a couple of drawbacks:
- The training data comes from the past and is not updated with the most recent data.
- It only knows the data it was trained on.

We want to use an LLM to work with our own data. That is where RAG, or Retrieval-Augmented Generation steps in.

1. **Retrieval-Augmented Generation (RAG)** combines a language model with a document retriever to enhance factual accuracy.
2. **It retrieves relevant external documents** (e.g., from a knowledge base) before generating responses.
3. **The language model uses both the prompt and retrieved context** to produce more informed and grounded outputs.

## üî¢ Embedding documents

In [3]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

Now we know what an embedding looks like, it's time to get working with our real data.

üëâ Head to the [LangChain documentation](https://docs.langchain.com/oss/python/integrations/document_loaders/index#pdfs), and find out how you can load a PDF using PyPDF.

üëâ Then go ahead and load one of the PDFs you downloaded before.

In [17]:
model = init_chat_model("google_genai:gemini-2.5-flash-lite")

In [37]:
prompt_template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use the following context to answer the question. Use maximum 7 sentences. Use specific terms. Highlight important ones."),
    ("human", """Context: {context}  Question: {question}""")
    ])

example_messages = prompt_template.invoke(
    {"context": "(context goes here)", "question": "(question goes here)"}
).to_messages()

Instantiate Vector Store:

In [87]:
vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
)


Write a function to populate the Vectore Store with own documents. 

In [97]:
def embed_and_store(file_path, vector_store):
    """Load a PDF file, split it into chunks, and store the chunks in a vector store."""
    # Load the PDF file


    loader = PyPDFLoader(file_path, mode='single')
    pdf = loader.load()

    # Split the pages into chunks
    splits = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 200)

    all_splits = splits.split_documents(pdf)

    # Add the party name to the metadata
    pattern = r"(?<=data/)[^_]+(?=_)"
    
    party_name = re.search(pattern, file_path)

    for split in all_splits:
        split.metadata['party_name'] = party_name.group()
        

    # Add the chunks to the vector store
    document_ids = vector_store.add_documents(documents=all_splits)

    return f'{file_path} embedded'

In [98]:
embed_and_store('data/B90G_25Wahlprogramm.pdf', vector_store)
# embed_and_store('data/AfD_25Wahlprogramm.pdf', vector_store)
# embed_and_store('data/BSW_25Wahlprogramm.pdf', vector_store)
# embed_and_store('data/FDP_25Wahlprogramm.pdf', vector_store)
# embed_and_store('data/SPD_25Wahlprogramm.pdf', vector_store)
# embed_and_store('data/DieLinke_25Wahlprogramm.pdf', vector_store)
# embed_and_store('data/CDU_25Wahlprogramm.pdf', vector_store)


'data/B90G_25Wahlprogramm.pdf embedded'

In [78]:
def answer(query, vector_store, llm, party, prompt_template=None):
    """Answer a query using the vector store and the language model."""
    # Retrieve similar documents from the vector store
    retrieved_docs = vector_store.similarity_search(query,k=6),filter={"party_name":party})

    # Create the prompt
    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

    # If no prompt template is provided, use the default one
    if not prompt_template:
        prompt_template = hub.pull("rlm/rag-prompt")

    prompt = prompt_template.invoke(
        {"context": docs_content, "question": query}
    )

    # Get the answer from the language model
    answer = llm.invoke(prompt)
    return answer.content

üëâ Try out your function with a query of your liking:

# Scenarion 1

Feed entire text without meta data

In [79]:
query = 'What does AfD say about migration?'

In [99]:
Markdown(answer(query, vector_store, model,None, prompt_template=prompt_template))

The party advocates for **regulated migration pathways** through visa agreements and training partnerships for students, trainees, and skilled workers. They believe in **human rights-based cooperation** with third and transit countries, emphasizing that more regulated migration leads to less irregular migration. The goal is to **effectively and long-term reduce irregular and dangerous migration** to Europe by creating better local living conditions and implementing comprehensive migration agreements. They explicitly oppose outsourcing asylum procedures to third countries, citing cost and legal failures. The party also stresses the importance of distinguishing between flight and labor migration.

# preferred: Scenario 2

Feed entire text with meta data

In [None]:
query = 'What does the party say about migration?'

In [None]:
Afd#_speeches = ['ID214376','ID1694723', 'ID326294']

In [None]:
Markdown(answer(query, vector_store, model,['AfD','mindate','maxdat'], prompt_template=prompt_template))