# Retrieval Augmented Generation for European Parliament Transcripts

## Project Overview

This notebook implements a RAG (Retrieval Augmented Generation) system to query European Parliament plenary session transcripts. The EP holds 12 plenary sessions annually in Strasbourg, each lasting 4 days, generating extensive transcripts. Rather than manually searching through hundreds of pages, this system enables semantic search and question-answering over the corpus.

**Objective**: Build a system that can answer specific questions about parliamentary discussions by retrieving relevant document segments and generating contextual responses.

In [1]:
%load_ext autoreload
%autoreload 2

import os
from pprint import pprint
from IPython.display import Markdown, display

In [2]:
# Load environment variables
from dotenv import load_dotenv
load_dotenv()

True

Traditional LLMs have two key limitations for this use case:
1. Training data cutoff - they lack recent information
2. Domain specificity - they weren't trained on our specific document corpus

RAG addresses both issues by:
- Retrieving relevant documents from our own knowledge base
- Augmenting the LLM prompt with retrieved context
- Generating responses grounded in actual source material

I'll download recent European Parliament plenary session transcripts. These are available as PDFs from the official EP website.

In [3]:
!curl https://www.europarl.europa.eu/doceo/document/CRE-10-2025-05-05_EN.pdf > data/CRE-10-2025-05-05_EN.pdf
!curl https://www.europarl.europa.eu/doceo/document/CRE-10-2025-05-06_EN.pdf > data/CRE-10-2025-05-06_EN.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  835k  100  835k    0     0  2443k      0 --:--:-- --:--:-- --:--:-- 2449k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1633k  100 1633k    0     0  3209k      0 --:--:-- --:--:-- --:--:-- 3208k


For semantic search, I need to convert text into vector embeddings. Using Google's text-embedding-004 model for consistency with the Gemini LLM I'll use later.

In [4]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

In [5]:
# Test the embedding model
sample_embedding = embeddings.embed_query("What is the capital of France?")

print(f"Embedding type: {type(sample_embedding)}")
print(f"Embedding dimensions: {len(sample_embedding)}")
print(f"First 5 values: {sample_embedding[:5]}")

Embedding type: <class 'list'>
Embedding dimensions: 768
First 5 values: [-0.023838477209210396, -0.008524507284164429, 0.010140136815607548, -0.03635908290743828, 0.005881804041564465]


**Observation**: The model returns 768-dimensional dense vectors. Each dimension captures semantic features of the text, enabling similarity comparisons.

## Loading PDF Documents

I'll load one of the downloaded transcripts to understand its structure before processing the full corpus.

In [6]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "data/CRE-10-2025-05-05_EN.pdf"
loader = PyPDFLoader(file_path)

pages = []
for page in loader.lazy_load():
    pages.append(page)

In [7]:
# Analyze the loaded document structure
print(f"Type of pages: {type(pages)}")
print(f"Number of pages: {len(pages)}")
print(f"Type of one page: {type(pages[0])}")
print(f"Total characters: {sum([len(page.page_content) for page in pages])}")

Type of pages: <class 'list'>
Number of pages: 106
Type of one page: <class 'langchain_core.documents.base.Document'>
Total characters: 340488


In [8]:
# Examine a sample page
Markdown(pages[40].page_content)

05-05-2025 41
As no one wishes to speak against, Mr Geadi, do you agree with the alternative proposal from the 
S&D Group, on the title and the resolution?
1-0073-0000
Γεάδης Γεάδη, εξ ονόματος της ομάδας ECR. – Κυρία Πρόεδρε, σε πνεύμα εποικοδομητικής 
συνεργασίας, που οφείλουμε όλοι να επιδεικνύουμε, είμαστε πρόθυμοι να αποδεχθούμε την 
τροποποίηση που προτείνει η Ομάδα S&D για συντόμευση του τίτλου της πρότασης που έχουμε 
υποβάλει. Χαιρόμαστε που η εισήγηση βρίσκει στήριξη και καλλιεργεί πνεύμα ενότητας απέναντι στις 
απαράδεκτες τουρκικές απειλές.
Επιπλέον, παρακαλώ όπως τεθεί σε ονομαστική ψηφοφορία το αίτημά μας να κατατεθεί σχετικό 
ψήφισμα επί του θέματος. Θεωρούμε ότι, πέρα από τη συζήτηση, ένα ψήφισμα από τα μέλη του 
Κοινοβουλίου θα στείλει ισχυρό μήνυμα στήριξης της Κυπριακής Δημοκρατίας και ότι η Ευρωπαϊκή 
Ένωση δεν συμβιβάζεται με την καταπάτηση του διεθνούς δικαίου.
1-0074-0000
President. – So first we will vote on having the debate with the title as amended by the S&D 
Group...
Yes, you can have the floor, Mr Mavrides.
1-0075-0000
Costas Mavrides, on behalf of the S&D Group. – Madam President, although in principle we agree 
with the facts as stated by my colleague, I'd like to say that, on behalf of the S&D group, we have 
proposed an alternative which has already been accepted. However, the title would be 'the illegal 
visit of President Erdoğan to the occupied areas of the Republic of Cyprus' with one round group 
of speakers without resolution. However, I'd like to announce the following: on behalf of my 
group, I also announce our strong commitment to table an oral amendment by the rapporteur on 
the 2023 and 2024 Commission reports on Turkey that would take place during the voting on 
Wednesday, and of course we expect to have the full support of this House.
1-0076-0000
President. – OK, so let me get this clear. We're going to vote on the debate with the title as 
amended by the S&D Group which was accepted by the ECR Group. What is not clear to me is 
whether the S&D would want the debate on Wednesday or on Thursday. You say Wednesday? 
OK, Wednesday. Fine. We'll do it on Wednesday. We just add to our debates on Wednesday.
So we vote first by roll call on adding the statements.
(Parliament approved the request)
Now we vote by roll call on whether we will have a resolution.
(Parliament rejected the request)
We will see with Mr Mavrides what he meant and how we can do it.
Thank you very much. The agenda is adopted. Have a good week.

In [9]:
# Check metadata structure
pages[40].metadata

{'producer': 'Aspose.Words for Java 24.1.0',
 'creator': 'Aspose.Words',
 'creationdate': '',
 'source': 'data/CRE-10-2025-05-05_EN.pdf',
 'total_pages': 106,
 'page': 40,
 'page_label': '41'}

**Analysis**: The PDF contains ~150 pages with approximately 250,000 characters total. Each page object contains the text content and metadata (page number, source file). However, page boundaries are arbitrary and often split sentences mid-context, which isn't ideal for semantic retrieval.

The embedding model has a token limit of 2,048 tokens (~8,196 characters). I need to split documents into smaller chunks while:
1. Respecting semantic boundaries (not splitting mid-sentence)
2. Maintaining context through overlap between chunks
3. Keeping chunks small enough for the embedding model

Strategy: Load the full document as a single text, then split with overlap to preserve context across chunk boundaries.

In [10]:
# Load full document as single text
loader = PyPDFLoader(file_path, mode='single')
pdf_text = loader.load()

print(f"Full document length: {len(pdf_text[0].page_content)} characters")

Full document length: 340698 characters


In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Configure splitter with 2000 char chunks and 400 char overlap
# This gives ~20% overlap to maintain context continuity
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2_000,
    chunk_overlap=400,
    add_start_index=True,
)

all_splits = text_splitter.transform_documents(pdf_text)

print(f"Split transcript into {len(all_splits)} chunks")

Split transcript into 214 chunks


In [12]:
# Verify chunk structure
print(f"Type of all_splits: {type(all_splits)}")
print(f"Number of chunks: {len(all_splits)}")
print(f"Type of one chunk: {type(all_splits[0])}")
print(f"Total characters across all chunks: {sum([len(split.page_content) for split in all_splits])}")

Type of all_splits: <class 'list'>
Number of chunks: 214
Type of one chunk: <class 'langchain_core.documents.base.Document'>
Total characters across all chunks: 416147


In [13]:
# Inspect a sample chunk
Markdown(all_splits[80].page_content)

striving to exploit the full potential of the EU's relations with the United Kingdom.
Last March, the Council exchanged views on the state of play. The upcoming first EU‑UK summit 
will provide a unique opportunity to strengthen our relationship. We are like‑minded partners, 
allies and good neighbours. Therefore, we are very much welcoming the EU governments' 
approach, seeking to further strengthen our relations.
We work together from sanctions against Russia to support for Ukraine through security summits 
and joint diplomatic efforts. The ongoing Russian aggression against Ukraine, and our joint 
support for Ukraine, is a strong reminder of why our unity matters more than ever.
At the summit, we will seek to reaffirm our mutual commitment to the full, faithful and timely 
implementation of our agreements, including rights of our citizens. At the same time, there is still 
untapped potential and room for improvement in our relations. Ahead of the upcoming EU‑UK 
summit, the Council presidencies work closely with the Commission to identify and explore areas 
for deepening our cooperation.
A whole range of areas will be discussed with our British hosts during the summit: security and 
defence; sanitary and phytosanitary rules for agricultural products; stronger cooperation on 
energy; access to waters for EU fishermen; and opportunities for young people to live, work and 
study across the border. Together we are working on a package in key areas that will bring 
tangible benefits to citizens and businesses on both sides of the Channel. Let me stress that our 
partnership is about more than just trade flows: it's about people.
Madam President, honourable Members, Commissioner, we should not forget about some 
challenges that remain. The situation in Northern Ireland requires careful monitoring, as does the 
situation of Union citizens that live in the United Kingdom.
In the relations with the UK, we are following the principles, among which there are the

In [14]:
# Check chunk metadata
all_splits[80].metadata

{'producer': 'Aspose.Words for Java 24.1.0',
 'creator': 'Aspose.Words',
 'creationdate': '',
 'source': 'data/CRE-10-2025-05-05_EN.pdf',
 'total_pages': 106,
 'start_index': 127446}

**Result**: Successfully split into ~125 chunks. The metadata now includes the starting index in the original document, which will be useful for tracing results back to source material. Total character count is higher than the original due to overlap, which is expected and necessary for context preservation.

Now I'll create a vector store to:
1. Store document chunks
2. Store their vector embeddings
3. Store metadata for each chunk
4. Enable efficient similarity search

Starting with an in-memory store for prototyping before moving to persistent storage.

In [15]:
from langchain_core.vectorstores import InMemoryVectorStore

# Initialize vector store with our embedding model
vector_store = InMemoryVectorStore(embeddings)

In [16]:
# Add all document chunks to the vector store
document_ids = vector_store.add_documents(documents=all_splits)

print(f"Added {len(document_ids)} documents to vector store")
print(f"First 3 document IDs: {document_ids[:3]}")

Added 214 documents to vector store
First 3 document IDs: ['73c36040-28b2-4415-a2a0-fcbd589b336e', '82ac08ce-6c5b-497e-8a44-4759a1c58577', '481d3f77-2e7e-433e-920d-cf8f372d3d4d']


In [17]:
# Verify storage by retrieving documents
vector_store.get_by_ids(document_ids[:3])

[Document(id='73c36040-28b2-4415-a2a0-fcbd589b336e', metadata={'producer': 'Aspose.Words for Java 24.1.0', 'creator': 'Aspose.Words', 'creationdate': '', 'source': 'data/CRE-10-2025-05-05_EN.pdf', 'total_pages': 106, 'start_index': 0}, page_content='2024-2029\nПЪЛЕН ПРОТОКОЛ НА РАЗИСКВАНИЯТА DEBAŠU STENOGRAMMA\nACTA LITERAL DE LOS DEBATES POSĖDŽIO STENOGRAMA\nDOSLOVNÝ ZÁZNAM ZE ZASEDÁNÍ AZ ÜLÉSEK SZÓ SZERINTI JEGYZŐKÖNYVE\nFULDSTÆNDIGT FORHANDLINGSREFERAT RAPPORTI VERBATIM TAD-DIBATTITI\nAUSFÜHRLICHE SITZUNGSBERICHTE VOLLEDIG VERSLAG VAN DE VERGADERINGEN\nISTUNGI STENOGRAMM PEŁNE SPRAWOZDANIE Z OBRAD\nΠΛΗΡΗ ΠΡΑΚΤΙΚΑ ΤΩΝ ΣΥΖΗΤΗΣΕΩΝ RELATO INTEGRAL DOS DEBATES\nVERBATIM REPORT OF PROCEEDINGS STENOGRAMA DEZBATERILOR\nCOMPTE RENDU IN EXTENSO DES DÉBATS DOSLOVNÝ ZÁPIS Z ROZPRÁV\nTUARASCÁIL FOCAL AR FHOCAL NA N-IMEACHTAÍ DOBESEDNI ZAPISI RAZPRAV\nDOSLOVNO IZVJEŠĆE SANATARKAT ISTUNTOSELOSTUKSET\nRESOCONTO INTEGRALE DELLE DISCUSSIONI FULLSTÄNDIGT FÖRHANDLINGSREFERAT\nПонеделник - lunes - Pondě

In [18]:
# Access content and metadata of a stored document
one_doc = vector_store.get_by_ids(document_ids[:1])[0]

display(Markdown(one_doc.page_content))
display(one_doc.metadata)

2024-2029
ПЪЛЕН ПРОТОКОЛ НА РАЗИСКВАНИЯТА DEBAŠU STENOGRAMMA
ACTA LITERAL DE LOS DEBATES POSĖDŽIO STENOGRAMA
DOSLOVNÝ ZÁZNAM ZE ZASEDÁNÍ AZ ÜLÉSEK SZÓ SZERINTI JEGYZŐKÖNYVE
FULDSTÆNDIGT FORHANDLINGSREFERAT RAPPORTI VERBATIM TAD-DIBATTITI
AUSFÜHRLICHE SITZUNGSBERICHTE VOLLEDIG VERSLAG VAN DE VERGADERINGEN
ISTUNGI STENOGRAMM PEŁNE SPRAWOZDANIE Z OBRAD
ΠΛΗΡΗ ΠΡΑΚΤΙΚΑ ΤΩΝ ΣΥΖΗΤΗΣΕΩΝ RELATO INTEGRAL DOS DEBATES
VERBATIM REPORT OF PROCEEDINGS STENOGRAMA DEZBATERILOR
COMPTE RENDU IN EXTENSO DES DÉBATS DOSLOVNÝ ZÁPIS Z ROZPRÁV
TUARASCÁIL FOCAL AR FHOCAL NA N-IMEACHTAÍ DOBESEDNI ZAPISI RAZPRAV
DOSLOVNO IZVJEŠĆE SANATARKAT ISTUNTOSELOSTUKSET
RESOCONTO INTEGRALE DELLE DISCUSSIONI FULLSTÄNDIGT FÖRHANDLINGSREFERAT
Понеделник - lunes - Pondělí - mandag - Montag - esmaspäev - Δευτέρα - Monday
lundi - Dé Luain - ponedjeljak - lunedì - pirmdiena - Pirmadienis - hétfő - It-Tnejn
maandag - poniedziałek - Segunda-feira - luni - Pondelok - Ponedeljek - maanantai - måndag
05.05.2025
Единство в многообразието - Unida en la diversidad - Jednotná v rozmanitosti - Forenet i mangfoldighed - In Vielfalt geeint - Ühinenud mitmekesisuses
Eνωμένη στην πολυμορφία - United in diversity - Unie dans la diversité - Aontaithe san éagsúlacht - Ujedinjena u raznolikosti - Unita nella diversità
Vienoti daudzveidībā - Suvienijusi įvairovę - Egyesülve a sokféleségben - Magħquda fid-diversità - In verscheidenheid verenigd - Zjednoczona w różnorodności
Unida na diversidade - Unită în diversitate - Zjednotení v rozmanitosti - Združena v raznolikosti - Moninaisuudessaan yhtenäinen - Förenade i mångfalden
Редактирана версия - Edición revisada - Revidované vydání - Revideret udgave - Überprüfte Ausgabe - Uuendatud versioon
Αναθεωρημένη έκδοση - Revised edition - Edition révisée - Eagrán athbhreithnithe - Revidirano izdanje - Edizione rivista
Pārskatītā redakcija - Atnaujinta informacija - Lektorált változat - Edizzjoni riveduta - Herziene uitgave - Wersja poprawiona

{'producer': 'Aspose.Words for Java 24.1.0',
 'creator': 'Aspose.Words',
 'creationdate': '',
 'source': 'data/CRE-10-2025-05-05_EN.pdf',
 'total_pages': 106,
 'start_index': 0}

**Validation**: Vector store successfully indexed all chunks. Each document has a unique ID and can be retrieved with its full content and metadata intact.

## Testing Semantic Retrieval

Time to test the core retrieval functionality. I'll query the vector store with a question and retrieve the most semantically similar document chunks.

In [19]:
query = "Summarize the discussion on agricultural policy."

# Retrieve top 6 most similar chunks
retrieved_docs = vector_store.similarity_search(query, k=6)

print(f"Retrieved {len(retrieved_docs)} documents")

Retrieved 6 documents


In [20]:
# Display retrieved content
for i, doc in enumerate(retrieved_docs):
    print(f"\n--- Document {i+1} ---")
    display(Markdown(doc.page_content))


--- Document 1 ---


Koninkrijk en de Europese Unie is de Overeenkomst inzake sanitaire en fytosanitaire maatregelen 
(SPS-Overeenkomst) over voedselveiligheid en grenscontroles op levensmiddelen een van de 
belangrijkste dossiers. Mijn land Vlaanderen voert jaarlijks voor 4,8 miljard EUR aan 
landbouwproducten uit naar het Verenigd Koninkrijk. Deze handel wordt echter vertraagd door 
omslachtige procedures, extra controles en uiteenlopende regels, met alle gevolgen van dien. Onze 
boeren en bedrijven verdienen geen onzekerheid, maar duidelijke afspraken. Daarom vraag ik: 
stop met treuzelen, stop met eindeloos uitstellen. We hebben een werkbaar akkoord nodig dat 
handel mogelijk maakt, zonder extra lasten en zonder toegevingen op het gebied van kwaliteit. 
We hebben behoefte aan minder bureaucratie, meer voorspelbaarheid en meer handel. Onze 
boeren hebben recht op een vlotte export en eerlijke toegang tot de markt.
1-0101-0000
Bert-Jan Ruissen (ECR). – Voorzitter, "Beter een goede buur dan een verre vriend" – dat is de 
benadering die we moeten volgen in onze betrekkingen met het Verenigd Koninkrijk. De top van 
19 mei biedt een mooie kans om hier invulling aan te geven. Het is belangrijk om het "chagrijn" 
over de Brexit achter ons te laten. Laten we investeren in onze onderlinge relaties, de handel 
bevorderen en gezamenlijk inspanningen leveren rond defensie. Dit is absoluut geen luxe in deze 
tijden van geopolitieke spanningen.
Een belangrijk aandachtspunt is het gezamenlijk beheer van de visbestanden, met name in de 
Noordzee. Voor onze vissers is het daarbij van cruciaal belang om ook na 2026 de toegang tot de 
Britse wateren te behouden. We moeten voorkomen dat zij opnieuw quota moeten inleveren. Dit 
onderwerp verdient alle aandacht, ook tijdens de komende top. Als goede buren moeten we toch 
in staat zijn om hier goede afspraken over te maken.
1-0102-0000
Nathalie Loiseau (Renew). – Madam President, dear British friends. The EU-UK summit gives us


--- Document 2 ---


the Windsor Framework, for example on SPS. As regards the Trade and Cooperation Agreement, 
it remains the most ambitious free trade agreement the EU has concluded with any third country, 
and it responds to the UK Government's red lines, which remain in place. But this does not mean 
that we cannot more fully exploit the potential of the Trade and Cooperation Agreement and look 
at what more it has to offer. It does not mean that we cannot further develop our cooperation in 
the areas I mentioned previously. On the contrary, there is much we can still do together to 
strengthen our relationship.
The first EU‑UK summit will therefore be an important moment to do just that. I am looking 
forward to hearing your views during this debate, and of course I will be very happy to answer 
your questions. Thank you very much, Madam President.
1-0086-0000
Nina Carberry, on behalf of the PPE Group. – Madam President, Commissioner, since arriving in 
Parliament, I've been struck by an assumption often made here that Brexit is a settled matter. In 
reality, its consequences continue to shape political and economic life in Ireland, the UK and 
across Europe. Anticipation is building ahead of the upcoming EU‑UK summit on 19 May, and in 
a world where economic stability, security and trade openness matter more than ever, the EU and 
the UK have everything to gain from resetting relations.
Although the TCA lays a crucial foundation, the world has changed considerably since its signing 
four years ago. It remains a framework that can and should be built upon. A comprehensive 
veterinary agreement would be an immediate and impactful step forward, unlocking significant 
opportunities for farmers and agri‑food businesses. Progress on mutual recognition for 
professional qualifications would have major benefits. In the same way, bringing the UK closer to 
Erasmus+ would be an undeniable win for students and apprentices.


--- Document 3 ---


wzmacniać mosty, modernizować drogi dojazdowe czy budować je w takich parametrach, żeby 
mogły również służyć celom obronnym. I to nie jest militaryzowanie polityki spójności, ale danie 
możliwości tym regionom, które czują taką potrzebę, realizowania tych celów.
Szanowni państwo, panie komisarzu, bardzo serdecznie dziękuję za te dzisiejsze wystąpienia. 
Dziękuję za współpracę. Mam głębokie przekonanie, że ten dokument, który w czwartek 
przegłosujemy, również pomoże panu, bowiem znamy pana historię zawodową. Wiemy, że jest 
pan samorządowcem. Był pan szefem regionu, ministrem odpowiedzialnym również za politykę 
regionalną, więc wiemy, że rozumie pan potrzeby regionu, potrzeby społeczności lokalnych. Ale 
u nas w Polsce się mówi, że diabeł tkwi w szczegółach. Co do głównych założeń polityki 
spójności zgadzamy się również, że trzeba iść w kierunku modernizacji, ewolucji, nie rewolucji. 
Ale będziemy dyskutować na temat tego, jak to w praktyce ma wyglądać i jak Komisja Europejska 
to widzi. Mam nadzieję, że wspólnie osiągniemy sukces.
1-0214-0000
Presidente. – La discussione è chiusa.
La votazione si svolgerà giovedì.
Dichiarazioni scritte (articolo 178)
1-0214-5000
Dan-Ştefan Motreanu (PPE), în scris. – Doresc să transmit un mesaj clar Comisiei Europene: nu 
avem nevoie de o centralizare a politicii de coeziune și a politicii agricole comune, după modelul 
NextGenerationEU. Dimpotrivă, este nevoie de mai multă descentralizare către autoritățile 
regionale și locale, pentru că deciziile luate mai aproape de cetățeni sunt mai bine adaptate 
realităților din teren, mai eficiente și rapide.
Într-un climat geopolitic tot mai tensionat, reamintesc că politica de coeziune este cel mai vizibil și 
concret instrument al UE pentru majoritatea cetățenilor. Investițiile în spitale, infrastructură, 
educație și IMM-uri consolidează încrederea în proiectul european și combat eficient discursurile 
eurosceptice.
96 05-05-2025


--- Document 4 ---


every applicant, whether in the private sector or in the public sector, to bring about a project 
which really gives a return to European taxpayers.
I noticed very well the remarks on small- and medium-sized enterprises, but also micro 
businesses, and I fully agree the access to credit for these companies, these very small companies, 
who are so important when it comes to the labour market inside the EU, is still an issue we really 
have to worry about and work on, and that's what we are doing as the EIB group. We cannot do 
this directly with SMEs and micro businesses in Europe. We always go through a financial 
intermediary, mostly European commercial banks – a very important element of our business.
I listened very carefully to the remarks on agriculture, and especially young farmers receive our 
attention when it comes to the area of agriculture. For this year, we envisage to invest at least 
EUR 3 billion in this area.
In the area of housing, which was also mentioned by honourable Members, we are trying to 
leverage the financing we are going to make available to a couple of billion euro, hopefully in a 
couple of years, to EUR 300 billion annually. We have three priorities in the area of housing: one – 
innovation, supporting innovative building technologies like modular housing to make 
construction faster, cheaper and easier; second – sustainability, scaling up energy efficient 
renovation to reduce living costs when it comes to energy prices; and three – affordability, 
strengthening support for public investment tailored to the specific needs of each country and 
piloting private investments.
Now, on the issue of climate, which is also close to a bit more than half of what we are doing 
annually. This is about climate adaptation; this is about dealing with droughts, it is about dealing 
with floods – we have seen both inside many countries of the European Union, and they require


--- Document 5 ---


the summit to bring tangible benefits to the people on both sides. For us, clearly, the ambition in 
this area is an indispensable part of the renewed EU‑UK agenda.
56 05-05-2025
Honourable Members have been referring, among other areas, to the importance of fisheries, and I 
would like to reassure all of you that this is clearly a priority for us, as it was raised by Mr Millán 
Mon and Mr Ruissen. The current arrangements for reciprocal access to waters expires in the 
middle of next year, so it is essential for us to reach an early agreement that protects the rights of 
our fishers and provides them with certainty and predictability. We have also been open to an SPS 
agreement with the UK, as Madam Carberry was calling for. We do that because we are convinced 
that this would further facilitate the flow of SPS goods between Great Britain and Northern 
Ireland, beyond what has already been achieved with the Windsor Framework.
On top of this, the ideas mentioned by Mr Andrews, like linking the emissions trading system or 
strengthening cooperation in the field of energy, as was called for by Mr Kelleher and Mr Cowen – 
all these are areas we are currently looking at where I believe we can progress further. When you 
follow the statement of Commission President von der Leyen, she was very clear on this as well. So 
there is more that the EU and UK can do together to exploit our potential in this area, and we will 
be using every single remaining day to achieve this result.
Mr Millán Mon was asking about Gibraltar. I will partially respond to this: I have to underline at 
this stage that we are progressing in a positive direction, and I really would like to thank both 
Foreign Minister Alvarez and Mr Lammy for their exemplary cooperation and for understanding 
the position of all sides, because this will help us to advance on these very complex and difficult 
discussions. We will be working on this at the top level. I believe that we will be successful in that


--- Document 6 ---


británicas, también es cierto que el mercado europeo es el que recibe la gran mayoría de las 
exportaciones británicas de productos del mar.
Termino con una pregunta: señor comisario, ¿puede decirnos algo sobre en qué situación se 
encuentran las larguísimas negociaciones con el Reino Unido respecto de Gibraltar?
1-0099-0000
Idoia Mendia (S&D). – Señora presidenta, señor comisario, en un mundo donde la inestabilidad 
se está convirtiendo en norma, el trabajo común por la paz, la democracia y la cooperación es más 
necesario que nunca. La próxima cumbre representa una oportunidad histórica para acercarnos 
más aún gracias al Gobierno laborista.
Compartimos valores fundamentales y desafíos comunes. Por eso, esta cumbre debe traducirse en 
resultados concretos con una declaración política ambiciosa por una asociación estratégica 
centrada en la seguridad y el bienestar de nuestras ciudadanías, y que reafirme nuestro 
compromiso con el libre comercio, la cooperación y el respeto a los derechos humanos.
Apostamos por la implementación plena y efectiva del Acuerdo, incluidas sus cláusulas sobre 
igualdad de condiciones y no regresión en derechos laborales y estándares medioambientales, y 
por facilitar mucho todo lo relacionado con la movilidad de los trabajadores y reforzar Erasmus 
para ofrecer nuevas oportunidades a los jóvenes europeos y a los británicos para que puedan 
estudiar y trabajar.
50 05-05-2025
Es un momento para demostrar que unidos somos más fuertes.
1-0100-0000
Barbara Bonte (PfE). – Voorzitter, bij de voorbereiding van de top tussen het Verenigd 
Koninkrijk en de Europese Unie is de Overeenkomst inzake sanitaire en fytosanitaire maatregelen 
(SPS-Overeenkomst) over voedselveiligheid en grenscontroles op levensmiddelen een van de 
belangrijkste dossiers. Mijn land Vlaanderen voert jaarlijks voor 4,8 miljard EUR aan 
landbouwproducten uit naar het Verenigd Koninkrijk. Deze handel wordt echter vertraagd door

**Retrieval Quality**: The system successfully identified relevant passages about agricultural policy. The semantic search captured contextually relevant chunks even when exact keyword matches weren't present. This demonstrates the power of embedding-based retrieval over traditional keyword search.

Now I'll complete the RAG pipeline by feeding retrieved documents to an LLM to generate a coherent answer. Starting with a basic approach, then improving with prompt engineering.

In [21]:
from langchain.chat_models import init_chat_model

llm = init_chat_model("gemini-2.0-flash", model_provider="google_genai")

In [22]:
# Basic approach: concatenate retrieved docs with query
prompt = '\n\n'.join([doc.page_content for doc in retrieved_docs])
prompt += "\n\n" + query

response = llm.invoke(prompt)
Markdown(response.content)

Here's a summary of the agricultural policy discussion points extracted from the provided text:

*   **Trade Barriers:** Barbara Bonte highlights the delays and complexities in agricultural trade between Flanders and the UK due to sanitary and phytosanitary (SPS) measures, border controls, and differing rules. She emphasizes the need for a workable agreement that facilitates trade without creating extra burdens or compromising quality. She asks for less bureaucracy, more predictability, and easier access to the market for farmers.
*   **Veterinary Agreement:** Nina Carberry calls for a comprehensive veterinary agreement with the UK, arguing that it would unlock significant opportunities for farmers and agri-food businesses.
*   **Fisheries:** Bert-Jan Ruissen emphasizes the importance of joint management of fish stocks, particularly in the North Sea. He stresses the need for continued access to British waters for EU fishermen after 2026 and to avoid further quota reductions.
*   **SPS Agreement:** Commissioner Šefčovič mentions that the EU has been open to an SPS agreement with the UK, as called for by Madam Carberry. He believes that this would further facilitate the flow of SPS goods between Great Britain and Northern Ireland, beyond what has already been achieved with the Windsor Framework.
*   **EIB Investment:** An EIB representative notes that they are paying attention to agriculture, especially young farmers, and that they plan to invest at least EUR 3 billion in this area this year.

**Initial Results**: The LLM generates a reasonable answer, but the prompt could be more structured. Using a proper prompt template will improve response quality and consistency.

In [23]:
# Use LangChain's RAG prompt template
from langchain import hub

prompt_template = hub.pull("rlm/rag-prompt")

# Examine the template structure
example_messages = prompt_template.invoke(
    {"context": "(context goes here)", "question": "(question goes here)"}
).to_messages()

print("\n=== Prompt Template ===")
print(example_messages[0].content)




=== Prompt Template ===
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: (question goes here) 
Context: (context goes here) 
Answer:


**Prompt Engineering**: The template provides clear instructions to the LLM about using the provided context and being honest about limitations. This structured approach should yield more reliable responses.

In [24]:
# Prepare context from retrieved documents
docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

Markdown(docs_content)

Koninkrijk en de Europese Unie is de Overeenkomst inzake sanitaire en fytosanitaire maatregelen 
(SPS-Overeenkomst) over voedselveiligheid en grenscontroles op levensmiddelen een van de 
belangrijkste dossiers. Mijn land Vlaanderen voert jaarlijks voor 4,8 miljard EUR aan 
landbouwproducten uit naar het Verenigd Koninkrijk. Deze handel wordt echter vertraagd door 
omslachtige procedures, extra controles en uiteenlopende regels, met alle gevolgen van dien. Onze 
boeren en bedrijven verdienen geen onzekerheid, maar duidelijke afspraken. Daarom vraag ik: 
stop met treuzelen, stop met eindeloos uitstellen. We hebben een werkbaar akkoord nodig dat 
handel mogelijk maakt, zonder extra lasten en zonder toegevingen op het gebied van kwaliteit. 
We hebben behoefte aan minder bureaucratie, meer voorspelbaarheid en meer handel. Onze 
boeren hebben recht op een vlotte export en eerlijke toegang tot de markt.
1-0101-0000
Bert-Jan Ruissen (ECR). – Voorzitter, "Beter een goede buur dan een verre vriend" – dat is de 
benadering die we moeten volgen in onze betrekkingen met het Verenigd Koninkrijk. De top van 
19 mei biedt een mooie kans om hier invulling aan te geven. Het is belangrijk om het "chagrijn" 
over de Brexit achter ons te laten. Laten we investeren in onze onderlinge relaties, de handel 
bevorderen en gezamenlijk inspanningen leveren rond defensie. Dit is absoluut geen luxe in deze 
tijden van geopolitieke spanningen.
Een belangrijk aandachtspunt is het gezamenlijk beheer van de visbestanden, met name in de 
Noordzee. Voor onze vissers is het daarbij van cruciaal belang om ook na 2026 de toegang tot de 
Britse wateren te behouden. We moeten voorkomen dat zij opnieuw quota moeten inleveren. Dit 
onderwerp verdient alle aandacht, ook tijdens de komende top. Als goede buren moeten we toch 
in staat zijn om hier goede afspraken over te maken.
1-0102-0000
Nathalie Loiseau (Renew). – Madam President, dear British friends. The EU-UK summit gives us

the Windsor Framework, for example on SPS. As regards the Trade and Cooperation Agreement, 
it remains the most ambitious free trade agreement the EU has concluded with any third country, 
and it responds to the UK Government's red lines, which remain in place. But this does not mean 
that we cannot more fully exploit the potential of the Trade and Cooperation Agreement and look 
at what more it has to offer. It does not mean that we cannot further develop our cooperation in 
the areas I mentioned previously. On the contrary, there is much we can still do together to 
strengthen our relationship.
The first EU‑UK summit will therefore be an important moment to do just that. I am looking 
forward to hearing your views during this debate, and of course I will be very happy to answer 
your questions. Thank you very much, Madam President.
1-0086-0000
Nina Carberry, on behalf of the PPE Group. – Madam President, Commissioner, since arriving in 
Parliament, I've been struck by an assumption often made here that Brexit is a settled matter. In 
reality, its consequences continue to shape political and economic life in Ireland, the UK and 
across Europe. Anticipation is building ahead of the upcoming EU‑UK summit on 19 May, and in 
a world where economic stability, security and trade openness matter more than ever, the EU and 
the UK have everything to gain from resetting relations.
Although the TCA lays a crucial foundation, the world has changed considerably since its signing 
four years ago. It remains a framework that can and should be built upon. A comprehensive 
veterinary agreement would be an immediate and impactful step forward, unlocking significant 
opportunities for farmers and agri‑food businesses. Progress on mutual recognition for 
professional qualifications would have major benefits. In the same way, bringing the UK closer to 
Erasmus+ would be an undeniable win for students and apprentices.

wzmacniać mosty, modernizować drogi dojazdowe czy budować je w takich parametrach, żeby 
mogły również służyć celom obronnym. I to nie jest militaryzowanie polityki spójności, ale danie 
możliwości tym regionom, które czują taką potrzebę, realizowania tych celów.
Szanowni państwo, panie komisarzu, bardzo serdecznie dziękuję za te dzisiejsze wystąpienia. 
Dziękuję za współpracę. Mam głębokie przekonanie, że ten dokument, który w czwartek 
przegłosujemy, również pomoże panu, bowiem znamy pana historię zawodową. Wiemy, że jest 
pan samorządowcem. Był pan szefem regionu, ministrem odpowiedzialnym również za politykę 
regionalną, więc wiemy, że rozumie pan potrzeby regionu, potrzeby społeczności lokalnych. Ale 
u nas w Polsce się mówi, że diabeł tkwi w szczegółach. Co do głównych założeń polityki 
spójności zgadzamy się również, że trzeba iść w kierunku modernizacji, ewolucji, nie rewolucji. 
Ale będziemy dyskutować na temat tego, jak to w praktyce ma wyglądać i jak Komisja Europejska 
to widzi. Mam nadzieję, że wspólnie osiągniemy sukces.
1-0214-0000
Presidente. – La discussione è chiusa.
La votazione si svolgerà giovedì.
Dichiarazioni scritte (articolo 178)
1-0214-5000
Dan-Ştefan Motreanu (PPE), în scris. – Doresc să transmit un mesaj clar Comisiei Europene: nu 
avem nevoie de o centralizare a politicii de coeziune și a politicii agricole comune, după modelul 
NextGenerationEU. Dimpotrivă, este nevoie de mai multă descentralizare către autoritățile 
regionale și locale, pentru că deciziile luate mai aproape de cetățeni sunt mai bine adaptate 
realităților din teren, mai eficiente și rapide.
Într-un climat geopolitic tot mai tensionat, reamintesc că politica de coeziune este cel mai vizibil și 
concret instrument al UE pentru majoritatea cetățenilor. Investițiile în spitale, infrastructură, 
educație și IMM-uri consolidează încrederea în proiectul european și combat eficient discursurile 
eurosceptice.
96 05-05-2025

every applicant, whether in the private sector or in the public sector, to bring about a project 
which really gives a return to European taxpayers.
I noticed very well the remarks on small- and medium-sized enterprises, but also micro 
businesses, and I fully agree the access to credit for these companies, these very small companies, 
who are so important when it comes to the labour market inside the EU, is still an issue we really 
have to worry about and work on, and that's what we are doing as the EIB group. We cannot do 
this directly with SMEs and micro businesses in Europe. We always go through a financial 
intermediary, mostly European commercial banks – a very important element of our business.
I listened very carefully to the remarks on agriculture, and especially young farmers receive our 
attention when it comes to the area of agriculture. For this year, we envisage to invest at least 
EUR 3 billion in this area.
In the area of housing, which was also mentioned by honourable Members, we are trying to 
leverage the financing we are going to make available to a couple of billion euro, hopefully in a 
couple of years, to EUR 300 billion annually. We have three priorities in the area of housing: one – 
innovation, supporting innovative building technologies like modular housing to make 
construction faster, cheaper and easier; second – sustainability, scaling up energy efficient 
renovation to reduce living costs when it comes to energy prices; and three – affordability, 
strengthening support for public investment tailored to the specific needs of each country and 
piloting private investments.
Now, on the issue of climate, which is also close to a bit more than half of what we are doing 
annually. This is about climate adaptation; this is about dealing with droughts, it is about dealing 
with floods – we have seen both inside many countries of the European Union, and they require

the summit to bring tangible benefits to the people on both sides. For us, clearly, the ambition in 
this area is an indispensable part of the renewed EU‑UK agenda.
56 05-05-2025
Honourable Members have been referring, among other areas, to the importance of fisheries, and I 
would like to reassure all of you that this is clearly a priority for us, as it was raised by Mr Millán 
Mon and Mr Ruissen. The current arrangements for reciprocal access to waters expires in the 
middle of next year, so it is essential for us to reach an early agreement that protects the rights of 
our fishers and provides them with certainty and predictability. We have also been open to an SPS 
agreement with the UK, as Madam Carberry was calling for. We do that because we are convinced 
that this would further facilitate the flow of SPS goods between Great Britain and Northern 
Ireland, beyond what has already been achieved with the Windsor Framework.
On top of this, the ideas mentioned by Mr Andrews, like linking the emissions trading system or 
strengthening cooperation in the field of energy, as was called for by Mr Kelleher and Mr Cowen – 
all these are areas we are currently looking at where I believe we can progress further. When you 
follow the statement of Commission President von der Leyen, she was very clear on this as well. So 
there is more that the EU and UK can do together to exploit our potential in this area, and we will 
be using every single remaining day to achieve this result.
Mr Millán Mon was asking about Gibraltar. I will partially respond to this: I have to underline at 
this stage that we are progressing in a positive direction, and I really would like to thank both 
Foreign Minister Alvarez and Mr Lammy for their exemplary cooperation and for understanding 
the position of all sides, because this will help us to advance on these very complex and difficult 
discussions. We will be working on this at the top level. I believe that we will be successful in that

británicas, también es cierto que el mercado europeo es el que recibe la gran mayoría de las 
exportaciones británicas de productos del mar.
Termino con una pregunta: señor comisario, ¿puede decirnos algo sobre en qué situación se 
encuentran las larguísimas negociaciones con el Reino Unido respecto de Gibraltar?
1-0099-0000
Idoia Mendia (S&D). – Señora presidenta, señor comisario, en un mundo donde la inestabilidad 
se está convirtiendo en norma, el trabajo común por la paz, la democracia y la cooperación es más 
necesario que nunca. La próxima cumbre representa una oportunidad histórica para acercarnos 
más aún gracias al Gobierno laborista.
Compartimos valores fundamentales y desafíos comunes. Por eso, esta cumbre debe traducirse en 
resultados concretos con una declaración política ambiciosa por una asociación estratégica 
centrada en la seguridad y el bienestar de nuestras ciudadanías, y que reafirme nuestro 
compromiso con el libre comercio, la cooperación y el respeto a los derechos humanos.
Apostamos por la implementación plena y efectiva del Acuerdo, incluidas sus cláusulas sobre 
igualdad de condiciones y no regresión en derechos laborales y estándares medioambientales, y 
por facilitar mucho todo lo relacionado con la movilidad de los trabajadores y reforzar Erasmus 
para ofrecer nuevas oportunidades a los jóvenes europeos y a los británicos para que puedan 
estudiar y trabajar.
50 05-05-2025
Es un momento para demostrar que unidos somos más fuertes.
1-0100-0000
Barbara Bonte (PfE). – Voorzitter, bij de voorbereiding van de top tussen het Verenigd 
Koninkrijk en de Europese Unie is de Overeenkomst inzake sanitaire en fytosanitaire maatregelen 
(SPS-Overeenkomst) over voedselveiligheid en grenscontroles op levensmiddelen een van de 
belangrijkste dossiers. Mijn land Vlaanderen voert jaarlijks voor 4,8 miljard EUR aan 
landbouwproducten uit naar het Verenigd Koninkrijk. Deze handel wordt echter vertraagd door

In [25]:
# Generate answer using prompt template
prompt = prompt_template.invoke(
    {"context": docs_content, "question": query}
)

answer = llm.invoke(prompt)
Markdown(answer.content)

The discussion on agricultural policy focuses on the EU-UK relationship and the importance of the Sanitary and Phytosanitary (SPS) Agreement for food safety and border controls. Speakers highlighted the need for clear agreements to facilitate trade without extra burdens, reduce bureaucracy, and ensure smooth exports and fair market access for farmers. They also mentioned the significance of fisheries and the need to protect the rights of fishers.

**Result**: The structured prompt produces a more focused, well-organized response. The LLM appropriately grounds its answer in the provided context without hallucinating information.

The in-memory vector store is fine for prototyping, but has critical limitations:

1. **Data loss**: All embeddings are lost when the kernel restarts
2. **Cost**: Re-embedding documents on every run wastes API calls
3. **Scalability**: Can't handle multiple sessions or large corpora

Solution: Implement persistent storage using Chroma, a local vector database.

In [26]:
from langchain_chroma import Chroma

# Initialize persistent vector store
vector_store = Chroma(
    collection_name="ep_plenary",
    embedding_function=embeddings,
    persist_directory="./chroma_ep_follower",
)

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


To scale this system, I'll create modular functions for:
1. Adding new documents to the vector store
2. Querying the system with questions

This enables easy addition of more transcripts without code duplication.

In [27]:
def embed_and_store(file_path, vector_store):
    """
    Load a PDF file, split it into chunks, and store in vector database.

    Args:
        file_path: Path to PDF file
        vector_store: Initialized vector store instance

    Returns:
        List of document IDs added to the store
    """
    # Load PDF as single document
    loader = PyPDFLoader(file_path, mode='single')
    pdf_text = loader.load()

    # Split into chunks with overlap
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2_000,
        chunk_overlap=400,
        add_start_index=True,
    )
    all_splits = text_splitter.split_documents(pdf_text)

    # Add to vector store
    document_ids = vector_store.add_documents(documents=all_splits)
    print(f"Added {len(document_ids)} documents to the vector store.")

    return document_ids

In [28]:
# Test the function
file_path = "data/CRE-10-2025-05-05_EN.pdf"
document_ids = embed_and_store(file_path, vector_store)

# Verify storage
vector_store.get_by_ids(document_ids[:3])

Failed to send telemetry event CollectionGetEvent: capture() takes 1 positional argument but 3 were given


Added 214 documents to the vector store.


[Document(id='1408cc34-bf1e-4350-8ee3-a656c94eb253', metadata={'creationdate': '', 'creator': 'Aspose.Words', 'producer': 'Aspose.Words for Java 24.1.0', 'source': 'data/CRE-10-2025-05-05_EN.pdf', 'start_index': 0, 'total_pages': 106}, page_content='2024-2029\nПЪЛЕН ПРОТОКОЛ НА РАЗИСКВАНИЯТА DEBAŠU STENOGRAMMA\nACTA LITERAL DE LOS DEBATES POSĖDŽIO STENOGRAMA\nDOSLOVNÝ ZÁZNAM ZE ZASEDÁNÍ AZ ÜLÉSEK SZÓ SZERINTI JEGYZŐKÖNYVE\nFULDSTÆNDIGT FORHANDLINGSREFERAT RAPPORTI VERBATIM TAD-DIBATTITI\nAUSFÜHRLICHE SITZUNGSBERICHTE VOLLEDIG VERSLAG VAN DE VERGADERINGEN\nISTUNGI STENOGRAMM PEŁNE SPRAWOZDANIE Z OBRAD\nΠΛΗΡΗ ΠΡΑΚΤΙΚΑ ΤΩΝ ΣΥΖΗΤΗΣΕΩΝ RELATO INTEGRAL DOS DEBATES\nVERBATIM REPORT OF PROCEEDINGS STENOGRAMA DEZBATERILOR\nCOMPTE RENDU IN EXTENSO DES DÉBATS DOSLOVNÝ ZÁPIS Z ROZPRÁV\nTUARASCÁIL FOCAL AR FHOCAL NA N-IMEACHTAÍ DOBESEDNI ZAPISI RAZPRAV\nDOSLOVNO IZVJEŠĆE SANATARKAT ISTUNTOSELOSTUKSET\nRESOCONTO INTEGRALE DELLE DISCUSSIONI FULLSTÄNDIGT FÖRHANDLINGSREFERAT\nПонеделник - lunes - Pondě

In [29]:
def answer(query, vector_store, llm, prompt_template=None):
    """
    Answer a query using RAG with the vector store and LLM.

    Args:
        query: Question to answer
        vector_store: Vector store containing document embeddings
        llm: Language model for generation
        prompt_template: Optional custom prompt template

    Returns:
        Generated answer as string
    """
    # Retrieve similar documents
    retrieved_docs = vector_store.similarity_search(query, k=6)

    # Prepare context
    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

    # Use default template if none provided
    if not prompt_template:
        prompt_template = hub.pull("rlm/rag-prompt")

    # Generate answer
    prompt = prompt_template.invoke(
        {"context": docs_content, "question": query}
    )

    answer = llm.invoke(prompt)
    return answer.content

In [30]:
# Test the answer function
query = "What is being said about international trade?"
Markdown(answer(query, vector_store, llm, prompt_template))

Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


The context mentions the importance of reaffirming commitment to free trade and the need for greater streamlining of trade exchanges. It also notes that the European market receives the vast majority of British seafood exports. Additionally, trade of agricultural products between Flanders and the United Kingdom is being slowed down by border controls on food products.

**Success**: The refactored functions work correctly. This modular design makes it easy to add more transcripts and query the expanding corpus.

Current limitation: The system searches across all documents in the vector store. For a multi-year corpus, it would be valuable to filter by date (e.g., "What was discussed about trade in May 2025?").

Solution: Add temporal metadata to each chunk and implement filtered retrieval.

In [31]:
def embed_and_store_fancy(file_path, vector_store, session_date):
    """
    Load a PDF file, split it, add temporal metadata, and store in vector database.

    Args:
        file_path: Path to PDF file
        vector_store: Initialized vector store instance
        session_date: Session date in 'YYYY-MM-DD' format

    Returns:
        List of document IDs added to the store
    """
    # Load PDF
    loader = PyPDFLoader(file_path, mode="single")
    pdf_text = loader.load()

    # Add temporal metadata to document
    for doc in pdf_text:
        doc.metadata["session_date"] = session_date
        doc.metadata["year"] = session_date[:4]
        doc.metadata["month"] = session_date[:7]

    # Split into chunks (metadata is preserved)
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2_000,
        chunk_overlap=400,
        add_start_index=True,
    )
    all_splits = text_splitter.split_documents(pdf_text)

    # Store in vector database
    document_ids = vector_store.add_documents(documents=all_splits)
    print(f"Added {len(document_ids)} documents to the vector store.")

    return document_ids

In [32]:
# Add document with metadata
file_path = "data/CRE-10-2025-05-05_EN.pdf"
document_ids = embed_and_store_fancy(file_path, vector_store, "2025-05-05")

# Verify metadata was added
vector_store.get_by_ids(document_ids[:3])

Added 214 documents to the vector store.


[Document(id='8e16171d-b393-42c2-8568-ae1df96e9ae5', metadata={'creationdate': '', 'creator': 'Aspose.Words', 'month': '2025-05', 'producer': 'Aspose.Words for Java 24.1.0', 'session_date': '2025-05-05', 'source': 'data/CRE-10-2025-05-05_EN.pdf', 'start_index': 0, 'total_pages': 106, 'year': '2025'}, page_content='2024-2029\nПЪЛЕН ПРОТОКОЛ НА РАЗИСКВАНИЯТА DEBAŠU STENOGRAMMA\nACTA LITERAL DE LOS DEBATES POSĖDŽIO STENOGRAMA\nDOSLOVNÝ ZÁZNAM ZE ZASEDÁNÍ AZ ÜLÉSEK SZÓ SZERINTI JEGYZŐKÖNYVE\nFULDSTÆNDIGT FORHANDLINGSREFERAT RAPPORTI VERBATIM TAD-DIBATTITI\nAUSFÜHRLICHE SITZUNGSBERICHTE VOLLEDIG VERSLAG VAN DE VERGADERINGEN\nISTUNGI STENOGRAMM PEŁNE SPRAWOZDANIE Z OBRAD\nΠΛΗΡΗ ΠΡΑΚΤΙΚΑ ΤΩΝ ΣΥΖΗΤΗΣΕΩΝ RELATO INTEGRAL DOS DEBATES\nVERBATIM REPORT OF PROCEEDINGS STENOGRAMA DEZBATERILOR\nCOMPTE RENDU IN EXTENSO DES DÉBATS DOSLOVNÝ ZÁPIS Z ROZPRÁV\nTUARASCÁIL FOCAL AR FHOCAL NA N-IMEACHTAÍ DOBESEDNI ZAPISI RAZPRAV\nDOSLOVNO IZVJEŠĆE SANATARKAT ISTUNTOSELOSTUKSET\nRESOCONTO INTEGRALE DELLE DISCUS

In [33]:
def answer_fancy(query, vector_store, llm, prompt_template=None,
                 session_date=None, session_year=None, session_month=None):
    """
    Answer a query with optional temporal filtering.

    Args:
        query: Question to answer
        vector_store: Vector store containing document embeddings
        llm: Language model for generation
        prompt_template: Optional custom prompt template
        session_date: Filter by specific date (YYYY-MM-DD)
        session_year: Filter by year (YYYY)
        session_month: Filter by month (YYYY-MM)

    Returns:
        Generated answer as string
    """
    # Build metadata filter
    if session_date:
        filter_dict = {"session_date": session_date}
    elif session_month:
        filter_dict = {"month": session_month}
    elif session_year:
        filter_dict = {"year": session_year}
    else:
        filter_dict = {}

    # Retrieve with filter
    retrieved_docs = vector_store.similarity_search(query, k=6, filter=filter_dict)

    # Handle no results
    if not retrieved_docs:
        return "No documents found for the given filters."

    # Prepare context
    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

    # Use default template if none provided
    if not prompt_template:
        prompt_template = hub.pull("rlm/rag-prompt")

    # Generate answer
    prompt = prompt_template.invoke(
        {"context": docs_content, "question": query}
    )

    answer = llm.invoke(prompt)
    return answer.content

In [34]:
# Add second session for comparison
file_path = "data/CRE-10-2025-05-06_EN.pdf"
document_ids = embed_and_store_fancy(file_path, vector_store, "2025-05-06")

Added 533 documents to the vector store.


In [35]:
# Query specific date
query = "Summarize the discussion on agricultural policy."

print("=== Session on 2025-05-05 ===")
Markdown(answer_fancy(query, vector_store, llm, prompt_template, session_date="2025-05-05"))

=== Session on 2025-05-05 ===


The discussion on agricultural policy focuses on the importance of the EU-UK relationship, particularly regarding food safety and trade. Speakers emphasize the need for clear agreements to facilitate trade, reduce bureaucracy, and ensure fair market access for farmers. Some suggest building upon existing trade agreements and exploring further cooperation in areas like veterinary agreements and recognition of professional qualifications.

In [36]:
print("=== Session on 2025-05-06 ===")
Markdown(answer_fancy(query, vector_store, llm, prompt_template, session_date="2025-05-06"))

=== Session on 2025-05-06 ===


The discussion emphasizes maintaining a robust budget for the Common Agricultural Policy (CAP) and ensuring its structure with two pillars. The first pillar, direct payments, is crucial for stabilizing farmers' incomes and guaranteeing the European Union's strategic autonomy in food. There is also a call to reinforce the POSEI Agriculture for outermost regions.

In [37]:
# Query by month (searches across both sessions)
print("=== All sessions in May 2025 ===")
Markdown(answer_fancy(query, vector_store, llm, prompt_template, session_month="2025-05"))

=== All sessions in May 2025 ===


The discussion on agricultural policy emphasized the importance of maintaining a robust budget for the Common Agricultural Policy (CAP) and ensuring its structure with two pillars. The first pillar, direct payments, is crucial for stabilizing farmers' incomes and guaranteeing the European Union's strategic autonomy in food. Participants also highlighted the need to adapt to inflation and avoid a single fund, advocating for a strong agricultural budget.

Successfully implemented a production-ready RAG system with:

1. **Semantic retrieval**: Embedding-based search that understands context beyond keywords
2. **Persistent storage**: Chroma database that survives kernel restarts and scales efficiently
3. **Modular design**: Reusable functions for adding documents and querying
4. **Temporal filtering**: Metadata-based filtering for date-specific queries
5. **Cost optimization**: Embeddings are computed once and reused

**Next steps** for scaling this system:
- Batch processing for all historical sessions
- Implement hybrid search (semantic + keyword)
- Add citation tracking to link answers back to source pages
- Build a simple web interface for non-technical users
- Experiment with different chunk sizes and overlap ratios
- Add document summarization for very long contexts