# Use Case Demonstration of EVE
In this notebook, we explore practical applications of EVE (Earth Virtual Explorer), a large language model specialized in Earth Observation (EO). EVE is designed to understand, analyze, and generate text related to EO data and topics, making it a valuable tool for researchers, analysts, and decision-makers in the field.

## Overview
We will demonstrate two key use cases of EVE:

- **Summarization**: In this task, EVE is given a document related to Earth Observation and asked to generate a concise and informative summary. This is useful for quickly understanding lengthy reports, scientific papers, or satellite data documentation.

- **Question Answering (Q&A)**: In this use case, we enhance EVE’s performance by integrating it with a retrieval system, using a technique known as Retrieval-Augmented Generation (RAG). Here, the model first retrieves relevant context from a knowledge base before answering questions, leading to more accurate and grounded responses.

These examples highlight EVE’s potential to streamline information processing and support decision-making in Earth Observation workflows.

Before diving into the use case, let’s briefly explore the core idea behind Large Language Models (LLM) and Natural Language Processing (NLP).

# Natural Language Processing
NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words. Some classic tasks in NLP are the following
- **Classifying whole sentences**: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not
- **Classifying each word in a sentence**: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization)
- **Generating text content**: Completing a prompt with auto-generated text, filling in the blanks in a text with masked word
- ...

## Large Language Models (LLM)
In recent years, the field of NLP has been revolutionized by Large Language Models (LLMs).
LLM are a powerful subset of NLP models characterized by their massive size, extensive training data, and ability to perform a wide range of language tasks with minimal task-specific training.

The objective of an LLM and more generally of a Language Model is to compute the probability of a series of words/tokens

1. **Chain rule**: probability of a sequence is the product of conditional probabilities
$$P\left(w_1^m\right)=\prod_{n=1}^m P\left(w_n \mid w_1^{n-1}\right)$$
2. **Markov assumption**: N-gram model approximates it by conditioning only on the last N−1 words:

$$P\left(w_n \mid w_1^{n-1}\right) \approx P\left(w_n \mid w_{n-N+1}^{n-1}\right)$$

So in general for **N-grams**
$$P\left(w_1^m\right) \approx P\left(w_1^N\right) \cdot \prod_{n=N+1}^M P\left(w_n \mid w_{n-N+1}^{n-1}\right)$$

**Example** for trigrams (N=3)
$$P(\text { country roads }) \approx P(\text { cou }) \cdot P(n \mid o u) \cdot P(t \mid u n) \cdot P(r \mid n t) \cdot P(y \mid t r) \cdot P(\mid r y) \cdot \text { etc. }$$

Now let's load a LLM from HuggingFace

In [None]:
# Install the required libraries (already installed)
#!pip3 install -q datasets
#!pip3 install -q langchain_community
#!pip3 install -q pypdf
#!pip3 install -q sentence-transformers
#!pip3 install -q faiss-cpu
#!pip3 install -q langchain_huggingface
#!pip3 install -q langchain_runpod
#!pip3 install -q qdrant_client
#!pip3 install -q langchain_aws
#!pip3 install evaluate
#!pip3 install rouge_score
#!pip3 install bert_score
#!pip3 install numpy==1.26.4

In [None]:
# Setup env
from dotenv import dotenv_values
from IPython.display import display, Markdown, Latex

config = dotenv_values('/home/sagemaker-user/env.sh')

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')

Before feeding our LLM with our question we need to tokenize the question. The tokenization is the process of breaking into pieces a text, this pieces are called token. Subword Tokenization is one of the most popular and efficent tokenization methods,  a word is split into subwords and these subwords are known as tokens, For example the word “football” might be split into “foot”, and “ball”.

Let's load a tokenizer and see how it works.


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')
message = 'The European Space Agency (ESA) is a 23-member international organization devoted to space exploration'

# Tokenize with options to return offsets and word_ids
encoded = tokenizer(
    message,
    return_offsets_mapping=True,
    return_tensors=None,
    add_special_tokens=True
)

tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"])
# The _ denotes a space before the word
print(tokens)

### Prompting
Prompting is the process of providing input to a language model in the form of a carefully designed instruction or question to guide its output.

First, we will define the prompt to be used using three different parts:
- **System message**: this section provides guidelines for the model on how to behave and interpret the conversation.
- **Human message**: this section contains the input or message from the user.
- **AI message**: this section represents a response generated by the model.

From the code below we can see the structure of the prompt and the templates used to create it. Specifically,. Specifically we could see some **special tokens** used in the prompt:
- **<|system|> | <|user|> | <|assistant|>**: are special tokens that helps the model to understand to who belongs that specific message.
- **<|end|>**: is a special token that indicates the end of a message.
- **{message}**: is a placeholder that will be replaced with the actual message, context and question.

In [None]:
prompt = """<|system|>
You are a helpful chatbot, be kind and answer to the user questions.<|end|>
<|user|>
{message}
<|end|>
<|assistant|>
"""

In [None]:
message = 'What is ESA?'
prompt = prompt.format(message=message)

In [None]:
from transformers import pipeline
# Let's assemble our pipeline
# Build the text generation pipeline
llama_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto"  # uses GPU if available
)

Device set to use cuda:0


In [None]:
output = llama_pipe(prompt, max_new_tokens=1000, do_sample=True, temperature=0.7)

In [None]:
# Here we can see the whole conversation with the generated answer by the model
print(output[0]['generated_text'])

## Summarization

In this section, we will ask our model to perform text summarization giving as a document a paper from our dataset.

In [None]:
from datasets import  load_dataset
import random
# Load documents
docs = load_dataset('eve-esa/summarization_ds_10k_sample_split')['dev']

# Select a random doc from the dataset
doc = docs.select(random.sample(range(len(docs)), 1))[0]

In [None]:
# Passage that we are going to summarize

passage = """### Minimum Volume Ellipsoid and SDP Formulation for Approximating \\(W^{-1}\\)

As explained in the previous section, we need to find a matrix \\(Q\\) such that \\(\\kappa(QW)\\) is close to one. Let \\(A=Q^{T}Q\\succ 0\\). We have \\(W^{T}AW=(QW)^{T}(QW)\\) so that \\(\\sigma_{i}(W^{T}AW)=\\sigma_{i}^{2}(QW)\\) for all \\(i\\), hence it is equivalent to find a matrix \\(A\\) such that \\(\\kappa(W^{T}AW)\\) is close to one since it will imply that \\(\\kappa(QW)\\) is close to one, while we can compute a factorization of \\(A=Q^{T}Q\\) (e.g., a Cholesky decomposition). Ideally, we would like that \\(A=(WW^{T})^{-1}=W^{-T}W^{-1}\\), that is, \\(Q=RW^{-1}\\) for some orthonormal transformation \\(R\\).

The central step of our algorithm is to compute the minimum volume ellipsoid centered at the origin containing all columns of \\(\\tilde{M}\\). An ellipsoid \\(\\mathcal{E}\\) centered at the origin in \\(\\mathbb{R}^{r}\\) is described via a positive definite matrix \\(A\\in\\mathbb{S}_{++}^{r}\\) :

\\[\\mathcal{E}=\\{\\ x\\in\\mathbb{R}^{r}\\ |\\ x^{T}Ax\\leq 1\\ \\}.\\]

The axes of the ellipsoid are given by the eigenvectors of matrix \\(A\\), while their length is equal to the inverse of the square root of the corresponding eigenvalue. The volume of \\(\\mathcal{E}\\) is equal to \\(\\det(A)^{-1/2}\\) times the volume of the unit ball in dimension \\(r\\). Therefore, given a matrix \\(\\tilde{M}\\in\\mathbb{R}^{r\\times n}\\) of rank \\(r\\), we can formulate the minimum volume ellipsoid centered at the origin and containing the columns \\(\\tilde{m}_{i}\\)\\(1\\leq i\\leq n\\) of matrix \\(\\tilde{M}\\) as follows

\\[\\min_{A\\in\\mathbb{S}_{++}^{r}}\\ \\log\\det(A)^{-1}\\quad\\text{such that}\\quad\\ \\tilde{m_{i}}^{T}A\\tilde{m}_{i}\\leq 1\\quad\\text{ for }i=1,2,\\ldots,n. \\tag{1}\\]

This problem is SDP representable [9, p.222] (see also Remark 10). Note that if \\(\\tilde{M}\\) is not full rank, that is, the convex hull of the columns of \\(\\tilde{M}\\) and the origin is not full dimensional, then the objective function value is unbounded below. Otherwise, the optimal solution of the problem exists and is unique [19]."""

In [None]:
from langchain_aws import BedrockLLM

# Load the model using APIs
llm = BedrockLLM(model_id='arn:aws:bedrock:us-west-2:637423382292:imported-model/81wcbcqx5vro', region_name='us-west-2', provider='meta')

In [None]:
# Prompt message

message = f"""
<|system|>
You are an helpful assistant expert in Earth Observation, help the user with his tasks.
<|end|>
<|user|>
Summarize the following document focusing on the main concepts and ideas.

The document starts here:
{passage}

<|end|>
<|assistant|>
"""


In [None]:
# Run this cell to show the final prompt given to the model
display(Latex(message))

In [None]:
output = llm.invoke(message)

In [None]:
display(Latex(message))

# Retrieval Augmented Generation (RAG) with Langchain

In this section, we implement a complete RAG pipeline for answering questions based on a given context. Using the LangChain library, we'll walk through the entire process—from retrieving relevant context to generating accurate answers.


**Roadmap**

1. **Indexing**: Organize the raw documents into a structured format suitable for processing, such as splitting them into chunks or passages for more efficient retrieval.

2. **Embedding**: Convert each text chunk into a dense vector representation using a pre-trained embedding model. These embeddings capture the semantic meaning of the content.

3. **Vector Store**: Store the embeddings in a vector database (Qdrant in our case), allowing fast and scalable similarity search across the document collection.

4. **Retrieval and Generation**: Given a user query, retrieve the most relevant document chunks from the vector store and feed them into a language model (EVE) to generate a context-aware, accurate response.

## Load dataset of Q&A
Let's load our dataset of Q&A about EO, each sample is composed of a question and an answer

In [None]:
from datasets import load_dataset

qa = load_dataset('eve-esa/eve-is-open-ended')['train']

# Take a random idx
idx = 120

question = qa[idx]['question']
answer = qa[idx]['answer']


print('Question: ', question)
print('Answer: ', answer)

## Indexing

The first part of a RAG pipeline is called **indexing**. This is the process of ingesting data from a source and indexing it. The indexing process is composed of three steps:
- **Load**: process and load data in text format.
- **Split**: this is useful both for indexing data and passing it into a model, as large chunks are harder to search over and won't fit in a model's finite context window.
- **Store**: we need somewhere to store and index our splits, so that they can be searched over later. This is often done using a [VectorStore](https://python.langchain.com/docs/concepts/vectorstores/) and [Embeddings](https://python.langchain.com/docs/concepts/embedding_models/) model.
Once the Indexing step is done we will have our knowledge base made of scientific papers indexed and ready to be used in the generation steps as context.


<figure>
<img src="https://python.langchain.com/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png" width="800"/>
<figcaption><a href='https://python.langchain.com/docs/tutorials/rag/'>Source</a></figcaption>
</figure>


Chunking is a fundamental step in working with large language models, especially in retrieval-augmented generation (RAG) pipelines. It refers to the process of splitting long documents or texts into smaller, manageable pieces—called chunks—that can be efficiently stored, indexed, and retrieved when needed. Well-formed chunks are crucial because they ensure that each piece of text contains enough context to be meaningful on its own, which in turn improves the quality of retrieval and the relevance of the information provided to the language model. Poorly chunked text can lead to incomplete or confusing results, so careful design of the chunking strategy is essential.


Since we are working with structured documents, we want to avoid blindly splitting the text based solely on character or token length. Instead, we can take advantage of the document’s markdown hierarchy to guide our chunking strategy. By doing so, each chunk will correspond to a logical section of the document—such as a heading and its associated content—making the chunks more coherent and meaningful. This structure-aware approach improves the quality of retrieval and the relevance of the context passed to the language model. We will use LangChain to implement this custom splitting logic efficiently.

In [None]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
    ("####", "Header 4"),
    ("#####", "Header 5"),
    ("######", "Header 6")
]

markdown_document = """###### Abstract

Total nitrogen concentration (C\({}_{\text{TN}}\)) enrichment is the primary cause of natural water eutrophication. Accurately estimating C\({}_{\text{TN}}\) and its spatiotemporal dynamics is crucial for formulating monitoring and control measures to alleviate lake eutrophication. A hybrid model was proposed for estimating C\({}_{\text{TN}}\) in optically complex inland waters by incorporating the relationship between C\({}_{\text{TN}}\) and water optical active components for Zhuhai-1 Orbita hyperspectral (OHS) imagery. Compared with other semi-analytical algorithms, the re-adjusted reference wavelength QMA's shows the best performance in a(a) retrieval. The hybrid model for C\({}_{\text{TN}}\) estimation achieves an root mean square deviation (RMSD) of 0.20 mg/L, a mean absolute percentage deviation (MAPD) of 6.96 %, and a unbiased mean absolute percentage deviation (UMAPD) of 6.96 %, with a\({}_{\text{DM}}\)(569) accuracy exerting the greatest influence. Ground-satellite synchronous validation demonstrates robust performance, with an RMSD of 0.28 mg/L, a MAPD of 14.49 %, and a UMAPD of 14.92 %. The hybrid model was applied to OHS observations of Lake Dinch from April 2019 to September 2021. The analysis revealed a generally decreasing trend in C\({}_{\text{TN}}\) during this timeframe. The above results demonstrate that the robustness and applicability of the proposed C\({}_{\text{TN}}\) hybrid model for inland waters with complex optical properties. Furthermore, satellite-based data products provide valuable information for formulating lake management strategies.

## 1 Introduction

Freshwater lakes serve as crucial ecosystems that significantly influence global climate regulation, maintain biodiversity, and provide essential resources for humans (Ho et al., 2019; Li et al., 2024). Human activities, including agricultural runoff and industrial waste discharge, coupled with the effects of global climate change, are causing widespread eutrophication and shrinkage of lakes worldwide (Dai et al., 2023; Grant et al., 2021). Excessive nitrogen elements in aquatic ecosystems have severe adverse effects on aquatic organisms and humans, such as eutrophication, water acidification, and toxicity (Li et al., 2023; Sheikholeslami and Hall, 2023). Therefore, obtaining total nitrogen concentration (C\({}_{\text{TN}}\)) and its spatial distribution is critical for furthering our understanding of the eutrophication process and biogeochemicalprocess of aquatic ecosystems.

Lake Bianchi is the the largest freshwater lake in Yunnan Province of China. It plays a crucial role in the region's ecological balance, water supply, and economic development (Li et al., 2024; Zheng et al., 2023). However, like other reservoirs and lakes around the world, the lake has suffered from severe environmental degradation in recent decades (Feng et al., 2021; Ho et al., 2019). Since the 1970 s, Lake Bianchi has experienced prolonged algal blooms due to increased nitrogen and phosphorus loads, severely affecting the lake's aquatic ecology and the likelihoods of surrounding residents (Cheng et al., 2023; Mu et al., 2021). Therefore, monitoring nutrient concentrations, especially total nitrogen, is crucial for the management of lake eutrophication.

Traditionally, \(\mathrm{C_{IN}}\) monitoring and assessment have relied on field measurements and laboratory analysis (Wang et al., 2022). While these methods provide valuable data points, they are inherently limited in their ability to capture large-scale spatial and temporal variations in \(\mathrm{C_{IN}}\). Additionally, they are both labor-intensive and time-consuming to implement (Cai et al., 2023; Li et al., 2024). Earth observation techniques, with their wide spatial coverage and capability for long-term, continuous monitoring, offer significant potential for uninterrupted water quality monitoring (Liu et al., 2020; Zheng et al., 2023). However, accurately estimating \(\mathrm{C_{IN}}\) remains challenging. Unlike optically active substances (OACs), \(\mathrm{C_{IN}}\) lacks a distinct spectral signature within the detectable wavelength range, and TN is therefore referred to as a non-optically active substance (Li et al., 2023; Li et al., 2017). Therefore, estimating \(\mathrm{C_{IN}}\) in inland entropic waters using the spectral features of remote sensing is difficult.

In the last two decades, researchers have achieved substantial advancements in \(\mathrm{C_{IN}}\) estimation using remote sensing techniques, primarily through direct and indirect estimation models (Chen and Quan, 2012; Dong et al., 2020; Guo et al., 2021; Guo et al., 2022; Li et al., 2023; Li et al., 2017; Qun'ou et al., 2021; Torbick et al., 2013; Wang et al., 2020; Zhu et al., 2023). Direct estimation models establish statistical relationships between remote sensing reflectance (\(\mathrm{R_{\mathrm{\SIUnitSymbolMicro m}}}\)) and \(\mathrm{C_{IN}}\). Multiple linear regression models have been employed for estimating \(\mathrm{C_{IN}}\) in various water bodies, such as Lake Taihu, the Pearl River Estuary, and the Lower Peninsula of Michigan (Chen and Quan, 2012; Guo et al., 2022; Torbick et al., 2013). Beyond traditional methods, researchers have increasingly focused on leveraging machine learning algorithms for \(\mathrm{C_{IN}}\) estimation. Examples include artificial neural networks, multi-spectral scale morphological combined features, support vector machines, AdaBoost Regression, and Gradient Boosting Regression, which have been trained and successfully applied to \(\mathrm{C_{IN}}\) estimation in inland rivers and lakes (Guo et al., 2021; Li et al., 2023; Zhu et al., 2023). While direct retrieval models offer advantages in terms of simplicity and ease of use, their effectiveness and applicability across diverse water bodies are significantly constrained by limitations in data availability and the complexity of the relationship between \(\mathrm{C_{IN}}\) and the selected features (Cai et al., 2023). Additionally, the direct use of \(\mathrm{R_{\mathrm{\SIUnitSymbolMicro m}}}\)(\(\mathrm{\SIUnitSymbolMicro}\)) to estimate \(\mathrm{C_{IN}}\) is controversial, as TN is a non-optically active substance, and changes in \(\mathrm{C_{IN}}\) are difficult to reflect in \(\mathrm{R_{\mathrm{\SIUnitSymbolMicro m}}}\)(Li et al., 2023). The second type involves indirect estimation models that leverage the relationships between \(\mathrm{C_{IN}}\) and OACs. Chen and Quan (2012) constructed a multiple linear regression model using OACs as predictor variables and \(\mathrm{C_{IN}}\) as the response variable. This model is applicable to \(\mathrm{C_{IN}}\) estimation in Lake Taihu. Previous empirical models only used band ratios or band combination models to estimate \(\mathrm{C_{IN}}\), and such models are difficult to directly apply to different lakes or different periods within the same lake (Oyama et al., 2009). Compared to direct estimation models, indirect models offer the advantage of incorporating bio-optical relationships between \(\mathrm{C_{IN}}\) and OACs. Therefore, we attempted to develop a novel \(\mathrm{C_{IN}}\) estimation model that considers the relationship between \(\mathrm{C_{IN}}\) and the inherent optical properties of water bodies.

In optically complex waters, various constituents like phytoplankton, suspended solids, and colored dissolved organic matter (CDOM) collectively modulate the water's spectral signature, making it challenging to isolate the signal specific to \(\mathrm{C_{IN}}\)(Xue et al., 2019). Additionally, the absorption coefficients of phytoplankton (\(\mathrm{a_{\mathrm{\SIUnitSymbolMicro m}}}\)(\(\mathrm{\SIUnitSymbolMicro}\))), non-algal particles (\(\mathrm{a_{\mathrm{\SIUnitSymbolMicro m}}}\)(\(\mathrm{\SIUnitSymbolMicro}\))), and colored dissolved organic matter (\(\mathrm{a_{\mathrm{\SIUnitSymbolMicro m}}}\)(\(\mathrm{\SIUnitSymbolMicro}\))) quantify how these optically active constituents absorb light particles, thereby reducing the light energy that penetrates the water column. These coefficients serve as the connection between \(\mathrm{R_{\mathrm{\SIUnitSymbolMicro m}}}\) and various OACs. Consequently, they are frequently incorporated as intermediary variables within models for estimating various water quality parameters (Liu et al., 2020; Zheng et al., 2023). Researchers have proposed various estimation models to estimate these absorption coefficients. These models encompass a range of approaches, including empirical approaches and semi-analytical approaches (Huang et al., 2014; Lee et al., 2014; Liu et al., 2020; Xue et al., 2019; Zheng et al., 2023). Building upon existing algorithms like QAA.V5, researchers have developed improved versions specifically designed for different water types. Lee et al. (2014) introduced QAA.V6 for clear ocean applications, utilizing a 670 nm reference wavelength and recalibrated coefficients. This model has proven effective in retrieving inherent optical properties (IOPs) of clear oceans (Jiang et al., 2019; Jorge et al., 2021). For inland turbid waters, modifications like Huang et al. (2014) obtain (using a 710 nm reference wavelength) and Xue et al. (2019) obtain \(\mathrm{QAA_{750}}\) algorithm (using a 750 nm reference wavelength) were developed to address retrieval challenges in these environments. Zheng et al. (2023) proposed a QAA algorithm (\(\mathrm{QAA_{710}}\)) with a reference wavelength of 716 nm and successfully estimated the total absorption coefficient (\(\mathrm{a_{\mathrm{\SIUnitSymbolMicro m}}}\)) and \(\mathrm{a_{\mathrm{\SIUnitSymbolMicro m}}}\)(\(\mathrm{\SIUnitSymbolMicro}\)) of Lake Tianchi. These advancements not only enabled IOP retrieval but also facilitated the successful estimation of chlorophyll-a (Chla) and total suspended matter (TSM), contributing to a more comprehensive assessment of inland water quality. Previous studies have demonstrated that using absorption coefficients, rather than OACs, is a more effective and feasible approach for estimating \(\mathrm{C_{IN}}\) in optically complex inland waters (Shi et al., 2019; Zhang et al., 2018). Therefore, we attempted to develop a model based on \(\mathrm{C_{IN}}\) and absorption coefficients.

This study focused on developing a hybrid remote sensing algorithm specifically designed to estimate \(\mathrm{C_{IN}}\) in inland waters with complex optical properties. The proposed hybrid algorithm integrates the improved QAA algorithm and semi-empirical algorithms to estimate absorption coefficients and their relationship with \(\mathrm{C_{IN}}\). The Zhuhai-1 Orbita hyperspectral (OHS) satellite has emerged as a valuable asset for monitoring \(\mathrm{C_{IN}}\) in inland waters. This next-generation satellite boasts exceptional spatial (10 m) and temporal resolution (2.5 days) alongside a high spectral resolution (32 spectral bands). This unique combination allows for the capture of fine-scale spatial and temporal variations in \(\mathrm{C_{IN}}\) across inland aquatic ecosystems. Specifically, we took the following steps: (1) proposed a hybrid model for estimating \(\mathrm{C_{IN}}\) based on absorption coefficients in inland waters; (2) employed the QAA\({}_{716}\) model to derive target absorption coefficients; (3) obtained the spatiotemporal variation of \(\mathrm{C_{IN}}\) in Lake Tianchi, a typical eutrophic lake; (4) evaluated the feasibility of using OHS data to estimate \(\mathrm{C_{IN}}\) and conduct a radiation performance assessment.

## 2 Study area and data

### Study region

Lake Bianchi, the largest freshwater lake on the Yunnan-Guizhou Plateau, plays a vital role in the regional ecosystem and economy (Fig. 1). It covers an area of approximately 330 km\({}^{2}\) with a mean depth of 4.4 m (Huang et al., 2014). Lake Tianchi is a typical tectonic lake with numerous inflowing tributaries and only one outflowing river system, resulting in distinct closed-semi-closed characteristics. Over the past few decades, rapid urbanization and industrial development have led to a sharp deterioration in water quality, with large amounts of nitrogen and phosphorus nutrients being discharged into the lake (Li et al., 2023). This has resulted in increasing eutrophication of the lake,leading to large-scale algal blooms under suitable conditions, posing a serious threat to Kunming's tourism industry (Li et al., 2024).

### Field data

To capture seasonal variations in water quality and optical properties, field campaigns were conducted in Lake Dianchi during April and November 2017 (Fig. 1B & Table 1).

"""

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)


In [None]:
for header_split in md_header_splits:
  headers = "\n".join([value for value in header_split.metadata.values()])
  print(f'Section name: ', headers)
  print('Chunk content:\n', header_split.page_content)
  print()

### Embeddings

An embeddings model in Retrieval-Augmented Generation (RAG) is a neural network that converts text into dense vector representations (embeddings) in a **high-dimensional space**. These models take text as input and produce a fixed-length array of numbers, a numerical fingerprint of the text's semantic meaning. Embeddings allow search system to find relevant documents not just based on keyword matches, but on semantic understanding.

Embeddings models are trained on large text corpora using unsupervised learning techniques. They learn to encode the semantic meaning of words, sentences, and documents in a way that captures relationships between them. For example, embeddings models can learn that "cat" and "dog" are similar because they are both animals, or that "apple" and "orange" are similar because they are both fruits.

There are many pre-trained embedding models available, each suited to different types of data and use cases. For our application, we use [Indus](https://huggingface.co/papers/2405.10725), a fine-tuned encoder-only transformer model trained specifically on scientific journals and articles related to NASA’s Science Mission Directorate (SMD).

Choosing the right embedding model is a critical step in building an effective retrieval system. Ideally, the embedding model should be trained—or at least fine-tuned—on data similar to the target documents. Since our corpus consists of scientific texts focused on Earth Observation, Indus is a better fit than a general-purpose model, as it captures domain-specific terminology and semantics more accurately.
<figure>

<img src="https://weaviate.io/assets/images/embedding-models-0c04d93c0be28dd63a0e8781c4e8685d.jpg" width='800px'>
<figcaption><a href='https://weaviate.io/blog/how-to-choose-an-embedding-model'>Source</a></figcaption>
<figure>



In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# Load the embeddings model
model_name = "nasa-impact/nasa-smd-ibm-st-v2"
encode_kwargs = {"normalize_embeddings": True}
indus_embd = HuggingFaceEmbeddings(
    model_name=model_name,  encode_kwargs=encode_kwargs
)

### Vector Store

Vector stores are specialized databases designed to efficiently index and retrieve information using vector representations of data. Vector stores leverages the dense representation by reducing the task of finding similar documents to a search in a high-dimensional space. This search is made by comparing the vector representation of the **query** with the vector representation of the **documents** in the database. The documents that are closer to the query vector are considered more similar to the query.

Wrapping up the retrieval process is composed of:
- **Documents embedding**
- **Store the embeddings in a VectorStore**
- **Query embedding**
- **Retrieve** the most similar documents to the query


The most popular and simple setup is using the **cosine similarity** to compare the vectors and retrieve the **top k** most similar ones


<figure>
<img src="https://python.langchain.com/assets/images/vectorstores-2540b4bc355b966c99b0f02cfdddb273.png" width="800"/>
<figcaption><a href='https://python.langchain.com/docs/concepts/vectorstores/'>Source</a></figcaption>
</figure>


## Connect to QDrant

To save time, the embedding and indexing of documents have already been completed prior to this notebook. These steps can be computationally intensive, so we’ve pre-processed the data to streamline the workflow.

We are using Qdrant as our vector store, which has been preloaded with all the relevant documents needed for retrieval. In this section, we will connect to the Qdrant instance and select the specific collection that contains our indexed data. This will enable us to perform efficient semantic searches and support the Retrieval-Augmented Generation (RAG) process used in our Q&A tasks.

In [None]:
# Examples of retrieval pipeline using the embedding function and the API from QDrant
from qdrant_client import QdrantClient
import os

qdrant_url = os.getenv('QDRANT_URL', config['QDRANT_URL'])
api_key = os.getenv('QDRANT_API_KEY', config['QDRANT_API_KEY'])

# Enstablish a connection wit the vector store
client = QdrantClient(
    url=qdrant_url,
    api_key=api_key
)

# Embedd the query
query_emb = indus_embd.embed_query(question)

# Perform similarity search using the computed embeddings
search_result = client.search(
    collection_name="esa-nasa-workshop",
    query_vector=query_emb,
    limit=1,
)

data = search_result[0].payload
# Payload containing metadata and text
for key, value in data.items():
  print(f'{key}: {value}')

print('Retrieved chunk:\n', data['text'])

In [None]:
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
from typing import List
from qdrant_client import QdrantClient
from pydantic import PrivateAttr, Field
from typing import List, Optional, Dict
from qdrant_client.models import Filter, PointStruct

from langchain_core.callbacks import CallbackManagerForRetrieverRun

import numpy as np

# Let's define our retriever class to have a nice interface
class QdrantRetriever():
    def __init__(self, embedding, api_key, qdrant_url, collection_name='esa-nasa-workshop', k: int = 3):
        self._client = QdrantClient(url=qdrant_url, api_key=api_key)
        self.embedding = embedding
        self.collection_name = collection_name
        self.k = k


    def get_relevant_documents(self, query: str) -> List[Document]:
        query_emb = self.embedding.embed_query(query)

        search_result = self._client.search(
            collection_name=self.collection_name,
            query_vector=query_emb,
            limit=self.k,
        )

        docs = []
        for hit in search_result:
            # Adjust based on your actual data structure
            data = hit.payload
            content = data.get("text", "")
            metadata = {}
            for key, value in data.items():
                if key != "text":
                    metadata[key] = value
            docs.append(Document(page_content=content, metadata=metadata))

        return docs


In [None]:
import json

# Let's define our retriever
retriever = QdrantRetriever(indus_embd, api_key, qdrant_url, k=3)

# Format retrieved documents:
def format_docs(docs):
  doc_str = ''
  for i, doc in enumerate(docs):
    doc_str += f'Document n. {i+1}\n'
    doc_str += f'TITLE: {doc.metadata.get("title", "No title")}\n' # Add title's of the paper
    if type(doc.metadata.get("headers", {})) is str:
      # Parse JSON string
      doc.metadata['headers'] = json.loads(doc.metadata.get("headers", {}))
    for key, value in doc.metadata.get("headers", {}).items():
      doc_str += f'{value}\n'
    doc_str += f'URL: {doc.metadata.get("url", "No url")}\n\n' # Add URL of the paper
    doc_str += f'{doc.page_content}\n\n'
  return doc_str



print('Question: ')
print(question)
print()
docs = retriever.get_relevant_documents(question)
print(format_docs(docs))

## Retrieval and generation

The pipeline consists of the following key components:

- Retriever: This component queries the vector store (Qdrant) to fetch the most relevant document chunks based on the user’s question. It performs a semantic search using the pre-computed embeddings to find contextually similar content.

- LLM (Large Language Model): Once the relevant context is retrieved, it is passed to EVE, our Earth Observation-specialized language model. EVE then generates a coherent and informed response based on both the query and the retrieved context.

This approach ensures that the generated answers are grounded in the source documents, improving accuracy and reducing hallucination.

### Prompt

First, we will define the prompt to be used  using three different templates:
- **SystemMessagePromptTemplate**: the system message represents guidelines for the model on how to interact with the user and interpret the conversation.
- **AIMessagePromptTemplate**: the AI message represents a message generate by the model.
- **HumanMessagePromptTemplate**: the human message represents the message sent by the user.


In [None]:
import os  # Customize SystemPromptTemplate
from langchain.prompts import SystemMessagePromptTemplate, AIMessagePromptTemplate, HumanMessagePromptTemplate

template= '''<|system|>
{message}
<|end|>
'''

# A human message will contain the question and the context. The context will be automatically added by the retriever.
human_template = '''<|user|>
Context: {context}

Question is below:

Question: {question}

<|end|>
<|assistant|>
'''


assistant_template = '''<|assistant|>
{message}
<|end|>
'''

# Define the templates
SystemMessageTemplate = SystemMessagePromptTemplate.from_template(template)
HumanMessageTemplate = HumanMessagePromptTemplate.from_template(human_template)
AIMessageTemplate = AIMessagePromptTemplate.from_template(assistant_template)


In [None]:
# Define the system message

system_message = '''You are an expert assistant that answers questions about different topics.

You are given some extracted parts from science papers along with a question.

If you don't know the answer, just say "I don't know." Don't try to make up an answer.

Use only the following pieces of context to answer the question at the end.

Do not use any prior knowledge.'''


system_msg = SystemMessageTemplate.format(message=system_message)

system_msg

Now that we have the definition of different templates we can define the chat prompt. Langchain requireres a specific structure for the chat prompt that is composed of a list of messages. In the code below we can see that our chat template will be composed of two messages, the **system message** and the **human message** that contains the input from the user.

In [None]:
from langchain.prompts import MessagesPlaceholder, PromptTemplate, ChatPromptTemplate

chat_template = ChatPromptTemplate.from_messages(
    messages=[
    system_msg,
    human_template,
    ]
)

# As we can see, our prompt is expecting two variables to be filled
print(chat_template)

### Model initialization


In [None]:
from langchain_aws import BedrockLLM

llm = BedrockLLM(model_id='arn:aws:bedrock:us-west-2:637423382292:imported-model/81wcbcqx5vro', region_name='us-west-2', provider='meta')

### Langchain pipelines

Langchain pipelines are a powerful tool used to assemble and coordinated different components. Our pipeline will look something like this

$$\text{user query} → \text{retriever} → \text{chat prompt} → \text{LLM} → \text{answer} $$

In langchain we will use the chain '|' operator to assemble in series our components. The chain operator is part of the **LangChain Expression Language** a declarative method to build pipelines. In the LCEL language the output of what is on the left of '|' will be the input on what there is on the right of the pipeline.

Let's build our first pipeline to understand how they works. In our sample pipeline below, we can see that we are dynamically creating a dictionary that will be given in input to our chat template (N.B. as we saw above our chat template takes in input three variables)

From the code we can see that the context value is created by taking the question (from the input dict given to the chain) and using it as input to our retriever. The output of the retriever will be then formatted by the format_docs function.
The question instead will remain as it is.


A chain will be called by the **invoke** method. The invoke methods takes as argument a dictionary that will represent the input of the first element of the pipeline.



In [None]:
from operator import itemgetter
from langchain.schema.runnable import RunnableLambda


# Build the pipeline
rag_chain_from_docs = (
    {
        "question": itemgetter('question'),
        "context": itemgetter('question') | RunnableLambda(retriever.get_relevant_documents) | format_docs,
    }
    | RunnableLambda(lambda inputs: {
        **inputs,
        "prompt": chat_template.invoke(inputs)  # Add the rendered prompt explicitly
    })
    | {
        "model_out": itemgetter("prompt") | llm,
        "prompt": itemgetter("prompt"),
    }
)

output = rag_chain_from_docs.invoke({"question": question})

In [None]:
# Print the prompt
print(output['prompt'].to_string())
print()
# Print the model output
print(output['model_out'])

# Evaluation

Evaluating Open-Ended Question Answering (Open-QA) is inherently challenging due to the diversity and variability in natural language. Two syntactically different responses can both be valid and informative answers to the same question, making strict matching metrics less effective in many cases. Below are the main approaches currently used for evaluating open-ended answers:

- **Human Evaluation**: Still considered the gold standard for assessing answer quality.

- **Lexical Matching**: Traditional metrics such as BLEU, ROUGE, and METEOR fall under this category. They rely on surface-level word overlap between the generated and reference answers. These metrics are limited in their ability to handle paraphrasing or synonymy.


- **Neural Method Evaluation**: These methods use pre-trained language models to compute semantic similarity between answers.

- **Llm-as-judge Evaluation**: Large Language Models are increasingly being used as evaluators themselves. These models can provide ratings, rankings, or free-form feedback based on the content and context of answers. LLM evaluation can better capture nuanced meaning, although it introduces its own biases and variability.

In the following section, we will see how **Lexical Matching** and **Neural Method Evaluation** can be used for evaluation.

In [None]:
# Install dependencies
#!pip install rouge_score

In [None]:
# Evaluation samples
# Original question
question = 'Why is Sentinel-1’s Radar instrument useful for monitoring mangroves?'
# Reference answer
reference = 'Sentinel-1’s Radar instrument is useful for monitoring mangroves because it can measure through clouds to show the canopy of the mangroves, and it provides data in the visible spectrum, as well as near-infrared and shortwave infrared.'
# Predicted answer
good_prediction = "Sentinel-1's Radar instrument is useful for monitoring mangroves because it can penetrate clouds and provide all-weather observations, which are essential for understanding and managing these ecosystems. Additionally, Sentinel-1’s radar can measure several parameters related to mangroves, such as tree height and extent, which are important for sustainable mangrove management."
bad_prediction = "Sentinel-1 helps monitor mangroves by collecting temperature data that shows how warm the coastal areas are, which is important for understanding climate impacts."

predictions = [good_prediction, bad_prediction]
references = [reference, reference]

## ROUGE score

[ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) (Recall-Oriented Understudy for Gisting Evaluation) was originally designed for summarization, ROUGE compares n-gram overlaps between candidate and reference texts. ROUGE-1 (unigrams) and ROUGE-L (longest common subsequence) are commonly used in QA tasks. While easy to compute, ROUGE often fails to capture semantic equivalence when the wording differs significantly.

In [None]:
from evaluate import load

rouge = load("rouge")

# Compute ROUGE for each candidate
results = []
for candidate in predictions:
    score = rouge.compute(predictions=[candidate], references=[reference])
    results.append((candidate, score))
    print(f'Candidate: {candidate}')
    for metric, value in score.items():
        print(f'{metric}: {value}')
    print()


  ## BERTscore
  This [metric](https://huggingface.co/spaces/evaluate-metric/bertscore) leverages contextual embeddings from BERT or similar models to evaluate the similarity between candidate and reference sentences. Unlike ROGUE, BERTScore compares the semantic content of each token using cosine similarity of their embeddings. It is more robust to paraphrasing and has shown stronger correlation with human judgment in many open-ended tasks.

  <figure>
  <img src='https://miro.medium.com/v2/resize:fit:866/1*mKfTGKo-nIsOHBFD9QcJiQ.png' width='700px'>
  <figcaption><a href='https://sh-tsang.medium.com/brief-review-bertscore-evaluating-text-generation-with-bert-0bc5fc889d7b'>Source</a></figcaption>
  </figure>

In [None]:
from evaluate import load

bertscore = load("bertscore")

bertscore_result = bertscore.compute(predictions=predictions, references=references, lang="en")

for i in range(len(predictions)):
  print(f'Candidate: {predictions[i]}')
  print(f'Precision: {round(bertscore_result["precision"][i], 4)}')
  print(f'Recall: {round(bertscore_result["recall"][i], 4)}')
  print(f'F1: {round(bertscore_result["f1"][i], 4)}')
  print()


In the example above, we saw that both methods correctly ranked the relevant and irrelevant questions. However, it's important to use them with caution. ROUGE often assigns low scores even to correct questions because it relies on strict word-level matching. In contrast, BERTScore tends to yield higher scores since it measures **token-level similarity**. As a result, even an incorrect question can receive a high score if it has the right format and vocabulary.

## Llm-as-judge

As an additional evaluation method, we leverage a large language model (LLM) to perform preference ranking between two different generated answers. The model is provided with specific evaluation guidelines, the original reference answer (used as a gold standard), and the two candidate answers to compare. Based on the provided criteria, the LLM is asked to judge which of the two answers is preferable. For this evaluation, we use OpenAI's o4-mini model as the judge.

Here we can see a sample prompt for preference ranking:

```You are a helpful and precise evaluator for language model outputs.
Task description (for context): You are an helpful assistant expert in Earth Observation, help the user with his tasks. Answer the following question: Why is Sentinel-1’s Radar instrument useful for monitoring mangroves?

Please evaluate all outputs independently based on:

1. Relevance to the task
2. Accuracy
3. Fluency and grammar
4. Completeness of the answer

The original answer is the following:
Sentinel-1's Radar instrument is useful for monitoring mangroves because it can penetrate clouds and provide all-weather observations, which are essential for understanding and managing these ecosystems. Additionally, Sentinel-1’s radar can measure several parameters related to mangroves, such as tree height and extent, which are important for sustainable mangrove management.

Return a justification text on the order of preference of all outputs based on the previous criteria.
Then return a JSON object with:

* "ranking": a dictionary where 1 = best, 2 = worst Example: { "output\_1": 1, "output\_2": 2 }

Respond **only** with the justification and a valid JSON object.

### Output 1:

Sentinel-1's Radar instrument is useful for monitoring mangroves because it can penetrate clouds and provide all-weather observations, which are essential for understanding and managing these ecosystems. Additionally, Sentinel-1’s radar can measure several parameters related to mangroves, such as tree height and extent, which are important for sustainable mangrove management.

### Output 2:

Sentinel-1 helps monitor mangroves by collecting temperature data that shows how warm the coastal areas are, which is important for understanding climate impacts.

```

The model's output will be the following:


```
Justification:
Output 1 is clearly the best. It is directly relevant to the task, accurate in its explanation of the radar capabilities of Sentinel-1 (e.g., cloud penetration and all-weather monitoring), fluent in language, and complete in content. It matches the original answer almost verbatim and includes key points such as parameter measurement (tree height and extent) important for mangrove monitoring.

Output 2 is the worst. It is factually inaccurate—Sentinel-1 is a radar satellite and does not collect temperature data. This makes the core premise incorrect. Additionally, the output lacks completeness and relevance to the radar-specific capabilities asked in the question. While the grammar and fluency are acceptable, the fundamental misunderstanding of the instrument's purpose outweighs that.

{
  "ranking": {
    "output_1": 1,
    "output_2": 2
  }
}
```

## Human preference ranking

As with automated evaluation, a human annotator is presented with two different outputs along with the original answer and is asked to indicate which of the two outputs they prefer.

## Evaluation results
The table below presents evaluation results on a sample of 50 open-ended questions. The two models considered are EVE and the standard Llama 3.1 Instruct. Both models were prompted using the same retrieved context:

| Model                   | hf_bleu | avg_rouge1 | avg_rougeL | avg_bert | gpt_pref_winrate | human_pref_winrate
|-------------------------|---------|------------|------------|----------|------------------|---------------------|
| llama_rag        | 0.03    | 0.09       | 0.07       | 0.84     | 0.38             | 0.21                   |
| eve_rag     | 0.03    | 0.19       | 0.15       | 0.84     | 0.62             | 0.79                   |


The table above shows that EVE achieves higher scores in ROUGE-1 and ROUGE-L, indicating better overlap with the original answers at the N-gram level. However, both models achieve the same BERTScore, suggesting comparable performance in terms of semantic similarity.

The preference columns represent the percentage of winning (be preferred) of a model on the other.
When comparing preference scores, both the LLM-as-a-judge model (O4-mini) and human evaluator consistently prefer the answers generated by EVE. For this set of metrics, it's important to emphasize that evaluating open-ended generation is inherently challenging, and each score represents just one perspective on model performance. Among the reported metrics, human and automatic preference scores are the most informative, as they better reflect the overall quality and relevance of the generated answers.
