## Data Ingestion

**Ingestion** is the next crucial steps after chunking of data. This is the process of ingesting data from a source and indexing it. The indexing process is composed of three steps:
- **Load**: process and load data in text format.
- **Split**: this is useful both for indexing data and passing it into a model, as large chunks are harder to search over and won't fit in a model's finite context window.
- **Store**: we need somewhere to store and index our splits, so that they can be searched over later. This is often done using a [VectorStore](https://python.langchain.com/docs/concepts/vectorstores/) and [Embeddings](https://python.langchain.com/docs/concepts/embedding_models/) model.
Once the Indexing step is done we will have our knowledge base made of scientific papers indexed and ready to be used in the generation steps as context.

## QDrant Vector Database

In our case, we will be using the **qdrant** vector database to store our embeddings. You can sign up for a free cluster [here](https://qdrant.tech/documentation/cloud/create-cluster/). In the free tier, you get access to a 1GB cluster, forever free with no credit card signup. If unused, free tier clusters are automatically suspended after 1 week, and deleted after 4 weeks of inactivity if not reactivated.

### Embeddings

An embeddings model in Retrieval-Augmented Generation (RAG) is a neural network that converts text into dense vector representations (embeddings) in a **high-dimensional space**. These models take text as input and produce a fixed-length array of numbers, a numerical fingerprint of the text's semantic meaning. Embeddings allow search system to find relevant documents not just based on keyword matches, but on semantic understanding.

Embeddings models are trained on large text corpora using unsupervised learning techniques. They learn to encode the semantic meaning of words, sentences, and documents in a way that captures relationships between them. For example, embeddings models can learn that "cat" and "dog" are similar because they are both animals, or that "apple" and "orange" are similar because they are both fruits.

There are many pre-trained embedding models available, each suited to different types of data and use cases.

Choosing the right embedding model is a critical step in building an effective retrieval system. Ideally, the embedding model should be trained—or at least fine-tuned—on data similar to the target documents. Since our corpus consists of scientific texts focused on Earth Observation, Indus is a better fit than a general-purpose model, as it captures domain-specific terminology and semantics more accurately.


In [11]:
from langchain_huggingface import HuggingFaceEmbeddings

# Load the embeddings model
model_name = "nasa-impact/nasa-smd-ibm-st-v2"
encode_kwargs = {"normalize_embeddings": True}
embedder = HuggingFaceEmbeddings(
    model_name=model_name,  encode_kwargs=encode_kwargs
)

In [12]:
result = embedder.embed_query("hello")
print(len(result))

768


### Vector Store

Vector stores are specialized databases designed to efficiently index and retrieve information using vector representations of data. Vector stores leverages the dense representation by reducing the task of finding similar documents to a search in a high-dimensional space. This search is made by comparing the vector representation of the **query** with the vector representation of the **documents** in the database. The documents that are closer to the query vector are considered more similar to the query.

Let's see how to create a qdrant collection

In [7]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import VectorParams, Distance, HnswConfigDiff, OptimizersConfigDiff,PointStruct
from qdrant_client import models
import os

from dotenv import load_dotenv
load_dotenv()

# get your keys from the qdrant UI
QDRANT_API_KEY = os.environ['QDRANT_API_KEY']
QDRANT_URL = os.environ['QDRANT_URL']

collection_name = "ingestion_demo"
vector_size = 768 

client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(
    size=vector_size,
    distance=Distance.COSINE,
    on_disk=True,
        ),
    shard_number=8,  # increase shards for large data
    on_disk_payload=True,
    quantization_config=models.BinaryQuantization( # using binary quantization for faster retrieval
    binary=models.BinaryQuantizationConfig(
    always_ram=False,
    ),
    ),
)

True

In [10]:
chunk_1 = """Sentinel-2 is a European mission that utilizes wide-swath, high-resolution, multi-spectral imaging. Dedicated to Europe’s Copernicus programme, 
            the mission supports operational applications primarily for land services, including the monitoring of vegetation, soil and water cover, 
            as well as the observation of inland waterways and coastal areas."""
chunk_2 = """TROPOMI (Tropospheric Monitoring Instrument) is a cutting-edge satellite instrument aboard the European Copernicus Sentinel-5 Precursor (S5P) satellite, launched in October 2017." 
            It plays an essential role in gathering data that helps scientists to better understand atmospheric processes and environmental changes. 
            TROPOMI’s high-resolution data contributes to global efforts in monitoring air quality, tracking climate trends, and protecting the ozone layer, 
            making it a key tool for advancing environmental science and policy worldwide."""
chunk_3 = """Copernicus is the European Union's flagship Earth observation initiative, designed to provide valuable information services to citizens and organizations across the EU. 
            It utilizes a combination of satellite Earth observation and in-situ (non-space) data to monitor and understand our planet and its environment."""
chunks = [chunk_1, chunk_2, chunk_3]
print(len(chunks))

3


We can now batch them together and upload them to the qdrant collection we just created

In [19]:
batch_vectors = embedder.embed_documents(chunks)
points = [
    PointStruct(
        id=i, 
        vector=vec,
        payload={"content" : chunks[i]}
    )
    for i, vec in enumerate(batch_vectors)
]

client.upload_points(collection_name=collection_name,
                            points=points,
                            parallel=10,
                            max_retries=3)

Perfect! In this next notebook, we will see how to retrieve the most relevant chunks for a given user query.