

# Multimodal RAG with Elasticsearch: The Gotham City Case



This notebook implements the Multimodal RAG (Retrieval-Augmented Generation) pipeline with Elasticsearch as described in the blog. We follow the same structure as the article, with each section explained and implemented in code.

## Environment Setup

First, we need to clone the repository that contains the complete project code.

In [1]:
# Clone do repositório específico com a branch feature/multimodal-rag-gotham
!git clone -b feature/multimodal-rag-gotham https://github.com/salgado/elasticsearch-labs.git

Cloning into 'elasticsearch-labs'...
remote: Enumerating objects: 4343, done.[K
remote: Counting objects: 100% (688/688), done.[K
remote: Compressing objects: 100% (239/239), done.[K
remote: Total 4343 (delta 546), reused 458 (delta 449), pack-reused 3655 (from 1)[K
Receiving objects: 100% (4343/4343), 98.51 MiB | 40.58 MiB/s, done.
Resolving deltas: 100% (2431/2431), done.


In [2]:
import getpass

Let's navigate to the project directory where the necessary files are located:


In [3]:
cd elasticsearch-labs/supporting-blog-content/building-multimodal-rag-with-elasticsearch-gotham

/Users/jessgarson/elasticsearch-labs/supporting-blog-content/building-multimodal-rag-with-elasticsearch-gotham/notebook/elasticsearch-labs/supporting-blog-content/building-multimodal-rag-with-elasticsearch-gotham


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


Now let's configure the environment variables needed to connect to Elasticsearch and OpenAI. This is necessary for indexing and searching content, as well as generating the final report.


In [4]:
ELASTICSEARCH_URL = input("Enter the Elasticsearch endpoint url: ")
ELASTICSEARCH_API_KEY = getpass.getpass("Enter the Elasticsearch API key: ")
OPENAI_API_KEY = getpass.getpass("Enter the OpenAI API key: ")

Enter the Elasticsearch endpoint url:  https://getting-started.es.us-east4.gcp.elastic-cloud.com
Enter the Elasticsearch API key:  ········
Enter the OpenAI API key:  ········


In [5]:
import os

os.environ["ELASTICSEARCH_API_KEY"] = ELASTICSEARCH_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["ELASTICSEARCH_URL"] = ELASTICSEARCH_URL


## Installing Dependencies

As mentioned in the blog, we need to install the specific dependencies, including the custom ImageBind fork:


In [6]:
# Install base dependencies
!pip install torch>=2.1.0 torchvision>=0.16.0 torchaudio>=2.1.0
!pip install opencv-python-headless pillow numpy

# Install the specific ImageBind fork
!pip install git+https://github.com/hkchengrex/ImageBind.git

zsh:1: 2.1.0 not found

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting git+https://github.com/hkchengrex/ImageBind.git
  Cloning https://github.com/hkchengrex/ImageBind.git to /private/var/folders/z9/dz5wy_nd4_v1_gc8dg_5krqr0000gn/T/pip-req-build-4_8958wu
  Running command git clone --filter=blob:none --quiet https://github.com/hkchengrex/ImageBind.git /private/var/folders/z9/dz5wy_nd4_v1_gc8dg_5krqr0000gn/T/pip-req-build-4_8958wu
  Resolved https://github.com/hkchengrex/ImageBind.git to commit 9989650c87d393d7e8c144194182cbf124cd03a0
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting pytorchvideo@ git+https://github.com/facebookresearch/pytorch

In [7]:
!pip -q install elasticsearch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [8]:
!pip install python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [17]:
!pip install openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [9]:
!pip install soundfile


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Stage 1 - Collecting Crime Scene Clues

As explained in the blog, the first step is to verify that we have the correct directory structure and that the evidence files are present. We use `files_check.py` for this.

In [10]:
!python stages/01-stage/files_check.py

All files are correctly organized!


## Stage 2 - Generating Embeddings with ImageBind

Now we test the embedding generation for an image using ImageBind. As the blog explains, ImageBind allows us to generate embeddings for different modalities (image, audio, text) in a shared vector space.


In [11]:
!python stages/02-stage/test_embedding_generation.py

INFO:embedding_generator:Testing model with sample input...
INFO:embedding_generator:🤖 ImageBind model initialized successfully
(1024,)


This script generates a 1024-dimensional embedding for a test image, confirming that the ImageBind model is working correctly.



## Stage 3 - Storage and Search in Elasticsearch

### Content Indexing

The next step is to index all multimodal evidence in Elasticsearch. This includes images, audio, text, and depth maps as described in the blog.

In [13]:
!python stages/03-stage/index_all_modalities.py

INFO:embedding_generator:Testing model with sample input...
INFO:embedding_generator:🤖 ImageBind model initialized successfully
INFO:elastic_transport.transport:HEAD https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content [status:200 duration:0.133s]
INFO:elastic_transport.transport:POST https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content/_doc [status:201 duration:0.053s]
INFO:__main__:

Indexed vision: {
  "result": "created",
  "_id": "9BkUSZUBvmLH5RQPHhhg",
  "_index": "multimodal_content"
}
INFO:elastic_transport.transport:POST https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content/_doc [status:201 duration:0.033s]
INFO:__main__:

Indexed vision: {
  "result": "created",
  "_id": "9RkUSZUBvmLH5RQPIBh2",
  "_index": "multimodal_content"
}
INFO:elastic_transport.transport:POST https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content/_doc [status:201 duration:0.030s]
INFO:__main__:



Each piece of evidence is now indexed in Elasticsearch with their respective embeddings, allowing for similarity search.

### Searching by Similarity Across Different Modalities

Now we can test searching for evidence by similarity using different modalities as queries. The blog describes how an input from one modality can retrieve results from all modalities.

#### Search by Audio


In [14]:
!python stages/03-stage/search_by_audio.py

INFO:embedding_generator:Testing model with sample input...
INFO:embedding_generator:🤖 ImageBind model initialized successfully
INFO:elastic_transport.transport:HEAD https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content [status:200 duration:0.183s]
INFO:elastic_transport.transport:POST https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content/_search [status:200 duration:0.101s]

🔎 Similar evidence found:

1. A sinister laugh captured near the crime scene (audio)
   Similarity: 0.9987
   File path: data/audios/joker_laugh.wav

2. A sinister laugh captured near the crime scene (audio)
   Similarity: 0.9987
   File path: data/audios/joker_laugh.wav

3. A sinister laugh captured near the crime scene (audio)
   Similarity: 0.9987
   File path: data/audios/joker_laugh.wav




This command uses an audio file as a query and retrieves the most similar evidence. In the case of Gotham, this helps identify connections between the audio of a sinister laugh and other evidence.

#### Search by Text

In [15]:
!python stages/03-stage/search_by_text.py

INFO:embedding_generator:Testing model with sample input...
INFO:embedding_generator:🤖 ImageBind model initialized successfully
INFO:elastic_transport.transport:HEAD https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content [status:200 duration:0.091s]
INFO:elastic_transport.transport:POST https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content/_search [status:200 duration:0.168s]

🔎 Similar evidence found:

1. Mysterious note found at the location (text)
   Similarity: 0.7639
   File path: data/texts/riddle.txt

2. Mysterious note found at the location (text)
   Similarity: 0.7589
   File path: data/texts/riddle.txt

3. Mysterious note found at the location (text)
   Similarity: 0.7589
   File path: data/texts/riddle.txt




Here we use a text query ("Why so serious?") to find related evidence.

#### Search by Image


In [18]:
!python stages/03-stage/search_by_image.py

INFO:embedding_generator:Testing model with sample input...
INFO:embedding_generator:🤖 ImageBind model initialized successfully
INFO:elastic_transport.transport:HEAD https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content [status:200 duration:0.152s]
INFO:elastic_transport.transport:POST https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content/_search [status:200 duration:0.081s]

🔎 Similar evidence found:

1. Photo of the crime scene: A dark, rain-soaked alley is filled with playing cards, while a sinister graffiti of the Joker laughing stands out on the brick wall. (vision)
   Similarity: 0.8258
   File path: data/images/crime_scene1.jpg

2. Photo of the crime scene: A dark, rain-soaked alley is filled with playing cards, while a sinister graffiti of the Joker laughing stands out on the brick wall. (vision)
   Similarity: 0.8258
   File path: data/images/crime_scene1.jpg

3. Photo of the crime scene: A dark, rain-soaked alley is filled

This script uses an image from the crime scene to find similar visual evidence.

#### Search by Depth Map


In [16]:
!python stages/03-stage/search_by_depth.py

INFO:embedding_generator:Testing model with sample input...
INFO:embedding_generator:🤖 ImageBind model initialized successfully
INFO:elastic_transport.transport:HEAD https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content [status:200 duration:0.088s]
INFO:elastic_transport.transport:POST https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content/_search [status:200 duration:0.095s]

🔎 Similar evidence found:

1. Photo of the crime scene: A dark, rain-soaked alley is filled with playing cards, while a sinister graffiti of the Joker laughing stands out on the brick wall. (vision)
   Similarity: 0.5053
   File path: data/images/crime_scene1.jpg

2. Photo of the crime scene: A dark, rain-soaked alley is filled with playing cards, while a sinister graffiti of the Joker laughing stands out on the brick wall. (vision)
   Similarity: 0.5053
   File path: data/images/crime_scene1.jpg

3. Photo of the crime scene: A dark, rain-soaked alley is filled

As explained in the blog, depth maps can provide information about the 3D structure of the scene or objects, complementing the other modalities.

## Stage 4 - Evidence Analysis with LLM

Finally, we bring together all the retrieved evidence and use an LLM (GPT-4) to generate a forensic report that identifies the suspect based on the connections between the different modalities.


In [17]:
!python stages/04-stage/rag_crime_analyze.py

INFO:embedding_generator:Testing model with sample input...
INFO:embedding_generator:🤖 ImageBind model initialized successfully
INFO:elastic_transport.transport:HEAD https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content [status:200 duration:0.072s]
INFO:__main__:✅ All components initialized successfully
INFO:__main__:🔍 Collecting evidence...
INFO:elastic_transport.transport:POST https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content/_search [status:200 duration:0.095s]
INFO:__main__:✅ Data retrieved for vision: 2 results
INFO:elastic_transport.transport:POST https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content/_search [status:200 duration:0.028s]
INFO:__main__:✅ Data retrieved for audio: 2 results
INFO:elastic_transport.transport:POST https://getting-started.es.us-east4.gcp.elastic-cloud.com:443/multimodal_content/_search [status:200 duration:0.024s]
INFO:__main__:✅ Data retrieved for text: 2 results
INFO:


This is the final step of the Multimodal RAG pipeline, where the LLM analyzes the evidence retrieved from Elasticsearch and synthesizes it into a coherent report that identifies the Joker as the main suspect.

## Conclusion

We have thus completed the implementation of the complete Multimodal RAG pipeline with Elasticsearch, following all the steps described in the blog. This pipeline demonstrates how different types of media can be analyzed in an integrated way to provide richer insights and connections between evidence that would be difficult to identify manually.
