# How to implement Image search using Elasticsearch

The workbook shows how to implement an Image search using Elasticsearch. You will index documents with image embeddings (generated or pre-generated) and then using NLP model be able to search using natural language description of the image.

### Prerequisities
Before you start make sure you have Elasticsearch cluster running. The cluster must have at least one machine learning (ML) node with enough (4GB) memory.

### Install Python requirements
Before you start you need to install all required Python dependencies.

In [16]:
!pip install sentence-transformers eland elasticsearch transformers torch tqdm Pillow streamlit



### Upload NLP model for querying
Using an `eland` library you will import NLP CLIP model. The model will tran
sfer your search query into vector which will be used for the search over the set of images stored in Elasticsearch.

The model used is [clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1) because the image embeddings are also generated by the CLIP model.

How to get cloud id? Go to [ESS cloud](https://cloud.elastic.co/logout?redirectTo=%2Fhome&reason=unauthorised) and `On the deployment overview page, copy down the Cloud ID.`

The authentication is using api key (`--es-api-key`). Learn how to generate [API key](https://www.elastic.co/guide/en/kibana/current/api-keys.html#create-api-key).
```
$ eland_import_hub_model --cloud-id $CLOUD_ID  \
  --hub-model-id sentence-transformers/clip-ViT-B-32-multilingual-v1 \
  --task-type text_embedding --es-api-key $API_KEY --start
```

In [None]:
API_KEY=''
CLOUD_ID=''
!eland_import_hub_model --cloud-id $CLOUD_ID --hub-model-id sentence-transformers/clip-ViT-B-32-multilingual-v1 --task-type text_embedding --es-api-key API_KEY --start

### Import Python libraries

In [17]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import parallel_bulk
import requests
import os
import sys
# import shutil
import zipfile
from tqdm.auto import tqdm
import pandas as pd
from PIL import Image
from sentence_transformers import SentenceTransformer
import urllib.request
# import urllib.error
import json
from getpass import getpass

### Connect to Elasticsearch cluster
Use your own cluster details `ELASTIC_CLOUD_ID`, `API_KEY`.

In [19]:
# ESS Cloud connection definition using an API_KEY
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
API_KEY = getpass("Elastic API key: ")

# ELASTIC_CLOUD_USER = "elastic"
# CLOUD_PASSWORD = getpass("Elastic Password")

es = Elasticsearch(
  cloud_id=ELASTIC_CLOUD_ID,
  #basic_auth=(ELASTIC_CLOUD_USER, ELASTIC_CLOUD_PASSWORD),
  api_key=API_KEY,
  request_timeout=600
)

es.info() # should return cluster info

Elastic Cloud ID: ··········
Elastic API key: ··········


ObjectApiResponse({'name': 'instance-0000000000', 'cluster_name': 'a597bbe1e0d047c494e7d4015f67ef37', 'cluster_uuid': 'EnT0vwwSSZeAahPw3Vhsuw', 'version': {'number': '8.8.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '98e1271edf932a480e4262a471281f1ee295ce6b', 'build_date': '2023-06-26T05:16:16.196344851Z', 'build_snapshot': False, 'lucene_version': '9.6.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

### Create Index and mappings for Images
Befor you can index documents into Elasticsearch, you need to create an Index with correct mappings.

In [None]:
# Destination Index name
INDEX_NAME="images"
# If you want to delete previous version of the Index
DELETE_INDEX=False

INDEX_MAPPING = {
    "properties": {
      "image_embedding": {
        "type": "dense_vector",
        "dims": 512,
        "index": True,
        "similarity": "cosine"
      },
      "photo_id": {
        "type": "keyword"
      },
      "photo_image_url": {
        "type": "keyword"
      },
      "ai_description": {
        "type": "text"
      },
      "photo_description": {
        "type": "text"
      },
      "photo_url": {
        "type": "keyword"
      },
      "photographer_first_name": {
        "type": "keyword"
      },
      "photographer_last_name": {
        "type": "keyword"
      },
      "photographer_username": {
        "type": "keyword"
      },
      "exif_camera_make": {
        "type": "keyword"
      },
      "exif_camera_model": {
        "type": "keyword"
      },
      "exif_iso": {
        "type": "integer"
      }
    }
  }

# Index settings
INDEX_SETTINGS = {
    "index": {
      "number_of_replicas": "1",
      "number_of_shards": "1",
      "refresh_interval": "5s"
    }
}

if(DELETE_INDEX):
  if es.indices.exists(index=INDEX_NAME):
    print("Deleting existing %s" % INDEX_NAME)
    es.indices.delete(index=INDEX_NAME, ignore=[400, 404])

if not es.indices.exists(index=INDEX_NAME):
  print("Creating index %s" % INDEX_NAME)
  es.indices.create(index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS,
                  ignore=[400, 404])


### Get image dataset and embeddings
Download:
- The example image dataset is from [Unsplash](https://github.com/unsplash/datasets)
- The [Image embeddings](https://github.com/radoondas/flask-elastic-nlp/blob/main/embeddings/blogs/blogs-no-embeddings.json.zip) are pre-generated using CLIP model

Then unzip both files.

In [None]:
!wget https://unsplash.com/data/lite/1.2.0 -O data/unsplash-research-dataset-lite-1.2.0.zip
!wget https://raw.githubusercontent.com/radoondas/flask-elastic-nlp/main/embeddings/images/image-embeddings.json.zip -P data

--2023-07-25 14:28:32--  https://unsplash.com/data/lite/1.2.0
Resolving unsplash.com (unsplash.com)... 151.101.65.181, 151.101.1.181, 151.101.129.181, ...
Connecting to unsplash.com (unsplash.com)|151.101.65.181|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://unsplash-datasets.s3.amazonaws.com/lite/1.2.0/unsplash-research-dataset-lite-1.2.0.zip [following]
--2023-07-25 14:28:32--  https://unsplash-datasets.s3.amazonaws.com/lite/1.2.0/unsplash-research-dataset-lite-1.2.0.zip
Resolving unsplash-datasets.s3.amazonaws.com (unsplash-datasets.s3.amazonaws.com)... 52.217.102.84, 3.5.25.253, 52.217.96.188, ...
Connecting to unsplash-datasets.s3.amazonaws.com (unsplash-datasets.s3.amazonaws.com)|52.217.102.84|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 632351052 (603M) [application/zip]
Saving to: ‘data/unsplash-research-dataset-lite-1.2.0.zip’


2023-07-25 14:29:16 (14.2 MB/s) - ‘data/unsplash-research-dataset-lite-1.2.0.zip’

In [None]:
# Unzip downloaded files
UNSPLASH_ZIP_FILE="data/unsplash-research-dataset-lite-1.2.0.zip"
EMBEDDINGS_ZIP_FILE="data/image-embeddings.json.zip"

with zipfile.ZipFile(UNSPLASH_ZIP_FILE, 'r') as zip_ref:
  print('Extracting file ', UNSPLASH_ZIP_FILE, '.')
  zip_ref.extractall('data/unsplash/')

with zipfile.ZipFile(EMBEDDINGS_ZIP_FILE, 'r') as zip_ref:
  print('Extracting file ', EMBEDDINGS_ZIP_FILE, '.')
  zip_ref.extractall("data/embeddings/")

Extracting file  data/unsplash-research-dataset-lite-1.2.0.zip .
Extracting file  data/image-embeddings.json.zip .


# Image embeddings
We have now 2 options  how to proceed.
1. Import all pregenerated image embeddings (~19k). This is faster option with a lot of images available in a short time.
2. Import a small subset of randomly choosen images to see the process of generating of image embeddings using external Clip model.

In [None]:
# define helper function
def gen_rows(df):
  for doc in df.to_dict(orient='records'):
    yield doc

## 1. Import all pregenerated image embeddings
This option lets you inport ~19k documents woth pregenenerated image embeddings with metadata.

The process downloads files with images information, merge them and index into Elasticsearch.

In [None]:
df_unsplash = pd.read_csv('data/unsplash/' + 'photos.tsv000', sep='\t', header=0)

# follwing 8 lines are fix for inconsistent/incorrect data
df_unsplash['photo_description'].fillna('', inplace=True)
df_unsplash['ai_description'].fillna('', inplace=True)
df_unsplash['photographer_first_name'].fillna('', inplace=True)
df_unsplash['photographer_last_name'].fillna('', inplace=True)
df_unsplash['photographer_username'].fillna('', inplace=True)
df_unsplash['exif_camera_make'].fillna('', inplace=True)
df_unsplash['exif_camera_model'].fillna('', inplace=True)
df_unsplash['exif_iso'].fillna(0, inplace=True)
## end of fix

# read subset of columns from the original/downloaded dataset
df_unsplash_subset = df_unsplash[
        ['photo_id', 'photo_url', 'photo_image_url', 'photo_description', 'ai_description', 'photographer_first_name',
         'photographer_last_name', 'photographer_username', 'exif_camera_make', 'exif_camera_model', 'exif_iso']]

# read all pregenerated embeddings
df_embeddings = pd.read_json('data/embeddings/' + 'image-embeddings.json', lines=True)

df_merged = pd.merge(df_unsplash_subset, df_embeddings,
                         on='photo_id',
                         how='inner')

count = 0
for success, info in parallel_bulk(
        client=es,
        actions=gen_rows(df_merged),
        thread_count=5,
        chunk_size=1000,
        index=INDEX_NAME
):
  if success:
    count += 1
    if count % 1000 == 0:
      print('Indexed %s documents' % str(count), flush=True)
      sys.stdout.flush()
  else:
    print('Doc failed', info)

print('Indexed %s image embeddings documents' % str(count), flush=True)
sys.stdout.flush()

## 2. Import small number of random
In this part you will randomly choose small number of images and then generate image embeddings. The script will then index documents into Elasticserach.

In [None]:
NUMBER_OF_IMAGES=20
INDEX_NAME="images-test"

df_unsplash = pd.read_csv('data/unsplash/' + 'photos.tsv000', sep='\t', header=0)

## stat fix
# follwing 8 lines are fix for inconsistent/incorrect data
df_unsplash['photo_description'].fillna('', inplace=True)
df_unsplash['ai_description'].fillna('', inplace=True)
df_unsplash['photographer_first_name'].fillna('', inplace=True)
df_unsplash['photographer_last_name'].fillna('', inplace=True)
df_unsplash['photographer_username'].fillna('', inplace=True)
df_unsplash['exif_camera_make'].fillna('', inplace=True)
df_unsplash['exif_camera_model'].fillna('', inplace=True)
df_unsplash['exif_iso'].fillna(0, inplace=True)
## end of fix

df_unsplash_subset = df_unsplash[
        ['photo_id', 'photo_url', 'photo_image_url', 'photo_description', 'ai_description', 'photographer_first_name',
         'photographer_last_name', 'photographer_username', 'exif_camera_make', 'exif_camera_model', 'exif_iso']]

df_random_subset = df_unsplash_subset.sample(n=NUMBER_OF_IMAGES, replace=False)
df_random_subset = df_random_subset.reset_index()

# Load model CLIP
img_model = SentenceTransformer('clip-ViT-B-32')

# new list of image documents for indexing into ES
lst = []
if not os.path.exists("data/images"):
  os.mkdir("data/images")

for index, row in df_random_subset.iterrows():
  #open image from url
  img_path = "data/images/" + row['photo_id']
  try:
    urllib.request.urlretrieve(row['photo_image_url'], img_path)
    print(row['photo_id'] + " " + row['photo_url'])
  except urllib.error.HTTPError as err:
        if err.code == 404:
            print('404 error: Image not found at {}'.format(row['photo_image_url']))
        else:
            raise

  img = Image.open(img_path)
  # create doc
  doc = {}
  embedding = img_model.encode(img)
  doc['photo_id'] = row['photo_id']
  doc['image_embedding'] = embedding.tolist()
  lst.append(doc)
  # print(doc)

  # Image cleanup.
  # If file exists, delete it.
  if os.path.exists(img_path):
    os.remove(img_path)

# read all pregenerated embeddings
df_embeddings = pd.read_json('data/embeddings/' + 'image-embeddings.json', lines=True)

df_merged = pd.merge(df_random_subset, pd.DataFrame(lst),
                         on='photo_id',
                         how='inner')
# print(df_merged)

count = 0
for success, info in parallel_bulk(
        client=es,
        actions=gen_rows(df_merged),
        thread_count=5,
        chunk_size=10,
        index=INDEX_NAME
):
  if success:
    count += 1
    if count % 10 == 0:
      print('Indexed %s documents' % str(count), flush=True)
      sys.stdout.flush()
  else:
    print('Doc failed', info)

print('Indexed %s image embeddings documents' % str(count), flush=True)
sys.stdout.flush()

### Query the image dataset
The next step is to run a query to search for images. The example query searches for `"model_text": "Valentine day flowers"` using the model we uploaded to Elasticsearch `sentence-transformers__clip-vit-b-32-multilingual-v1`.

The process is one query even it internaly consists of two tasks. One is to tramsform your search text into a vector using the NLP model and the second task is to run the vector search over the image dataset.
```
POST images/_search
{
  "knn": {
    "field": "image_embedding",
    "k": 5,
    "num_candidates": 10,
    "query_vector_builder": {
      "text_embedding": {
        "model_id": "sentence-transformers__clip-vit-b-32-multilingual-v1",
        "model_text": "Valentine day flowers"
      }
    }
  },
  "fields": [
    "photo_description",
    "ai_description",
    "photo_url"
  ],
  "_source": false
}
```



In [35]:
# Search queary
WHAT_ARE_YOU_LOOKING_FOR="Valentine day flowers"
INDEX_IM_EMBED="images"

source_fields = ["photo_description", "ai_description", "photo_url", "photo_image_url", "photographer_first_name",
                     "photographer_username", "photographer_last_name", "photo_id"]
query = {
  "field": "image_embedding",
  "k": 5,
  "num_candidates": 100,
  "query_vector_builder": {
    "text_embedding": {
      "model_id": "sentence-transformers__clip-vit-b-32-multilingual-v1",
      "model_text": WHAT_ARE_YOU_LOOKING_FOR
    }
  }
}

response = es.search(
    index=INDEX_IM_EMBED,
    fields=source_fields,
    knn=query, source=False)

print(response.body)

# the code writes the response into a file for the streamlit UI used in the optional step.
with open('json_data.json', 'w') as outfile:
  json.dump(response.body['hits']['hits'], outfile)

# Use the `loads()` method to load the JSON data
dfr = json.loads(json.dumps(response.body['hits']['hits']))
# Pass the generated JSON data into a pandas dataframe
dfr = pd.DataFrame(dfr)
# Print the data frame
dfr

results = pd.json_normalize(json.loads(json.dumps(response.body['hits']['hits'])))
# results
results[['_id', '_score', 'fields.photo_id', 'fields.photo_image_url',
         'fields.photo_description', 'fields.photographer_first_name',
         'fields.photographer_last_name', 'fields.ai_description',
         'fields.photo_url']]

{'took': 114, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 5, 'relation': 'eq'}, 'max_score': 0.646751, 'hits': [{'_index': 'images', '_id': 'nK5Fh4kBLg4Kd5ySLbKC', '_score': 0.646751, 'fields': {'photo_image_url': ['https://images.unsplash.com/photo-1587404787163-d03a28fcc1f0'], 'photo_id': ['gQFZxLe3m4g'], 'photographer_first_name': ['Vadim'], 'photo_description': ['instagram.com/vadimsadovski'], 'photographer_last_name': ['Sadovski'], 'photo_url': ['https://unsplash.com/photos/gQFZxLe3m4g'], 'photographer_username': ['vadimsadovski'], 'ai_description': ['']}}, {'_index': 'images', '_id': 'Xa5Eh4kBLg4Kd5yS84Qf', '_score': 0.64675057, 'fields': {'photo_image_url': ['https://images.unsplash.com/photo-1574259148543-dd376a61339f'], 'photo_id': ['g5Mhx29yp-A'], 'photographer_first_name': ['Erin'], 'photo_description': ['Cute but grumpy cat in the Austrian mountains'], 'photographer_last_name': ['East'], 'photo_url': [

Unnamed: 0,_id,_score,fields.photo_id,fields.photo_image_url,fields.photo_description,fields.photographer_first_name,fields.photographer_last_name,fields.ai_description,fields.photo_url
0,nK5Fh4kBLg4Kd5ySLbKC,0.646751,[gQFZxLe3m4g],[https://images.unsplash.com/photo-15874047871...,[instagram.com/vadimsadovski],[Vadim],[Sadovski],[],[https://unsplash.com/photos/gQFZxLe3m4g]
1,Xa5Eh4kBLg4Kd5yS84Qf,0.646751,[g5Mhx29yp-A],[https://images.unsplash.com/photo-15742591485...,[Cute but grumpy cat in the Austrian mountains],[Erin],[East],[brown Persian cat on white bench],[https://unsplash.com/photos/g5Mhx29yp-A]
2,265Eh4kBLg4Kd5yS84Uf,0.642449,[9KZ0PGNCxNE],[https://images.unsplash.com/photo-15657721010...,[],[Perchek],[Industrie],[siamese cat],[https://unsplash.com/photos/9KZ0PGNCxNE]
3,xq5Fh4kBLg4Kd5ySEpuC,0.64216,[VOZQkkKXvY4],[https://images.unsplash.com/photo-15899942053...,[],[Andrey],[Svistunov],[orange tabby cat on ground covered with snow ...,[https://unsplash.com/photos/VOZQkkKXvY4]
4,WK5Eh4kBLg4Kd5yS5XcD,0.641853,[wmz8y6L6c_k],[https://images.unsplash.com/photo-1556820161-...,[],[Phillip],[Suitcases],[brown and black kitten close-up photography],[https://unsplash.com/photos/wmz8y6L6c_k]


# [Optional] Simple streamlit UI
In the following section, you will view the response in a simple UI for better visualisation.

The query in the previous step did write down a file response `json_data.json` for the UI to load and visualise.

Follow the steps below to see the results in a table.

### Install tunnel library

In [12]:
!npm install localtunnel

[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35msaveError[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[34;40mnotice[0m[35m[0m created a lockfile as package-lock.json. You should commit this file.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35menoent[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No description
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No repository field.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No README data
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No license field.
[0m
[K[?25h+ localtunnel@2.0.2
added 22 packages from 22 contributors and audited 22 packages in 5.903s

3 packages are looking for funding
  run `npm fund` for details

found [92m0[0m vulnerabilities

[K[?25h

### Create application

In [36]:
%%writefile app.py

import streamlit as st
import json
import pandas as pd


def get_image_preview(image_url):
    """Returns an HTML <img> tag with preview of the image."""
    return f"""<img src="{image_url}" width="400" />"""


def get_url_link(photo_url):
    """Returns an HTML <a> tag to the image page."""
    return f"""<a href="{photo_url}"  target="_blank"> {photo_url} </a>"""


def main():
    """Creates a Streamlit app with a table of images."""
    data = json.load(open("json_data.json"))
    table = []
    for image in data:
        image_url = image["fields"]["photo_image_url"][0]
        image_preview = get_image_preview(image_url)
        photo_url = image["fields"]["photo_url"][0]
        photo_url_link = get_url_link(photo_url)
        table.append([image_preview, image["fields"]["photo_id"][0],
                      image["fields"]["photographer_first_name"][0],
                      image["fields"]["photographer_last_name"][0],
                      image["fields"]["photographer_username"][0],
                      photo_url_link])

    st.write(pd.DataFrame(table, columns=["Image", "ID", "First Name", "Last Name",
                                          "Photographer username", "Photo url"]).to_html(escape = False),
             unsafe_allow_html=True)


if __name__ == "__main__":
    main()



Overwriting app.py


### Run app
Run the application and check your IP for the tunneling

In [37]:
!streamlit run app.py &>/content/logs.txt & curl ipv4.icanhazip.com

34.138.156.22


### Create the tunnel
Run the tunnel and use the link below to connect to the tunnel.

Use the IP from the previous step to connect to the application

In [38]:
!npx localtunnel --port 8501

[K[?25hnpx: installed 22 in 2.186s
your url is: https://nine-facts-act.loca.lt
^C


# Resources

Blog: https://www.elastic.co/blog/implement-image-similarity-search-elastic

GH  : https://github.com/radoondas/flask-elastic-image-search
