## HNSW (Hierarchical Navigable Small World)
This notebook demonstrates how to use the `hnswlib` library to perform efficient **k-nearest neighbors (k-NN)** searches on large datasets of embeddings, specifically for images and academic papers.

### 1. Initialization and Setup
This section installs the necessary Python libraries, clones a repository containing embedding utilities *(EmbedX)*, and installs its dependencies.

In [None]:
!pip install gdown hnswlib
!git clone https://github.com/huynguyen6906/EmbedX.git
!pip install -r EmbedX/requirements.txt
!cd EmbedX && pip install .

In [1]:
import hnswlib
import gdown
import h5py
import clip
import embedx
import json
import torch
from sentence_transformers import SentenceTransformer
import os
import numpy as np

  from pkg_resources import packaging


### 2. Download Cache
This section checks for the existence of cached files containing pre-computed embeddings and HNSW indices. If any are missing, it uses `gdown` to download them from Google Drive using their IDs.

In [2]:
# Ensure the local directory for caching files exists.
os.makedirs(".cache", exist_ok=True)

if not os.path.isfile(".cache/Papers_Embedbed_0-1000000.h5"):
    # If not present, download the snapshot file using its Google Drive ID.
    gdown.download(id='1hMNur1U24ULce6osanfrePuCAz5uIvc3', output='.cache/Papers_Embedbed_0-1000000.h5', quiet=False)

if not os.path.isfile(".cache/hnsw_paper_index.bin"):
    # If not present, download the snapshot file using its Google Drive ID.
    gdown.download(id='1RR1o_eX9LKv_3pd9Iz1y4xB90pQIpUuX', output='.cache/hnsw_paper_index.bin', quiet=False)

if not os.path.isfile(".cache/hnsw_paper_index.json"):
    # If not present, download the snapshot file using its Google Drive ID.
    gdown.download(id='1H_BNLy9fXYwaOCXvZBRawN8S0w_KuGIK', output='.cache/hnsw_paper_index.json', quiet=False)

if not os.path.isfile(".cache/Image_Embedded.h5"):
    # If not present, download the snapshot file using its Google Drive ID.
    gdown.download(id='12R_DHBMOVFVEaPrjBEQXOd7tb14Isb1D', output='.cache/Image_Embedded.h5', quiet=False)

if not os.path.isfile(".cache/hnsw_image_index.bin"):
    # If not present, download the snapshot file using its Google Drive ID.
    gdown.download(id='1vWbJFLTdFnT6N2cnpMDA7c85JCfrY-QH', output='.cache/hnsw_image_index.bin', quiet=False)

if not os.path.isfile(".cache/hnsw_image_index.json"):
    # If not present, download the snapshot file using its Google Drive ID.
    gdown.download(id='16F8VRuKZb4SJ0KLa1Oh281RvdmQzj6Ft', output='.cache/hnsw_image_index.json', quiet=False)

### 3. Search Image by HNSW
This section loads the image embedding data and the HNSW index configuration to perform image search based on text queries.

Reading image URLs and their corresponding embedding vectors from the `.h5` file.

In [3]:
with h5py.File(".cache/Image_Embedded.h5", "r") as f:
  image_urls = f["urls"][:]
  image_embeddings = f["embeddings"][:]

Reading the essential configuration parameters (`dim`, `max_element`, `ef`) for the HNSW index from the JSON file.

In [4]:
image_index_config = ".cache/hnsw_image_index.json"
dim_image = 0
max_element_image = 0
ef_image = 0
with open(image_index_config, 'r') as f:
    data = json.load(f)
    dim_image = data[0]
    max_element_image = data[1]
    ef_image = data[2]

Initializing and loading the pre-built HNSW index from the binary (`.bin`) file.

In [5]:
image_index_file = ".cache/hnsw_image_index.bin"

if os.path.exists(image_index_file):
    image_search = hnswlib.Index(space='cosine', dim=dim_image)
    image_search.load_index(image_index_file, max_elements=max_element_image) 
    image_search.set_ef(ef_image)   
else:
    print("❌ ERROR: file is not exist.")

Performing text queries, embedding them into vectors using `embedx`, and searching for the top 10 nearest images (k=10) in the HNSW index.

In [None]:
Text_queries = ["computer", "chemical", "single"]
queries = [embedx.image.embed_Text(Text_queries[i]) for i in range(len(Text_queries))]
indices, distances = image_search.knn_query(queries, k=10)
for i in range(len(Text_queries)):
  print(Text_queries[i])
  for idx in indices[i]:
    print(image_urls[idx])
del indices
del distances

computer
https://farm4.staticflickr.com/4145/5144385210_11af7a12b3_o.jpg
https://c1.staticflickr.com/8/7066/6986311580_f25426e4c8_o.jpg
https://c4.staticflickr.com/1/115/309069511_061d065296_o.jpg
https://c5.staticflickr.com/2/1055/1040841974_7ec01b4d9f_o.jpg
https://farm6.staticflickr.com/2803/5713954047_4cd45a029d_o.jpg
https://farm7.staticflickr.com/2746/4267363398_b770c0757e_o.jpg
https://c3.staticflickr.com/2/1056/986204626_8d7252a06a_o.jpg
https://farm7.staticflickr.com/206/487304481_a7c6b071cc_o.jpg
https://c1.staticflickr.com/3/2364/1798975528_b15ffc491d_o.jpg
https://farm8.staticflickr.com/4006/4673592688_4f8abde8a7_o.jpg
chemical
https://farm3.staticflickr.com/3795/10044488764_132b77a668_o.jpg
https://c1.staticflickr.com/8/7019/6523859709_3904afb388_o.jpg
https://c7.staticflickr.com/4/3665/9914382424_f77338be49_o.jpg
https://farm1.staticflickr.com/19/99996654_dd3692c886_o.jpg
https://c5.staticflickr.com/7/6208/6136234094_13d9df22f5_o.jpg
https://farm4.staticflickr.com/5012/55

Deleting variables to free up memory after the image search section.

In [7]:
del image_urls
del image_embeddings
del image_index_config
del image_search
del image_index_file
del dim_image
del max_element_image
del ef_image

### 4. Search Paper by HNSW (SEARCH PAPER BY HNSW)
This section loads the academic paper embedding data and the HNSW index configuration to search for relevant papers based on text queries.

Reading paper URLs and their corresponding embedding vectors from the `.h5` file.

In [8]:
with h5py.File(".cache/Papers_Embedbed_0-1000000.h5", "r") as f:
  paper_urls = f["urls"][:]
  paper_embeddings = f["embeddings"][:]

Reading the essential configuration parameters (`dim`, `max_element`, `ef`) for the HNSW index from the JSON file.

In [9]:
paper_index_config = ".cache/hnsw_paper_index.json"
dim_paper = 0
max_element_paper = 0
ef_paper = 0
with open(paper_index_config, 'r') as f:
    data = json.load(f)
    dim_paper = data[0]
    max_element_paper = data[1]
    ef_paper = data[2]

Initializing and loading the pre-built HNSW index from the binary (`.bin`) file.

In [10]:
paper_index_file = ".cache/hnsw_paper_index.bin"

if os.path.exists(paper_index_file):
    paper_search = hnswlib.Index(space='cosine', dim=dim_paper)
    paper_search.load_index(paper_index_file, max_elements=max_element_paper) 
    paper_search.set_ef(ef_paper)   
else:
    print("❌ ERROR: file is not exist.")

Performing text queries, embedding them into vectors using `embedx`, and searching for the top 10 nearest papers (k=10) in the HNSW index.

In [None]:
Text_queries = ["computer", "chemical"]
queries = [embedx.text.embed_Text(Text_queries[i]) for i in range(len(Text_queries))]
indices, distances = paper_search.knn_query(queries, k=10)
for i in range(len(Text_queries)):
  print(Text_queries[i])
  for idx in indices[i]:
    print(paper_urls[idx])
del indices
del distances

computer
https://arxiv.org/pdf/1309.5737.pdf
https://arxiv.org/pdf/1107.3893.pdf
https://arxiv.org/pdf/1107.4217.pdf
https://arxiv.org/pdf/0903.4286.pdf
https://arxiv.org/pdf/1304.1428.pdf
https://arxiv.org/pdf/1712.09404.pd
https://arxiv.org/pdf/1312.2447.pdf
https://arxiv.org/pdf/1703.02944.pd
https://arxiv.org/pdf/1202.5944.pdf
https://arxiv.org/pdf/0910.3440.pdf
chemical
https://arxiv.org/pdf/1606.04381.pd
https://arxiv.org/pdf/1207.3242.pdf
https://arxiv.org/pdf/1307.6360.pdf
https://arxiv.org/pdf/1007.0818.pdf
https://arxiv.org/pdf/0708.1826.pdf
https://arxiv.org/pdf/1301.0833.pdf
https://arxiv.org/pdf/0906.0871.pdf
https://arxiv.org/pdf/1211.3163.pdf
https://arxiv.org/pdf/1601.05356.pd
https://arxiv.org/pdf/0901.0318.pdf


Deleting variables to free up memory after the paper search section.

In [25]:
del paper_urls
del paper_embeddings
del paper_index_config
del paper_search
del paper_index_file
del dim_paper
del max_element_paper
del ef_paper