# Information

**This Demo notebook automatically creates a Faiss knn indices with the most optimal similarity search parameters.**

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

Github: https://github.com/criteo/autofaiss

# Parameters

In [1]:
#@title Index parameters

max_index_query_time_ms = 10 #@param {type: "number"}
max_index_memory_usage = "10MB" #@param
metric_type = "l2" #@param ['ip', 'l2']

# Embeddings creation (add your own embeddings here)

In [2]:
import numpy as np

# Create embeddings
embeddings = np.float32(np.random.rand(4000, 100))

# Save your embeddings on the disk

In [3]:
# Create a new folder
import os
import shutil
embeddings_dir = "embeddings_folder"
if os.path.exists(embeddings_dir):
  shutil.rmtree(embeddings_dir)
os.makedirs(embeddings_dir)

# Save your embeddings
# You can split you embeddings in several parts if it is too big
# The data will be read in the lexicographical order of the filenames
np.save(f"{embeddings_dir}/part1.npy", embeddings[:2000]) 
np.save(f"{embeddings_dir}/part2.npy", embeddings[2000:]) 

# Build the KNN index with Autofaiss

In [4]:
os.makedirs("my_index_folder", exist_ok=True)

In [5]:
# Install autofaiss
!pip install autofaiss &> /dev/null

# Build a KNN index
!autofaiss quantize --embeddings_path={embeddings_dir} \
                    --output_path="my_index_folder" \
                    --metric_type={metric_type} \
                    --max_index_query_time_ms=5 \
                    --max_index_memory_usage={max_index_memory_usage}

Launching the whole pipeline 08/02/2021, 13:25:58
	Compute estimated construction time of the index 08/02/2021, 13:25:58
		-> Train: 16.7 minutes
		-> Add: 0.0 seconds
		Total: 16.7 minutes
	>>> Finished "Compute estimated construction time of the index" in 0.0001 secs
	Checking that your have enough memory available to create the index 08/02/2021, 13:25:58
	>>> Finished "Checking that your have enough memory available to create the index" in 0.0006 secs
	Selecting most promising index types given data characteristics 08/02/2021, 13:25:58
	>>> Finished "Selecting most promising index types given data characteristics" in 0.0012 secs
	Creating the index 08/02/2021, 13:25:58
		-> Instanciate the index HNSW32 08/02/2021, 13:25:58
		>>> Finished "-> Instanciate the index HNSW32" in 0.0013 secs
		-> Extract training vectors 08/02/2021, 13:25:58
  0% 0/2 [00:00<?, ?it/s]100% 2/2 [00:00<00:00, 1055.97it/s]
		>>> Finished "-> Extract training vectors" in 0.0138 secs
		-> Training the index wi

# Load the index and play with it

In [6]:
import faiss
import glob
import numpy as np

my_index = faiss.read_index(glob.glob("my_index_folder/*.index")[0])

query_vector = np.float32(np.random.rand(1, 100))
k = 5
distances, indices = my_index.search(query_vector, k)

print(f"Top {k} elements in the dataset for max inner product search:")
for i, (dist, indice) in enumerate(zip(distances[0], indices[0])):
  print(f"{i+1}: Vector number {indice:4} with distance {dist}")

Top 5 elements in the dataset for max inner product search:
1: Vector number 2933 with distance 10.404068946838379
2: Vector number  168 with distance 10.53512191772461
3: Vector number 2475 with distance 10.688979148864746
4: Vector number 2525 with distance 10.713528633117676
5: Vector number 3463 with distance 10.774477005004883


# (Bonus) Python version of the CLI

In [7]:
from autofaiss.external.quantize import Quantizer

quantizer = Quantizer()

quantizer.quantize(embeddings_path="embeddings_folder",
                   output_path="my_index_folder",
                   max_index_query_time_ms = max_index_query_time_ms,
                   max_index_memory_usage = max_index_memory_usage,
                   metric_type=metric_type)

Launching the whole pipeline 08/02/2021, 13:26:11
	Compute estimated construction time of the index 08/02/2021, 13:26:11
		-> Train: 16.7 minutes
		-> Add: 0.0 seconds
		Total: 16.7 minutes
	>>> Finished "Compute estimated construction time of the index" in 0.0007 secs
	Checking that your have enough memory available to create the index 08/02/2021, 13:26:11
	>>> Finished "Checking that your have enough memory available to create the index" in 0.0012 secs
	Selecting most promising index types given data characteristics 08/02/2021, 13:26:11
	>>> Finished "Selecting most promising index types given data characteristics" in 0.0043 secs
	Creating the index 08/02/2021, 13:26:11
		-> Instanciate the index HNSW32 08/02/2021, 13:26:11
		>>> Finished "-> Instanciate the index HNSW32" in 0.0021 secs
		-> Extract training vectors 08/02/2021, 13:26:11


100%|██████████| 2/2 [00:00<00:00, 421.77it/s]

		>>> Finished "-> Extract training vectors" in 0.0238 secs
		-> Training the index with 4000 vectors of dim 100 08/02/2021, 13:26:11
		>>> Finished "-> Training the index with 4000 vectors of dim 100" in 0.0000 secs
		-> Adding the vectors to the index 08/02/2021, 13:26:11



100%|██████████| 2/2 [00:00<00:00,  4.55it/s]


		>>> Finished "-> Adding the vectors to the index" in 1.7814 secs
	>>> Finished "Creating the index" in 1.8182 secs
	Computing best hyperparameters 08/02/2021, 13:26:13
	>>> Finished "Computing best hyperparameters" in 3.2071 secs
The best hyperparameters are: efSearch=2077
	Saving the index on local disk 08/02/2021, 13:26:16
	>>> Finished "Saving the index on local disk" in 0.0064 secs
	Compute fast metrics 08/02/2021, 13:26:16
1025
	>>> Finished "Compute fast metrics" in 10.0180 secs
Recap:
{'99p_search_speed_ms': 13.157404919996907,
 'avg_search_speed_ms': 9.750819220487383,
 'compression ratio': 0.5956986092671344,
 'nb vectors': 4000,
 'reconstruction error %': 0.0,
 'size in bytes': 2685922,
 'vectors dimension': 100}
>>> Finished "Launching the whole pipeline" in 15.0867 secs


'Done'