# 1. Introduction
# FAISS, or Facebook AI Similarity Search,
is an open-source library that helps developers search for similar embeddings in multimedia documents:
## What it does
**FAISS** is a library that helps developers quickly search for **similar embeddings in multimedia documents.** It's designed to handle **large-scale datasets** and is a *key tool for applications in machine learning, artificial intelligence, and data science.*

## How it works
**FAISS** assumes that *instances are represented as vectors* and can be compared using **L2 (Euclidean) distances or dot products.** It uses a variety of algorithms and optimizations to ensure it remains at the forefront of vector database technology.

## Why it's useful
**FAISS** solves limitations of traditional query search engines, which are optimized for hash-based searches. It's a valuable tool for applications that require rapid and accurate similarity searches.

## Some of its features
FAISS includes a variety of index structures, including:

* **IndexIVFFlat:** Uses an inverted file system to divide the dataset into clusters and assign a list of vectors to each cluster. This index structure is suitable for large-scale applications.
* **IndexIVFPQ:** Combines Product Quantization (PQ) and an inverted file system to store and retrieve high-dimensional embeddings.

## Where to learn more
* You can learn more about FAISS from the [Faiss documentation](https://faiss.ai/?form=MG0AV3), [GitHub](https://github.com/facebookresearch/faiss?form=MG0AV3), and other resources.
* [Implementing FAISS: Vector Similarity Search for Recommendations](https://manangarg.medium.com/implementing-faiss-vector-similarity-search-for-recommendations-faa5149f55de)

# 2. Install libraries

In [1]:
# Installing relevant libraries
!pip install sentence_transformers
!pip install datasets
!pip install faiss-gpu
!pip install faiss-cpu

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

# 3. Import libraries

In [2]:
from datasets import load_dataset
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
import warnings

# 4. Set Options
* *ignore Warning*
* *floating number display options*

In [3]:
warnings.simplefilter('ignore')
pd.set_option("display.max_columns", None)
pd.options.display.float_format = '{:.2f}'.format

# 5. Load dataset
Let's use a small dataset of book titles and their descriptions for our use-case.

In [4]:
dataset = load_dataset('Skelebor/book_titles_and_descriptions_en_clean', split='test')
df = pd.DataFrame(dataset)
df.head()

dataset_infos.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

(…)-00000-of-00002-7ed8cdce71e9d933.parquet:   0%|          | 0.00/306M [00:00<?, ?B/s]

(…)-00001-of-00002-68a449783d5db899.parquet:   0%|          | 0.00/306M [00:00<?, ?B/s]

(…)-00000-of-00001-0ce6014f3ee7e1e3.parquet:   0%|          | 0.00/34.0M [00:00<?, ?B/s]

(…)-00000-of-00001-b285c92e4abb7e76.parquet:   0%|          | 0.00/33.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1032335 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/57352 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/57352 [00:00<?, ? examples/s]

Unnamed: 0,title,description
0,The Baby of Their Dreams,"Barcelona, baby...bride?\nSeven years ago ER d..."
1,"Air Gear, Vol. 8 (Air Gear, #8)",Behemoth has already taken out part of Ikki's ...
2,Walking Over Eggshells,Walking Over Eggshells is an autobiography tha...
3,"Charmed (Fairy Tale Reform School, #2)",Charmed is the exciting sequel to the wildly p...
4,"Blown Away (Unconventional in Atlanta, #2)",Sometimes love finds you before you think you'...


# 6. Dataset EDA
* shape
* redundant data
* null

In [4]:
# Shape of the dataset -
print("Shape of dataset".ljust(25, '.'), df.shape)

# Removing duplicate rows
df.drop_duplicates(inplace=True)

# Shape of the dataset after removing duplicates -
print("Shape of dataset after removing duplicates".ljust(25, '.'), df.shape)

# Checking number of Nulls
df.isnull().sum(axis=1).sum()
df.isnull().sum()