# Similarity Search with Redis
### Exploring Redis as a Vector Database

with Brian Sam-Bodden 

![redis](./images/redis.png)

## The "Unstructured Data" Problem

- The **balanced** of data has changed radically... 
- **~80%** of the data generated by organizations is **Unstructured**<sup>(IDC report, 2020)</sup>
- This percentage is estimated to keep growing <sup>(with CAGR of 36.5% between 2020 and 2025)</sup>




## But what is "Unstructured" Data?

- Data that does not conform to a **pre-defined** data model
- Data that can not be easily **"indexed"** by a search engine
- Data is typically **high-dimensional** and **semantically** rich
- Examples include **images**, **videos**, **free-form text**, and **audio**


![data pyramid](./images/data-balance.png)

## Dealing with Unstructured Data

- Unstructured data must be **transformed**
- To deal with the **high-dimensional** nature we extract **"features"**
- Traditional extraction techniques included **labelling**, **tagging**, and **1-hot encoding** 
- The extracted features are commonly encoded as **vectors** 


## Manual Image Feature Extraction

![manual image feature extraction](./images/image-manual-feature-extraction.png)

## Manual Text Feature Extraction

![manual text feature extraction](./images/text-manual-feature-extraction.png)

## 🏹 Vectors

- They are a **Numeric representation** of something in **N-dimensional** space
- Can represent **anything**... entire documents, images, video, audio 
- Quantifies **features** or **characteristics** of the item
- More importantly... they are **comparable**

## 🏹 Vectors

- A Vector is a tuple of one or more **values** called **scalars**
- Each **scalar** represents the measure of a **feature**
- Different frameworks use different data types to represent them:
  - In **Numpy** they are **Numpy Arrays** (`np.arrays`)
  - In **TensorFlow** they are **Tensors** (`tf.Tensor`)
  - In **PyTorch** they are also **Tensors** (`torch.tensor`)

## 3 "Bicycle Reviews" Features as a Vector

![represenation of a vector](./images/bicycle_vector.png)

## 🧨 Issues with Feature Engineering

- **Time-consuming**: Might require domain knowledge and expertise.
- **High dimensionality**: Can lead to a high-dimensional feature space.
- **Lack of scalability**: Not easily scalable, more data **==** more people.

## Enter "Vector Embeddings"

- **Machine Learning** / **Deep Learning** have leaped forward in last decade 
- ML models **outperform** humans in many tasks nowadays
  - 🔥 **CV** (Computer Vision) models excel at detection/classification
  - 🔥 **LLMs** (Large Language Models) have advanced exponentially
- Today, most vectors are **generated** using pre-trained **ML Models**

## Enter "Vector Embeddings"

- ML models can **extract contextual meaning** from unstructured data
- Reduce semantically-rich high-dimensional inputs and **"flatten"** them 
- Flatten representations retain the semantic information and make for ideal vectors
- Once in vector form the world of **linear algebra** allows to operate on vectors

## Vector Embeddings from a CV Model

![vector embedding extraction](./images/embedding-extraction.png)

## Enter "Vector Databases"

- Pure Vector Databases **efficiently store** Vectors (along with **metadata**)
- Enable **searching** for vectors using **"similarity"** and **"distance"** metrics
- Enable **hybrid searches** combining vectors and metadata

## Redis as a Vector Database

- Redis provides **Search Capabilities** for structured/semi-structured data
- Redis supports `TEXT`, `NUMERIC`, `TAG`, `GEO` and `GEOSHAPE` fields
- Redis introduces the **`VECTOR`** schema field type for vector support 
- **`VECTOR`** field allows **indexing**, and **querying** vectors in **Hashes** or **JSON**
- Redis **in-memory** approach provides **fast** and **efficient** vector searches





## Redis as a Vector Database

- Capabilities:
  - **3** distance metrics: **Euclidean**, **Internal Product** and **Cosine**
  - **2** indexing methods: **HNSW** and **Flat**
  - **Hybrid queries** combined with `GEO`, `TAG`, `TEXT` or `NUMERIC`

## 🛠️ Demo
### The **Redis Fashion Store**

![bikeshop](./images/dataset-cover.png)

## The **Redis Fashion Store**

* Populated from Kaggle's [**Fashion Product Images Dataset**](https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-dataset)
* Contains 44k products with descriptions, metadata, and images 

## Connecting to Redis Stack

* **Redis Stack** instance running locally
* Import `redis-py` client library
* Create a **client connection**

In [None]:
import redis
client = redis.Redis(host='localhost', port=6379, decode_responses=True)

* Use the `PING` command to check that Redis is up and running:

In [None]:
client.ping()

## Inspect the Store Items

* Use the `JSON.GET` command to retrive the product with key `fashion:17445`:

In [None]:
fashion17445 = client.json().get('fashion:17445')
fashion17445

In [None]:
client.json().objkeys('fashion:17445')

## Generating Embeddings with ML

![ML Models for embeddings](./images/target-model-embeddings-redis.png)

## Where to find pre-trained models?

![Model Zoos](./images/model-zoos.png)

## Sentence Transformers

![SBERT](./images/sbert-net.png)

- **SentenceTransformers** to **generate embeddings** for the product **descriptions** 
- **Sentence-BERT** (**SBERT**) produces **contextually rich** sentence embeddings
- Embeddings provide **efficient sentence-level** semantic similarity
- Improves tasks like **semantic search** and **text grouping**

## Selecting a suitable pre-trained Model

- We must pick a **suitable model** for **generating embeddings**
- We want to query for products using **short queries** against the **longer** product **descriptions**
- This is referred to as **"Asymmetric Semantic Search"** 
- Used when **search query** and the **documents** being searched are of **different nature or structure**

## Selecting a suitable pre-trained Model

- For **asymmetric semantic search** suitable models include pre-trained **MS MARCO** Models
- Optimized for understanding **real-world queries** and producing **relevant responses**
- **Highest performing** MS MARCO model is **`msmarco-distilbert-base-v4`**
  - which is tuned for **cosine-similarity** 

In [None]:
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('msmarco-distilbert-base-v4') 

## Extract the Product's Description

- Let's extract the `description` into the `sample_description` var:

In [None]:
sample_description = fashion17445['description']
sample_description

## Generating an Embedding Vector

- To generate the vector embeddings, we use the `encode` function:

In [None]:
embedding = embedder.encode(sample_description)
VECTOR_DIMENSION = len(embedding)
VECTOR_DIMENSION

- Let's take a peek at the first **5** elements of the generated vector:

In [None]:
print(embedding.tolist()[:5])

## Generate Embeddings for remaining Products Descriptions

* To vectorize all the descriptions in the database, we will first collect all the Redis keys for the products:



In [None]:
all_keys = sorted(client.keys('fashion:*')) 
len(all_keys)

In [None]:
from dotenv import load_dotenv
import os
import json
import preso
load_dotenv()
DEMO_PRODUCTS = json.loads(os.environ.get('DEMO_PRODUCTS'))
print(DEMO_PRODUCTS)

In [None]:
keys = ["fashion:" + str(id) for id in DEMO_PRODUCTS]
print(keys)

## Generate Embeddings for the Products Descriptions

* With the keys in `keys` we can use the Redis `JSON.MGET` command to retrieve just the `description` field
* We'll store all the descriptions in the `descriptions` variable
* The `encode` method can take a List of text passages to encode

In [None]:
import numpy as np

descriptions = client.json().mget(keys, '$.description')
descriptions = [item for sublist in descriptions for item in sublist]
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()

* Let's checked that we've generated the correct number of embedding vectors:

In [None]:
len(embeddings)

## Add the embeddings to the JSON documents

- Now we can add the vectorized descriptions to the JSON documents in Redis
- Use the `JSON.SET` command to insert a new field in each of the documents at `$.description_embeddings`
- Use Redis' **pipeline** mode to minimize the round-trip times:

In [None]:
pipeline = client.pipeline()

for key, embedding in zip(keys, embeddings):
    pipeline.json().set(key, '$.description_embeddings', embedding)

pipeline.execute()
print('Vector Sentence Embeddings Saved!')

## Inspect the Products Documents

- Let's inspect one of the vectorized product documents using the `JSON.GET` command:

In [None]:
import json

print(json.dumps(client.json().get('fashion:17445'), indent=2)) 

## Create Search Index for the Products Collection

- To define the index we'll import the `IndexDefinition` and the `IndexType`
- To define the schema fields we'll use the classes `TagField`, `TextField`, `NumericField`, and **`VectorField`**
- We'll create an index named **`idx:fashion`**

In [None]:
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.field import TagField, TextField, NumericField, VectorField
from redis.commands.search.query import Query

INDEX_NAME = 'idx:fashion'
DOC_PREFIX = 'fashion:'

## The Search Index Schema

In [None]:
try:
    client.ft(INDEX_NAME).info()
    print('Index already exists!')
except:
    schema = (
        TagField('$.gender', as_name='gender'),  
        TagField('$.subCategory', as_name='subCategory'), 
        TagField('$.season', as_name='season'), 
        NumericField('$.year', as_name='year'),
        TagField('$.usage', as_name='usage'),
        TextField('$.name', no_stem=True, as_name='name'),
        TagField('$.color', as_name='color'),
        TagField('$.type', as_name='type'),
        TagField('$.category', as_name='category'),
        TextField('$.description', as_name='description'),
        VectorField('$.description_embeddings', 'FLAT', {
          'TYPE': 'FLOAT32',
          'DIM': VECTOR_DIMENSION,
          'DISTANCE_METRIC': 'COSINE',
        },  as_name='vector'),
    )

    # index Definition
    definition = IndexDefinition(prefix=[DOC_PREFIX], index_type=IndexType.JSON)

    # create Index
    client.ft(INDEX_NAME).create_index(fields=schema, definition=definition)

## `VECTOR` Schema Field Definition

* **Indexing method**: `FLAT` **(brute-force indexing)** or `HNSW` **(Hierarchical Navigable Small World)**
* **Vector Type**: `FLOAT32` or `FLOAT64`.
* **Vector Dimension**: The length or dimension of our embeddings (`768`).
* **Distance Metric**: `L2` **(Euclidean distance)**, `IP` **(Inner Product)**, or `COSINE` **(Cosine Similarity)** 

## Check the state of the Index

- `FT.CREATE` creates the index
- The **indexing process** is automatically started in the **background**
- In the blink of an eye, our JSON documents are indexed and ready to be searched
- To corroborate that, we use the **`FT.INFO`**:

In [None]:
info = client.ft(INDEX_NAME).info()

num_docs = info['num_docs']
indexing_failures = info['hash_indexing_failures']
total_indexing_time = info['total_indexing_time']
percent_indexed = float(info['percent_indexed']) * 100


print(f"{num_docs} docs ({percent_indexed}%) indexed w/ {indexing_failures} failures in {float(total_indexing_time):.2f} msecs")

## Structured Data Searches with Redis

- Let's test the non-vector part of the index first:

- Retrieve all products where the `type` is `Deodorant`:

In [None]:
query = (
    Query('@type:{Deodorant}')
    .paging(0, 5)
    .return_fields('id', 'name', 'color', 'type', 'year')
)
client.ft(INDEX_NAME).search(query).docs

- Find all `Deodorant` products find the ones with the word `hydra` in their `name`:

In [None]:
query = (
    Query('@type:{Deodorant} @name:hydra').return_fields('id', 'name', 'color', 'type', 'year')
)
client.ft(INDEX_NAME).search(query).docs

## Semantic Queries

- We want to query for products using short query prompts
- Let's put our queries in a list so we can vectorize them and execute them in bulk:

In [None]:
queries = [
    'Comfortable pants',
    'Casual shades',
    'Floral dress'
]

In [None]:
encoded_queries = embedder.encode(queries)
len(encoded_queries)

## Visualizing Embeddings

- TensorFlow provides "projector" with some sample vector spaces mapped into a 3-D space
- By using Dimensionality Reduction techniques we can visualize and explore embeddings

In [None]:
%%html
<iframe src="https://projector.tensorflow.org/" width="1920" height="540"></iframe>

## Constructing a "Pure KNN" VSS Query

- We'll start with a **K-nearest neighbors** (KNN) query 
- KNN goal is to find the **most similar** items to a given query item
- KNN calculates the **distance** between the query vector and each vector in the database
- Returns 'K' items with the **smallest** distances
- These are considered to be the most similar items

## Constructing a "Pure KNN" VSS Query

In [None]:
query = (
    Query('(*)=>[KNN 3 @vector $query_vector AS vector_score]')
     .sort_by('vector_score')
     .return_fields('vector_score', 'id', 'name', 'color', 'type', 'year', 'description')
     .dialect(2)
)

- The syntax for KNN queries is `(*)=>[vector_similarity_query>]` 
  - where the `(*)` (the `*` meaning all) is the filter query for the search engine.
  - `$query_vector` represents the query parameter we'll use to pass the vectorized query prompt.
  - results are filtered by `vector_score`
  - Query returns the `vector_score`, the `id` of the matched documents, the `$.brand`, `$.model`, and `$.description`

## 🏃🏾‍♀️Running the Query

- With the Query prepared in `query`
- and the query prompts in `queries` 
- and the encoded queries in `encoded_queries`
- we can use the `create_query_table` function to generate a table of results:

## 🏃🏾‍♀️Running the Query

In [None]:
preso.create_query_table(client, INDEX_NAME, query, queries, encoded_queries)

## 🔍 Pre-filtering Queries

- "Pure KNN" queries evaluate a query against the **whole space of vectors**
- The larger the collection, the more **computationally expensive**
- Unstructured data does not live in isolation
- Rich search experiences must allow searching all data (structured and unstructured) 

## 🔍 Pre-filtering Queries

- For example, users might arrive at your search interface with a brand preference in mind
- Redis can use this information to pre-filter the search space
- In the hybrid query definition below:
  - we pre-filter using the `color` equal to `Blue` and the `season` being `Summer` 
  - before our primary filter query was `(*)`, AKA everything
  - we narrow the search space using `(@brand:Peaknetic)` before the KNN query

In [None]:
hybrid_query = (
    Query('(@color:{Blue} @season:{Summer})=>[KNN 3 @vector $query_vector AS vector_score]')
     .sort_by('vector_score')
     .return_fields('vector_score', 'id', 'name', 'color', 'type', 'year', 'description')
     .dialect(2)
)

## 🏃🏾‍♀️Running the Query

In [None]:
preso.create_query_table(client, INDEX_NAME, hybrid_query, queries, encoded_queries)

## Range Queries

- Range queries retrieve items within a specific **distance** from a query vector
- We consider **"distance"** to be the **measure of similarity** 
- The smaller the distance, the more similar the items
- For example, to return the top `4` products within a `0.55` "radius" of query: 

```
1️⃣ FT.SEARCH idx:fashion 
2️⃣   @vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score} 
3️⃣   SORTBY vector_score ASC
4️⃣   LIMIT 0 4 
5️⃣   DIALECT 2 
6️⃣   PARAMS 4 range 0.55 query_vector "\x9d|\x99>bV#\xbfm\x86\x8a\xbd\xa7~$?*...."
```

## Range Queries

- In Python:

In [None]:
range_query = (
    Query('@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}') 
    .sort_by('vector_score')
    .return_fields('vector_score', 'id', 'name', 'color', 'type', 'year', 'description')
    .paging(0, 4)
    .dialect(2)
)

## 🏃🏾‍♀️Running the Query

In [None]:
preso.create_query_table(client, INDEX_NAME, range_query, queries, encoded_queries, {'range': 0.75})

## 📸 Searching using Similar Images

* Just like we did we the text descriptions, we can use the images associated with the products
* The "Sometimes a picture is word 1000 words" adage applies well to searching

In [None]:
from PIL import Image
sample_image = Image.open(fashion17445['image_url'])
sample_image 

## 📸 Finding an Image Embedding Model

- For generating Image Embeddings I've chosen the CLIP model
- Specifically the `clip-ViT-B-32` which maps text and images to a shared vector space

In [None]:
image_embedder = SentenceTransformer('clip-ViT-B-32')

sample_image_224x224 = sample_image.convert('RGB').resize((224, 224))

image_embedding = image_embedder.encode(sample_image_224x224)
IMAGE_VECTOR_DIMENSION = len(image_embedding)
IMAGE_VECTOR_DIMENSION

In [None]:
print(image_embedding.tolist()[:5])

##  📸 Generate the Image Embeddings for the Demo Products

- As before, we use the `encode` function over the collection of products:

In [None]:
image_urls = client.json().mget(keys, '$.image_url')
image_embeddings = []

for filepath, key in zip(image_urls, keys):    
    image = Image.open(filepath[0]).convert('RGB')
    image = image.resize((224, 224))
    image_embeddings.append(image_embedder.encode(image).astype(np.float32).tolist())

## 🪣 Store the Embeddings in Redis

* Add the new `image_embeddings` field to the JSON documents
* Again we use Redis "pipeline" mode to minimize the network trips

In [None]:
pipeline = client.pipeline()

for key, image_embedding in zip(keys, image_embeddings):
    pipeline.json().set(key, '$.image_embeddings', image_embedding)

pipeline.execute()
print('Vector Image Embeddings Saved!')

## Update the Search Index

* To enable searching against the created image embeddings we alter the schema
* Redis provides the [FT.ALTER](https://redis.io/commands/ft.alter/) command to do so
* Available in Redis PY as `alter_schema_add`

In [None]:
client.ft(INDEX_NAME).alter_schema_add(
    VectorField('$.image_embeddings', 'FLAT', {
      'TYPE': 'FLOAT32',
      'DIM': IMAGE_VECTOR_DIMENSION,
      'DISTANCE_METRIC': 'COSINE',
    },  as_name='image_vector')
)

## Finding a "Query Image"

* Let's pick a random image from database to search for similar items visually:

In [None]:
import random

random_index = random.randint(0, len(image_urls) - 1)
query_image = Image.open(image_urls[random_index][0])
query_image

## Image Similarity Vector Query

* As before, we'll use a KNN query but with K=6

In [None]:
image_query = (
    Query('(*)=>[KNN 7 @image_vector $query_vector AS image_vector_score]')
     .sort_by('image_vector_score')
     .return_fields('image_vector_score', 'id')
     .paging(1, 6)
     .dialect(2)
)

## Running the Query

* We run the query by "vectorizing" the query image...
* The we grab the images corresponding to the matched documents

In [None]:
results = client.ft(INDEX_NAME).search(image_query, { 'query_vector': np.array(image_embeddings[random_index], dtype=np.float32).tobytes() }).docs

image_matches_urls = client.json().mget([doc.id for doc in results], '$.image_url')
image_matches = [Image.open(url[0]) for url in image_matches_urls] 

## Visualizing the Results

* Since we're in a Jupyter Notebook, let's use MatplotLib to show a grid out our visual matches:

In [None]:
preso.plot_images(2, 3, image_matches)

## Recap

- The tools and techniques to unlock the value in **Unstructured Data** have evolved greatly...
- Redis **in-memory first** approach makes it a perfect fit for vector similarity searches
- Redis natively supports vector searches over **Hashes** and **JSON**
- Redis combines the power of searching over semi-structured and unstructured data
  - with the performance you've come to expect from Redis 



## https://github.com/bsbodden/redis-vss-py

## Learn more at Redis University

## `https://university.redis.com`

![Redis U](./images/redis_university.png)

## Thank You!

![BSB](./images/bsb.png)