<a href="https://colab.research.google.com/github/antonum/Redis-VSS-Streamlit/blob/main/vector_embeddings_redis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Similarity Search with Redis

[Always-on demo](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlitapp.com/)

[GitHub repo](https://github.com/antonum/Redis-VSS-Streamlit)

![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook generates vector embeddings using pretrained sentence-transformers/all-distilroberta-v1 model from HuggingFace, loads them to Redis and runs Vector Similarity search against Redis database. 

In [None]:
#install Redis client and Hugging Face sentence transformers
!pip install redis sentence_transformers

In [26]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
from redis import Redis
from redis.commands.search.field import (
    NumericField,
    TagField,
    TextField,
    VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query


tqdm.pandas()

#load pre-trained model from HuggingFace
model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

Download 12k+ tweets

In [None]:
!wget https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv

In [None]:
df = pd.read_csv('Labelled_Tweets.csv')
#df=df.head(1000) #trim dataframe to fit results into 30MB Redis database
df


Generate vector embeddings within the dataframe

In [None]:
def text_to_embedding(text):
  return model.encode(text).astype(np.float32).tobytes()

#generate vector embeddings
df["text_embedding"] = df["full_text"].progress_apply(text_to_embedding)
df.head()

Functions to: 
- save dataframe to Redis HASH
- Create RediSearch Index

In [30]:
def df_to_redis_hash(redis,df,key="tweet",pipesize=100):
  tweethash={}
  pipe = redis.pipeline(transaction=False)
  for i in tqdm(range(len(df["id"]))):
    keyname = "{}:{}".format(key,df["id"][i])
    tweethash["text"]=df["full_text"][i]
    tweethash["text_embeddings"]=df["text_embedding"][i]
    pipe.hset(keyname, mapping=tweethash)
    if (i % pipesize == 0):
      pipe.execute()
      pipe = redis.pipeline(transaction=False)
  pipe.execute()

def create_redis_index(redis, idxname="tweet:idx"):
  try:
    redis.ft(idxname).dropindex()
  except:
    print("no index found")

  # Create an index
  indexDefinition = IndexDefinition(
      prefix=["tweet:"],
      index_type=IndexType.HASH,
  )

  redis.ft(idxname).create_index(
      (
          TextField("text", no_stem=False, sortable=True),
          VectorField("text_embeddings", "HNSW", {  "TYPE": "FLOAT32", 
                                                    "DIM": 768, 
                                                    "DISTANCE_METRIC": "COSINE",
                                                  })
      ),
      definition=indexDefinition
  )



## Connect to Redis instance

You need Redis with RediSearch 2.4+ to complete the notebook. You can either use Redis Cloud instance, install [Redis Stack](https://redis.io/docs/stack/) within the colab notebook or run notebook locally and connect it to the local Redis Stack instance 

### Create free Redis Cloud Subscription
Redis.com - try free

Create Free 30MB Fixed subscription

Capture:
- “Public Endpoint” 
- “Default User Password”

### Install redis-stack in the colab runtime

Create and run new code cell:

```
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update
sudo apt-get install redis-stack-server
redis-stack-server --daemonize yes
```

### Local Redis Stack

If you are running this notebook locally, you can use local version of the [Redis Stack](https://redis.io/docs/stack/)

In [None]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update
sudo apt-get install redis-stack-server
redis-stack-server --daemonize yes

In [32]:
# make sure to enter your values here!!!
#host = "redis-18900.c73.us-east-1-2.ec2.cloud.redislabs.com"
#port = 18900
#pwd="sDv0puwA3oMXNBe3e8gdcBQtYXXXXX"

host = "localhost"
port = 6379
pwd=""


# connect to Redis
redis = Redis(host=host, port=port, password=pwd)

# clear Redis database (optional)
redis.flushdb()

# create Index
create_redis_index(redis)

# load data from Dataframe to Redis HASH
df_to_redis_hash(redis,df,key="tweet", pipesize=100)

no index found


  0%|          | 0/12420 [00:00<?, ?it/s]

## Query the database

[Alway-on Streamlit app](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlitapp.com/) 


Try queries like:
“Oil”, “Oil Reserves”, “Fossil fuels”

Lexical Full Text search quickly runs out of matches

Vector search continues to discover relevant tweets

In [56]:
user_query="oil down"
# queries to try "oil reserve", "fossil fuels"

In [None]:
#using Full Text Index
q = Query(user_query)\
  .return_fields("text")
res = redis.ft("tweet:idx").search(q)
res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
res_df

In [None]:
#using Vector Similarity Index
query_vector=model.encode(user_query).astype(np.float32).tobytes()
q = Query("*=>[KNN 10 @text_embeddings $vector AS result_score]")\
                .return_fields("result_score","text")\
                .dialect(2)\
                .sort_by("result_score", True)
res = redis.ft("tweet:idx").search(q, query_params={"vector": query_vector})
res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
res_df