<a href="https://colab.research.google.com/github/antonum/Redis-VSS-Streamlit/blob/main/vector_embeddings_redis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Similarity Search with Redis

[Always-on demo](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlitapp.com/)

[GitHub repo](https://github.com/antonum/Redis-VSS-Streamlit)

![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook generates vector embeddings using pretrained sentence-transformers/all-distilroberta-v1 model from HuggingFace, loads them to Redis and runs Vector Similarity search against Redis database. 

In [5]:
#install Redis client and Hugging Face sentence transformers
!pip install redis sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting redis
  Using cached redis-4.3.4-py3-none-any.whl (246 kB)
Collecting sentence_transformers
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting deprecated>=1.2.3
  Using cached Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting sentencepiece
  Using cached sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Collecting huggingface-hub>=0.4.0
  Using cached huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
Collecting transformers<5.0.0,>=4.6.0
  Using cached transformers-4.21.3-py3-none-any.whl (4.7 MB)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Using cached tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
Installing collected packages: tokenizers, huggingface-hub, transformers, sentencepiece, deprecated, sentence-transformers, redis
Successfully installed deprecated-1.2.13 huggin

In [6]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
from redis import Redis
from redis.commands.search.field import (
    NumericField,
    TagField,
    TextField,
    VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query


tqdm.pandas()

#load pre-trained model from HuggingFace
model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [7]:
!wget https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv

--2022-09-14 17:53:17--  https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2486081 (2.4M) [text/plain]
Saving to: ‘Labelled_Tweets.csv.1’


2022-09-14 17:53:17 (258 MB/s) - ‘Labelled_Tweets.csv.1’ saved [2486081/2486081]



In [8]:
df = pd.read_csv('Labelled_Tweets.csv')
df=df.head(1000) #trim dataframe to fit results into 30MB Redis database
df


Unnamed: 0,id,created_at,full_text,score
0,1,2020-04-09 23:59:51+00:00,@KennyDegu very very little volume. With $10T ...,-0.7
1,2,2020-04-09 23:58:55+00:00,#ES_F achieved Target 2780 closing above 50% #...,0.0
2,3,2020-04-09 23:58:52+00:00,RT @KimbleCharting: Silver/Gold indicator crea...,-0.2
3,4,2020-04-09 23:58:27+00:00,@Issaquahfunds Hedged our $MSFT position into ...,-0.4
4,5,2020-04-09 23:57:59+00:00,RT @zipillinois: 3 Surprisingly Controversial ...,0.1
...,...,...,...,...
995,1018,2020-04-09 21:38:07+00:00,Stuck at home during #Coronavirus quarantine? ...,0.0
996,1019,2020-04-09 21:38:02+00:00,$AVGO #Broadcom Inc Broadcom Inc: 1 director s...,-0.1
997,1020,2020-04-09 21:37:49+00:00,RT @OptionsITrader: I am Short the following s...,-0.3
998,1021,2020-04-09 21:37:40+00:00,VIDEO - $DUST Stock Technical Analysis - 04-09...,0.0


In [9]:
def text_to_embedding(text):
  return model.encode(text).astype(np.float32).tobytes()

#generate vector embeddings
df["text_embedding"] = df["full_text"].progress_apply(text_to_embedding)
df.head()

  0%|          | 0/1000 [00:00<?, ?it/s]

Unnamed: 0,id,created_at,full_text,score,text_embedding
0,1,2020-04-09 23:59:51+00:00,@KennyDegu very very little volume. With $10T ...,-0.7,b'\xba#\xe1;3\x8dQ\xbd\xb8K\xa4<S;\x82;\x93z6\...
1,2,2020-04-09 23:58:55+00:00,#ES_F achieved Target 2780 closing above 50% #...,0.0,b'\x08\x01l=f\xa9B\xbc\xc8\'|<\x07g\x8d;\xb3\x...
2,3,2020-04-09 23:58:52+00:00,RT @KimbleCharting: Silver/Gold indicator crea...,-0.2,b'\x8e\xd0\xc4\xb9>w{\xbdc\x17w\xbce\xaa\xdb;\...
3,4,2020-04-09 23:58:27+00:00,@Issaquahfunds Hedged our $MSFT position into ...,-0.4,b'\xdc2\x0b\xbbi\xba\x1e\xbbmUk<\xef\xfc\xcf\x...
4,5,2020-04-09 23:57:59+00:00,RT @zipillinois: 3 Surprisingly Controversial ...,0.1,b'\xe1K\x17\xbc\xdeXp\xbd\n\x03\x10<L\x0e`\xbd...


In [None]:
def df_to_redis_hash(redis,df,key="tweet",pipesize=100):
  tweethash={}
  pipe = redis.pipeline(transaction=False)
  for i in tqdm(range(len(df["id"]))):
    keyname = "{}:{}".format(key,df["id"][i])
    tweethash["text"]=df["full_text"][i]
    tweethash["text_embeddings"]=df["text_embedding"][i]
    pipe.hset(keyname, mapping=tweethash)
    if (i % pipesize == 0):
      pipe.execute()
      pipe = redis.pipeline(transaction=False)
  pipe.execute()

def create_redis_index(redis, idxname="tweet:idx"):
  try:
    redis.ft(idxname).dropindex()
  except:
    print("no index found")

  # Create an index
  indexDefinition = IndexDefinition(
      prefix=["tweet:"],
      index_type=IndexType.HASH,
  )

  redis.ft(idxname).create_index(
      (
          #TextField("text", no_stem=False, sortable=True),
          VectorField("text_embeddings", "HNSW", {  "TYPE": "FLOAT32", 
                                                    "DIM": 768, 
                                                    "DISTANCE_METRIC": "COSINE",
                                                  })
      ),
      definition=indexDefinition
  )



## Create free Redis Cloud Subscription
Redis.com - try free

Create Free 30MB Fixed subscription

Capture:
- “Public Endpoint” 
- “Default User Password”


In [None]:
# make sure to enter your values here!!!
host = "redis-18900.c73.us-east-1-2.ec2.cloud.redislabs.com"
port = 18900
pwd="sDv0puwA3oMXNBe3e8gdcBQtYXXXXX"

# connect to Redis
redis = Redis(host=host, port=port, password=pwd)

# clear Redis database (optional)
redis.flushdb()

# create Index
create_redis_index(redis)

# load data from Dataframe to Redis HASH
df_to_redis_hash(redis,df,key="tweet", pipesize=100)

no index found


  0%|          | 0/1000 [00:00<?, ?it/s]

## Query the database

[Alway-on Streamlit app](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlitapp.com/) 


Try queries like:
“Oil”, “Oil Reserves”, “Fossil fuels”

Lexical Full Text search quickly runs out of matches

Vector search continues to discover relevant tweets

In [None]:
user_query="oil down"

#using Full Text Index
q = Query(user_query)\
  .return_fields("text")
res = redis.ft("tweet:idx").search(q)
res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
res_df

Unnamed: 0,id,text
0,tweet:1017,"RT @leadlagreport: Oil services, Oil &amp; Gas..."


In [None]:
#using Vector Similarity Index
query_vector=model.encode(user_query).astype(np.float32).tobytes()
q = Query("*=>[KNN 10 @text_embeddings $vector AS result_score]")\
                .return_fields("result_score","text")\
                .dialect(2)\
                .sort_by("result_score", True)
res = redis.ft("tweet:idx").search(q, query_params={"vector": query_vector})
res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
res_df

Unnamed: 0,id,result_score,text
0,tweet:531,0.455647051334,#OIL Sentiment ($22.50)\r\n\r\nWhat’s next for...
1,tweet:204,0.488326072693,Bad news for #oil. It’s going to between $10 ...
2,tweet:178,0.53811788559,OH how bullish for #oil LOL\r\n\r\n#OOTT #Oi...
3,tweet:813,0.547377109528,Tankers Are the Big Winners of the 2020 Oil Cr...
4,tweet:751,0.565672159195,Need little help w/your #WTI #OIL #FUTURES #CL...
5,tweet:304,0.590872764587,"""OPEC and allies agree to historic 10 million ..."
6,tweet:761,0.59246712923,OPEC and allies agree to historic 10 million b...
7,tweet:311,0.599368274212,"Oil Prices Rise, Fall As Russia, Saudi Arabia ..."
8,tweet:883,0.623463749886,RT @khmerxbxboi: $XOM $CHK $HUSA = Sunday afte...
9,tweet:636,0.626247644424,Told you Saudi Arabia will bend the knee @jimc...
