<a href="https://colab.research.google.com/github/antonum/Redis-VSS-Streamlit/blob/main/vector_embeddings_redis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Similarity Search with Redis

[Always-on demo](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlit.app/)

[GitHub repo](https://github.com/antonum/Redis-VSS-Streamlit)

![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook generates vector embeddings using pretrained `sentence-transformers/all-MiniLM-L6-v2` model from HuggingFace, loads them to Redis and runs Vector Similarity search against Redis database. 

In [1]:
#install Redis client and Hugging Face sentence transformers
!pip install -q redis sentence_transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.3/240.3 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m100.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m44.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m71.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence_transformers (setup.py) ... [?25l[?25hdone


Install Redis Stack locally

In [2]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg 
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list 
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes 


deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb focal main
Starting redis-stack-server, database path /var/lib/redis-stack


### Connect to the Redis server

In [3]:
import os 
import redis
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
#Replace values above with your own if using Redis Cloud instance
#REDIS_HOST="redis-12110.c82.us-east-1-2.ec2.cloud.redislabs.com"
#REDIS_PORT=12110
#REDIS_PASSWORD="pobhBJP7Psicp2gV0iqa2ZOc1WdXXXXX"

#shortcut for redis-cli $REDIS_CONN command
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"
redis = redis.Redis(
  host=REDIS_HOST,
  port=REDIS_PORT,
  password=REDIS_PASSWORD)
redis.ping()

True

In [4]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
from redis import Redis
from redis.commands.search.field import (
    NumericField,
    TagField,
    TextField,
    VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query


tqdm.pandas()



### Embedding generation model

Here we are using `sentence-transformers/all-MiniLM-L6-v2` from HuggingFace. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2



In [5]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Download 12k+ tweets

In [6]:
!wget https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv

--2023-06-05 16:52:47--  https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2486081 (2.4M) [text/plain]
Saving to: ‘Labelled_Tweets.csv’


2023-06-05 16:52:47 (304 MB/s) - ‘Labelled_Tweets.csv’ saved [2486081/2486081]



In [7]:
df = pd.read_csv('Labelled_Tweets.csv').drop(columns=['created_at','score'])
#df=df.head(3000) #trim dataframe to fit results into 30MB Redis database
df


Unnamed: 0,id,full_text
0,1,@KennyDegu very very little volume. With $10T ...
1,2,#ES_F achieved Target 2780 closing above 50% #...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...
3,4,@Issaquahfunds Hedged our $MSFT position into ...
4,5,RT @zipillinois: 3 Surprisingly Controversial ...
...,...,...
12415,12587,RT @PeterLBrandt: $SPX $ES_F \r\nFollowing thi...
12416,12588,RT @vieiraUAE: Fearless Alex Vieira Calls Best...
12417,12589,$spy $spx $qqq $ndx #nyse going from poking th...
12418,12590,RT @DavidScottAdams: On watch tomorrow // Pt. ...


### Generate Embeddings

Generate vector embeddings within the dataframe. This step can take 2-3 minutes on GPU runtime for all 12k records.

In [8]:
def text_to_embedding(text):
  return model.encode(text).astype(np.float32).tobytes()

#generate vector embeddings
df["text_embedding"] = df["full_text"].progress_apply(text_to_embedding)
df.head()

  0%|          | 0/12420 [00:00<?, ?it/s]

Unnamed: 0,id,full_text,text_embedding
0,1,@KennyDegu very very little volume. With $10T ...,b')\x92\x81\xbd\x12h\x8b\xbd}\xdf\xe4\xbc\xbf\...
1,2,#ES_F achieved Target 2780 closing above 50% #...,b'S\x1b\x02\xbd\x16~/\xbd\x9bz\xb1\xbc`\x99\xd...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...,b'\x0f\xaa\xa3\xbdc}\x10\xbd\xcc\xe8\xb9=(\x08...
3,4,@Issaquahfunds Hedged our $MSFT position into ...,b'\xc1\x7f\xd1\xbc\xbc\n`\xbd79 =\xe4\xc0\xef=...
4,5,RT @zipillinois: 3 Surprisingly Controversial ...,b'\xca\r\x1e\xbdV\\\xd4\xbcL/\xa1\xbc\xe8q7=\x...


### Create Helper Functions
 
- Save dataframe to Redis HASH
- Create RediSearch Index

In [9]:
def load_dataframe(redis, df, key_prefix="tweet", id_column="id", pipe_size=100):
    records = df.to_dict(orient="records")
    pipe = redis.pipeline()
    i=1
    for record in tqdm(records):
        i=i+1
        key = f"{key_prefix}:{record[id_column]}"
        pipe.hset(key, mapping=record)
        if (i+1) % pipe_size == 0:
          res=pipe.execute()
    pipe.execute()

def create_redis_index(redis, idxname="tweet:idx"):
  try:
    redis.ft(idxname).dropindex()
  except:
    print("no index found")

  # Create an index
  indexDefinition = IndexDefinition(
      prefix=["tweet:"],
      index_type=IndexType.HASH,
  )

  redis.ft(idxname).create_index(
      (
          TextField("full_text", no_stem=False, sortable=False),
          VectorField("text_embedding", "HNSW", {  "TYPE": "FLOAT32", 
                                                    "DIM": 384, 
                                                    "DISTANCE_METRIC": "COSINE",
                                                  })
      ),
      definition=indexDefinition
  )



### Create index and load data to Redis

In [10]:
# clear Redis database (optional)
redis.flushdb()

# create Index
create_redis_index(redis)

# load data from Dataframe to Redis HASH
load_dataframe(redis,df,key_prefix="tweet", pipe_size=100)


no index found


  0%|          | 0/12420 [00:00<?, ?it/s]

In [11]:
#Check how the data is stored in Redis
!redis-cli $REDIS_CONN hgetall "tweet:1001"

1) "id"
2) "1001"
3) "text_embedding"
4) "\xd8\x1e\x03:\xdfb\xbd\xbcA\xb0\x94\xbc1K)<\xf4T\x94;%>\x11\xbd\xae3\r=\t\x98\x7f=\x99X\x88<uUR\xbd\xbc\xf6h\xbci\xb5\xd2\xbb\x15;\t\xbd\xf9m#<\xe7\xd4C=\x91\xc8\"\xbd\x1b\xa3-\xbd0^\t\xbdv\x00\xfd\xbc\xb6GE\xbdn\xadZ\xbc]\xae=\xbdH<\x99<\xf2b\xce\xbc\b`\xab<4DT\xbcmM\xd7\xbc\xdb\xf0\x82\xbb\x8e\xb0(=Gv\x8a\xbc\xd1\xa8M\xbdd,P=\bS\x86<\x8fj\xfa<??\xb5<\xd9[\xee\xbb\xcf\x8d\x90=\rW8=R\x81\x82\xbd\x96\xfb&<\xf2-:=Y\xe6\xe4\xbd\xae\x1d\x11\xbc\xd4\xb7\xac\xbd\xfbY<\xbbi\xc9\xcc\xbb\xf29b\xbcI\xe3\x90=0\x9f\xa7=\xef7\x0f=*\xbb\xbb\xbbP\xad\x17=\xb8\xaeE\xbd\xe2\x8e\x90=r\xd7\x02\xbd\xe0\x01\x9f\xbcv\xc3e;\a\xa5\xd8\xbc\x80{f<\x9bx\xaf\xbd\xf2\xc4\x89\xbcPo\x03=wB\x8e\xbc4\x91G;\xf8\xe6\x97\xba\xbd\xc1\x9a=\x12\xf7B\xbd\xb5\r\xca<_\xf2G\xbb\xbf\x0e\x86<\xb2r\x84<\xcd\xf7^\xbd5\xc9\xbe\xbd`K5\xbc\"\xe09\xbd\xc4\xd7==5y\xb3=\xad\x10\xb4\xbdr\xfc\x11\xbd\x11\x87O\xbc\xc9o<\xbd}\x87\x80\xbc\x1e\xa9\"\xbdV) \xbd\xe2\xfa\x85\xba\xc7y\x95=\xf2\xdf\xa6<\x03

## Query the database

[Alway-on Streamlit app](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlit.app/) 


Try queries like:
“Oil”, “Oil Reserves”, “Fossil fuels”

Lexical Full Text search quickly runs out of matches

Vector search continues to discover relevant tweets

In [43]:
user_query="oil price"
# queries to try "oil reserve", "fossil fuels"

In [44]:
#using Full Text Index
q = Query(user_query)\
  .return_fields("full_text")
res = redis.ft("tweet:idx").search(q)
if res.total==0:
  print("No matches found")
else:
  res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
  display(res_df)

Unnamed: 0,id,full_text
0,tweet:3220,The relative performance of TIPS has historica...
1,tweet:311,"Oil Prices Rise, Fall As Russia, Saudi Arabia ..."
2,tweet:1490,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
3,tweet:1585,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
4,tweet:1610,"Oil Prices Rise, Fall As Russia, Saudi Arabia ..."
5,tweet:7189,Do higher oil prices help the consumer and sma...
6,tweet:636,Told you Saudi Arabia will bend the knee @jimc...
7,tweet:5405,https://t.co/3IJBXa5wuf Historic oil price plu...
8,tweet:5406,Historic oil price plunge trashes sector's pro...
9,tweet:3865,Today's book recommendation goes for the winne...


In [14]:
#using Vector Similarity Index
query_vector=text_to_embedding(user_query)
q = Query("*=>[KNN 10 @text_embedding $vector AS result_score]")\
                .return_fields("result_score","full_text")\
                .dialect(2)\
                .sort_by("result_score", True)
res = redis.ft("tweet:idx").search(q, query_params={"vector": query_vector})
#print(res)
res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
res_df

Unnamed: 0,id,result_score,full_text
0,tweet:444,0.369450807571,Would you spend $2 more a gallon of gasoline i...
1,tweet:11529,0.371095597744,RT @tradingcrudeoil: Crude oil closed up $0.48...
2,tweet:5654,0.381934285164,..and oil still 25.74 LMAO &gt;&gt;&gt;NO DEM...
3,tweet:204,0.396132171154,Bad news for #oil. It’s going to between $10 ...
4,tweet:9189,0.409308671951,Oil erases gains for the day in fall to $25 ht...
5,tweet:7189,0.429816782475,Do higher oil prices help the consumer and sma...
6,tweet:9330,0.430081248283,The price of Texas intermediate oil (WTI) slum...
7,tweet:531,0.431391596794,#OIL Sentiment ($22.50)\r\n\r\nWhat’s next for...
8,tweet:178,0.441844820976,OH how bullish for #oil LOL\r\n\r\n#OOTT #Oi...
9,tweet:6867,0.442979931831,$DXY 99.55-0.57%&lt;==US Dollar lower #Fed $2....
