<a href="https://colab.research.google.com/github/antonum/Redis-VSS-Streamlit/blob/main/vector_embeddings_redis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Similarity Search with Redis

[Always-on demo](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlitapp.com/)

[GitHub repo](https://github.com/antonum/Redis-VSS-Streamlit)

![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook generates vector embeddings using pretrained sentence-transformers/all-distilroberta-v1 model from HuggingFace, loads them to Redis and runs Vector Similarity search against Redis database. 

In [25]:
#install Redis client and Hugging Face sentence transformers
!pip install redis sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [26]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
from redis import Redis
from redis.commands.search.field import (
    NumericField,
    TagField,
    TextField,
    VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query


tqdm.pandas()

#load pre-trained model from HuggingFace
model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

Download 12k+ tweets

In [27]:
!wget https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv

--2022-10-21 13:04:01--  https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2486081 (2.4M) [text/plain]
Saving to: ‘Labelled_Tweets.csv.1’


2022-10-21 13:04:02 (259 MB/s) - ‘Labelled_Tweets.csv.1’ saved [2486081/2486081]



In [28]:
df = pd.read_csv('Labelled_Tweets.csv')
#df=df.head(1000) #trim dataframe to fit results into 30MB Redis database
df


Unnamed: 0,id,created_at,full_text,score
0,1,2020-04-09 23:59:51+00:00,@KennyDegu very very little volume. With $10T ...,-0.7
1,2,2020-04-09 23:58:55+00:00,#ES_F achieved Target 2780 closing above 50% #...,0.0
2,3,2020-04-09 23:58:52+00:00,RT @KimbleCharting: Silver/Gold indicator crea...,-0.2
3,4,2020-04-09 23:58:27+00:00,@Issaquahfunds Hedged our $MSFT position into ...,-0.4
4,5,2020-04-09 23:57:59+00:00,RT @zipillinois: 3 Surprisingly Controversial ...,0.1
...,...,...,...,...
12415,12587,2020-04-09 03:38:33+00:00,RT @PeterLBrandt: $SPX $ES_F \r\nFollowing thi...,-0.2
12416,12588,2020-04-09 03:38:23+00:00,RT @vieiraUAE: Fearless Alex Vieira Calls Best...,0.0
12417,12589,2020-04-09 03:38:17+00:00,$spy $spx $qqq $ndx #nyse going from poking th...,0.2
12418,12590,2020-04-09 03:37:45+00:00,RT @DavidScottAdams: On watch tomorrow // Pt. ...,-0.1


Generate vector embeddings within the dataframe

In [29]:
def text_to_embedding(text):
  return model.encode(text).astype(np.float32).tobytes()

#generate vector embeddings
df["text_embedding"] = df["full_text"].progress_apply(text_to_embedding)
df.head()

  0%|          | 0/12420 [00:00<?, ?it/s]

Unnamed: 0,id,created_at,full_text,score,text_embedding
0,1,2020-04-09 23:59:51+00:00,@KennyDegu very very little volume. With $10T ...,-0.7,b'\xba#\xe1;3\x8dQ\xbd\xb8K\xa4<S;\x82;\x93z6\...
1,2,2020-04-09 23:58:55+00:00,#ES_F achieved Target 2780 closing above 50% #...,0.0,b'\x08\x01l=f\xa9B\xbc\xc8\'|<\x07g\x8d;\xb3\x...
2,3,2020-04-09 23:58:52+00:00,RT @KimbleCharting: Silver/Gold indicator crea...,-0.2,b'\x8e\xd0\xc4\xb9>w{\xbdc\x17w\xbce\xaa\xdb;\...
3,4,2020-04-09 23:58:27+00:00,@Issaquahfunds Hedged our $MSFT position into ...,-0.4,b'\xdc2\x0b\xbbi\xba\x1e\xbbmUk<\xef\xfc\xcf\x...
4,5,2020-04-09 23:57:59+00:00,RT @zipillinois: 3 Surprisingly Controversial ...,0.1,b'\xe1K\x17\xbc\xdeXp\xbd\n\x03\x10<L\x0e`\xbd...


Functions to: 
- save dataframe to Redis HASH
- Create RediSearch Index

In [30]:
def df_to_redis_hash(redis,df,key="tweet",pipesize=100):
  tweethash={}
  pipe = redis.pipeline(transaction=False)
  for i in tqdm(range(len(df["id"]))):
    keyname = "{}:{}".format(key,df["id"][i])
    tweethash["text"]=df["full_text"][i]
    tweethash["text_embeddings"]=df["text_embedding"][i]
    pipe.hset(keyname, mapping=tweethash)
    if (i % pipesize == 0):
      pipe.execute()
      pipe = redis.pipeline(transaction=False)
  pipe.execute()

def create_redis_index(redis, idxname="tweet:idx"):
  try:
    redis.ft(idxname).dropindex()
  except:
    print("no index found")

  # Create an index
  indexDefinition = IndexDefinition(
      prefix=["tweet:"],
      index_type=IndexType.HASH,
  )

  redis.ft(idxname).create_index(
      (
          TextField("text", no_stem=False, sortable=True),
          VectorField("text_embeddings", "HNSW", {  "TYPE": "FLOAT32", 
                                                    "DIM": 768, 
                                                    "DISTANCE_METRIC": "COSINE",
                                                  })
      ),
      definition=indexDefinition
  )



## Create free Redis Cloud Subscription
Redis.com - try free

Create Free 30MB Fixed subscription

Capture:
- “Public Endpoint” 
- “Default User Password”

## Install redis-stack in the colab runtime

Create and run new code cell:

```
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update
sudo apt-get install redis-stack-server
redis-stack-server --daemonize yes
```

In [31]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update
sudo apt-get install redis-stack-server
redis-stack-server --daemonize yes

deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb bionic main
Hit:1 http://security.ubuntu.com/ubuntu bionic-security InRelease
Hit:2 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:6 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:8 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:9 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:10 https://packages.redis.io/deb bionic InRelease
Hit:11 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:12 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:13 http://archive.ubuntu.com/ubuntu bionic-backports InRelease

gpg: cannot open '/dev/tty': No such device or address
(23) Failed writing body


In [32]:
# make sure to enter your values here!!!
#host = "redis-18900.c73.us-east-1-2.ec2.cloud.redislabs.com"
#port = 18900
#pwd="sDv0puwA3oMXNBe3e8gdcBQtYXXXXX"

host = "localhost"
port = 6379
pwd=""


# connect to Redis
redis = Redis(host=host, port=port, password=pwd)

# clear Redis database (optional)
redis.flushdb()

# create Index
create_redis_index(redis)

# load data from Dataframe to Redis HASH
df_to_redis_hash(redis,df,key="tweet", pipesize=100)

no index found


  0%|          | 0/12420 [00:00<?, ?it/s]

## Query the database

[Alway-on Streamlit app](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlitapp.com/) 


Try queries like:
“Oil”, “Oil Reserves”, “Fossil fuels”

Lexical Full Text search quickly runs out of matches

Vector search continues to discover relevant tweets

In [53]:
user_query="oil reserve"
# queries to try "oil reserve", "fossil fuels"

In [54]:
#using Full Text Index
q = Query(user_query)\
  .return_fields("text")
res = redis.ft("tweet:idx").search(q)
res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
res_df

Unnamed: 0,id,text
0,tweet:12327,"Dow climbs about 780 points, oil rises as inve..."


In [55]:
#using Vector Similarity Index
query_vector=model.encode(user_query).astype(np.float32).tobytes()
q = Query("*=>[KNN 10 @text_embeddings $vector AS result_score]")\
                .return_fields("result_score","text")\
                .dialect(2)\
                .sort_by("result_score", True)
res = redis.ft("tweet:idx").search(q, query_params={"vector": query_vector})
res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
res_df

Unnamed: 0,id,result_score,text
0,tweet:7755,0.454441964626,oil diving again... careful on the markets $spx
1,tweet:5452,0.465721070766,RT @SWEnergyReport: Tankers Are the Big Winner...
2,tweet:5537,0.504880189896,Tankers Are the Big Winners of the 2020 Oil Cr...
3,tweet:813,0.506099939346,Tankers Are the Big Winners of the 2020 Oil Cr...
4,tweet:8242,0.506734371185,This is the perfect chance to unload O&amp;G s...
5,tweet:6997,0.510683059692,#stocks opec looking to cut 20 million barrels
6,tweet:3880,0.510708749294,"Oil Slides After OPEC's Barkindo Warns Of ""Hor..."
7,tweet:204,0.516302108765,Bad news for #oil. It’s going to between $10 ...
8,tweet:8511,0.520390510559,according to Kuwait 🇰🇼 #oil minister the inten...
9,tweet:2443,0.521108627319,$SPY $SPX $USO OPEC deal is to cut only 10 mil...
