## Conduct vector similarity search on Azure OpenAI embeddings using Azure Managed Redis

- Tutorial: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-tutorial-vector-similarity
- Code: https://github.com/Azure-Samples/azure-cache-redis-samples/tree/main/tutorial/vector-similarity-search-open-ai

### Install dependencies
Install the python dependencies required for our application. Using a Python virtual environment is usually a good idea.

In [None]:
# Code cell 1

! pip install openai num2words matplotlib plotly scipy scikit-learn pandas tiktoken redis langchain langchain_openai langchain_community langchain-redis
! pip install langchain-huggingface sentence-transformers scikit-learn
# ! pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Collecting num2words
  Downloading num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
Collecting plotly
  Downloading plotly-6.0.1-py3-none-any.whl.metadata (6.7 kB)
Collecting scipy
  Downloading scipy-1.15.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting langchain
  Downloading langchain-0.3.21-py3-none-any.whl.metadata (7.8 kB)
Collecting docopt>=0.6.2 (from num2words)
  Using cached docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting narwhals>=1.15.1 (from plotly)
  Downloading narwhals-1.31.0-py3-none-any.whl.metadata (11 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downl

### Import libraries and set up Azure OpenAI and Azure Managed Redis connection info
Fill in your Azure OpenAI and Azure Managed Redis information below. This will be used later to establish the connection these services, generate the embeddings, and load them into Redis. This example stores these values in application variables for the sake of simplicity. Outside of tutorials, it's strongly recommended to store these in environment variables or using a secrets manager like Azure KeyVault. 

Note that there are differences  between the `OpenAI` and `Azure OpenAI` endpoints. This example uses the configuration for `Azure OpenAI`. See [How to switch between OpenAI and Azure OpenAI endpoints with Python](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/switching-endpoints) for more details. 

In [2]:
# Code cell 2
import re
import os
import pandas as pd
import tiktoken
from typing import List
from dotenv import load_dotenv
from num2words import num2words
from langchain.embeddings import AzureOpenAIEmbeddings
from langchain.vectorstores.redis import Redis as RedisVectorStore
from langchain.document_loaders import DataFrameLoader

load_dotenv()

API_KEY = os.getenv('API_KEY')
RESOURCE_ENDPOINT = os.getenv('RESOURCE_ENDPOINT')
DEPLOYMENT_NAME = os.getenv('DEPLOYMENT_NAME')
MODEL_NAME = os.getenv('MODEL_NAME')
REDIS_ENDPOINT = os.getenv('REDIS_ENDPOINT')
REDIS_PASSWORD = os.getenv('REDIS_PASSWORD')

print(f"RESOURCE_ENDPOINT: {RESOURCE_ENDPOINT}")
print(f"REDIS_ENDPOINT: {REDIS_ENDPOINT}")
print(f"DEPLOYMENT_NAME: {DEPLOYMENT_NAME}")
print(f"MODEL_NAME: {MODEL_NAME}")


RESOURCE_ENDPOINT: https://amrdemocbx.openai.azure.com
REDIS_ENDPOINT: amrdemoscbx.australiaeast.redis.azure.net:10000
DEPLOYMENT_NAME: text-embedding-3-large
MODEL_NAME: text-embedding-3-large


### Import dataset

This example uses the [Wikipedia Movie Plots](https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots) dataset from Kaggle. Download this file and place it in the same directory as this jupyter notebook.  

In [2]:
# Code cell 3

df=pd.read_csv(os.path.join(os.getcwd(),'wiki_movie_plots_deduped.csv'))
df

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...
...,...,...,...,...,...,...,...,...
34881,2014,The Water Diviner,Turkish,Director: Russell Crowe,Director: Russell Crowe\r\nCast: Russell Crowe...,unknown,https://en.wikipedia.org/wiki/The_Water_Diviner,"The film begins in 1919, just after World War ..."
34882,2017,Çalgı Çengi İkimiz,Turkish,Selçuk Aydemir,"Ahmet Kural, Murat Cemcir",comedy,https://en.wikipedia.org/wiki/%C3%87alg%C4%B1_...,"Two musicians, Salih and Gürkan, described the..."
34883,2017,Olanlar Oldu,Turkish,Hakan Algül,"Ata Demirer, Tuvana Türkay, Ülkü Duru",comedy,https://en.wikipedia.org/wiki/Olanlar_Oldu,"Zafer, a sailor living with his mother Döndü i..."
34884,2017,Non-Transferable,Turkish,Brendan Bradley,"YouTubers Shanna Malcolm, Shira Lazar, Sara Fl...",romantic comedy,https://en.wikipedia.org/wiki/Non-Transferable...,The film centres around a young woman named Am...


Process the dataset to remove spaces in the column titles and filter the dataset to lower the size. This isn't required, but is helpful in reducing the time it takes to generate embeddings and loading the index into Redis. Feel free to play around with the filters, or add your own! 

In [3]:
# Code cell 4

df.insert(0, 'id', range(0, len(df)))
df['year'] = df['Release Year'].astype(int)
df['origin'] = df['Origin/Ethnicity'].astype(str)
del df['Release Year']
del df['Origin/Ethnicity']
df = df[df.year > 1970] # only movies made after 1970
df = df[df.origin.isin(['American','British','Canadian'])] # only movies from English-speaking cinema
df

Unnamed: 0,id,Title,Director,Cast,Genre,Wiki Page,Plot,year,origin
8626,8626,$ aka Dollars,Richard Brooks,"Warren Beatty, Goldie Hawn",unknown,https://en.wikipedia.org/wiki/$_(film),"Set in Hamburg, West Germany, several criminal...",1971,American
8627,8627,200 Motels,"Tony Palmer, Charles Swenson","Frank Zappa, Ringo Starr, Theodore Bikel",unknown,https://en.wikipedia.org/wiki/200_Motels,"In 200 Motels, the film attempts to portray th...",1971,American
8628,8628,The Anderson Tapes,Sidney Lumet,"Sean Connery, Dyan Cannon, Christopher Walken,...",unknown,https://en.wikipedia.org/wiki/The_Anderson_Tapes,"Burglar John ""Duke"" Anderson (Sean Connery) is...",1971,American
8629,8629,The Andromeda Strain,Robert Wise,"Arthur Hill, James Olson, Kate Reid, David Way...",unknown,https://en.wikipedia.org/wiki/The_Andromeda_St...,"After a satellite, a U.S. government project c...",1971,American
8630,8630,Bad Man's River,Eugenio Martin,"Lee Van Cleef, Gina Lollobrigida",unknown,https://en.wikipedia.org/wiki/Bad_Man%27s_River,Roy King's gang robs a bank and flees to Mexic...,1971,American
...,...,...,...,...,...,...,...,...,...
22428,22428,"Hochelaga, Land of Souls (Hochelaga terre des ...",François Girard,"Raoul Max Trujillo, Tanaya Beatty, David La Haye",historical drama,"https://en.wikipedia.org/wiki/Hochelaga,_Land_...","One night on the campus of McGill University, ...",2017,Canadian
22429,22429,Indian Horse,Stephen Campanelli,"Forrest Goodluck, Michiel Huisman, Michael Mur...",drama,https://en.wikipedia.org/wiki/Indian_Horse_(film),"The Indian Horse family, including six-year-ol...",2017,Canadian
22430,22430,The Little Girl Who Was Too Fond of Matches (L...,Simon Lavoie,,unknown,https://en.wikipedia.org/wiki/The_Little_Girl_...,"In rural 1930s Quebec, Alice lives in house wi...",2017,Canadian
22431,22431,Meditation Park,Mina Shum,"Sandra Oh, Liane Balaban, Don McKellar",drama,https://en.wikipedia.org/wiki/Meditation_Park,"Opened by Mandarin theme song, Meditation Park...",2017,Canadian


Remove whitespace from the `Plot` column to make it easier to generate embeddings.

In [4]:
# Code cell 5

pd.options.mode.chained_assignment = None

# s is input text
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

df['Plot']= df['Plot'].apply(lambda x : normalize_text(x))

Calculate the number of tokens required to generate the embeddings for this dataset. You may want to filter the dataset more stringently in order to limit the tokens required. 

In [5]:
# Code cell 6

tokenizer = tiktoken.get_encoding("cl100k_base")
df['n_tokens'] = df["Plot"].apply(lambda x: len(tokenizer.encode(x)))
df = df[df.n_tokens<8192]
print('Number of movies: ' + str(len(df))) # print number of movies remaining in dataset
print('Number of tokens required:' + str(df['n_tokens'].sum())) # print number of tokens

Number of movies: 11125
Number of tokens required:7044844


### Load Dataframe into LangChain
Using the `DataFrameLoader` class allows you to load a pandas dataframe into LangChain. That makes it easy to load your data and use it to generate embeddings using LangChain's other integrations.

In [6]:
# Code cell 7

loader = DataFrameLoader(df, page_content_column="Plot" )
movie_list = loader.load()

### Generate embeddings and Load them into Azure Managed Redis
Using LangChain, this example connects to Azure OpenAI Service to generate embeddings for the dataset. These embeddings are then loaded into [Azure Managed Redis](https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/managed-redis/managed-redis-overview), a fully managed Redis service on Azure, which features the [RediSearch](https://redis.io/docs/latest/develop/interact/search-and-query/) module that includes vector search capability. Finally, a copy of the index schema is saved. That is useful for loading the index into Redis later if you don't want to regenerate the embeddings.

In [18]:
# Code cell 8

# we will use Azure OpenAI as our embeddings provider
embedding = AzureOpenAIEmbeddings(
    azure_endpoint=RESOURCE_ENDPOINT,
    azure_deployment=DEPLOYMENT_NAME,
    openai_api_key=API_KEY,
    openai_api_version='2024-03-01-preview',
    show_progress_bar=True,
    chunk_size=16)

# name of the Redis search index to create
index_name = "movieindex"

# create a connection string for the Redis Vector Store. Uses Redis-py format: https://redis-py.readthedocs.io/en/stable/connections.html#redis.Redis.from_url
# This example assumes TLS is enabled. If not, use "redis://" instead of "rediss://
redis_url = "rediss://:" + REDIS_PASSWORD + "@"+ REDIS_ENDPOINT

# Take the first 100 movies
# short_list = movie_list[:100]

# create and load redis with documents
vectorstore = RedisVectorStore.from_documents(
    documents=movie_list,
    embedding=embedding,
    index_name=index_name,
    redis_url=redis_url
)

# save index schema so you can reload in the future without re-generating embeddings
vectorstore.write_schema("redis_schema.yaml")

# This may take up to 10 minutes to complete.

100%|██████████| 696/696 [10:51<00:00,  1.07it/s]


### Run search queries
Using the vectorstore we just built in LangChain, we can conduct similarity searches using the `similarity_search_with_score` method. In this example, the top 10 results for a given query are returned.

In [None]:
# Code cell 9

results = vectorstore.similarity_search_with_score(query="Spaceships, aliens, and heroes saving America", k=10)

for doc, score  in enumerate(results):
    movie_title = str(results[doc][0].metadata['Title'])
    similarity_score = str(round((1 - results[doc][1]),4))
    print(movie_title + ' (Score: ' + similarity_score + ')')


100%|██████████| 1/1 [00:01<00:00,  1.21s/it]


### Run hybrid queries

You can also run hybrid queries. That is, queries that use both vector search and filters based on other parameters in the dataset. In this case, we filter our query results to only movies tagged with the `comedy` genre. One of the advantages of using LangChain with Redis is that metadata is preserved in the index, so you can use it to filter your results. 

In [None]:
# Code cell 10

from langchain.vectorstores.redis import RedisText

query = "Spaceships, aliens, and heroes saving America"
genre = "comedy"

genre_filter = RedisText("Genre") == genre

results = vectorstore.similarity_search_with_score(query, filter=genre_filter, k=10)
for i, j in enumerate(results):
    movie_title = str(results[i][0].metadata['Title'])
    similarity_score = str(round((1 - results[i][1]),4))
    print(movie_title + ' (Score: ' + similarity_score + ')')

100%|██████████| 1/1 [00:00<00:00,  3.90it/s]

Real Men (Score: 0.4859)
Real Men (Score: 0.4857)
Mars Attacks! (Score: 0.479)
Alien Trespass (Score: 0.4771)
Meet Dave (Score: 0.4693)
Strange Invaders (Score: 0.4642)
Strange Invaders (Score: 0.4642)
My Science Project (Score: 0.455)
My Science Project (Score: 0.455)
Galaxy Quest (Score: 0.4544)





### Appendix A: Load index data already in Redis
If you already have embeddings data in Redis, you can load it into your LangChain vectorstore oboject using the `from_existing_index` method. This is useful if you don't want to re-run your embeddings model. You'll need to provide the index schema that was saved when you generated the embeddings.

In [4]:
# Code cell 11

# we will use Azure OpenAI as our embeddings provider
embedding = AzureOpenAIEmbeddings(
    azure_endpoint=RESOURCE_ENDPOINT,
    azure_deployment=DEPLOYMENT_NAME,
    openai_api_key=API_KEY,
    openai_api_version='2024-03-01-preview',
    show_progress_bar=True,
    chunk_size=16)

# name of the Redis search index to create
index_name = "movieindex"

# create a connection string for the Redis Vector Store. Uses Redis-py format: https://redis-py.readthedocs.io/en/stable/connections.html#redis.Redis.from_url
# This example assumes TLS is enabled. If not, use "redis://" instead of "rediss://
redis_url = "rediss://:" + REDIS_PASSWORD + "@"+ REDIS_ENDPOINT

vectorstore = RedisVectorStore.from_existing_index(
    embedding=embedding,
    redis_url=redis_url,
    index_name=index_name,
    schema="redis_schema.yaml"
)

  embedding = AzureOpenAIEmbeddings(


### Appendix B: Query Redis using the CLI

In [5]:
# Install RedisVL
! pip install redisvl



In [6]:
# Query Azure Manager Redis
! rvl index listall -u $redis_url
! rvl index info -i movieindex -u $redis_url
! rvl stats -i movieindex -u $redis_url


[32m14:06:43[0m [34m[RedisVL][0m [1;30mINFO[0m   Indices:
[32m14:06:43[0m [34m[RedisVL][0m [1;30mINFO[0m   1. movieindex


Index Information:
╭──────────────┬────────────────┬────────────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes           │ Index Options   │   Indexing │
├──────────────┼────────────────┼────────────────────┼─────────────────┼────────────┤
│ movieindex   │ HASH           │ ['doc:movieindex'] │ []              │          0 │
╰──────────────┴────────────────┴────────────────────┴─────────────────┴────────────╯
Index Fields:
╭────────────────┬────────────────┬─────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────╮
│ Name           │ Attribute      │ Type    │ Field Option   │ Option Value   │ Field Option   │ Option Value   │ Field Option   │   Option Value │ Field Option    │ Option Value   │
├────────────────┼────────────────┼─

In [17]:
# Destroy movie index
! rvl index destroy -i movieindex -u $redis_url

[32m07:53:54[0m [34m[RedisVL][0m [1;30mINFO[0m   Index deleted successfully


### Appendix C: Simple RAG Chain

In [7]:
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("RESOURCE_ENDPOINT"),
    azure_deployment='gpt-4o-mini',
    api_key=os.getenv("API_KEY"),
    openai_api_version="2024-09-01-preview"
)

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Prompt
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "human",
            """You are a movie buff who can answer questions about movies, make suggestions, summarise key facts, and provide other useful movie information. Use the following information as context to build your answer. If you are unsure, just say 'I'm unsure".  Only discuss movies from the context provided.  Don't discuss other topics not related to the movies.
Question: {question} 
Context: {context} 
Answer:""",
        ),
    ]
)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10})

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

while True:
    question = input("What's your question about movie(s)? ")
    if question == 'q' or question == '':
        print('Bye!')
        break
    else:
        answer = rag_chain.invoke(question)

        print(f'\nAnswer:\n{answer}')


  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 1/1 [00:00<00:00,  3.85it/s]



Answer:
Tom Cruise stars in several movies within the provided context, primarily featuring the character Ethan Hunt in the "Mission: Impossible" series. In these films, Hunt is an IMF agent who undertakes dangerous missions involving espionage, high-stakes heists, and complex plots against various antagonists.

Additionally, Cruise plays David Aames in "Vanilla Sky," where he navigates a mentally complex and surreal storyline involving love, loss, and dreams. Another noteworthy performance is as Lieutenant Pete "Maverick" Mitchell in "Top Gun," where he is a naval aviator who deals with personal loss and seeks to prove himself among the best pilots.

If you are looking for specific movies or details about a particular role, let me know!
Bye!
