<a href="https://colab.research.google.com/github/gacerioni/redis-workshop-semantic-cache-llm/blob/master/redis_vector_semantic_cache_llm_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop - Redis as a VectorDB - Semantic Caching (RedisVL)

![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)


Bem-vind[ao]s ao Workshop! Vamos ter uma experiência hands-on sobre como usar o Redis para Semantic Caching, integrando-se tranquilamente com a sua stack de LLM.


Para uma experiência premium, como a que eu quero que vocês tenham, recomendo MUITO utilizar o Redis Insight (App ou Web) pra apoiar na visualização dos dados.

https://redis.com/redis-enterprise/redis-insight/

# Crie uma conta free forever no Redis Cloud

Para criar a sua conta grátis no Redis Cloud, basta seguir este colab [aqui](https://https://github.com/gacerioni/redis-workshop-notebook-validator/blob/master/redis-workshop-setup-notebook-validator.ipynb).

Clique no botão "Open in colab" e siga o passo a passo.

# Introdução

O client **RedisVL** fornece uma interface de **Semantic Cache** que utiliza as capacidades de cache internas do Redis e o vector search para armazenar respostas de perguntas já respondidas anteriormente.

Isso reduz o número de requisições e tokens enviados para serviços de LLM, diminuindo os custos e aumentando o throughput da aplicação ao reduzir o tempo necessário para gerar respostas em linguagem natural.

Este colab vai te ensinar como usar o Redis como um cache semântico para suas aplicações.

# Hands-on: hora de começar a codar


Vamos instalar algumas dependências aqui mesmo, direto no colab.

In [99]:
# instalando algumas deps que iremos usar no lab
!pip install openai redisvl sentence-transformers



## Criando o esqueleto para interceptar o prompt do usuário

Neste bloco, vamos definir como iremos interagir com o LLM, de maneira bem simples.


In [100]:
import os
import getpass
import time

from openai import OpenAI

import numpy as np

os.environ["TOKENIZERS_PARALLELISM"] = "False"

api_key = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")


client = OpenAI(api_key=api_key)

def ask_openai(question: str) -> str:
    response = client.completions.create(
      model="gpt-3.5-turbo-instruct",
      prompt=question,
      max_tokens=200
    )
    return response.choices[0].text.strip()

Enter your OpenAI API key: ··········


Agora, vamos fazer uma pergunta bem simples e direta, pra só depois começar o drift.

In [101]:
print(ask_openai("What is the capital of Brazil?"))

00:19:56 httpx INFO   HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
The capital of Brazil is Brasília.


## Inicializando o SemanticCache

Ao ser inicializado, o SemanticCache criará automaticamente um índice dentro do Redis para o conteúdo do cache semântico.

In [104]:
from redisvl.extensions.llmcache import SemanticCache

llmcache = SemanticCache(
    name="llmcache",                     # underlying search index name
    prefix="llmcache",                   # redis key prefix for hash entries
    redis_url="redis://default:Xfoa8sZGrhrxm8KyY718wIMuuQenXxD0@redis-16962.c11.us-east-1-2.ec2.cloud.redislabs.com:16962",  # redis connection url string
    distance_threshold=0.1               # semantic cache distance threshold
)

00:22:06 sentence_transformers.SentenceTransformer INFO   Use pytorch device_name: cpu
00:22:06 sentence_transformers.SentenceTransformer INFO   Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [105]:
#!export REDIS_URL=
!rvl index info -i llmcache --url "redis://default:Xfoa8sZGrhrxm8KyY718wIMuuQenXxD0@redis-16962.c11.us-east-1-2.ec2.cloud.redislabs.com:16962"
#!rvl index -h



Index Information:
╭──────────────┬────────────────┬──────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes     │ Index Options   │   Indexing │
├──────────────┼────────────────┼──────────────┼─────────────────┼────────────┤
│ llmcache     │ HASH           │ ['llmcache'] │ []              │          0 │
╰──────────────┴────────────────┴──────────────┴─────────────────┴────────────╯
Index Fields:
╭───────────────┬───────────────┬─────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────╮
│ Name          │ Attribute     │ Type    │ Field Option   │ Option Value   │ Field Option   │ Option Value   │ Field Option   │   Option Value │ Field Option    │ Option Value   │
├───────────────┼───────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────────────────┼────────────────┤
│ prompt        │ prom

In [107]:
question = "What is the capital of Brazil?"

In [108]:
# Check the semantic cache -- should be empty
if response := llmcache.check(prompt=question):
    print(response)
else:
    print("Empty cache")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Empty cache


In [109]:
# Cache the question, answer, and arbitrary metadata
llmcache.store(
    prompt=question,
    response="Brasília",
    metadata={"city": "Brasília", "country": "brazil", "most_nerdola_citizen": "gabs"}
)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'llmcache:1bb2f7bfcea048cbc97b5639ccaf367ca354e33319f39e0b72fa8256df73a59e'

In [110]:
# Check the cache again
if response := llmcache.check(prompt=question, return_fields=["prompt", "response", "metadata"]):
    print(response)
else:
    print("Empty cache")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[{'prompt': 'What is the capital of Brazil?', 'response': 'Brasília', 'metadata': {'city': 'Brasília', 'country': 'brazil', 'most_nerdola_citizen': 'gabs'}, 'key': 'llmcache:1bb2f7bfcea048cbc97b5639ccaf367ca354e33319f39e0b72fa8256df73a59e'}]


In [111]:
# Check for a semantically similar result
question = "What actually is the capital of Brazil?"
llmcache.check(prompt=question)[0]['response']

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'Brasília'

In [117]:
# Widen the semantic distance threshold
llmcache.set_threshold(0.3)

In [118]:
# Really try to trick it by asking around the point
# But is able to slip just under our new threshold
question = "What is the capital city of the country in LATAM that also has a city named São Paulo?"
llmcache.check(prompt=question)[0]['response']

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'Brasília'

In [None]:
# Invalidate the cache completely by clearing it out
llmcache.clear()

# should be empty now
llmcache.check(prompt=question)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[]

In [None]:
llmcache.set_ttl(5) # 5 seconds


In [None]:
llmcache.store("This is a TTL test", "This is a TTL test response")

time.sleep(5)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# confirm that the cache has cleared by now on it's own
result = llmcache.check("This is a TTL test")

print(result)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[]


In [None]:
# Reset the TTL to null (long lived data)
llmcache.set_ttl()

----

In [None]:
def answer_question(question: str) -> str:
    """Helper function to answer a simple question using OpenAI with a wrapper
    check for the answer in the semantic cache first.

    Args:
        question (str): User input question.

    Returns:
        str: Response.
    """
    results = llmcache.check(prompt=question)
    if results:
        return results[0]["response"]
    else:
        answer = ask_openai(question)
        return answer

In [None]:
start = time.time()
# asking a question -- openai response time
question = "What was the name of the first US President?"
answer = answer_question(question)
end = time.time()

print(f"Without caching, a call to openAI to answer this simple question took {end-start} seconds.")


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

19:34:10 httpx INFO   HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
Without caching, a call to openAI to answer this simple question took 0.9294853210449219 seconds.


In [None]:
llmcache.store(prompt=question, response="George Washington")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'llmcache:67e0f6e28fe2a61c0022fd42bf734bb8ffe49d3e375fd69d692574295a20fc1a'

In [None]:
# Calculate the avg latency for caching over LLM usage
times = []

for _ in range(10):
    cached_start = time.time()
    cached_answer = answer_question(question)
    cached_end = time.time()
    times.append(cached_end-cached_start)

avg_time_with_cache = np.mean(times)
print(f"Avg time taken with LLM cache enabled: {avg_time_with_cache}")
print(f"Percentage of time saved: {round(((end - start) - avg_time_with_cache) / (end - start) * 100, 2)}%")


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Avg time taken with LLM cache enabled: 0.2613499402999878
Percentage of time saved: 71.88%


In [None]:
# Clear the cache AND delete the underlying index
#llmcache.delete()

Automated Machine


In [None]:
def answer_question(question: str) -> str:
    """Helper function to answer a simple question using OpenAI with a wrapper
    check for the answer in the semantic cache first. If not found, it queries
    OpenAI and stores the response in the cache.

    Args:
        question (str): User input question.

    Returns:
        str: Response.
    """
    # Check if the answer is already in the semantic cache
    results = llmcache.check(prompt=question)

    if results:
        # If found, print message and return the cached response
        print("[CACHE HIT] The answer was found in the semantic cache.")
        return results[0]["response"]
    else:
        # Otherwise, ask the LLM (OpenAI) for the answer
        print("[CACHE MISS] The answer was not in the cache. Querying OpenAI...")
        answer = ask_openai(question)

        # Store the question and its answer in the semantic cache
        print("[CACHE STORE] Storing the new response in the semantic cache.")
        llmcache.store(prompt=question, response=answer)

        # Return the answer
        return answer

In [None]:
# Main block to ask user for input and check the cache or query the LLM
while True:
    # Open a prompt for the user to ask a question
    question = input("Enter your question (or 'exit' to stop): ")

    # Break the loop if the user types 'exit'
    if question.lower() == 'exit':
        print("Exiting the demo.")
        break

    # Get the answer using the cached system
    answer = answer_question(question)

    # Display the answer
    print(f"Answer: {answer}\n")

Enter your question (or 'exit' to stop): What is the capital of Spain?


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[CACHE MISS] The answer was not in the cache. Querying OpenAI...
20:15:30 httpx INFO   HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
[CACHE STORE] Storing the new response in the semantic cache.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Answer: The capital of Spain is Madrid.

Enter your question (or 'exit' to stop): What is the capital of a country that has a city named as Barcelona?


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[CACHE HIT] The answer was found in the semantic cache.
Answer: The capital of Spain is Madrid.

Enter your question (or 'exit' to stop): exit
Exiting the demo.
