# LLM Reference Architecture using Redis & Google Cloud Platform

<a href="https://colab.research.google.com/github/bestarch/gcp-redis-llm-stack/blob/main/BigQuery_Palm_Redis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook serves as a getting started guide for working with LLMs on Google Cloud Platform with Redis Enterprise.

## Intro
Google's Vertex AI has expanded its capabilities by introducing [Generative AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview). This advanced technology comes with a specialized [in-console studio experience](https://cloud.google.com/vertex-ai/docs/generative-ai/start/quickstarts/quickstart), a [dedicated API](https://cloud.google.com/vertex-ai/docs/generative-ai/start/quickstarts/api-quickstart) and [Python SDK](https://cloud.google.com/vertex-ai/docs/python-sdk/use-vertex-ai-python-sdk) designed for deploying and managing instances of Google's powerful PaLM language models (more sample code). With a distinct focus on text generation, summarization, chat completion, and embedding creation, PaLM models are reshaping the boundaries of natural language processing and machine learning.

Redis Enterprise offers robust vector database features, with an efficient API for vector index creation, management, distance metric selection, similarity search, and hybrid filtering. When coupled with its versatile data structures - including lists, hashes, JSON, and sets - Redis Enterprise shines as the optimal solution for crafting high-quality Large Language Model (LLM)-based applications. It embodies a streamlined architecture and exceptional performance, making it an instrumental tool for production environments.

![](https://github.com/RedisVentures/redis-google-llms/blob/main/assets/GCP_RE_GenAI.drawio.png?raw=true)

Below we will work through several design patterns with Vertex AI LLMs and Redis Enterprise that will ensure optimal production performance.

___
## Contents
- Setup
    1. Prerequisites
    2. Create BigQuery Table
    3. Generate Embeddings
        
        a. Embed Text Data

    4. Load Embeddings to Redis
    5. Create Index
- Build LLM Applications
- LLM Design Patterns
    1. Semantic Search
    2. Retrieval Augmented Generation (RAG)
    3. Caching
    4. Memory
- Cleanup

___

# Setup

## 1. Prerequisites
Before we begin, we must install some required libraries, authenticate with Google, create a Redis database, and initialize other required components.

### Install required libraries

In [2]:
!pip install redis "google-cloud-aiplatform==1.25.0" --upgrade --user

Collecting redis
  Downloading redis-5.0.1-py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.3/250.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting google-cloud-aiplatform==1.25.0
  Downloading google_cloud_aiplatform-1.25.0-py2.py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m54.8 MB/s[0m eta [36m0:00:00[0m
Collecting shapely<2.0.0 (from google-cloud-aiplatform==1.25.0)
  Downloading Shapely-1.8.5.post1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m64.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: shapely, redis, google-cloud-aiplatform
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires fasta

### Connect to Redis server
Replace the connection params below with your own if you are connecting to an external Redis instance.

In [1]:
import os
import redis

# Redis connection params
REDIS_HOST = os.getenv("REDIS_HOST", "redis-15850.c1.asia-northeast1-1.gce.cloud.redislabs.com")
REDIS_PORT = os.getenv("REDIS_PORT", "15850")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "admin")

# Create Redis client
redis_client = redis.Redis(
  host=REDIS_HOST,
  port=REDIS_PORT,
  password=REDIS_PASSWORD
)

# Test connection
redis_client.ping()

True

In [21]:
# Clear Redis database (optional)
redis_client.flushdb()

True

### Authenticate to Google Cloud

In [2]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [3]:
from getpass import getpass

# input your GCP project ID and region for Vertex AI
PROJECT_ID = getpass("PROJECT_ID:") #'central-beach-194106'
REGION = input("REGION:") #'asia-northeast1'

PROJECT_ID:··········
REGION:asia-northeast1


### Initialize Vertex AI Components



In [4]:
import vertexai
print(REGION)
vertexai.init(project=PROJECT_ID, location=REGION)

asia-northeast1


### Initialise BigQuery client

In [5]:
from google.cloud import bigquery

# Create bigquery client
bq = bigquery.Client(project=PROJECT_ID, location=REGION)


### Populate source dataset in BigQuery

The hackernews dataset contains records of hacker news posts including the **title**, **text**, **time**, **id** and **timestamp**. Let's pull a few sample rows and inspect.
We will use `hackernews` table present under `abhi_hackernews_dataset`.

The record present in this table has been populated from publicly available dataset `bigquery-public-data.hacker_news` using query:

`SELECT title, text, time, timestamp, id FROM bigquery-public-data.hacker_news.full WHERE type ='story' LIMIT 2000`

In [6]:
# Query 5 sample records from BQ
query_job = bq.query(f'''
SELECT *
FROM `central-beach-194106.abhi_hackernews_dataset.hackernews`
LIMIT 5
''')

query_job.result().to_dataframe()

Unnamed: 0,id,timestamp,time,text,title
0,25922721,2021-01-27 00:26:34+00:00,1611707194,ZiKoNWuK6Bu0,D
1,22692739,2020-03-26 12:13:30+00:00,1585224810,asddddddddddddddd,Ads
2,13930560,2017-03-22 13:06:44+00:00,1490188004,Is there any one who can help me better under...,Help
3,21210935,2019-10-10 04:38:16+00:00,1570682296,Como Estas,Hola
4,33383542,2022-10-29 12:41:31+00:00,1667047291,I am New to HN. What is that 1 advice you woul...,Advice


## 3. Generate Embeddings

### Create text embeddings with Vertex AI embedding model
Use the [Vertex AI API for text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings), developed by Google.

> Text embeddings are a dense vector representation of a piece of content such that, if two pieces of content are semantically similar, their respective embeddings are located near each other in the embedding vector space. This representation can be used to solve common NLP tasks, such as:
> - **Semantic search**: Search text ranked by semantic similarity.
> - **Recommendation**: Return items with text attributes similar to the given text.
> - **Classification**: Return the class of items whose text attributes are similar to the given text.
> - **Clustering**: Cluster items whose text attributes are similar to the given text.
> - **Outlier Detection**: Return items where text attributes are least related to the given text.

The `textembedding-gecko` model accepts a maximum of 3,072 input tokens (i.e. words) and outputs 768-dimensional vector embeddings.

### Define embedding helper function
We define a helper function, `embedding_model_with_backoff`, to create embeddings from a list of texts while making it resilient to [Vertex AI API quotas](https://cloud.google.com/vertex-ai/docs/quotas) via [exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff).

We also define a method to convert an array of floats to a byte string for efficient storage in Redis (later on).



In [7]:
from typing import Generator, List, Any

from tenacity import retry, stop_after_attempt, wait_random_exponential
from vertexai.preview.language_models import TextEmbeddingModel

# Embedding model definition from VertexAI PaLM API
embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
VECTOR_DIMENSIONS = 768

@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3))
def embed_text(text=[]):
    embeddings = embedding_model.get_embeddings(text)
    return [each.values for each in embeddings]

# Convert embeddings to bytes for Redis storage
def convert_embedding(emb: List[float]):
  return np.array(emb).astype(np.float32).tobytes()

### Embed text data
At the moment, our table in BigQuery (created above), contains records of the hacker news posts that we wish to embed and make available for LLMs.

In order to conserve RAM usage of this machine, we will iterate over batches of posts from BigQuery, create embeddings, and write them to Redis, which is being used as a [vector database](https://redis.com/solutions/use-cases/vector-database).

In [8]:
import pandas as pd
import numpy as np

QUERY_TEMPLATE = f"""
SELECT id, title, text
FROM `central-beach-194106.abhi_hackernews_dataset.hackernews`
LIMIT {{limit}} OFFSET {{offset}};
"""

def query_bigquery_batches(
    max_rows: int,
    rows_per_batch: int,
    start_batch: int = 0
) -> Generator[pd.DataFrame, Any, None]:
    # Generate batches from a table in big query
    for offset in range(start_batch, max_rows, rows_per_batch):
        query = QUERY_TEMPLATE.format(limit=rows_per_batch, offset=offset)
        query_job = bq.query(query)
        rows = query_job.result()
        df = rows.to_dataframe()
        # Join title and text fields
        df["content"] = df.apply(lambda r: "Title: " + r.title + ". Content: " + r.text, axis=1)
        yield df


Below we define a few helper functions for processing a single row of data, writing batches to **Redis**, querying source data from **BigQuery**, and creating text embeddings with **Vertex AI**.

In [9]:
import math
from tqdm.auto import tqdm


# Redis key helper function
def redis_key(key_prefix: str, id: str) -> str:
  return f"{key_prefix}:{id}"

# Process a single dataset record
def process_record(record: dict) -> dict:
  return {
      'id': record['id'],
      'embedding': record['embedding'],
      'text': record['text'],
      'title': record['title']
  }

# Load batch of data into Redis as HASH objects
def load_redis_batch(
    redis_client: redis.Redis,
    dataset: list,
    key_prefix: str = "doc",
    id_column: str = "id",
):
    pipe = redis_client.pipeline()
    for i, record in enumerate(tqdm(dataset)):
        record = process_record(record)
        key = redis_key(key_prefix, record[id_column])
        pipe.hset(key, mapping=record)
    pipe.execute()

# Run the entire process
def create_embeddings_bigquery_redis(redis_client):
    # Create generator from BigQuery
    max_rows = 1000
    rows_per_batch = 100
    bq_content_query = query_bigquery_batches(max_rows, rows_per_batch)

    for batch in tqdm(bq_content_query):
      # Split batch into smaller chunks for embedding generation
      batch_splits = np.array_split(batch, math.ceil(rows_per_batch/5))
      # Create embeddings
      batch["embedding"] = [
          convert_embedding(embedding)
          for split in batch_splits
          for embedding in embed_text(split.content)
      ]
      # Write batch to Redis
      batch = batch.to_dict("records")
      load_redis_batch(redis_client, batch)


## 4. Load Embeddings
Now that we have a function to generate BigQuery batches, create text embeddings, and write batches to Redis, we can run the single function to process our entire dataset:

In [22]:
create_embeddings_bigquery_redis(redis_client)

0it [00:00, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

In [10]:
# Validate how many records are stored in Redis
redis_client.dbsize()

1000

## 5. Create Vector Index

Now that we have created embeddings that represent the text in our dataset and stored them in Redis, we will create a secondary index that enables efficient search over the embeddings. To learn more about the vector similarity features in Redis, [check out these docs](https://redis.io/docs/interact/search-and-query/search/vectors/) and [these Redis AI resources](https://github.com/RedisVentures/redis-ai-resources).

**Why do we need to enable search???**
Using Redis for vector similarity search allows us to retrieve chunks of text data that are **similar** or **relevant** to an input question or query. This will be extremely helpful for our sample generative ai / LLM application.

In [19]:
from redis.commands.search.field import (
    NumericField,
    TagField,
    TextField,
    VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query


INDEX_NAME = "google:idx"
PREFIX = "doc:"
VECTOR_FIELD_NAME = "embedding"

# Store vectors in redis and create index
def create_redis_index(
    redis_client: redis.Redis,
    vector_field_name: str = VECTOR_FIELD_NAME,
    index_name: str = INDEX_NAME,
    prefix: list = [PREFIX],
    dim: int = VECTOR_DIMENSIONS
  ):

    # Construct index
    try:
        redis_client.ft(index_name).info()
        print("Existing index found. Dropping and recreating the index", flush=True)
        redis_client.ft(index_name).dropindex(delete_documents=False)
    except:
        print("Creating new index", flush=True)

    # Create new index
    redis_client.ft(index_name).create_index(
        (
            VectorField(
                vector_field_name, "FLAT",
                {
                    "TYPE": "FLOAT32",
                    "DIM": dim,
                    "DISTANCE_METRIC": "COSINE",
                }
            )
        ),
        definition=IndexDefinition(prefix=prefix, index_type=IndexType.HASH)
    )

In [25]:
# Create index
create_redis_index(redis_client)

Creating new index


In [26]:
# Inspect index attributes
redis_client.ft(INDEX_NAME).info()

{'index_name': 'google:idx',
 'index_options': [],
 'index_definition': [b'key_type',
  b'HASH',
  b'prefixes',
  [b'doc:'],
  b'default_score',
  b'1'],
 'attributes': [[b'identifier',
   b'embedding',
   b'attribute',
   b'embedding',
   b'type',
   b'VECTOR']],
 'num_docs': '1000',
 'max_doc_id': '1000',
 'num_terms': '0',
 'num_records': '1000',
 'inverted_sz_mb': '0',
 'vector_index_sz_mb': '3.047210693359375',
 'total_inverted_index_blocks': '0',
 'offset_vectors_sz_mb': '0',
 'doc_table_size_mb': '0.07362556457519531',
 'sortable_values_size_mb': '0',
 'key_table_size_mb': '0.040406227111816406',
 'geoshapes_sz_mb': '0',
 'records_per_doc_avg': '1',
 'bytes_per_record_avg': '0',
 'offsets_per_term_avg': '0',
 'offset_bits_per_record_avg': 'nan',
 'hash_indexing_failures': '0',
 'total_indexing_time': '15.75',
 'indexing': '0',
 'percent_indexed': '1',
 'number_of_uses': 2,
 'cleaning': 0,
 'gc_stats': [b'bytes_collected',
  b'0',
  b'total_ms_run',
  b'0',
  b'total_cycles',
  b

In [11]:
# Retreive single HASH from Redis
key = redis_client.keys()[1]
redis_client.hgetall(key)

{b'title': b'Camera Galactica',
 b'text': b'I am trying to make a galactic camera. Do take a look.',
 b'embedding': b'x\xee\xdf\xbbm\x9cV<VaA=\x12\xda&=\x0f\x81\xa5=\'q\xe9\xbcM\x81.=\x15\x02<\xbc\x18\xe3,\xbch\xcej\xbc\xd7\xa7\x05=1\x87\xf7;b\xd6\xec\xbb\xe1\xa8\xcc\xbc\xce\xcc\xd9\xbb\x96\x12\xa1\xbcgj\xcf\xbc\x91\x9a\xd1\xbcn\x08K\xbd\xce\xfc\x03=\x9f~\x9a\xbd\x9be\x81\xbc\xee`\xd2<\x95\xbb\xcf\xbb\xa0G\xad\xba\xd9z\xd5\xbd$\xbc\x91<-\x12G<|\xf2\x07\xbbAr\t\xbd\xde\xd7\x90\xba"\x1e ;\x1f\xea1\xbc\x1eN\xb4\xbct\xa3\xfa<\xba;\xb3<\xe2\xdcL\xbdTw\x02\xbd\x02\xda\x96<D_q=\x12B\x1b<\xb9\x8d/\xbd&\x01<=\xa1L\\\xbc\xd6^H=_"\xe9<\xcd\xd57\xbdut8=\xfd\x9b\xbe</d\xa6\xbd\xbe=<\xbcY?\xee;\x18\x9a@=\xddq\x02=\xf0\x1bT\xbd\x04\xd9L=P\xd9n\xbdm\xf1\x8f\xbcI5\xc3<k\x01=\xbd\xb2O\x84;\xfe\xa7U\xbb\x8dS\n=\xba\xe3f\xbc\xddq\x96\xbbT%\x88=(\x03\xa4\xbc\xb6\xcf\xc3\xbc\xd0Z\xa6\xbbE\x11.<HI\xb6<R\xf2\xb6<\xe0\x91\xfe\xbc\xbf\x82\x93\xbb\x8e\xcf\xa8<\x19\x80\x9e<\xedNC<\xa5\xb37=D\xe98=\xda@\x89\xbc\xb

At this point, our **Redis** datastore is completely loaded with a subset of data from **BigQuery** including text embeddings created with **Vertex AI** PaLM APIs.

# Build LLM applications
With Redis fully loaded as a vector database and powerful PaLM APIs at our disposal, we can build a number of AI applications on this stack. Below we will briefly describe each of these applications and use cases

- **Document Retrieval** - search through documents to return only the most relevant to a given query.
- **Product Recommendations** - recommend products with similar attributes and descriptions to a product the shopper likes.
- **Chatbots** - provide a conversational interface for information retrieval or customer service.
- **Text Summarization & Generation** - Generate new copy from sources of relevant information to accelerate team output.
- **Fraud/Anomaly Detection** - identify anomalous and potentially fraudulent events, transactions, or items based on attribute similarity of other known entities.

# LLM Design Patterns

In order to build these kinds of apps, below we highlight 4 technical design patterns and techniques where Redis Enterprise comes in handy to boost LLM performance:

- **Semantic Search**
- **Retrieval Augmented Generation (RAG)**
- **Caching**
- **Memory**

Leveraging some combination of these patterns is recommended best practice, derived from enterprise use cases and open source users all over the world.

### Simple Semantic Search


**Semantic Search**, in the context of Large Language Models (LLMs), is a sophisticated search technique that goes beyond *literal* keyword matching to understand the contextual meaning and intent behind user queries. Leveraging the power of Google's Vertex AI platform and Redis' vector database capabilities, semantic search can map and extract deep-level knowledge from vast text datasets, including nuanced relationships and hidden patterns.

This allows applications to return search results that are contextually relevant, enhancing user experience by offering meaningful responses, even to complex or ambiguous search terms. Thus, semantic search not only boosts the accuracy and relevancy of search results but also empowers applications to interact with users in a more human-like, intuitive manner.

The general process of semantic search includes 3 steps:
1. Create query vector
2. Perform vector search
3. Review and return results

In [15]:
# 1. Create query vector
query = "What is the best computer operating system for software dev?"
query_vector = embed_text([query])[0]
print(query_vector)
# Our query has been converted to a list of floats (this is a truncated view)
query_vector[:10]

[0.019742609933018684, -0.006285817362368107, 0.057011738419532776, 0.0630318894982338, 0.019907860085368156, -0.045217469334602356, 0.0657828152179718, 0.027572786435484886, -0.056532640010118484, 0.003752157324925065, 0.04423457384109497, -0.0016217123484238982, -0.025323165580630302, 0.0297939982265234, -0.002300396328791976, 0.0024665985256433487, -0.0899939015507698, -0.07056761533021927, -0.0359533354640007, 0.03945730999112129, -0.06873913109302521, -0.022089926525950432, -0.02876616083085537, 0.04097577556967735, 0.007288066204637289, -0.03723378852009773, 0.031475335359573364, 0.03984894976019859, -0.03222772851586342, -0.019176499918103218, 0.017018195241689682, 0.021402642130851746, -0.06851468980312347, -0.012636758387088776, 0.02377924695611, 0.06952198594808578, -0.0091903917491436, 0.026848996058106422, 0.002631336683407426, 0.06696653366088867, 0.018011517822742462, 0.00747966393828392, 0.08326854556798935, -0.021848762407898903, -0.015890108421444893, -0.00847246125340

[0.019742609933018684,
 -0.006285817362368107,
 0.057011738419532776,
 0.0630318894982338,
 0.019907860085368156,
 -0.045217469334602356,
 0.0657828152179718,
 0.027572786435484886,
 -0.056532640010118484,
 0.003752157324925065]

In [20]:
# Helper method to perform KNN similarity search in Redis
def similarity_search(query: str, k: int, return_fields: tuple, index_name: str = INDEX_NAME) -> list:
    # create embedding from query text
    query_vector = embed_text([query])[0]
    # create redis query object
    redis_query = (
        Query(f"*=>[KNN {k} @{VECTOR_FIELD_NAME} $embedding AS score]")
            .sort_by("score")
            .return_fields(*return_fields)
            .paging(0, k)
            .dialect(2)
    )
    # execute the search
    results = redis_client.ft(index_name).search(
        redis_query, query_params={"embedding": convert_embedding(query_vector)}
    )
    return pd.DataFrame([t.__dict__ for t in results.docs ]).drop(columns=["payload"])


In [21]:
# 2. Perform vector similarity search with given query
results = similarity_search(query, k=5, return_fields=("score", "title", "text"))

In [22]:
# 3. Review and return the results
display(results)

Unnamed: 0,id,score,title,text
0,doc:23429219,0.276380896568,Ask HN: Laptop and OS setup for total web deve...,As a web developer I was handed a MacBook a co...
1,doc:14396292,0.326441407204,Ask HN: Recommend me a linux compatible laptop,Looking for a new laptop. I do full stack deve...
2,doc:9320563,0.329460144043,Ask HN: Best server setup for running machine ...,I&#x27;m into applying machine learning for la...
3,doc:24626052,0.332834243774,What is the best accounting software for a sma...,"I just stumblupon this topic, so like to ask y..."
4,doc:6132351,0.345210790634,Ask HN: Software Stack for WebApp Python vs. R...,I have been seriously out of the hard core tec...


Results above indicate that our search for recommended operating systems for software devs yielded some posts from Hacker News that might be helpful in answering this question.

**Interested in tuning the search results?**
- Try using a different [Distance Metric](https://redis.io/docs/interact/search-and-query/search/vectors/#creation-attributes-per-algorithm)
- Try using a different [Index Type](https://redis.io/docs/interact/search-and-query/search/vectors/#flat)

### Retrieval Augmented Generation (RAG)

**Retrieval Augmented Generation** (RAG), within the scope of Large Language Models (LLMs), is a technique that combines the knowledge of domain-specific data and generative models to enhance the production of contextually-rich question responses. In essence, *RAG* functions by retrieving relevant information from a knowledge base of documents or data before proceeding to generate a response. This allows generalized foundation models to gain access to these datasources at runtime, and is NOT the same thing as fine-tuning.

RAG exploits the strengths of Redis as a low-latency vector database for efficient retrieval operations and Google's Vertex AI to generate a coherent text response. In LLM applications, RAG enables a deeper comprehension of context, returning highly nuanced responses, even to intricate queries. This pattern enhances the interactive capability of applications, delivering more precise and informative responses, thereby significantly enriching the user experience.


In order to build a RAG pipeline for question answering, we need to use Vertex PaLM API for text generation (`text-bison@001`).

In [23]:
from vertexai.preview.language_models import TextGenerationModel

# Define generation model
generation_model = TextGenerationModel.from_pretrained("text-bison@001")

response = generation_model.predict(prompt="What is a large language model?")

print("Example response:\n", response.text)


Example response:
 A large language model (LLM) is a type of artificial intelligence (AI) model that can understand and generate human language. LLMs are trained on massive datasets of text and code, and they can learn to perform a wide variety of tasks, such as translating languages, writing different kinds of creative content, and answering your questions in an informative way.

LLMs are still under development, but they have the potential to revolutionize many industries. For example, LLMs could be used to create more accurate and personalized customer service experiences, to help doctors diagnose and treat diseases, and to even write entire books and movies.




In order to be able to answer questions **while referencing domain-specific sources** (like our sample hackernews dataset), we must build a RAG pipeline:

1. First perform **Semantic Search** with the user query on the knowledge base (stored in Redis) to find relevant sources that will help the language model answer and respond intelligently.

2. The sources (called context) are "stuffed" into the prompt (input).

3. Lastly, the full prompt is passed on to the language model for text generation.

In [24]:
def create_prompt(prompt_template: str, **kwargs) -> str:
  return prompt_template.format(**kwargs)

def rag(query: str, prompt: str, verbose: bool = True) -> str:
    """
    Simple pipeline for performing retrieval augmented generation with
    Google Vertex PaLM API and Redis Enterprise.
    """
    # Perform a vector similarity search in Redis
    if verbose:
        print("Pulling relevant data sources from Redis", flush=True)
    relevant_sources = similarity_search(query, k=3, return_fields=("text",))
    if verbose:
        print("Relevant sources found!", flush=True)
    # Combine the relevant sources and inject into the prompt
    sources_text = "-" + "\n-".join([source for source in relevant_sources.text.values])
    full_prompt = create_prompt(
        prompt_template=prompt,
        sources=sources_text,
        query=query
      )
    if verbose:
        print("\nFull prompt:\n\n", full_prompt, flush=True)
    # Perform text generation to get a response from PaLM API
    response = generation_model.predict(prompt=full_prompt)
    return response.text



Below is an example prompt template. Feel free to edit and tweak the initial sentence that sets the context for the language model to perform the action we are anticipating. The process of tuning and iterating on prompt design is widely refered to as "*prompt engineering*".

In [25]:
PROMPT = """You are a helpful virtual technology and IT assistant. Use the hacker news posts below as relevant context and sources to help answer the user question. Don't blindly make things up.

SOURCES:
{sources}

QUESTION:
{query}?

ANSWER:"""



In [26]:
query = "What are the best operating systems for software development?"
response = rag(query=query, prompt=PROMPT)
print(response)

Pulling relevant data sources from Redis
Relevant sources found!

Full prompt:

 You are a helpful virtual technology and IT assistant. Use the hacker news posts below as relevant context and sources to help answer the user question. Don't blindly make things up.

SOURCES:
-As a web developer I was handed a MacBook a couple years ago.  Its &#x27;ok&#x27;.  The keyboard is jank and not being able to span windows over multiple monitors is one of my many minor annoyances with MacOS.<p>However, I was recently given the chance to switch to a laptop+OS of my choosing.  I was eyeing up a thinkpad or xps, but I realized that I have no clue how I would facilitate the testing of Mac specific things like Safari and Apple Pay outside of a VM.  Likewise, currently I have a 2nd windows laptop just for the sake of Edge&#x2F;IE11 testing.<p>Folks who need to be able to test IE11&#x2F;Edge&#x2F;Edgium&#x2F;Chrome&#x2F;Firefox&#x2F;Safari are you using VM&#x27;s, juggling multiple physical machines, or 

In [27]:
query = "What are some commonly reported problems with MacBooks and MacOS for software dev purposes?"
response = rag(query=query, prompt=PROMPT)
print(response)

Pulling relevant data sources from Redis
Relevant sources found!

Full prompt:

 You are a helpful virtual technology and IT assistant. Use the hacker news posts below as relevant context and sources to help answer the user question. Don't blindly make things up.

SOURCES:
-As a web developer I was handed a MacBook a couple years ago.  Its &#x27;ok&#x27;.  The keyboard is jank and not being able to span windows over multiple monitors is one of my many minor annoyances with MacOS.<p>However, I was recently given the chance to switch to a laptop+OS of my choosing.  I was eyeing up a thinkpad or xps, but I realized that I have no clue how I would facilitate the testing of Mac specific things like Safari and Apple Pay outside of a VM.  Likewise, currently I have a 2nd windows laptop just for the sake of Edge&#x2F;IE11 testing.<p>Folks who need to be able to test IE11&#x2F;Edge&#x2F;Edgium&#x2F;Chrome&#x2F;Firefox&#x2F;Safari are you using VM&#x27;s, juggling multiple physical machines, or 

Clearly this example dataset (hackernews) is not the only example we could work with and it 's certainly not "production" ready out of the gate. This is also only utilizing a subset (1000 records) of the actual data for teaching purposes.

However, this example demonstrates how you can combine external sources of data and LLMs to surface more useful information.

### LLM Caching

**LLM Caching** is an advanced strategy used to optimize the performance of Large Language Model (LLM) applications. Utilizing the ultra-fast, in-memory data store of Redis, LLM Caching enables the storage and quick retrieval of pre-computed responses generated by Google's Vertex AI (PaLM). This means the computationally expensive process of response generation, especially for repetitive queries, is significantly reduced, resulting in faster response times and efficient resource utilization. This pairing of Google's powerful generative AI capabilities with Redis' high-performance caching system thus facilitates a more scalable and performant architecture for LLM applications, improving overall user experience and application reliability.

There are primarily two modes of caching for LLMs:
- Standard Caching
- Semantic Caching

#### Standard Caching

Standard caching for LLMs involves simply matching an exact phrase or prompt that has been provided before. We can return the previously used response from the LLM in order to speed up the throughput of the system overall and reduce redundant computation.

In [28]:
# Some boiler plate helper methods
import hashlib

def hash_input(prefix: str, _input: str):
    return prefix + hashlib.sha256(_input.encode("utf-8")).hexdigest()

def standard_check(key: str):
  # function to perform a standard cache check
    res = redis_client.hgetall(key)
    if res:
      return res[b'response'].decode('utf-8')

def cache_response(query: str, response: str):
    key = hash_input("llmcache:", query)
    redis_client.hset(key, mapping={"prompt": query, "response": response})

# LLM Cache wrapper / decorator function
def standard_llmcache(llm_callable):
    def wrapper(*args, **kwargs):
        # Check LLM Cache first
        key = hash_input("llmcache:", *args, **kwargs)
        response = standard_check(key)
        # Check if we have a cached response we can use
        if response:
            return response
        # Otherwise execute the llm callable here
        response = llm_callable(*args, **kwargs)
        cache_response(query, response)
        return response

    return wrapper

In [29]:
# Define a function that invokes the PaLM API wrapped with a cache check

@standard_llmcache
def ask_palm(query: str):
  prompt = PROMPT
  response = rag(query, prompt, verbose=False)
  return response

In [31]:
%%time

query = "What are the best operating systems for software development?"

ask_palm(query)

CPU times: user 1.5 ms, sys: 0 ns, total: 1.5 ms
Wall time: 148 ms


'The best operating system for software development is the one that you are most comfortable with. There are many different operating systems available, and each has its own strengths and weaknesses. Some of the most popular operating systems for software development include Windows, macOS, and Linux.\n\nWindows is a popular choice for software development because it is easy to use and has a wide range of software available. macOS is also a good choice for software development because it is stable and has a good developer community. Linux is a good choice for software development because it is free and open source, and it has a wide range of features that can be customized to meet the'

Now if we ask the same question again -- we should get the same response in near real-time.

In [32]:
%%time

ask_palm(query)

CPU times: user 1.91 ms, sys: 122 µs, total: 2.03 ms
Wall time: 148 ms


'The best operating system for software development is the one that you are most comfortable with. There are many different operating systems available, and each has its own strengths and weaknesses. Some of the most popular operating systems for software development include Windows, macOS, and Linux.\n\nWindows is a popular choice for software development because it is easy to use and has a wide range of software available. macOS is also a good choice for software development because it is stable and has a good developer community. Linux is a good choice for software development because it is free and open source, and it has a wide range of features that can be customized to meet the'

#### Semantic Caching
Semantic caching builds off of the same concept as above, but takes advantage of Redis' vector database features to lookup smenatically similar prompts that have been applied before -- this makes it possible to cache responses of queries that are very similar in meaning, but not necessarily using the exact same words + phrases. This widens the "hit rate", making LLM caching even more effective in practice.

**COMING SOON** -- we have a client tool/library in the works to best demonstrate this. But you can build your own using the functions and tools we've defined thus far.

### Memory

Giving your application access to "memory" for chat history is a common technique to improve the models ability to reason through recent or past conversations, gain context from previous answers, and thus provide a more accurate and acceptable response.

Below we setup simple helper functions to persist and load conversation history in a Redis List data structure.

In [33]:
import json

def add_message(prompt: str, response: str):
    msg = {
        "prompt": prompt,
        "response": response
    }
    redis_client.lpush("chat-history", json.dumps(msg))

def get_messages(k: int = 5):
    return [json.loads(msg) for msg in redis_client.lrange("chat-history", 0, k)]

In [34]:

query = "Do you have any advice for getting started in the tech field as a software dev?"
response = rag(query, PROMPT, verbose=False)

print(response)

add_message(query, response)

There are a few things you can do to get started in the tech field as a software developer. First, you should learn the basics of programming. There are many resources available online to help you do this, such as tutorials, books, and online courses. Once you have a basic understanding of programming, you can start working on your own projects. This will help you build your skills and experience. You can also start looking for internships or entry-level jobs in the tech industry. This will give you the opportunity to work on real-world projects and learn from experienced developers.


In [35]:
query = "What if I am still in college, any tips there?"
response = rag(query, PROMPT, verbose=False)

print(response)

add_message(query, response)

I would recommend that you take advantage of all the resources that your college has to offer. There are likely to be many clubs and organizations that you can join, which can be a great way to meet new people and learn new things. You should also take advantage of the career center, which can help you with everything from finding internships to preparing for job interviews.


In [36]:
get_messages()

[{'prompt': 'What if I am still in college, any tips there?',
  'response': 'I would recommend that you take advantage of all the resources that your college has to offer. There are likely to be many clubs and organizations that you can join, which can be a great way to meet new people and learn new things. You should also take advantage of the career center, which can help you with everything from finding internships to preparing for job interviews.'},
 {'prompt': 'Do you have any advice for getting started in the tech field as a software dev?',
  'response': 'There are a few things you can do to get started in the tech field as a software developer. First, you should learn the basics of programming. There are many resources available online to help you do this, such as tutorials, books, and online courses. Once you have a basic understanding of programming, you can start working on your own projects. This will help you build your skills and experience. You can also start looking for 