# INTRODUCTION

### This notebook demonstrates how one might use MemoryDB as a recommendation engine.

In this Notebook, we utilize [Amazon Bedrock to create vector embeddings for text](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html) whenever necessary. This includes the initial creation of vector embeddings for the dataset we have. We also use it to create embeddings of the text we provide to MemoryDB during a [vector similarity search (VSS)](https://docs.aws.amazon.com/memorydb/latest/devguide/vector-search-overview.html).

## SECTION 1 - SETUP

### Requirements

Before beginning, you must have the following:

- MemoryDB Cluster with:
  - VSS Enabled
  - TLS Enabled
  - Username and password with sufficient permissions
- Access to Amazon Bedrock for embeddings
- Access to either:
  - Amazon Sagemaker AI Notebooks
  - OR
  - An EC2 system running Jupyter Notebook that has connectivity to MemoryDB.

The Amazon SageMaker AI Notebook or EC2 system must have at least 100 GB of storage free.

### Setup credential information

Create or modify the `env` file to include the following variables

- `MEMORYDB_ENDPOINT` (ex: "clustercfg....memorydb.us-east-1.amazonaws.com")
- `MEMORYDB_USERNAME` (ex: "my_username")
- `MEMORYDB_PASSWORD` (ex: "my_password")

### Install Python libraries

In [None]:
%%time
import importlib.util
import sys
def install_if_missing(package):
    if importlib.util.find_spec(package) is None:
        !{sys.executable} -m pip install {package}
        
packages = ['valkey[libvalkey]', 'requests', 'pandas', 'python-dotenv',
            'boto3', 'botocore', 'langchain_aws', 'redis', 'numpy', 'langchain']

for package in packages:
    install_if_missing(package)

### Import required libraries

In [None]:
%%time
import ast
import json
import os
import tarfile
import time
from operator import itemgetter
from typing import Dict, List, Tuple
from urllib.parse import quote_plus
import boto3
import numpy as np
import pandas as pd
import requests
from IPython.display import Markdown, display
from botocore.exceptions import ClientError
from dotenv import load_dotenv
from langchain.docstore.document import Document
from langchain_aws.embeddings import BedrockEmbeddings
from langchain_aws.vectorstores.inmemorydb import (
    InMemoryDBTag,
    InMemoryVectorStore,
)
from langchain_aws.vectorstores.inmemorydb.filters import InMemoryDBFilterExpression
from valkey.cluster import ValkeyCluster as MemoryDBCluster

### Load configuration values from environment file

In [None]:
load_dotenv('env')
MEMORYDB_ENDPOINT = os.getenv('MEMORYDB_ENDPOINT')
MEMORYDB_USERNAME = os.getenv('MEMORYDB_USERNAME')
MEMORYDB_PASSWORD = os.getenv('MEMORYDB_PASSWORD')

### Define global variables for later use

In [None]:
MEMORYDB_CLUSTER_URI = f"rediss://{MEMORYDB_USERNAME}:{MEMORYDB_PASSWORD}@{MEMORYDB_ENDPOINT}"

MAX_MOVIES = 15000

MAX_PLOT_LENGTH = 250

# Which metadata field names to use, and in the specific order
metadata_field_names = ['plot', 'release_date',
                        'content', 'genres', 'actors', 'movie_name', 'movie_id']

INDEX_NAME = 'movie_index'
MOVIE_DATA_URL = 'https://www.cs.cmu.edu/~ark/personas/data'
ORIGINAL_MOVIE_FILE = 'MovieSummaries.tar.gz'
DATASET_DIR = 'datasets'
COMPRESSED_FILE = f'{DATASET_DIR}/{ORIGINAL_MOVIE_FILE}'
FULL_DATASET_PATH = f'{DATASET_DIR}/MovieSummaries'
MOVIE_TSV = f'{FULL_DATASET_PATH}/movie.metadata.tsv'
ACTOR_TSV = f'{FULL_DATASET_PATH}/character.metadata.tsv'
PLOTS_TSV = f'{FULL_DATASET_PATH}/plot_summaries.txt'

## SECTION 2 - DOWNLOAD AND EXTRACT DATA

In this section we download a movie dataset including movies, actors and plots.

In [None]:
%%time
print('Wait until you see "DONE" after executing this cell before moving on.')
print('There are several steps and it could take a minute.\n')
# Create a new directory if it doesn't exist
if not os.path.exists(f"{DATASET_DIR}"):
    print(f'Creating dataset directory \'{DATASET_DIR}\'\n')
    os.makedirs(f"{DATASET_DIR}")
else:
    print(
        f'Dataset directory \'{DATASET_DIR}\' already exists, skipping creation of directory.\n')

if not os.path.isfile(COMPRESSED_FILE):
    DOWNLOAD_URL = f'{MOVIE_DATA_URL}/{ORIGINAL_MOVIE_FILE}'
    print(f'Starting download of {DOWNLOAD_URL}\n', flush=True)
    response = requests.get(f'{DOWNLOAD_URL}')
    print(f'Saving to {COMPRESSED_FILE}\n')
    with open(COMPRESSED_FILE, 'wb') as file:
        file.write(response.content)
else:
    print(f'Found {COMPRESSED_FILE}, skipping download.\n')

if not os.path.exists(FULL_DATASET_PATH):
    print(f'Extracting {COMPRESSED_FILE}.\n', flush=True)
    with tarfile.open(COMPRESSED_FILE, 'r:gz') as tar:
        tar.extractall(path=f'{DATASET_DIR}')
else:
    print(
        f'Compressed file {COMPRESSED_FILE} has already been extracted. Skipping.\n')

print('DONE\n')

### Helper function to remove Freebase ID information.

In [None]:
# We use this function because the source dataset has
# an old Freebase unique ID value, which is confusing to
# work with, and we do not need it, only the values.

def convert_freebase_kv_pairs(field_string):
    try:
        # Convert string to dictionary
        field_dict = ast.literal_eval(field_string)
        # Extract only the values (descriptions)
        return list(field_dict.values())
    except (ValueError, SyntaxError):
        return []  # Return an empty list if parsing fails

### Load Movie Data

In [None]:
%%time
movie_column_names = [
    'movie_id', 'freebase_movie_id', 'movie_name', 'release_date',
    'revenue', 'runtime', 'languages', 'countries', 'genres'
]
movie_include_columns = ['movie_id', 'movie_name', 'release_date', 'genres']

# Read TSV into a DataFrame, keeping only the included columns
movies_df = pd.read_csv(
    f'{MOVIE_TSV}',
    sep='\t',
    names=movie_column_names,
    usecols=movie_include_columns,
    index_col='movie_id',
    nrows=MAX_MOVIES
)

# Convert genres field into a list
movies_df['genres'] = movies_df['genres'].apply(convert_freebase_kv_pairs)
movies_df.head()

### Load Actor Data

The data file has one line per actor, so we have to import the actors, and then group them together based on movie id.

In [None]:
%%time
actor_column_names = [
    'movie_id', 'freebase_movie_id', 'release_date', 'character_name', 'actor_dob',
    'actor_gender', 'actor_height', 'actor_ethnicity', 'actor_name', 'actor_age_at_movie',
    'freebase_character_map', 'misc1', 'misc2']
actor_include_columns = ['movie_id', 'actor_name']
initial_actors_df = pd.read_csv(
    f'{ACTOR_TSV}',
    sep='\t',
    names=actor_column_names,
    usecols=actor_include_columns
)

# Drop nan values
nan_actors_df = initial_actors_df.dropna()

# Group actors based on movie_id
actors_df = nan_actors_df.groupby(
    'movie_id')['actor_name'].apply(list).reset_index()

# Rename the column containing grouped actor_name values to 'actors'
actors_df.columns = ['movie_id', 'actors']
actors_df = actors_df.set_index('movie_id')

actors_df.head()

### Load Movie Plot Data

In [None]:
%%time

plots_column_names = ['movie_id', 'plot']
plots_include_columns = ['movie_id', 'plot']

initial_plots_df = pd.read_csv(
    f'{PLOTS_TSV}',
    sep='\t',
    names=plots_column_names,
    usecols=plots_include_columns,
    index_col='movie_id'
)

# remove nans
plots_df = initial_plots_df.dropna()
# Remove remnants of Wikipedia "{{Expand section}}" comments
plots_df['plot'] = plots_df['plot'].replace("{{Expand section}}", "")

plots_df.head()

### Merge all 3 datasets (movie, actors, and move plots)

In [None]:
# Merge all 3 data frames into a single data frame
FULL_MOVIE_DATASET = pd.concat([movies_df, actors_df, plots_df], axis=1)

# If we have more than MAX_MOVIES, reduce the size
FULL_MOVIE_DATASET = FULL_MOVIE_DATASET.iloc[:MAX_MOVIES]

# remove any entries that have NaN in any column
FULL_MOVIE_DATASET = FULL_MOVIE_DATASET.dropna()

print(f'Resulted in {len(FULL_MOVIE_DATASET)} total movies.\n')
FULL_MOVIE_DATASET.head()

## SECTION 3 - PREP MEMORYDB FOR VECTOR STORE USAGE

### Function to convert String into List of Strings

This function is required to turn a String of multiple values such as  `'Value 1, Value 2, Value 3'` into a List of Strings like `['Value 1', 'Value 2', Value 3']`. This is required if we want to use MemoryDB's TAG functions.

In [None]:
def preprocess_metadata(row):
    metadata = row.to_dict()  # Convert row to dictionary
    for key, value in metadata.items():
        # Convert ndarray to a list
        if isinstance(value, np.ndarray):
            metadata[key] = [f'"{item}"' for item in value.tolist()]
    return metadata

### Create collection of Langchain Document objects.

This simplifies the creation of text embeddings for the data. We will use this collection in the subsequent step when we create the MemoryDB VSS index and ingest the data.

In [None]:
%%time
# Create Documents from the DataFrame
ALL_DOCS = []

counter = 0
for index, row in FULL_MOVIE_DATASET.iterrows():

    metadata = preprocess_metadata(row)
    metadata['movie_id'] = str(index)
    content = f"Movie: {row['movie_name']}\n"
    content += f"Release Date: {row['release_date']}\n"
    content += f"Genres: {row['genres']}\n"
    content += f"Actors: {row['actors']}\n"
    content += f"Plot: {row['plot']}\n"

    # Create a Document object
    doc = Document(
        page_content=content,
        metadata=metadata
    )
    ALL_DOCS.append(doc)

print(f'Created {len(ALL_DOCS)} LangChain Documents.\n', flush=True)

# Uncommment the following field to display the first document to validate.
#ALL_DOCS[0]

### Get vector embeddings for all documents from Amazon Bedrock, then import into MemoryDB vector store

**Note**: For 6,500 documents of this size, this usually takes about 9 minutes when using a remote embedding model.

In [None]:
%%time
vectorstore = InMemoryVectorStore.from_documents(
    ALL_DOCS,
    embedding=BedrockEmbeddings(),
    redis_url=MEMORYDB_CLUSTER_URI,
    index_name=INDEX_NAME)

### Setup MemoryDB client to run various Valkey commands

This following cell creates is a connection to MemoryDB to perform operations such as [HMGET](https://valkey.io/commands/hmget/). It is not what we will use for Vector-based searches. That is defined in the `perform_query` function later on.

In [None]:
mdb_client = MemoryDBCluster.from_url(
    MEMORYDB_CLUSTER_URI, decode_responses=True)

### Show VSS index information

Here we use the MemoryDB client we created in the previous cell to execute the [FT.INFO](https://docs.aws.amazon.com/memorydb/latest/devguide/vector-search-commands-ft.info.html) command to get information about the VSS index. As a reminder, this client can be used to perform any valid MemoryDB [command](https://valkey.io/commands/) and is not specific to vector searches.

When executing this cell, review the output of the command.

Notice that some fields are of type `TAG` (such as actors and genres). This is because when we created the LangChain documents, we included those as `metadata` fields. And because they were each a List of strings, LangChain automatically determined these should be configured to use as tags.

Notice also that the `content_vector` field is of type `VECTOR`. This field is where the vector embeddings are stored and is specifically what we query when we perform a vector similarity search later on. This field is the embedded equivalent of the text field `content`.

In [None]:
mdb_client.ft(INDEX_NAME).info()

## SECTION 4 - QUERYING AND ENRICHING DOCUMENTS WITH METADATA

### Primary vector search function

The following `perform_query` function will be called whenever we perform a vector search on MemoryDB.

This will use the [LangChain AWS InMemoryVectorStore](https://api.python.langchain.com/en/latest/aws/vectorstores/langchain_aws.vectorstores.inmemorydb.base.InMemoryVectorStore.html) class, which will automatically create a vector embedding for the query text that we pass into it, which it then uses under the covers with the [FT.SEARCH](https://docs.aws.amazon.com/memorydb/latest/devguide/vector-search-commands-ft.search.html) command.

Note that it uses the [similarity_search_with_relevance_scores](https://api.python.langchain.com/en/latest/aws/vectorstores/langchain_aws.vectorstores.inmemorydb.base.InMemoryVectorStore.html#langchain_aws.vectorstores.inmemorydb.base.InMemoryVectorStore.similarity_search_with_relevance_scores) function. It could also use the `similarity_search` function which does not include the score. We included it here as we want to show what the relevancy score was based on the query.

Finally of note: we provide the `BedrockEmbeddings` function to create an embedding which provides the vector equivalent of the `query` field so that we can perform fast vector searches in MemoryDB.

In [None]:
def perform_query(query, k=15, filter=None):
    memorydb_vss_client = InMemoryVectorStore(
        redis_url=MEMORYDB_CLUSTER_URI,
        index_name=INDEX_NAME,
        embedding=BedrockEmbeddings()
    )
    results = memorydb_vss_client.similarity_search_with_relevance_scores(
        query, k=k, filter=filter)
    return results

### Create functions for metadata

**Note** again that we are using LangChain (AWS) InMemoryVectorStore. The search functions available in that class do not automatically return all of the fields from MemoryDB, such as `genres`, `actors`, etc. Because of this, we need to create the following helper functions to fetch that metadata from MemoryDB.

In the future, the `InMemoryVectorStore` search functions may have the ability to provide more fields from MemoryDB. In that case, the following functions could be removed.

### Function to fetch all movie data from MemoryDB

This function takes a MemoryDB key name and fetches all of the metadata fields defined in `metadata_field_names` from it.

In [None]:
# Note: we simplify creating a Dictionary by using the `metadata_field_names`
# defined in the GLOBALS cell at the top.
def get_metadata_from_mdb(_id):
    mdb_result = mdb_client.hmget(_id, metadata_field_names)
    result_dict = dict(zip(metadata_field_names, mdb_result))
    result_dict['id'] = _id  # add the key name as part of the metadata
    return result_dict

### Function to enrich an existing LangChain Document

Now that we have the ability to fetch metadata from MemoryDB, we can use it to enrich a LangChain document. We fetch the MemoryDB key name which is currently the only metadata field that is returned from a similarity search, and is located in the `metadata['id']` field.

In [None]:
# Get the metadata fields from MemoryDB and insert them into the LangChain Document
def enrich_doc_from_mdb(_doc):
    metadata = get_metadata_from_mdb(_doc.metadata['id'])
    new_doc = Document(page_content=_doc.page_content, metadata=metadata)
    return new_doc

### Function to enrich a collection of LangChain Documents

Now that we have defined a function to enrich a LangChain Document (`enrich_doc_from_mdb`) by adding it's metadata (`get_metadata_from_mdb`), we need to have a way to enrich the **all** of the results of a vector search query, which is a collection of LangChain Documents. So we iterate through the results and enrich each document.

In [None]:
def enrich_documents(vss_response):
    doc_list = list(map(itemgetter(0), vss_response))
    score_list = list(map(itemgetter(1), vss_response))
    new_doc_list = []
    for entry in vss_response:
        doc = entry[0]
        score = entry[1]
        key_name = doc.metadata['id']
        new_doc = enrich_doc_from_mdb(doc)
        entry = (new_doc, score)
        new_doc_list.append(entry)
    return new_doc_list

## SECTION 5 - INITIAL MOVIE SEARCH

This section is primarily to provide an initial set of movies. It is not an actual recommendation (yet!).

### Verify Functionality

The next step verifies we are able to perform a vector search based on the search terms provided in the `query` field below, enrich the results, and then print those results.

We only print information about the first LangChain Document (`enriched_results[0][0]`) from the results. We explicitly print out the LangChain Document's `metadata` and `page_content` fields.

In [None]:
NUM_RESULTS = 10

query = "universe, planets, space, adventure"
results = perform_query(query, k=NUM_RESULTS)
enriched_results = enrich_documents(results)
print ('** METADATA:\n')
print (f'{enriched_results[0][0].metadata}\n')
print ('** PAGE CONTENT:\n')
print (f'{enriched_results[0][0].page_content}\n')

### Optional Filtering

MemoryDB provides the flexibility to filter search results.

Below you can see we are providing a LangChain `InMemoryDBTag` as a filter on the `genres` field. We are telling it to filter the results so that only movies that have the `Action` genre in the `genres` field are included in the final result.

Depending upon the initial query you provided, the result below should be different than the result above.

In [None]:
filter_condition = InMemoryDBTag('genres') == 'Action'

filtered_results = perform_query(query, k=NUM_RESULTS, filter=filter_condition)
enriched_results = enrich_documents(filtered_results)
print ('** METADATA:\n')
print (f'{enriched_results[0][0].metadata}\n')
print ('** PAGE CONTENT:\n')
print (f'{enriched_results[0][0].page_content}\n')

## SECTION 6 - READABILITY

The following two functions will improve readability by formatting results into a Markdown table.

### Function to display a markdown table with search results

In [None]:
def display_results(list_of_movies, ignore_movie_id=None):
    markdown = ""
    total_movies = len(list_of_movies)
    print(f"Total number of movies below: {total_movies}\n")
    markdown += "| Movie Name   | Movie Information |\n"
    markdown += "|--------------|-------------------|\n"
    for movie in list_of_movies:
        doc = movie[0]
        vss_score = movie[1]
        metadata = doc.metadata
        if 'id' in doc.metadata and doc.metadata['id'] != ignore_movie_id:
            markdown += get_cell_info_from_movie(movie, vss_score)
        else:
            continue
    display(Markdown(markdown))

### Function to populate a Markdown table with movie information

This is called one time per movie from the function above

In [None]:
def get_cell_info_from_movie(movie, relevance_score=0.0):
    movie_cell = ""
    doc = movie[0]
    movie_name_cell = ""
    metadata = doc.metadata
    if 'movie_name' in metadata:
        movie_name = metadata['movie_name']
        movie_name_cell += f"|**{movie_name}**|"
    else:
        movie_name_cell = "||"
    movie_name_cell = f"|**{movie_name}**|"
    movie_data_cell = ""
    if 'plot' in metadata:
        plot_text = metadata['plot'][:MAX_PLOT_LENGTH] + '...'
        movie_data_cell += f"**Plot**: {plot_text}<p><p>"
    if 'actors' in metadata:
        movie_data_cell += f"**Actors**: {metadata['actors']}<p><p>"
    if 'id' in metadata:
        movie_data_cell += f"**MemoryDB keyname**: {metadata['id']}<p><p>"
    if 'genres' in metadata:
        movie_data_cell += f"**Genres**: {metadata['genres']}<p><p>"
    movie_data_cell += f"**Relevance Score**: {relevance_score}<p>"
    movie_data_cell += "|"
    return movie_name_cell + movie_data_cell + '\n'

### Simplifying it

This function simplifies the process of performing a search, enriching the documents, and displaying the results.

In [None]:
def query_and_display_results(query, k):
    results = perform_query(query=query, k=k)
    enriched_documents = enrich_documents(results)
    display_results(enriched_documents)

### Now run the above function

Everything up until now is so we can easily search for, and display, results from a query. Let's get an initial list of 10 movies to review.

Notice that each row contains a `MemoryDB keyname`. This will be used in a follow-up step.

Modify the value of `k` to change the number of results returned.

In [None]:
query_and_display_results(query=query, k=10)

## SECTION 7 - ACTUAL RECOMMENDATIONS

Up until now we have searched on a collection of words that we manually provided. Now we want to provide actual recommendations to our user based on the movie they just watched. So let's create a function that does the following:

1. Takes a document ID (then unique ID of the movie our user just watched).
2. Gets the metadata for that movie.
3. Gets the vector data (the `content_vector` field), which contains a vector of the combined values of `movie name`, `genres`, `actors` and `plot`.
4. Takes that vector data and performs a vector similarity search with it. This provides a list of movies based on the similarity of all of these fields.
5. Enriches the movie information with the metadata stored in MemoryDB.
6. Displays the enriched data in a Markdown table (without displaying the original movie).

**Why are we performing a vector search against the vector data of an existing record?** We do this because we are simulating a user who has just watched a movie that they like, and we are recommending similar movies to it. We are basing this on the fact that the user might like the movie plot, genres, actors and movie name.

In [None]:
def show_results_from_id(_doc_id, k=15):
    mdb_value = get_metadata_from_mdb(_doc_id)
    content = mdb_value['content']
    results = perform_query(content, k=k)
    enriched_documents = enrich_documents(results)
    display_results(enriched_documents, _doc_id)

## Testing recommendations

Now copy the `MemoryDB keyname:` value from one of the movies in the Markdown table above, and run this cell to see recommendations based on that movie!

Feel free to run this multiple times with different movie ID's (`MemoryDB keyname`) to see the results! 

You can also modify the `k` value to change the number of results returned.

In [None]:
MOVIE_ID = ''

show_results_from_id(MOVIE_ID, k=10)

## SECTION 8 - SUMMARY

In this demo you have used a movie dataset, created vectorized embeddings, stored those in MemoryDB, and then received recommendations based off both search terms as well as **recommendations** based off of a specific movie id.

Click to learn more about [MemoryDB's vector search capabilities](https://docs.aws.amazon.com/memorydb/latest/devguide/vector-search-overview.html).