# Movie recommender example with Fireworks + MongoDB + Nomic embedding model

## Introduction
In this tutorial, we'll explore how to create an advanced movie recommendation system. We'll leverage the Fireworks API for embedding generation, MongoDB for data storage and retrieval, and the Nomic-AI embedding model for nuanced understanding of movie data.

## Setting Up Your Environment
Before we dive into the code, make sure to set up your environment. This involves installing necessary packages like pymongo and openai. Run the following command in your notebook to install these packages:

In [1]:
!pip install pymongo openai tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Initializing Fireworks and MongoDB Clients
To interact with Fireworks and MongoDB, we need to initialize their respective clients. Replace "YOUR FIREWORKS API KEY" and "YOUR MONGO URL" with your actual credentials.

In [4]:
import openai
import pymongo

mongo_url = input()
client = pymongo.MongoClient(mongo_url)

In [7]:
fw_client = openai.OpenAI(
  api_key=input(),
  base_url="https://api.fireworks.ai/inference/v1"
)

## Understanding the Nomic-ai 1.5 Model

The Nomic AI model, specifically the `nomic-ai/nomic-embed-text-v1.5` variant, is a great open source model that has support for Matryoshka Representation Learning, which means you can change your embedding dimensions from 768 all the way down to 64, and to the quality/data trade-off you need.

## Embedding Generation Function
The core of our recommender system is embedding generation. We'll use the Nomic-AI model to create embeddings from text data. The function generate_embeddings takes a list of texts and returns dimensionality-reduced embeddings.

In [15]:
from typing import List

def generate_embeddings(input_texts: List[str], model_api_string: str, prefix="") -> List[List[float]]:
    """Generate embeddings from Fireworks python library and reduce their size by averaging adjacent elements.

    Args:
        input_texts: a list of string input texts.
        model_api_string: str. An API string for a specific embedding model of your choice.

    Returns:
        reduced_embeddings_list: a list of reduced-size embeddings. Each element corresponds to each input text.
    """
    if prefix:
        input_texts = [prefix + text for text in input_texts] 
        print("show updated input texts", input_texts)
    return [x.embedding for x in 
        fw_client.embeddings.create(
        input=input_texts,
        model=model_api_string,
    ).data]

## Data Processing
Now, let's process our movie data. We'll extract key information from our MongoDB collection and generate embeddings for each movie. Ensure NUM_DOC_LIMIT is set to limit the number of documents processed.

In [21]:
embedding_model_string = 'nomic-ai/nomic-embed-text-v1.5' # model API string from Together.
vector_database_field_name = 'embedding_2k_movies_fw_nomic_1_5' # define your embedding field name.
NUM_DOC_LIMIT = 400 # the number of documents you will process and generate embeddings.

sample_output = generate_embeddings(["This is a test."], embedding_model_string)
print(f"Embedding size is: {str(len(sample_output[0]))}")


Embedding size is: 768


In [17]:
from tqdm import tqdm
from datetime import datetime

db = client.sample_mflix
collection = db.movies

keys_to_extract = ["plot", "genre", "cast", "title", "fullplot", "countries", "directors"]
for doc in tqdm(collection.find(
  {
    "fullplot":{"$exists": True},
    "released": { "$gt": datetime(2000, 1, 1, 0, 0, 0)},
  }
).limit(NUM_DOC_LIMIT), desc="Document Processing "):
  extracted_str = "\n".join([k + ": " + str(doc[k]) for k in keys_to_extract if k in doc])
  if vector_database_field_name not in doc:
    doc[vector_database_field_name] = generate_embeddings([extracted_str], embedding_model_string)[0]
  collection.replace_one({'_id': doc['_id']}, doc)


Document Processing : 0it [00:00, ?it/s]

Document Processing : 400it [00:35, 11.24it/s]


## Setting Up the Search Index
For our system to efficiently search through movie embeddings, we need to set up a search index in MongoDB. Define the index structure as shown:

In [6]:
"""
{
  "fields": [
    {
      "type": "vector",
      "path": "embedding_2k_movies_fw_nomic_1_5",
      "numDimensions": 768,
      "similarity": "dotProduct"
    }
  ]
}

"""

'\n{\n  "fields": [\n    {\n      "type": "vector",\n      "path": "embedding_2k_movies_fw_e5_mistral",\n      "numDimensions": 2048,\n      "similarity": "dotProduct"\n    }\n  ]\n}\n\n'

## Querying the Recommender System
Let's test our recommender system. We create a query for superhero movies and exclude Spider-Man movies, as per user preference.

In [22]:
# Example query.
query = "I love superhero movies, any recommendations?"
prefix="Instruct: Given a user query for movies, retrieve the relevant movie that can fulfill the query.\nQuery: "
query_emb = generate_embeddings([query], embedding_model_string, prefix=prefix)[0]

results = collection.aggregate([
  {
    "$vectorSearch": {
      "queryVector": query_emb,
      "path": vector_database_field_name,
      "numCandidates": 100, # this should be 10-20x the limit
      "limit": 10, # the number of documents to return in the results
      "index": 'movie_index', # the index name you used in Step 4, here we default to basics
    }
  }
])
results_as_dict = {doc['title']: doc for doc in results}

print(f"From your query \"{query}\", the following movie listings were found:\n")
print("\n".join([str(i+1) + ". " + name for (i, name) in enumerate(results_as_dict.keys())]))


show updated input texts ['Instruct: Given a user query for movies, retrieve the relevant movie that can fulfill the query.\nQuery: I love superhero movies, any recommendations?']
From your query "I love superhero movies, any recommendations?", the following movie listings were found:

1. Spider-Man
2. Lara Croft: Tomb Raider
3. Monkeybone
4. Fantastic Four
5. Titan A.E.
6. Sinbad: Legend of the Seven Seas
7. Charlie's Angels
8. X-Men
9. Alaska: Spirit of the Wild
10. Planet of the Apes


## Generating Recommendations
Finally, we use Fireworks' chat API to generate a personalized movie recommendation based on the user's query and preferences.



In [23]:
your_task_prompt = (
    "From the given movie listing data, choose a great movie recommendation for superhero movies. "
    "I don't like spider man though. "
    "Tell me the name of the movie and why it works for me."
)

listing_data = ""
for doc in results_as_dict.values():
  listing_data += f"Movie title: {doc['title']}\n"
  for (k, v) in doc.items():
    if not(k in keys_to_extract) or ("embedding" in k): continue
    if k == "name": continue
    listing_data += k + ": " + str(v) + "\n"
  listing_data += "\n"

augmented_prompt = (
    "movie listing data:\n"
    f"{listing_data}\n\n"
    f"{your_task_prompt}"
)


In [24]:
response = fw_client.chat.completions.create(
  messages=[{"role": "user", "content": augmented_prompt}],
  model="accounts/fireworks/models/mixtral-8x7b-instruct",
)

print(response.choices[0].message.content)


Based on your preference to exclude Spider-Man movies, I would recommend "X-Men." This movie is a great choice for superhero fans as it features a team of mutants with unique abilities who fight to protect humanity from a dangerous terrorist organization. The film features impressive special effects, engaging action sequences, and well-developed characters, making it an exciting and entertaining viewing experience. Additionally, the themes of acceptance and prejudice add depth to the story, making it a great pick for those who enjoy thought-provoking superhero movies.


## Conclusion
And that's it! You've successfully built a movie recommendation system using Fireworks, MongoDB, and the Mistral E5 embedding model. This system can be further customized and scaled to suit various needs.

