# Movie recommender example with Fireworks + MongoDB + Mistral E5 embedding model

## Introduction
In this tutorial, we'll explore how to create an advanced movie recommendation system. We'll leverage the Fireworks API for embedding generation, MongoDB for data storage and retrieval, and the Mistral E5 embedding model for nuanced understanding of movie data.

## Setting Up Your Environment
Before we dive into the code, make sure to set up your environment. This involves installing necessary packages like pymongo and openai. Run the following command in your notebook to install these packages:

In [1]:
!pip install pymongo openai tqdm



## Initializing Fireworks and MongoDB Clients
To interact with Fireworks and MongoDB, we need to initialize their respective clients. Replace "YOUR FIREWORKS API KEY" and "YOUR MONGO URL" with your actual credentials.

In [2]:
import openai
import pymongo

fw_client = openai.OpenAI(
  api_key="YOUR FIREWORKS API KEY",
  base_url="https://api.fireworks.ai/inference/v1"
)

client = pymongo.MongoClient("YOUR MONGO URL")

## Understanding the E5 Mistral Model

The E5 Mistral model, specifically the `intfloat/e5-mistral-7b-instruct` variant, is a highly adaptable language model designed to enhance text embeddings. It has 32 layers and an embedding size of 4096, making it well-suited for complex embedding tasks. It is currently the **state-of-the-art** model on huggingface leaderboard.

### Dynamic Adaptation with Instructions
A unique feature of E5 Mistral is its ability to adapt to different tasks through natural language instructions in queries. This allows the model to be customized for various scenarios without needing separate models or extensive retraining.

### Specialization in English
While E5 Mistral is fine-tuned on multilingual datasets, it is primarily recommended for English due to its predominant training on English data. This makes it particularly effective for English-language tasks, such as our movie recommendation system.

### Application in Movie Recommendations
In our system, we use E5 Mistral's capability to understand context through one-sentence instructions in queries. This feature enables us to generate more accurate and contextually relevant embeddings for movie recommendations.

For more details, you can refer to the [E5 Mistral model page on Hugging Face](https://huggingface.co/intfloat/e5-mistral-7b-instruct).

## Embedding Generation Function
The core of our recommender system is embedding generation. We'll use the Mistral E5 model to create embeddings from text data. The function generate_embeddings takes a list of texts and returns dimensionality-reduced embeddings.

In [3]:
from typing import List

# we will need to do pairwise average for the elements to reduce the dimensionality
# from 4k to 2k while fitting into MongoDB

def generate_embeddings(input_texts: List[str], model_api_string: str, prefix="") -> List[List[float]]:
    """Generate embeddings from Together python library and reduce their size by averaging adjacent elements.

    Args:
        input_texts: a list of string input texts.
        model_api_string: str. An API string for a specific embedding model of your choice.

    Returns:
        reduced_embeddings_list: a list of reduced-size embeddings. Each element corresponds to each input text.
    """
    if prefix:
        input_texts = [prefix + text for text in input_texts] 
        print("show updated input texts", input_texts)
    outputs = fw_client.embeddings.create(
        input=input_texts,
        model=model_api_string,
    )

    def reduce_embedding_size(embedding: List[float]) -> List[float]:
        # Average every adjacent pair of elements in the embedding
        return [(embedding[i] + embedding[i + 1]) / 2 for i in range(0, len(embedding), 2)]

    # Apply the size reduction to each embedding
    reduced_embeddings_list = [reduce_embedding_size(x.embedding) for x in outputs.data]

    return reduced_embeddings_list


## Data Processing
Now, let's process our movie data. We'll extract key information from our MongoDB collection and generate embeddings for each movie. Ensure NUM_DOC_LIMIT is set to limit the number of documents processed.

In [4]:
embedding_model_string = 'intfloat/e5-mistral-7b-instruct' # model API string from Together.
vector_database_field_name = 'embedding_2k_movies_fw_e5_mistral' # define your embedding field name.
NUM_DOC_LIMIT = 400 # the number of documents you will process and generate embeddings.

sample_output = generate_embeddings(["This is a test."], embedding_model_string)
print(f"Embedding size is: {str(len(sample_output[0]))}")


Embedding size is: 2048


In [5]:
from tqdm import tqdm
from datetime import datetime

db = client.sample_mflix
collection = db.movies

keys_to_extract = ["plot", "genre", "cast", "title", "fullplot", "countries", "directors"]
for doc in tqdm(collection.find(
  {
    "fullplot":{"$exists": True},
    "released": { "$gt": datetime(2000, 1, 1, 0, 0, 0)},
  }
).limit(NUM_DOC_LIMIT), desc="Document Processing "):
  extracted_str = "\n".join([k + ": " + str(doc[k]) for k in keys_to_extract if k in doc])
  if vector_database_field_name not in doc:
    doc[vector_database_field_name] = generate_embeddings([extracted_str], embedding_model_string)[0]
  collection.replace_one({'_id': doc['_id']}, doc)


Document Processing : 0it [00:00, ?it/s]

Document Processing : 400it [00:14, 28.51it/s]


## Setting Up the Search Index
For our system to efficiently search through movie embeddings, we need to set up a search index in MongoDB. Define the index structure as shown:

In [6]:
"""
{
  "fields": [
    {
      "type": "vector",
      "path": "embedding_2k_movies_fw_e5_mistral",
      "numDimensions": 2048,
      "similarity": "dotProduct"
    }
  ]
}

"""

'\n{\n  "fields": [\n    {\n      "type": "vector",\n      "path": "embedding_2k_movies_fw_e5_mistral",\n      "numDimensions": 2048,\n      "similarity": "dotProduct"\n    }\n  ]\n}\n\n'

## Querying the Recommender System
Let's test our recommender system. We create a query for superhero movies and exclude Spider-Man movies, as per user preference.

In [7]:
# Example query.
query = "I love superhero movies, any recommendations?"
prefix="Instruct: Given a user query for movies, retrieve the relevant movie that can fulfill the query.\nQuery: "
query_emb = generate_embeddings([query], embedding_model_string, prefix=prefix)[0]

results = collection.aggregate([
  {
    "$vectorSearch": {
      "queryVector": query_emb,
      "path": vector_database_field_name,
      "numCandidates": 100, # this should be 10-20x the limit
      "limit": 10, # the number of documents to return in the results
      "index": "movie_index", # the index name you used in Step 4.
    }
  }
])
results_as_dict = {doc['title']: doc for doc in results}

print(f"From your query \"{query}\", the following movie listings were found:\n")
print("\n".join([str(i+1) + ". " + name for (i, name) in enumerate(results_as_dict.keys())]))


show updated input texts ['Instruct: Given a user query for movies, retrieve the relevant movie that can fulfill the query.\nQuery: I love superhero movies, any recommendations?']
From your query "I love superhero movies, any recommendations?", the following movie listings were found:

1. Spider-Man
2. Fantastic Four
3. Iron Monkey
4. Gladiator
5. X-Men
6. Final Fantasy: The Spirits Within
7. Akira
8. Crouching Tiger, Hidden Dragon
9. Titan A.E.


## Generating Recommendations
Finally, we use Fireworks' chat API to generate a personalized movie recommendation based on the user's query and preferences.



In [8]:
your_task_prompt = (
    "From the given movie listing data, choose a great movie recommendation for superhero movies. "
    "I don't like spider man though. "
    "Tell me the name of the movie and why it works for me."
)

listing_data = ""
for doc in results_as_dict.values():
  listing_data += f"Movie title: {doc['title']}\n"
  for (k, v) in doc.items():
    if not(k in keys_to_extract) or ("embedding" in k): continue
    if k == "name": continue
    listing_data += k + ": " + str(v) + "\n"
  listing_data += "\n"

augmented_prompt = (
    "movie listing data:\n"
    f"{listing_data}\n\n"
    f"{your_task_prompt}"
)


In [9]:
response = fw_client.chat.completions.create(
  messages=[{"role": "user", "content": augmented_prompt}],
  model="accounts/fireworks/models/mixtral-8x7b-instruct",
)

print(response.choices[0].message.content)


Based on your preference to exclude Spider-Man movies and my commitment to provide a helpful, secure, and positive response, I recommend "X-Men." This movie combines action, drama, and science fiction elements, creating an engaging superhero story.

"X-Men" features a variety of characters with unique superpowers, which adds excitement and unpredictability to the plot. The central conflict between mutants and humans, along with the internal struggles within the mutant community, offers depth and intrigue.

Additionally, the movie has a strong cast, including Hugh Jackman, Patrick Stewart, and Ian McKellen, ensuring high-quality performances. "X-Men" is a fantastic choice for those who enjoy superhero movies without focusing on Spider-Man.


## Conclusion
And that's it! You've successfully built a movie recommendation system using Fireworks, MongoDB, and the Mistral E5 embedding model. This system can be further customized and scaled to suit various needs.

