MongoDB, a widely adopted NoSQL document database, plays a pivotal role in the RAG system. It efficiently stores and retrieves embeddings and associated text data. MongoDB's adaptability and scalability make it an ideal choice for managing large text corpora and their corresponding embeddings in the RAG system.

In a RAG system integrated with MongoDB, the embeddings and text data are stored in a MongoDB collection. Each document in the collection represents a text passage or document. The embeddings can be stored within the document as arrays or binary data. This storage method enables efficient nearest neighbour search using MongoDB's geospatial indexing capabilities, a crucial aspect of the RAG system's functionality.

During the retrieval phase, the input text would be encoded into an embedding, and MongoDB's geospatial queries (e.g., $nearSphere) can be used to find the nearest neighbour embeddings in the database, effectively retrieving the most relevant documents or passages.

In [52]:
# !pip install datasets pandas openai pymongo

In [1]:
from datasets import load_dataset
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
df = load_dataset("MongoDB/embedded_movies", limit=10)

# Convert the dataset to a pandas dataframe
df = pd.DataFrame(df['train'])

df.head(5)

ValueError: BuilderConfig JsonConfig(name='default', version=0.0.0, data_dir=None, data_files={NamedSplit('train'): ['hf://datasets/MongoDB/embedded_movies@664ec70151d15e49a077ab7354a842cadf27e711/sample_mflix.embedded_movies.json']}, description=None, features=None, encoding='utf-8', encoding_errors=None, field=None, use_threads=True, block_size=None, chunksize=10485760, newlines_in_values=None) doesn't have a 'limit' key.

In [10]:
lst_cmd = ['df.columns', 'df.shape', 'df.describe()', 'df.isnull().sum()']

for cmd in lst_cmd:
    print(eval(cmd), '\n')

Index(['cast', 'languages', 'fullplot', 'plot', 'writers', 'runtime',
       'directors', 'awards', 'imdb', 'type', 'metacritic',
       'num_mflix_comments', 'rated', 'poster', 'title', 'genres', 'countries',
       'plot_embedding_optimised'],
      dtype='object') 

(1473, 18) 

           runtime  metacritic  num_mflix_comments
count  1459.000000  570.000000         1473.000000
mean    112.208362   51.691228            6.180584
std      42.406701   16.854471           27.616895
min       6.000000    9.000000            0.000000
25%      96.000000   40.000000            0.000000
50%     106.000000   51.000000            0.000000
75%     121.000000   63.000000            1.000000
max    1256.000000   97.000000          158.000000 

cast                          1
languages                     1
fullplot                     21
plot                          0
writers                      13
runtime                      14
directors                    13
awards                        0


In [11]:
df = df.dropna(subset=['plot'])
df = df.drop(columns=['plot_embedding'])


df

KeyError: "['plot_embedding'] not found in axis"

In [54]:
from openai import OpenAI
import os

import openai

client = OpenAI()


openai.api_key = os.getenv("OPENAI_API_KEY")

if openai.api_key is None:
    raise Exception("API key is required. Set OPENAI_API_KEY environment variable.")


In [55]:
def get_embedding(text):
    """Generate an embedding for the given text using OpenAI's API."""

    # Check for valid input
    if not text or not isinstance(text, str):
        return None

    try:
        # Call OpenAI API to get the embedding
        embedding = client.embeddings.create(input=text, model="text-embedding-3-small").data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None

df["plot_embedding_optimised"] = df['plot'].apply(get_embedding)

df.head(5)

Unnamed: 0,cast,languages,fullplot,plot,writers,runtime,directors,awards,imdb,type,metacritic,num_mflix_comments,rated,poster,title,genres,countries,plot_embedding_optimised
0,"[Pearl White, Crane Wilbur, Paul Panzer, Edwar...",[English],Young Pauline is left a lot of money when her ...,Young Pauline is left a lot of money when her ...,"[Charles W. Goddard (screenplay), Basil Dickey...",199.0,"[Louis J. Gasnier, Donald MacKenzie]","{'nominations': 0, 'text': '1 win.', 'wins': 1}","{'id': 4465, 'rating': 7.6, 'votes': 744}",movie,,0,,https://m.media-amazon.com/images/M/MV5BMzgxOD...,The Perils of Pauline,[Action],[USA],"[0.015452137216925621, -0.0037907666992396116,..."
1,"[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...",[English],As a penniless man worries about how he will m...,A penniless young man tries to save an heiress...,[H.M. Walker (titles)],22.0,"[Alfred J. Goulding, Hal Roach]","{'nominations': 1, 'text': '1 nomination.', 'w...","{'id': 10146, 'rating': 7.0, 'votes': 639}",movie,,0,TV-G,https://m.media-amazon.com/images/M/MV5BNzE1OW...,From Hand to Mouth,"[Comedy, Short, Action]",[USA],"[-0.024403313174843788, 0.009798603132367134, ..."
2,"[Ronald Colman, Neil Hamilton, Ralph Forbes, A...",[English],"Michael ""Beau"" Geste leaves England in disgrac...","Michael ""Beau"" Geste leaves England in disgrac...","[Herbert Brenon (adaptation), John Russell (ad...",101.0,[Herbert Brenon],"{'nominations': 0, 'text': '1 win.', 'wins': 1}","{'id': 16634, 'rating': 6.9, 'votes': 222}",movie,,0,,,Beau Geste,"[Action, Adventure, Drama]",[USA],"[-0.031453937292099, 0.07588011771440506, 0.03..."
3,"[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...",,A nobleman vows to avenge the death of his fat...,"Seeking revenge, an athletic young man joins t...","[Douglas Fairbanks (story), Jack Cunningham (a...",88.0,[Albert Parker],"{'nominations': 0, 'text': '1 win.', 'wins': 1}","{'id': 16654, 'rating': 7.2, 'votes': 1146}",movie,,1,,https://m.media-amazon.com/images/M/MV5BMzU0ND...,The Black Pirate,"[Adventure, Action]",[USA],"[0.0217618215829134, 0.06848196685314178, 0.04..."
4,"[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...",[English],"The Uptown Boy, J. Harold Manners (Lloyd) is a...",An irresponsible young millionaire changes his...,"[Ted Wilde (story), John Grey (story), Clyde B...",58.0,[Sam Taylor],"{'nominations': 1, 'text': '1 nomination.', 'w...","{'id': 16895, 'rating': 7.6, 'votes': 918}",movie,,0,PASSED,https://m.media-amazon.com/images/M/MV5BMTcxMT...,For Heaven's Sake,"[Action, Comedy, Romance]",[USA],"[0.008152924478054047, -0.0193649735301733, -0..."


In [59]:
import pymongo
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

uri = "mongodb+srv://erkanmalcok:2LtkNVWOuWnmSP9c@cluster0.grz4hfh.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"

# Create a new client and connect to the server
mongo_client = MongoClient(uri, server_api=ServerApi('1'))

# Send a ping to confirm a successful connection
try:
    mongo_client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


In [60]:
db = mongo_client['movies']
collection = db['movie_collection']

In [61]:
collection.delete_many({})

DeleteResult({'n': 1473, 'electionId': ObjectId('7fffffff00000000000000f9'), 'opTime': {'ts': Timestamp(1711732336, 476), 't': 249}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1711732336, 478), 'signature': {'hash': b"\x01\xf1:\xca\xe2\xdd\x9f\x8e@\x08p\xda\rM\x81c\xe7\x8c'\xd3", 'keyId': 7314012916322140166}}, 'operationTime': Timestamp(1711732336, 476)}, acknowledged=True)

In [62]:
documents = df.to_dict('records')
collection.insert_many(documents)

print("Data ingestion into MongoDB completed")

Data ingestion into MongoDB completed


In [65]:
def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "plot_embedding_optimised",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 5  # Return top 5 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "plot": 1,  # Include the plot field
                "title": 1,  # Include the title field
                "genres": 1, # Include the genres field
                "score": {
                    "$meta": "vectorSearchScore"  # Include the search score
                }
            }
        }
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

In [66]:
def handle_user_query(query, collection):

  get_knowledge = vector_search(query, collection)

  search_result = ''
  for result in get_knowledge:
      search_result += f"Title: {result.get('title', 'N/A')}, Plot: {result.get('plot', 'N/A')}\n"

  completion = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a movie recommendation system."},
            {"role": "user", "content": "Answer this user query: " + query + " with the following context: " + search_result}
        ]
    )
  
  print(completion.choices[0].message)
  print(search_result)

  return (completion.choices[0].message), search_result

In [72]:
query = "What is the best sci-fi and adventure movie to watch?"
response, source_information = handle_user_query(query, collection)

print(f"Response: {response}")
# print(f"Source Information: \n{source_information}")

ChatCompletionMessage(content='With the context provided, I recommend you watch "Interstellar". It is a sci-fi adventure movie directed by Christopher Nolan that follows a group of astronauts on a mission to find a new habitable planet for humanity. The movie has a captivating story, stunning visuals, and thought-provoking themes that make it a great choice for both sci-fi and adventure enthusiasts.', role='assistant', function_call=None, tool_calls=None)

Response: ChatCompletionMessage(content='With the context provided, I recommend you watch "Interstellar". It is a sci-fi adventure movie directed by Christopher Nolan that follows a group of astronauts on a mission to find a new habitable planet for humanity. The movie has a captivating story, stunning visuals, and thought-provoking themes that make it a great choice for both sci-fi and adventure enthusiasts.', role='assistant', function_call=None, tool_calls=None)


In [73]:
query = "Show me movies are similiar to the movie AlphaGo"

In [74]:
response, source_information = handle_user_query(query, collection)x

ChatCompletionMessage(content='If you enjoyed the documentary "AlphaGo" which explores the intersection of artificial intelligence and human intelligence through the ancient game of Go, you may also like the following movies:\n\n1. "Deep Blue" (2003) - A documentary about the legendary chess match between Garry Kasparov and IBM\'s Deep Blue computer.\n\n2. "The Imitation Game" (2014) - A biographical drama about mathematician and cryptanalyst Alan Turing who played a key role in breaking the Enigma code during World War II.\n\n3. "Ex Machina" (2014) - A sci-fi thriller that delves into the relationship between humans and artificial intelligence through the Turing Test.\n\n4. "Her" (2013) - A romantic sci-fi drama that explores the emotional connection between a man and an advanced operating system.\n\nThese movies share themes of artificial intelligence, human-AI interactions, and the impact of technology on society, much like "AlphaGo."', role='assistant', function_call=None, tool_cal

In [78]:
print(f"Response: {response}")

Response: ChatCompletionMessage(content='If you enjoyed the documentary "AlphaGo" which explores the intersection of artificial intelligence and human intelligence through the ancient game of Go, you may also like the following movies:\n\n1. "Deep Blue" (2003) - A documentary about the legendary chess match between Garry Kasparov and IBM\'s Deep Blue computer.\n\n2. "The Imitation Game" (2014) - A biographical drama about mathematician and cryptanalyst Alan Turing who played a key role in breaking the Enigma code during World War II.\n\n3. "Ex Machina" (2014) - A sci-fi thriller that delves into the relationship between humans and artificial intelligence through the Turing Test.\n\n4. "Her" (2013) - A romantic sci-fi drama that explores the emotional connection between a man and an advanced operating system.\n\nThese movies share themes of artificial intelligence, human-AI interactions, and the impact of technology on society, much like "AlphaGo."', role='assistant', function_call=None

In [84]:
content = """
If you enjoyed the documentary "AlphaGo" which explores the intersection of artificial intelligence and human intelligence through the ancient game of Go, you may also like the following movies:

1. "Deep Blue" (2003) - A documentary about the legendary chess match between Garry Kasparov and IBM's Deep Blue computer.

2. "The Imitation Game" (2014) - A biographical drama about mathematician and cryptanalyst Alan Turing who played a key role in breaking the Enigma code during World War II.

3. "Ex Machina" (2014) - A sci-fi thriller that delves into the relationship between humans and artificial intelligence through the Turing Test.

4. "Her" (2013) - A romantic sci-fi drama that explores the emotional connection between a man and an advanced operating system.

These movies share themes of artificial intelligence, human-AI interactions, and the impact of technology on society, much like "AlphaGo."
"""

In [95]:
content = content.replace("\n", " ").replace("\n2", "")
content

' If you enjoyed the documentary "AlphaGo" which explores the intersection of artificial intelligence and human intelligence through the ancient game of Go, you may also like the following movies:  1. "Deep Blue" (2003) - A documentary about the legendary chess match between Garry Kasparov and IBMs Deep Blue computer.  2. "The Imitation Game" (2014) - A biographical drama about mathematician and cryptanalyst Alan Turing who played a key role in breaking the Enigma code during World War II.  3. "Ex Machina" (2014) - A sci-fi thriller that delves into the relationship between humans and artificial intelligence through the Turing Test.  4. "Her" (2013) - A romantic sci-fi drama that explores the emotional connection between a man and an advanced operating system.  These movies share themes of artificial intelligence, human-AI interactions, and the impact of technology on society, much like "AlphaGo." '

In [101]:
outcome = content[1:len(content)-1]

In [102]:
outcome

'If you enjoyed the documentary "AlphaGo" which explores the intersection of artificial intelligence and human intelligence through the ancient game of Go, you may also like the following movies:  1. "Deep Blue" (2003) - A documentary about the legendary chess match between Garry Kasparov and IBMs Deep Blue computer.  2. "The Imitation Game" (2014) - A biographical drama about mathematician and cryptanalyst Alan Turing who played a key role in breaking the Enigma code during World War II.  3. "Ex Machina" (2014) - A sci-fi thriller that delves into the relationship between humans and artificial intelligence through the Turing Test.  4. "Her" (2013) - A romantic sci-fi drama that explores the emotional connection between a man and an advanced operating system.  These movies share themes of artificial intelligence, human-AI interactions, and the impact of technology on society, much like "AlphaGo."'