Embeddings are dense vector representations of text data that capture the semantic meaning of words and phrases. They are widely used in natural language processing (NLP) tasks, such as text classification, question answering, and information retrieval. Embeddings are generated using techniques like Word2Vec, GloVe, or transformer-based models like BERT.

The RAG (Retrieval-Augmented Generation) system enhances language models with external knowledge retrieved from a large corpus. It combines the strengths of retrieval and generation models to produce more informed and contextual responses. The RAG system consists of three main components:

+---------------+
                  |     Input     |
                  +-------+-------+
                          |
                  +-------+-------+
                  |    Retriever  |
                  +-------+-------+
                          |
                  +-------+-------+
                  |    Corpus     |
                  | (Embeddings + |
                  |   Text Data)  |
                  +-------+-------+
                          |
                  +-------+-------+
                  |    Reader     |
                  +-------+-------+
                          |
                  +-------+-------+
                  |   Generator   |
                  +-------+-------+
                          |
                  +-------v-------+
                  |    Output     |
                  +---------------+

* Retriever: This component takes the input text and queries a large corpus (e.g., Wikipedia) to retrieve relevant documents or passages based on semantic similarity. This is often done using embeddings and nearest-neighbor search techniques.
*Reader: The retrieved documents or passages are then processed by a reading comprehension model (e.g., a transformer-based language model) to extract the most relevant information for the given input.
* Generator: The output from the reader, along with the original input text, is fed into a language generation model (e.g., GPT) to produce the final output response.

MongoDB, a widely adopted NoSQL document database, plays a pivotal role in the RAG system. It efficiently stores and retrieves embeddings and associated text data. MongoDB's adaptability and scalability make it an ideal choice for managing large text corpora and their corresponding embeddings in the RAG system.

In a RAG system integrated with MongoDB, the embeddings and text data are stored in a MongoDB collection. Each document in the collection represents a text passage or document. The embeddings can be stored within the document as arrays or binary data. This storage method enables efficient nearest neighbour search using MongoDB's geospatial indexing capabilities, a crucial aspect of the RAG system's functionality.

During the retrieval phase, the input text would be encoded into an embedding, and MongoDB's geospatial queries (e.g., $nearSphere) can be used to find the nearest neighbour embeddings in the database, effectively retrieving the most relevant documents or passages.Embeddings are dense vector representations of text data that capture the semantic meaning of words and phrases. They are widely used in natural language processing (NLP) tasks, such as text classification, question answering, and information retrieval. Embeddings are generated using techniques like Word2Vec, GloVe, or transformer-based models like BERT.

The RAG (Retrieval-Augmented Generation) system enhances language models with external knowledge retrieved from a large corpus. It combines the strengths of retrieval and generation models to produce more informed and contextual responses. The RAG system consists of three main components:
Retriever: This component takes the input text and queries a large corpus (e.g., Wikipedia) to retrieve relevant documents or passages based on semantic similarity. This is often done using embeddings and nearest-neighbor search techniques.
Reader: The retrieved documents or passages are then processed by a reading comprehension model (e.g., a transformer-based language model) to extract the most relevant information for the given input.
Generator: The output from the reader, along with the original input text, is fed into a language generation model (e.g., GPT) to produce the final output response.

MongoDB, a widely adopted NoSQL document database, plays a pivotal role in the RAG system. It efficiently stores and retrieves embeddings and associated text data. MongoDB's adaptability and scalability make it an ideal choice for managing large text corpora and their corresponding embeddings in the RAG system.

In a RAG system integrated with MongoDB, the embeddings and text data are stored in a MongoDB collection. Each document in the collection represents a text passage or document. The embeddings can be stored within the document as arrays or binary data. This storage method enables efficient nearest neighbour search using MongoDB's geospatial indexing capabilities, a crucial aspect of the RAG system's functionality.

During the retrieval phase, the input text would be encoded into an embedding, and MongoDB's geospatial queries (e.g., $nearSphere) can be used to find the nearest neighbour embeddings in the database, effectively retrieving the most relevant documents or passages.

In [4]:
# !pip install datasets pandas openai pymongo

In [5]:
from datasets import load_dataset
import pandas as pd

In [3]:
df = load_dataset("MongoDB/embedded_movies")

# Convert the dataset to a pandas dataframe
df = pd.DataFrame(df['train'])

df.head(5)

Downloading readme: 100%|██████████| 6.18k/6.18k [00:00<00:00, 12.8MB/s]
Downloading data: 100%|██████████| 42.3M/42.3M [00:05<00:00, 8.43MB/s]
Generating train split: 1500 examples [00:00, 5125.68 examples/s]


Unnamed: 0,cast,languages,fullplot,plot,writers,runtime,directors,awards,imdb,type,plot_embedding,metacritic,num_mflix_comments,rated,poster,title,genres,countries
0,"[Pearl White, Crane Wilbur, Paul Panzer, Edwar...",[English],Young Pauline is left a lot of money when her ...,Young Pauline is left a lot of money when her ...,"[Charles W. Goddard (screenplay), Basil Dickey...",199.0,"[Louis J. Gasnier, Donald MacKenzie]","{'nominations': 0, 'text': '1 win.', 'wins': 1}","{'id': 4465, 'rating': 7.6, 'votes': 744}",movie,"[0.00072939653, -0.026834568, 0.013515796, -0....",,0,,https://m.media-amazon.com/images/M/MV5BMzgxOD...,The Perils of Pauline,[Action],[USA]
1,"[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...",[English],As a penniless man worries about how he will m...,A penniless young man tries to save an heiress...,[H.M. Walker (titles)],22.0,"[Alfred J. Goulding, Hal Roach]","{'nominations': 1, 'text': '1 nomination.', 'w...","{'id': 10146, 'rating': 7.0, 'votes': 639}",movie,"[-0.022837115, -0.022941574, 0.014937485, -0.0...",,0,TV-G,https://m.media-amazon.com/images/M/MV5BNzE1OW...,From Hand to Mouth,"[Comedy, Short, Action]",[USA]
2,"[Ronald Colman, Neil Hamilton, Ralph Forbes, A...",[English],"Michael ""Beau"" Geste leaves England in disgrac...","Michael ""Beau"" Geste leaves England in disgrac...","[Herbert Brenon (adaptation), John Russell (ad...",101.0,[Herbert Brenon],"{'nominations': 0, 'text': '1 win.', 'wins': 1}","{'id': 16634, 'rating': 6.9, 'votes': 222}",movie,"[0.00023330493, -0.028511643, 0.014653289, -0....",,0,,,Beau Geste,"[Action, Adventure, Drama]",[USA]
3,"[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...",,A nobleman vows to avenge the death of his fat...,"Seeking revenge, an athletic young man joins t...","[Douglas Fairbanks (story), Jack Cunningham (a...",88.0,[Albert Parker],"{'nominations': 0, 'text': '1 win.', 'wins': 1}","{'id': 16654, 'rating': 7.2, 'votes': 1146}",movie,"[-0.005927917, -0.033394486, 0.0015323418, -0....",,1,,https://m.media-amazon.com/images/M/MV5BMzU0ND...,The Black Pirate,"[Adventure, Action]",[USA]
4,"[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...",[English],"The Uptown Boy, J. Harold Manners (Lloyd) is a...",An irresponsible young millionaire changes his...,"[Ted Wilde (story), John Grey (story), Clyde B...",58.0,[Sam Taylor],"{'nominations': 1, 'text': '1 nomination.', 'w...","{'id': 16895, 'rating': 7.6, 'votes': 918}",movie,"[-0.0059373598, -0.026604708, -0.0070914757, -...",,0,PASSED,https://m.media-amazon.com/images/M/MV5BMTcxMT...,For Heaven's Sake,"[Action, Comedy, Romance]",[USA]
