In this guide, we will build a simple RAG-based chatbot that answers questions about the contents of a PDF document. This tutorial will use Friendli Serverless Endpoints for LLM inference and MongoDB Atlas for the vector store.

# Dependencies

First, let’s install the required packages:

In [None]:
!pip install langchain langchain-community friendli-client pypdf pymongo langchain-openai tiktoken

## Setting Up MongoDB Atlas
While you can run MongoDB locally, this tutorial will use MongoDB Atlas, a managed service.
We have prepared a dedicated cluster for each team, so you can use it.

With the provided connection info, initialize the MongoDB client:

In [2]:
from pymongo import MongoClient

MONGODB_ATLAS_CLUSTER_URI = "YOUR_MONGO_URI"  # mongodb+srv://team-xx:<password>@cluster0.tyqdayd.mongodb.net/

client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)

DB_NAME = "team-xx"
COLLECTION_NAME = "rag"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "team_xx_rag"

MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]

Test the connection by running:

In [None]:
client.server_info()

## Creating a Vector Search Index

To use MongoDB as a vector store, you need to create a vector search index for querying. Configure the search index as follows:

In [None]:
from pymongo.operations import SearchIndexModel

search_model = SearchIndexModel(
    definition={
        "fields": [
            {
                "numDimensions": 1536,
                "path": "embedding",
                "similarity": "cosine",
                "type": "vector"
            }
        ]
    },
    name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
    type="vectorSearch",
)
client[DB_NAME][COLLECTION_NAME].create_search_index(search_model)

## Loading Documents and Embeddings

Now, let’s load a document from a PDF file and insert them into MongoDB Atlas with their embeddings. In our case, we’ll load the BPipe paper from the ICML 2023 conference:

In [5]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader("https://openreview.net/pdf?id=HVKmLi1iR4")
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)

vector_store = MongoDBAtlasVectorSearch.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(disallowed_special=()),
    collection=MONGODB_COLLECTION,
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)
retriever = vector_store.as_retriever()

## Initializing the LLM with Friendli

Now, let’s initialize the LLM part using Friendli Serverless Endpoints, using Meta’s new Llama 3 70B model:

In [6]:
from langchain_community.chat_models.friendli import ChatFriendli

llm = ChatFriendli(model="meta-llama-3-70b-instruct")

## Building the RAG Chain

We have prepared all the components for our RAG pipeline. Here’s how to ask questions about the PDF file. In our case, we’ll find out what the ‘memory imbalance problem’ is, within BPipe’s context.

In [8]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

template = """Use the following pieces of context to answer the question at the end.
If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say “thanks for asking!” at the end of the answer.

{context}

Question: {question}

Helpful Answer:"""

prompt = PromptTemplate.from_template(template)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is the memory imbalance problem that BPipe solves?")

'The memory imbalance problem that BPIPE solves is when certain GPUs face high memory pressure while others underutilize their capacity during pipeline parallelism, resulting in suboptimal training performance. Thanks for asking!'

Upon execution, you will be able to get the response from the RAG-applied model, which correctly describes the information from the pdf file, despite it being excluded from the data used to train the original model.