## RAG Template with LangChain

- Inference requires a [Groq](https://groq.com/) API KEY (free for testing).

In [None]:
%pip install -q langchain langchain-community langchain-huggingface langchain_groq sentence-transformers faiss-cpu bs4

### Build up Knowledge Base

We first load the document(s) from web url's:

In [3]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(["https://ai.meta.com/blog/meta-llama-3-1/",
                        "https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md"])
docs = loader.load()

### Chunking Documents

LangChain offers various text splitters, with the `RecursiveCharacterTextSplitter` being a recommended choice for generic text. This splitter is intended to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
splits = text_splitter.split_documents(docs)

Let's inspect the second and third chunks:

In [5]:
from IPython.display import display, Markdown

def md(s):
    display(Markdown(s))

In [6]:
md(splits[1].page_content)
md(splits[2].page_content)

Our approachResearchProduct experiencesLlamaBlogTry Meta AILarge Language ModelIntroducing Llama 3.1: Our most capable models to dateJuly 23, 2024•15 minute readTakeaways:Meta is committed to openly accessible AI. Read Mark Zuckerberg’s letter detailing why open source is good for developers, good for Meta, and good for the world.Bringing open intelligence to all, our latest models expand context length to 128K, add support across eight languages, and include Llama 3.1 405B—the first

context length to 128K, add support across eight languages, and include Llama 3.1 405B—the first frontier-level open source AI model.Llama 3.1 405B is in a class of its own, with unmatched flexibility, control, and state-of-the-art capabilities that rival the best closed source models. Our new model will enable the community to unlock new workflows, such as synthetic data generation and model distillation.We’re continuing to build out Llama to be a system by providing more components that work

We can see that there is indeed an overlap among those chunks:

In [7]:
md(splits[1].page_content[-100:])
md(splits[2].page_content[:100])

and context length to 128K, add support across eight languages, and include Llama 3.1 405B—the first

context length to 128K, add support across eight languages, and include Llama 3.1 405B—the first fro

### Embedding Transformation, and Indexing

Let's load the documents into a vector storage with an open-source embedding model. In this example we use FAISS, which is highly optimized for large-scale datasets and GPU acceleration:

In [8]:
%%capture 
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
db = FAISS.from_documents(documents = splits, 
                          embedding = HuggingFaceEmbeddings(model_name=embedding_model))

### RAG and Inference

In [9]:
import os
from getpass import getpass

GROQ_API_TOKEN = getpass()
os.environ["GROQ_API_KEY"] = GROQ_API_TOKEN

In this example we will use Llama3-8b:

In [13]:
from langchain_groq import ChatGroq
llm = ChatGroq(temperature=0, model_name="llama3-8b-8192")

In [14]:
from langchain.chains import ConversationalRetrievalChain

chat_history = []
chain = ConversationalRetrievalChain.from_llm(llm,
                                              db.as_retriever(),
                                              return_source_documents=True)

#### Q&A with Source Citation

In [25]:
user_query = "how long is the context length in Llama 3.1 405B?"
llm_output = chain.invoke({"question": user_query, "chat_history": chat_history})

# the answer should be 128k
md(llm_output['answer'])

According to the provided context, the context length in Llama 3.1 405B is 128K.

Note that LangChain includes the sources in the response:

In [26]:
llm_output['source_documents']

[Document(metadata={'source': 'https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md', 'title': 'llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models · GitHub', 'description': 'Utilities intended for use with Llama models. Contribute to meta-llama/llama-models development by creating an account on GitHub.', 'language': 'en'}, page_content='Language\n\nLlama 3.1 8B Instruct\n\nLlama 3.1 70B Instruct\n\nLlama 3.1 405B Instruct\n\n\n\nGeneral\n\nMMLU (5-shot, macro_avg/acc)\n\nPortuguese\n   \n62.12\n   \n80.13\n   \n84.95\n   \n\n\nSpanish\n   \n62.45\n   \n80.05\n   \n85.08\n   \n\n\nItalian\n   \n61.63\n   \n80.4\n   \n85.04\n   \n\n\nGerman\n   \n60.59\n   \n79.27\n   \n84.36\n   \n\n\nFrench\n   \n62.34\n   \n79.82\n   \n84.66\n   \n\n\nHindi\n   \n50.88\n   \n74.52\n   \n80.31\n   \n\n\nThai\n   \n50.32\n   \n72.95\n   \n78.21'),
 Document(metadata={'source': 'https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MO

We can see that the first source includes indeed the answer:

In [28]:
md(llm_output['source_documents'][0].page_content[:100])

Language

Llama 3.1 8B Instruct

Llama 3.1 70B Instruct

Llama 3.1 405B Instruct



General

MMLU (5

##### Follow-up Question with Chat History

In [29]:
chat_history = [(user_query, llm_output["answer"])]

In [30]:
user_query = "what about the 8b model?"
llm_output = chain.invoke({"question": user_query, "chat_history": chat_history})
md(llm_output['answer'])

According to the text, the context length in the 8B model is 128K.

##### Follow-up Question with *without* Chat History

In [31]:
user_query = "what about the 8b model?"
llm_output = chain.invoke({"question": user_query, "chat_history": []})
md(llm_output['answer'])

The text does not mention the "8b model". It does mention quantizing the 405B model from 16-bit (BF16) to 8-bit (FP8) numerics, but it does not mention an "8b model" specifically.

Without chat history, the model appears to just retrieve passages that approximate the semantic meaning of the word 'model' contained in the user question, but is not able to retrieve information about the context length:

In [32]:
for doc in llm_output['source_documents']:
    md(doc.page_content)

Introducing Llama 3.1: Our most capable models to date

this blog post.)While this is our biggest model yet, we believe there’s still plenty of new ground to explore in the future, including more device-friendly sizes, additional modalities, and more investment at the agent platform layer.As always, we look forward to seeing all the amazing products and experiences the community will build with these models.This work was supported by our partners across the AI community. We’d like to thank and acknowledge (in alphabetical order): Accenture, Amazon

parameter model to improve the post-training quality of our smaller models.To support large-scale production inference for a model at the scale of the 405B, we quantized our models from 16-bit (BF16) to 8-bit (FP8) numerics, effectively lowering the compute requirements needed and allowing the model to run within a single server node.Instruction and chat fine-tuningWith Llama 3.1 405B, we strove to improve the helpfulness, quality, and detailed instruction-following capability of the model in

translation. With the release of the 405B model, we’re poised to supercharge innovation—with unprecedented opportunities for growth and exploration. We believe the latest generation of Llama will ignite new applications and modeling paradigms, including synthetic data generation to enable the improvement and training of smaller models, as well as model distillation—a capability that has never been achieved at this scale in open source.As part of this latest release, we’re introducing upgraded

##### Model Hallucination without RAG

In [33]:
result = llm.invoke("how long is the context length in Llama 3.1 405B?")
md(result.content)

According to the official LLaMA documentation, the context length for LLaMA 3.1-405B is 4096 tokens. This means that the model can process sequences of up to 4096 tokens (i.e., words or subwords) at a time. However, it's worth noting that the optimal sequence length may vary depending on the specific use case and task.

Without RAG, the model generates an incorrect response, and that the user can not verify the information since the sources are not available.