# **RAG with Groq and Llama 3**

This notebook demonstrates a complete workflow for building a semantic search and retrieval-based question-answering system using ChromaDB as a persistent vector store. The key components include dataset preparation, embedding generation, vector storage, document retrieval, and answer generation.

**Workflow**:

1. **Dataset Preparation**:
    - A dataset is loaded using the `datasets` library.
    - Unnecessary columns are removed to streamline data processing.
2. **Embedding Generation**:
    - Utilizes the `e5-base-4k` model from the HuggingFace `semantic_router.encoders` module for encoding the dataset into vector embeddings.
3. **Vector Storage**:
    - Embeddings are stored in a ChromaDB persistent vector database.
    - Data is added to the vector store in batches for efficiency.
4. **Document Retrieval**:
    - Implements a query function to retrieve relevant documents based on semantic similarity from ChromaDB.
    - Tests retrieval accuracy by verifying the results against input queries.
5. **Answer Generation**:
    - Generates answers to user queries using the Groq API, leveraging retrieved documents for contextual grounding.

### **Data Preparation**:

We start by downloading a dataset that we will encode and store. The dataset `jamescalam/ai-arxiv2-semantic-chunks` contains scraped data from many popular ArXiv papers centred around LLMs and GenAI.

In [None]:
# Using with HuggingFaceEncoder
#!pip install -qU "semantic-router[local]"

In [None]:
import datasets
import groq
import semantic_router

print("Library Versions:")
print(f"datasets: {datasets.__version__}")
print(f"groq: {getattr(groq, '__version__', 'Version attribute not found')}")
print(f"semantic_router: {getattr(semantic_router, '__version__', 'Version attribute not found')}")

Library Versions:
datasets: 3.1.0
groq: 0.13.0
semantic_router: 0.0.72


In [1]:
from datasets import load_dataset

data = load_dataset(
    "jamescalam/ai-arxiv2-semantic-chunks",
    split="train[:10000]"
)

data

Dataset({
    features: ['id', 'title', 'content', 'prechunk_id', 'postchunk_id', 'arxiv_id', 'references'],
    num_rows: 10000
})

We have 200K chunks, where each chunk is roughly the length of 1-2 paragraphs in length. Here is an example of a single record:

In [2]:
data[0]

{'id': '2401.04088#0',
 'title': 'Mixtral of Experts',
 'content': '4 2 0 2 n a J 8 ] G L . s c [ 1 v 8 8 0 4 0 . 1 0 4 2 : v i X r a # Mixtral of Experts Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, LÃ©lio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, ThÃ©ophile Gervet, Thibaut Lavril, Thomas Wang, TimothÃ©e Lacroix, William El Sayed Abstract We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts

Format the data into the format we need, this will contain `id`, `text` (which we will embed), and `metadata`.

In [2]:
data = data.map(lambda x: {
    "id": x["id"],
    "metadata": {
        "title": x["title"],
        "content": x["content"],
    }
})

# drop uneeded columns
data = data.remove_columns([
    "title", "content", "prechunk_id",
    "postchunk_id", "arxiv_id", "references"
])

data

Dataset({
    features: ['id', 'metadata'],
    num_rows: 10000
})

In [4]:
data[0]

{'id': '2401.04088#0',
 'metadata': {'content': '4 2 0 2 n a J 8 ] G L . s c [ 1 v 8 8 0 4 0 . 1 0 4 2 : v i X r a # Mixtral of Experts Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, LÃ©lio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, ThÃ©ophile Gervet, Thibaut Lavril, Thomas Wang, TimothÃ©e Lacroix, William El Sayed Abstract We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected expe

We need to define an embedding model to create our embedding vectors for retrieval, for that we will be using a variation of the e5-base model with a longer context length of 4k tokens. Ideally we should be running this on GPU for optimal runtimes.

> 'multilingual-e5-base' model has support for over 100 languages, including Portuguese.

In [4]:
from semantic_router.encoders import HuggingFaceEncoder

encoder = HuggingFaceEncoder(name="dwzhu/e5-base-4k")

In [18]:
# Inspect the available methods
print(dir(encoder))

['Config', '__abstractmethods__', '__annotations__', '__call__', '__class__', '__class_vars__', '__config__', '__custom_root_type__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__exclude_fields__', '__fields__', '__fields_set__', '__format__', '__ge__', '__get_validators__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__include_fields__', '__init__', '__init_subclass__', '__iter__', '__json_encoder__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__post_root_validators__', '__pre_root_validators__', '__pretty__', '__private_attributes__', '__reduce__', '__reduce_ex__', '__repr__', '__repr_args__', '__repr_name__', '__repr_str__', '__rich_repr__', '__schema_cache__', '__setattr__', '__setstate__', '__signature__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__try_update_forward_refs__', '__validators__', '_abc_impl', '_calculate_keys', '_copy_and_set_values', '_decompose_class', '_enforce_dict_if_root', '_get_value', '_init_priva

In [None]:
print(encoder._model)  # just checking the _model attribute

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(4096, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=Fals

We can check whether our encoder will use cpu or a cuda GPU (where available).

In [19]:
encoder.device

'cpu'

We can create embeddings now like so:

In [5]:
embeds = encoder(["this is a test"]) # the class implements a __call___ method, makint it directly callable to generate embeddings

We can view the dimensionality of our returned embeddings, which we'll need soon when initializing our vector index:

In [6]:
dims = len(embeds[0]) 
dims

768

In [7]:
from chromadb import PersistentClient, EmbeddingFunction 

# Define the directory for persistent storage
persist_dir = "../chromadb_persist"

# Initialize ChromaDB client
db = PersistentClient(path=persist_dir)

In [8]:
encoder

HuggingFaceEncoder(name='dwzhu/e5-base-4k', score_threshold=0.5, type='huggingface', tokenizer_kwargs={}, model_kwargs={}, device='cpu')

In [None]:
#Check if the collection already exists
existing_collections = [collection.name for collection in db.list_collections()]

# Delete the collection
#db.delete_collection(name=existing_collections[0])

In [23]:
from tqdm.auto import tqdm

# Create a collection in ChromaDB
collection_name = "groq_llama_3_rag"
collection = db.get_or_create_collection(
    name=collection_name, 
    )

# Insert embeddings into ChromaDB
batch_size = 128

# Check if the collection already exists
existing_collections = [collection.name for collection in db.list_collections()]

if collection_name not in existing_collections:
    # Create a new collection if it doesn't exist
    collection = db.create_collection(
        name=collection_name)
    
else:
    # Retrieve the existing collection
    collection = db.get_collection(name=collection_name)

# Start populating the collection with embeddings
batch_size = 128  # How many embeddings to insert at once

for i in tqdm(range(0, len(data), batch_size)):
    # Find end of batch
    i_end = min(len(data), i + batch_size)
    # Create batch
    batch = data[i:i_end]

    # Extract metadata (ensure correct length)
    batch_metadata = batch['metadata']  # This should have the same number of entries as the batch size

    # Use the 'id' from the data for the current batch
    ids = batch['id']  # Assuming 'id' is a field in your data that is already unique and correct

    # Check that batch size matches metadata and IDs length
    assert len(batch_metadata) == len(ids), f"Batch size mismatch: {len(batch_metadata)} metadata vs {len(ids)} IDs"
 
    # Generate embeddings from content
    chunks = [f'{x["title"]}: {x["content"]}' for x in batch_metadata]
    embeds = encoder(chunks)  # Directly using encoder
    
    # Check if embedding length matches the expected size
    assert len(embeds) == len(ids), f"Mismatch between number of embeddings ({len(embeds)}) and IDs ({len(ids)})"

    # Prepare data for insertion into ChromaDB
    to_upsert = list(zip(ids, embeds, batch_metadata))
    
    # Insert embeddings into ChromaDB
    collection.add(embeddings=embeds, metadatas=batch_metadata, ids=ids)

  0%|          | 0/79 [00:00<?, ?it/s]

In [25]:
batch['metadata'][:5]

[{'content': '4 . So the answer is 9Ï 4 . Are there variables in the solution? the form of "1. variable is defined as...". If so, please list the definition of variable in The underlined parts are the type of question, the question itself and the steps in its solution, respectively. The output from the LLM is: Yes. There are variables in the solution. x + yi, where xxx and yyy are real numbers. x + yi 1. zzz is defined as a complex number of the form x + yi',
  'title': 'SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning'},
 {'content': 'The bold part is then saved to form a part of the input in the regeneration stage. Target extraction To get a brief and clear target of the current step, the input to the LLM is: The following is a part of the solution to the problem: Let S be the set of complex numbers z such that the real part of 1 6 . This set forms a curve. Find the area of the 12 region inside the curve. (Step 0) Let z = x + yi be a complex number, where x a

Now let's test retrieval:

In [46]:
def get_docs(query: str, top_k: int, collection) -> list[str]:
    """
    Retrieve documents from ChromaDB based on a query.

    Parameters:
        query (str): The input query string.
        top_k (int): The number of top results to retrieve.
        collection: The ChromaDB collection to query.

    Returns:
        list[str]: A list of document content strings.
    """
    # Encode the query into an embedding
    xq = encoder([query])
    
    # Query the ChromaDB collection
    results = collection.query(
        query_embeddings=xq,
        n_results=top_k,
        include=["metadatas", "distances"]  
    )

    #print(results["distances"])  # Check similarity scores
    #print(results["metadatas"]) # it's a nested list of dicts
    
    # Extract document content from nested metadata structure
    docs = [metadata.get("content", "") for sublist in results["metadatas"] for metadata in sublist]
    
    return docs

In [None]:
collection = db.get_collection("groq_llama_3_rag")

# Query ChromaDB
query = "can you tell me about the Llama LLMs?"
top_k = 5
documents = get_docs(query, top_k, collection)

print(documents)



In [45]:
len(documents)

5

Our retrieval component works, now let's try feeding this into a Llama 3 70B model hosted by Groq to produce an answer.

In [49]:
import os
from dotenv import load_dotenv
from groq import Groq

# Initialize the Groq client
load_dotenv()
groq_api_key = os.getenv("GROQ_API_KEY")
groq_client = Groq(api_key=groq_api_key)

In [None]:
def generate(query: str, docs: list[str]):
    system_message = (
        "You are a helpful assistant that answers questions about AI using the "
        "context provided below.\n\n"
        "CONTEXT:\n"
        "\n---\n".join(docs)
    )
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": query}
    ]
    # generate response
    chat_response = groq_client.chat.completions.create(
        model="llama3-70b-8192",
        messages=messages
    )
    return chat_response.choices[0].message.content

In [53]:
out = generate(query=query, docs=documents)
print(out)

In the context provided, LLaMA refers to a family of Large Language Models (LLMs) developed by Meta AI. There are several variants of LLaMA mentioned, including:

1. LLaMA-2-7B: An updated version of LLaMA, pre-trained on a mixture of publicly available online data of 2 trillion tokens.
2. LLaMA-2-13B: A larger version of LLaMA-2-7B, with a larger model size.
3. LLaMA-13B: Another variant of LLaMA with a larger model size.
4. LLaMA-33B: A larger version of LLaMA with an even larger model size.
5. LLaMA-65B: The largest variant of LLaMA mentioned, with a model size of 65 billion parameters.
6. LLaMA-2-70B: Another variant of LLaMA with a model size of 70 billion parameters.

These LLaMA models are compared to other LLMs, such as GPT-3.5 and GPT-4, in terms of their performance on various benchmarks and tasks. The results suggest that stronger LLMs, like LLaMA-65B, can boost the performance of Multimodal Models (MMMs) and achieve better agreements with human evaluation scores.


In [54]:
another_query='Tell me about LLM and finetuning'
out = generate(query=another_query, docs=documents)
print(out)

Based on the context provided, here's what I can tell you about LLM (Large Language Models) and finetuning:

**LLMs:**

* LLMs are large-scale language models that have been pre-trained on vast amounts of text data.
* They can be fine-tuned for specific tasks, such as evaluation, summarization, and understanding.
* Examples of LLMs mentioned in the context include GPT-3.5, GPT-4, LLaMA-2-7B, LLaMA-2-13B, and others.

**Finetuning:**

* Finetuning refers to the process of adapting a pre-trained LLM to a specific task or dataset.
* During finetuning, the model is trained on a smaller dataset specific to the task, which adjusts the model's weights to better fit the task at hand.
* Finetuning can significantly improve the performance of LLMs on specific tasks, as seen in the context where using training sets can greatly boost the evaluation results on specific benchmarks.
* However, over-emphasizing a specific task through finetuning can lead to decreased performance on other tasks, as obs