# RAG Evaluation
_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_

This notebook demonstrates how you can evaluate your RAG (Retrieval Augmented Generation), by building a synthetic evaluation dataset and using LLM-as-a-judge to compute the accuracy of your system.

For an introduction to RAG, you can check [this other cookbook](rag_zephyr_langchain)!

RAG systems are complex: here a RAG diagram, where we noted in blue all possibilities for system enhancement:

<img src="https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/RAG_workflow.png" height="700">

Implementing any of these improvements can bring a huge performance boost; but changing anything is useless if you cannot monitor the impact of your changes on the system's performance!
So let's see how to evaluate our RAG system.

### Evaluating RAG performance

Since there are so many moving parts to tune with a big impact on performance, benchmarking the RAG system is crucial.

For our evaluation pipeline, we will need:
1. An evaluation dataset with question - answer couples (QA couples)
2. An evaluator to compute the accuracy of our system on the above evaluation dataset.

➡️ It turns out, we can use LLMs to help us all along the way!
1. The evaluation dataset will be synthetically generated by an LLM 🤖, and questions will be filtered out by other LLMs 🤖
2. An [LLM-as-a-judge](https://huggingface.co/papers/2306.05685) agent 🤖 will then perform the evaluation on this synthetic dataset.

__Let's dig into it and start building our evaluation pipeline!__ First, we install the required model dependancies.

In [None]:
#!pip install -q torch transformers transformers langchain sentence-transformers tqdm openpyxl openai pandas datasets

In [None]:
# %reload_ext autoreload
# %autoreload 2

In [1]:
from tqdm.auto import tqdm
import pandas as pd
from typing import Optional, List, Tuple
import json
import datasets

pd.set_option("display.max_colwidth", None)

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Load your knowledge base

In [61]:
ds = datasets.load_dataset("m-ric/huggingface_doc", split="train")


# 1. Build a synthetic dataset for evaluation
We first build a synthetic dataset of questions and associated contexts. The method is to get elements from our knowledge base, and ask an LLM to generate questions based on these documents.

Then we setup other LLM agents to act as quality filters for the generated QA couples: each of them will act as the filter for a specific flaw.

### 1.1. Prepare source documents

In [3]:
import os
import PyPDF2
from PyPDF2 import PdfReader
from pathlib import Path
PDFS_PATH = Path('/home/mainuser/Desktop/LLMs/RagOverArXiv/temp')
PDFS = list(PDFS_PATH.glob('*.pdf'))
PDFS[0], len(PDFS)

reader = PdfReader(os.path.expanduser(PDFS[0]))
pages = reader.pages
documents = []
for page in pages:
  documents.append(page.extract_text())


def load_pdf_to_string(pdf_path):
    # Open the PDF file in binary mode
    with open(pdf_path, 'rb') as file:
        # Create a PDF file reader object
        pdf_reader = PyPDF2.PdfReader(file)

        # Initialize an empty string to hold the text
        text = ''

        # Loop through each page and extract the text
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            page_text = page.extract_text()
            references_index= page_text.upper().find('\nREFERENCES\n')
            if references_index != -1:
              page_text = page_text[:references_index]
              text += page_text
              return text
            text += page_text
    return text

# Use the function to load a PDF into a string
text = load_pdf_to_string(os.path.expanduser(PDFS[0]))
def get_title(pdf_path): return os.path.expanduser(pdf_path).split('/')[-1]

all_docs_and_titles = [(load_pdf_to_string(os.path.expanduser(pdf_path)),get_title(pdf_path)) for pdf_path in PDFS]

all_docs = [doc[0] for doc in all_docs_and_titles]
all_titles = [doc[1] for doc in all_docs_and_titles]

from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter, RecursiveCharacterTextSplitter
from langchain.docstore.document import Document 

CHUNK_SIZE = 1000 #try 2000 next
CHUNK_OVERLAP = 30 #try 200 next

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap = CHUNK_OVERLAP,
    length_function=len,
)

docs_processed  = [text_splitter.split_documents([Document(page_content=doc, metadata={'source':all_titles[idx]})]) 
         for idx,doc in enumerate(all_docs)]

# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=2000,
#     chunk_overlap=200,
#     add_start_index=True,
#     #separators=["\n\n", "\n", ".", " ", ""],
# )

# docs_processed = []
# for idx,doc in enumerate(all_docs):
#     docs_processed += text_splitter.split_documents([Document(page_content=doc, metadata={'source':all_titles[idx]})])


In [5]:
docs_processed = [txt for doc in docs_processed for txt in doc]

In [6]:
len(docs_processed)

67

In [65]:
docs_processed[0]

Document(page_content='FlashAttention : Fast and Memory-Eﬃcient Exact Attention\nwith IO-Awareness\nTri Daoy, Daniel Y. Fuy, Stefano Ermony, Atri Rudraz, and Christopher Réy\nyDepartment of Computer Science, Stanford University\nzDepartment of Computer Science and Engineering, University at Buﬀalo, SUNY\n{trid,danfu}@cs.stanford.edu ,ermon@stanford.edu ,atri@buffalo.edu ,\nchrismre@cs.stanford.edu\nJune 24, 2022\nAbstract\nTransformers are slow and memory-hungry on long sequences, since the time and memory complexity\nof self-attention are quadratic in sequence length. Approximate attention methods have attempted\nto address this problem by trading oﬀ model quality to reduce the compute complexity, but often do\nnot achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-\naware—accounting for reads and writes between levels of GPU memory. We propose FlashAttention ,\nan IO-aware exact attention algorithm that uses tiling to reduce the number of 

In [66]:
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.docstore.document import Document as LangchainDocument

# langchain_docs = [
#     LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
#     for doc in tqdm(ds)
# ]


# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=2000,
#     chunk_overlap=200,
#     add_start_index=True,
#     separators=["\n\n", "\n", ".", " ", ""],
# )

# docs_processed = []
# for doc in langchain_docs:
#     docs_processed += text_splitter.split_documents([doc])

### 1.2. Setup agents for question generation

We use [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) for QA couple generation because it it has excellent performance in leaderboards such as [Chatbot Arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard).

In [67]:
# from huggingface_hub import InferenceClient


# repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

# llm_client = InferenceClient(
#     model=repo_id,
#     timeout=120,
# )


# def call_llm(inference_client: InferenceClient, prompt: str):
#     response = inference_client.post(
#         json={
#             "inputs": prompt,
#             "parameters": {"max_new_tokens": 1000},
#             "task": "text-generation",
#         },
#     )
#     return json.loads(response.decode())[0]["generated_text"]


# call_llm(llm_client, "This is a test context")

- Tried Mixtral4Bit, visually perhaps a bit better, but overfits on 'deep question'

In [7]:
from exllamav2 import *
from exllamav2.generator import *
import sys, torch


generator_config = ExLlamaV2Config()
generator_config.model_dir = "/home/mainuser/Desktop/LLMs/MiStralInference"
#generator_config.model_dir = '/home/mainuser/Desktop/LLMs/Mixtral4bit'
generator_config.prepare()

generator_model = ExLlamaV2(generator_config)
cache = ExLlamaV2Cache(generator_model, lazy = True)

print("Loading model...")
generator_model.load_autosplit(cache)

generator_tokenizer = ExLlamaV2Tokenizer(generator_config)
generator_llm = ExLlamaV2StreamingGenerator(generator_model, cache, generator_tokenizer)
generator_llm.set_stop_conditions([generator_tokenizer.eos_token_id])
generator_settings = ExLlamaV2Sampler.Settings()
generator_settings.temperature = 0.85
generator_settings.top_k = 50
generator_settings.top_p = 0.8
generator_settings.token_repetition_penalty = 1.01
#generator_settings.disallow_tokens(generator_tokenizer, [generator_tokenizer.eos_token_id])
# see if commenting out the above solved the endless generation issue (did not have with stream generator)

Loading model...


In [69]:
#help(generator_llm)

In [70]:
#help(generator_tokenizer)

In [71]:
#help(generator_llm)

In [72]:
#help(generator_model)

In [73]:
# from transformers import Pipeline
# from ragatouille import RAGPretrainedModel
# from typing import Optional, List, Tuple
# from langchain.docstore.document import Document
# import time
# # RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(

# #     prompt_in_chat_format, tokenize=False, add_generation_prompt=True

# # )
# RERANKER = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
# from langchain.docstore.document import Document as LangchainDocument
# def call_llm(
#     question: str,
#     generator_model: ExLlamaV2,
#     generator_llm: ExLlamaV2StreamingGenerator,
#     tokenizer: ExLlamaV2Tokenizer,
#     settings:ExLlamaV2Sampler.Settings,
#     max_new_tokens = 512
# ) -> Tuple[str, List[LangchainDocument]]:


#     instruction_ids = tokenizer.encode(f"<s>[INST] {question} [/INST]", add_bos = True)
#     context_ids = instruction_ids if generator_llm.sequence_ids is None \
#             else torch.cat([generator_llm.sequence_ids, instruction_ids], dim = -1)

#     max_new_tokens = max_new_tokens

#     generator_llm.warmup()
#     time_begin = time.time()

#     output = generator_model.forward(f"<s>[INST] {question} [/INST]", settings, max_new_tokens)#, seed = 1234)#,add_eos=True)
    
#     time_end = time.time()
#     time_total = time_end - time_begin

#     print(output)
#     print()
#     print(f"Response generated in {time_total:.2f} seconds, {max_new_tokens} tokens, {max_new_tokens / time_total:.2f} tokens/second")
#     return output
# #     generator.begin_stream(context_ids, settings)

# # #    return generator.generate_simple(context_ids, settings,num_tokens=512)

# #     while True:
# #         chunk, eos, _ = generator.stream()
# #         if eos: break
# #         print(chunk, end = "")
# #         sys.stdout.flush()
# #     #####




# # def call_llm(inference_client: InferenceClient, prompt: str):
# #     response = inference_client.post(
# #         json={
# #             "inputs": prompt,
# #             "parameters": {"max_new_tokens": 1000},
# #             "task": "text-generation",
# #         },
# #     )
# #     return json.loads(response.decode())[0]["generated_text"]


# call_llm(question="How can I get my cat to like me?", generator_model=generator_model,generator_llm=generator_llm,tokenizer=generator_tokenizer,settings=generator_settings)

In [8]:
#Working except eos
from transformers import Pipeline
from ragatouille import RAGPretrainedModel
from typing import Optional, List, Tuple
from langchain.docstore.document import Document
import time
# RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(

#     prompt_in_chat_format, tokenize=False, add_generation_prompt=True

# )
RERANKER = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
from langchain.docstore.document import Document as LangchainDocument
def call_llm(
    question: str,
    generator: ExLlamaV2StreamingGenerator,
    tokenizer: ExLlamaV2Tokenizer,
    settings:ExLlamaV2Sampler.Settings,
    max_new_tokens = 512
) -> Tuple[str, List[LangchainDocument]]:


    # instruction_ids = tokenizer.encode(f"<s>[INST] {question} [/INST]", add_bos = True)
    # context_ids = instruction_ids if generator.sequence_ids is None \
    #         else torch.cat([generator.sequence_ids, instruction_ids], dim = -1)

    max_new_tokens = max_new_tokens

    generator.warmup()
    #time_begin = time.time()
    output = generator.generate_simple(f"<s>[INST] {question} [/INST]", settings, max_new_tokens, seed = 1234)

    #output = generator.generate_simple(f"<s>[INST] {question} [/INST]", settings, max_new_tokens, seed = 1234)
    #output = generator.generate_simple(question, settings, max_new_tokens, seed = 1234)

    # time_end = time.time()
    # time_total = time_end - time_begin

    print(output)
    print()
    #print(f"Response generated in {time_total:.2f} seconds, {max_new_tokens} tokens, {max_new_tokens / time_total:.2f} tokens/second")
    return output
#     generator.begin_stream(context_ids, settings)

# #    return generator.generate_simple(context_ids, settings,num_tokens=512)

#     while True:
#         chunk, eos, _ = generator.stream()
#         if eos: break
#         print(chunk, end = "")
#         sys.stdout.flush()
#     #####




# def call_llm(inference_client: InferenceClient, prompt: str):
#     response = inference_client.post(
#         json={
#             "inputs": prompt,
#             "parameters": {"max_new_tokens": 1000},
#             "task": "text-generation",
#         },
#     )
#     return json.loads(response.decode())[0]["generated_text"]


call_llm(question="How can I get my cat to like me?", generator=generator_llm,tokenizer=generator_tokenizer,settings=generator_settings,max_new_tokens=1024)

<s>[INST] How can I get my cat to like me? [/INST] 1. Spend time with your cat: Cats enjoy spending time with their owners, so take some time to play with them, cuddle with them, and give them attention.
2. Provide food and shelter: Make sure your cat has a comfortable place to sleep, eat, and drink water.
3. Use positive reinforcement: Reward your cat with treats, praise, and affection when they behave well around you.
4. Be patient: Cats can take time to warm up to new people, so be patient and keep trying to build a relationship with your cat.
5. Show affection: Cats are social animals and enjoy physical touch, so try petting them, stroking their fur, and giving them gentle scratches behind their ears.
6. Provide toys: Cats love to play, so provide them with toys that they can interact with and enjoy.
7. Be consistent: Consistency is key when it comes to building a relationship with your cat. Make sure you are consistently spending time with them and providing them with everything t

'<s>[INST] How can I get my cat to like me? [/INST] 1. Spend time with your cat: Cats enjoy spending time with their owners, so take some time to play with them, cuddle with them, and give them attention.\n2. Provide food and shelter: Make sure your cat has a comfortable place to sleep, eat, and drink water.\n3. Use positive reinforcement: Reward your cat with treats, praise, and affection when they behave well around you.\n4. Be patient: Cats can take time to warm up to new people, so be patient and keep trying to build a relationship with your cat.\n5. Show affection: Cats are social animals and enjoy physical touch, so try petting them, stroking their fur, and giving them gentle scratches behind their ears.\n6. Provide toys: Cats love to play, so provide them with toys that they can interact with and enjoy.\n7. Be consistent: Consistency is key when it comes to building a relationship with your cat. Make sure you are consistently spending time with them and providing them with every

In [75]:
# QA_generation_prompt = """
# Your task is to write a factoid question and an answer given a context.
# Your factoid question should be answerable with a specific, concise piece of factual information from the context.
# Your factoid question should be formulated in the same style as questions users could ask in a search engine.
# This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

# Provide your answer as follows:

# Output:::
# Factoid question: (your factoid question)
# Answer: (your answer to the factoid question)

# Now here is the context.

# Context: {context}\n
# Output:::"""

In [10]:
QA_generation_prompt = """
Your task is to write a deep factual or conceptual question and an answer given a context.
Your deep question should be unambigiously answerable from the context.
Your deep question should be formulated in the same style as questions people reading advanced LLM papers would ask.
This means that your question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Deep question: (your deep question)
Answer: (your answer to the deep question)

Now here is the context.

Context: {context}\n
Output:::"""

Now let's generate our QA couples.
For this example, we generate only 10 QA couples and will load the rest from the Hub.

But for your specific knowledge base, given that you want to get at least ~100 test samples, and accounting for the fact that we will filter out around half of these with our critique agents later on, you should generate much more, in the >200 samples.

In [11]:
import random
from tqdm import tqdm
N_GENERATIONS = 10  # We intentionally generate only 10 QA couples here for cost and time considerations

print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
for sampled_context in tqdm(random.sample(docs_processed, N_GENERATIONS)):
    # Generate QA couple
    # output_QA_couple = call_llm(
    #     llm_client, QA_generation_prompt.format(context=sampled_context.page_content)
    # )
    output_QA_couple = call_llm(question=QA_generation_prompt.format(context=sampled_context.page_content), generator=generator_llm,tokenizer=generator_tokenizer,settings=generator_settings,
                                max_new_tokens=1024)
    try:
        question = output_QA_couple.split("Deep question: ")[-1].split("Answer: ")[0]
        answer = output_QA_couple.split("Answer: ")[-1]
        #assert len(answer) < 300, "Answer is too long"
        outputs.append(
            {
                "context": sampled_context.page_content,
                "question": question,
                "answer": answer,
                "source_doc": sampled_context.metadata["source"],
            }
        )
    except:
        continue

Generating 10 QA couples...


 10%|█         | 1/10 [00:01<00:10,  1.20s/it]

<s>[INST] 
Your task is to write a deep factual or conceptual question and an answer given a context.
Your deep question should be unambigiously answerable from the context.
Your deep question should be formulated in the same style as questions people reading advanced LLM papers would ask.
This means that your question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Deep question: (your deep question)
Answer: (your answer to the deep question)

Now here is the context.

Context: algorithm in a considerably lower-level language than PyTorch, and requires signiﬁcant engineering eﬀort.
Implementations may also not be transferrable across GPU architectures. These limitations suggest the
need for a method that supports writing attention algorithms in a high-level language (e.g., PyTorch), and
compiling to IO-aware implementations in CUDA—similar to eﬀorts such as Halide in image processing [ 70].
IO-Aware Deep Learning. We

 20%|██        | 2/10 [00:02<00:10,  1.28s/it]

<s>[INST] 
Your task is to write a deep factual or conceptual question and an answer given a context.
Your deep question should be unambigiously answerable from the context.
Your deep question should be formulated in the same style as questions people reading advanced LLM papers would ask.
This means that your question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Deep question: (your deep question)
Answer: (your answer to the deep question)

Now here is the context.

Context: cations. In this section, we highlight how to leverage system prompting to optionally enforce output
constraints on top of our models. Additionally, we showcase the ability of Mistral 7B to perform
fine-grained content moderation, which can be useful to enforce quality content in applications.
5.1 System prompt to enforce guardrails
We introduce a system prompt (see below) to guide the model to generate answers within specified
guardrails, sim

 30%|███       | 3/10 [00:03<00:09,  1.36s/it]

<s>[INST] 
Your task is to write a deep factual or conceptual question and an answer given a context.
Your deep question should be unambigiously answerable from the context.
Your deep question should be formulated in the same style as questions people reading advanced LLM papers would ask.
This means that your question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Deep question: (your deep question)
Answer: (your answer to the deep question)

Now here is the context.

Context: generation. Furthermore, Mistral 7B approaches the coding performance of Code-Llama 7B [ 20],
without sacrificing performance on non-code related benchmarks.
Mistral 7B leverages grouped-query attention (GQA) [ 1], and sliding window attention (SWA) [ 6,3].
GQA significantly accelerates the inference speed, and also reduces the memory requirement during
decoding, allowing for higher batch sizes hence higher throughput, a crucial factor for rea

 40%|████      | 4/10 [00:13<00:27,  4.61s/it]

<s>[INST] 
Your task is to write a deep factual or conceptual question and an answer given a context.
Your deep question should be unambigiously answerable from the context.
Your deep question should be formulated in the same style as questions people reading advanced LLM papers would ask.
This means that your question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Deep question: (your deep question)
Answer: (your answer to the deep question)

Now here is the context.

Context: 𝑁
𝐵𝑐m
blocks
K1K𝑇𝑐andV1V𝑇𝑐, of size𝐵𝑐𝑑each.
4:Divide Ointo𝑇𝑟blocks O𝑖O𝑇𝑟of size𝐵𝑟𝑑each, divide ℓinto𝑇𝑟blocksℓ𝑖ℓ𝑇𝑟of size𝐵𝑟each,
divide𝑚into𝑇𝑟blocks𝑚1𝑚𝑇𝑟of size𝐵𝑟each.
5:for1𝑗𝑇𝑐do
6:Load K𝑗V𝑗from HBM to on-chip SRAM.
7:for1𝑖𝑇𝑟do
8:Load Q𝑖O𝑖ℓ𝑖𝑚𝑖from HBM to on-chip SRAM.
9:On chip, compute S𝑖𝑗=Q𝑖K𝑇
𝑗2R𝐵𝑟𝐵𝑐.
10:On chip, compute ~𝑚𝑖𝑗=rowmax¹S𝑖𝑗º 2R𝐵𝑟,~P𝑖𝑗=exp¹S𝑖𝑗 ~𝑚𝑖𝑗º 2R𝐵𝑟𝐵𝑐(pointwise), ~ℓ𝑖𝑗=
rowsum¹~P𝑖𝑗º2R𝐵𝑟.


 50%|█████     | 5/10 [00:14<00:16,  3.31s/it]

<s>[INST] 
Your task is to write a deep factual or conceptual question and an answer given a context.
Your deep question should be unambigiously answerable from the context.
Your deep question should be formulated in the same style as questions people reading advanced LLM papers would ask.
This means that your question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Deep question: (your deep question)
Answer: (your answer to the deep question)

Now here is the context.

Context: improves performance on the MIMIC-III [ 47] and ECtHR [ 6,7] datasets. MIMIC-III contains intensive care
unit patient discharge summaries, each annotated with multiple labels. ECtHR contains legal cases from the
3LRA accuracy results are known to be highly dependent on the tuning procedure [ 90]. Our reproduced baselines perform
better than as reported in the original comparison [80].
8Attention Memory Usage
Sequence LengthAttention Runtime (F

 60%|██████    | 6/10 [00:15<00:09,  2.41s/it]

<s>[INST] 
Your task is to write a deep factual or conceptual question and an answer given a context.
Your deep question should be unambigiously answerable from the context.
Your deep question should be formulated in the same style as questions people reading advanced LLM papers would ask.
This means that your question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Deep question: (your deep question)
Answer: (your answer to the deep question)

Now here is the context.

Context: guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium
on Operating Systems Principles , 2023.
[18] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano,
Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza.
xformers: A modular and hackable transformer modelling library. https://github.com/
facebookresearch/xformers , 2022.
[19] Todor Mih

 70%|███████   | 7/10 [00:15<00:05,  1.86s/it]

<s>[INST] 
Your task is to write a deep factual or conceptual question and an answer given a context.
Your deep question should be unambigiously answerable from the context.
Your deep question should be formulated in the same style as questions people reading advanced LLM papers would ask.
This means that your question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Deep question: (your deep question)
Answer: (your answer to the deep question)

Now here is the context.

Context: our chunk size. For each chunk, we thus need to compute the attention over the cache and over the
chunk. Figure 3 shows how the attention mask works over both the cache and the chunk.
godog0000100000thetoThecatsatonthe
1matand111sawthe1000doggoto
100000110000000011100000011110PastCacheCurrent
Figure 3: Pre-fill and chunking. During pre-fill of the cache, long sequences are chunked to limit memory
usage. We process a sequence in three chunks, “

 80%|████████  | 8/10 [00:17<00:03,  1.72s/it]

<s>[INST] 
Your task is to write a deep factual or conceptual question and an answer given a context.
Your deep question should be unambigiously answerable from the context.
Your deep question should be formulated in the same style as questions people reading advanced LLM papers would ask.
This means that your question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Deep question: (your deep question)
Answer: (your answer to the deep question)

Now here is the context.

Context: Mistral 7B
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,
Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel,
Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux,
Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix,
William El Sayed
Abstract
We introduce Mistral 7B, a 7–billion-parameter language model engineered for
superior performance and efficie

 90%|█████████ | 9/10 [00:26<00:04,  4.13s/it]

<s>[INST] 
Your task is to write a deep factual or conceptual question and an answer given a context.
Your deep question should be unambigiously answerable from the context.
Your deep question should be formulated in the same style as questions people reading advanced LLM papers would ask.
This means that your question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Deep question: (your deep question)
Answer: (your answer to the deep question)

Now here is the context.

Context: We apply two established techniques (tiling, recomputation) to overcome the technical challenge of
computing exact attention in sub-quadratic HBM accesses. We describe this in Algorithm 1. The main idea
is that we split the inputs QKVinto blocks, load them from slow HBM to fast SRAM, then compute the
attention output with respect to those blocks. By scaling the output of each block by the right normalization
factor before adding them up, we 

100%|██████████| 10/10 [00:27<00:00,  2.80s/it]

<s>[INST] 
Your task is to write a deep factual or conceptual question and an answer given a context.
Your deep question should be unambigiously answerable from the context.
Your deep question should be formulated in the same style as questions people reading advanced LLM papers would ask.
This means that your question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Deep question: (your deep question)
Answer: (your answer to the deep question)

Now here is the context.

Context: or Azure using the vLLM [ 17] inference server and SkyPilot2. Integration with Hugging Face3is
also streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across
a myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat
model fine-tuned from Mistral 7B that significantly outperforms the Llama 2 13B – Chat model.
Mistral 7B takes a significant step in balancing the




In [12]:
outputs

[{'context': 'algorithm in a considerably lower-level language than PyTorch, and requires signiﬁcant engineering eﬀort.\nImplementations may also not be transferrable across GPU architectures. These limitations suggest the\nneed for a method that supports writing attention algorithms in a high-level language (e.g., PyTorch), and\ncompiling to IO-aware implementations in CUDA—similar to eﬀorts such as Halide in image processing [ 70].\nIO-Aware Deep Learning. We believe that the IO-aware approach can extend beyond attention.\nAttention is the most memory-intensive computation in Transformers, but every layer in a deep network\ntouches GPU HBM. We hope our work inspires IO-aware implementations of additional modules. We discuss\nthese potential extensions in Appendix D.\nMulti-GPU IO-Aware Methods. Our IO-aware implementation of attention is optimal within con-\nstants for computing attention on a single GPU. However, the attention computation may be parallelizable',
  'question': 'Can a

In [14]:
import pandas as pd
pd.set_option('display.max_colwidth',800)
display(pd.DataFrame(outputs).head(10))

Unnamed: 0,context,question,answer,source_doc
0,"algorithm in a considerably lower-level language than PyTorch, and requires signiﬁcant engineering eﬀort.\nImplementations may also not be transferrable across GPU architectures. These limitations suggest the\nneed for a method that supports writing attention algorithms in a high-level language (e.g., PyTorch), and\ncompiling to IO-aware implementations in CUDA—similar to eﬀorts such as Halide in image processing [ 70].\nIO-Aware Deep Learning. We believe that the IO-aware approach can extend beyond attention.\nAttention is the most memory-intensive computation in Transformers, but every layer in a deep network\ntouches GPU HBM. We hope our work inspires IO-aware implementations of additional modules. We discuss\nthese potential extensions in Appendix D.\nMulti-GPU IO-Aware Methods. Ou...",Can a high-level language such as PyTorch be used to write attention algorithms with IO-aware implementations in CUDA?\n\n,"Yes, a high-level language such as PyTorch can be used to write attention algorithms with IO-aware implementations in CUDA. This is because the IO-aware approach can extend beyond attention and can be applied to other modules in a deep network. The IO-aware implementation of attention is optimal within constants for computing attention on a single GPU, but it may not be parallelizable across multiple GPUs.",FlashAttention_ Fast and Memory-Efficient Exact Attention with IO-Awareness.pdf
1,"cations. In this section, we highlight how to leverage system prompting to optionally enforce output\nconstraints on top of our models. Additionally, we showcase the ability of Mistral 7B to perform\nfine-grained content moderation, which can be useful to enforce quality content in applications.\n5.1 System prompt to enforce guardrails\nWe introduce a system prompt (see below) to guide the model to generate answers within specified\nguardrails, similar to the work done with Llama 2. Using this prompt allows the user to move on the\nPareto front of model utility / guardrails enforcement, as indicated in Table 4.\nAlways assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful,\nunethical, prejudiced, or negative content. Ensure replies promote fairne...",Can Mistral 7B perform fine-grained content moderation using a system prompt to enforce guardrails?\n\n,"Yes, according to the context, Mistral 7B can perform fine-grained content moderation using a system prompt to enforce guardrails. This is indicated in Table 4, where the mean official MT Bench score for Mistral system prompt is 6.58 ±0.05, which is higher than the score for no system prompt (6.84 ±0.07) and Llama 2 system prompt (6.38 ±0.07).",Mistral7B.pdf
2,"generation. Furthermore, Mistral 7B approaches the coding performance of Code-Llama 7B [ 20],\nwithout sacrificing performance on non-code related benchmarks.\nMistral 7B leverages grouped-query attention (GQA) [ 1], and sliding window attention (SWA) [ 6,3].\nGQA significantly accelerates the inference speed, and also reduces the memory requirement during\ndecoding, allowing for higher batch sizes hence higher throughput, a crucial factor for real-time\napplications. In addition, SWA is designed to handle longer sequences more effectively at a reduced\ncomputational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms\ncollectively contribute to the enhanced performance and efficiency of Mistral 7B.arXiv:2310.06825v1 [cs.CL] 10 Oct 2023Mistral 7B is rele...","Given the context, what are the attention mechanisms used by Mistral 7B and how do they contribute to its enhanced performance and efficiency?\n\n","Mistral 7B uses grouped-query attention (GQA) and sliding window attention (SWA) as its attention mechanisms. GQA significantly accelerates the inference speed and reduces the memory requirement during decoding, allowing for higher batch sizes and higher throughput, which is crucial for real-time applications. SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms collectively contribute to the enhanced performance and efficiency of Mistral 7B.",Mistral7B.pdf
3,"𝑁\n𝐵𝑐m\nblocks\nK1K𝑇𝑐andV1V𝑇𝑐, of size𝐵𝑐𝑑each.\n4:Divide Ointo𝑇𝑟blocks O𝑖O𝑇𝑟of size𝐵𝑟𝑑each, divide ℓinto𝑇𝑟blocksℓ𝑖ℓ𝑇𝑟of size𝐵𝑟each,\ndivide𝑚into𝑇𝑟blocks𝑚1𝑚𝑇𝑟of size𝐵𝑟each.\n5:for1𝑗𝑇𝑐do\n6:Load K𝑗V𝑗from HBM to on-chip SRAM.\n7:for1𝑖𝑇𝑟do\n8:Load Q𝑖O𝑖ℓ𝑖𝑚𝑖from HBM to on-chip SRAM.\n9:On chip, compute S𝑖𝑗=Q𝑖K𝑇\n𝑗2R𝐵𝑟𝐵𝑐.\n10:On chip, compute ~𝑚𝑖𝑗=rowmax¹S𝑖𝑗º 2R𝐵𝑟,~P𝑖𝑗=exp¹S𝑖𝑗�~𝑚𝑖𝑗º 2R𝐵𝑟𝐵𝑐(pointwise), ~ℓ𝑖𝑗=\nrowsum¹~P𝑖𝑗º2R𝐵𝑟.\n11:On chip, compute 𝑚new\n𝑖=max¹𝑚𝑖~𝑚𝑖𝑗º2R𝐵𝑟,ℓnew\n𝑖=𝑒𝑚𝑖�𝑚new\n𝑖ℓ𝑖¸𝑒~𝑚𝑖𝑗�𝑚new\n𝑖~ℓ𝑖𝑗2R𝐵𝑟.\n12:Write O𝑖 diag¹ℓnew\n𝑖º�1¹diag¹ℓ𝑖º𝑒𝑚𝑖�𝑚new\n𝑖O𝑖¸𝑒~𝑚𝑖𝑗�𝑚new\n𝑖~P𝑖𝑗V𝑗ºto HBM.\n13:Writeℓ𝑖 ℓnew\n𝑖,𝑚𝑖 𝑚new\n𝑖to HBM.\n14:end for\n15:end for\n16:Return O.\nWe show FlashAttention ’s correctness, runtime, and memory requirement (proof in Appendix...","Suppose K1, K2, K3, K4, K5, K6, K7, K8, K9, K10, K11, K12, K13, K14, K15, K16, K17, K18, K19, K20, K21, K22, K23, K24, K25, K26, K27, K28, K29, K30, K31, K32, K33, K34, K35, K36, K37, K38, K39, K40, K41, K42, K43, K44, K45, K46, K47, K48, K49, K50, K51, K52, K53, K54, K55, K56, K57, K58, K59, K60, K61, K62, K63, K64, K65, K66, K67, K68, K69, K70, K71, K72, K73, K74, K75, K76, K77, K78, K79, K80, K81, K82, K83, K84, K85, K86, K87, K88, K89, K90, K91, K92, K93, K94, K95, K96, K97, K98, K99, K100, K101, K102, K103, K104, K105, K106, K107, K108, K109, K110, K111, K112, K113, K114, K115, K116, K117, K118, K119, K120, K121, K122, K123, K124, K125, K126, K127, K128, K129, K130, K131, K132, K133, K134, K135, K136, K137, K138, K139, K140, K141, K142, K143, K144, K145, K146, K147, K148, K149, K1...","(your answer to the deep question)\n\nNow here is the context.\n\nContext: 𝑁\n𝐵𝑐m\nblocks\nK1K𝑇𝑐andV1V𝑇𝑐, of size𝐵𝑐𝑑each.\n4:Divide Ointo𝑇𝑟blocks O𝑖O𝑇𝑟of size𝐵𝑟𝑑each, divide ℓinto𝑇𝑟blocksℓ𝑖ℓ𝑇𝑟of size𝐵𝑟each,\ndivide𝑚into𝑇𝑟blocks𝑚1𝑚𝑇𝑟of size𝐵𝑟each.\n5:for1𝑗𝑇𝑐do\n6:Load K𝑗V𝑗from HBM to on-chip SRAM.\n7:for1𝑖𝑇𝑟do\n8:Load Q𝑖O𝑖ℓ𝑖𝑚𝑖from HBM to on-chip SRAM.\n9:On chip, compute S𝑖𝑗=Q𝑖K𝑇\n𝑗2R𝐵𝑟𝐵𝑐.\n10:On chip, compute ~𝑚𝑖𝑗=rowmax¹S𝑖𝑗º 2R𝐵𝑟,~P𝑖𝑗=exp¹S𝑖𝑗�~𝑚𝑖𝑗º 2R𝐵𝑟𝐵𝑐(pointwise), ~ℓ𝑖𝑗=\nrowsum¹~P𝑖𝑗º2R𝐵𝑟.\n11:On chip, compute 𝑚new\n𝑖=max¹𝑚𝑖~𝑚𝑖𝑗º2R𝐵𝑟,ℓnew\n𝑖=𝑒𝑚𝑖�𝑚new\n𝑖ℓ𝑖¸𝑒~𝑚𝑖𝑗�𝑚new\n𝑖~ℓ𝑖𝑗2R𝐵𝑟.\n12:Write O𝑖 diag¹ℓnew\n𝑖º�1¹diag¹ℓ𝑖º𝑒𝑚𝑖�𝑚new\n𝑖O𝑖¸𝑒~𝑚𝑖𝑗�𝑚new\n𝑖~P𝑖𝑗V𝑗ºto HBM.\n13:Writeℓ𝑖 ℓnew\n𝑖,𝑚𝑖 𝑚new\n𝑖to HBM.\n14:end for\n15:end for\n16:Return O.\nWe show FlashA...",FlashAttention_ Fast and Memory-Efficient Exact Attention with IO-Awareness.pdf
4,"improves performance on the MIMIC-III [ 47] and ECtHR [ 6,7] datasets. MIMIC-III contains intensive care\nunit patient discharge summaries, each annotated with multiple labels. ECtHR contains legal cases from the\n3LRA accuracy results are known to be highly dependent on the tuning procedure [ 90]. Our reproduced baselines perform\nbetter than as reported in the original comparison [80].\n8Attention Memory Usage\nSequence LengthAttention Runtime (Fwd Pass + Bwd Pass)\nSequence LengthRuntime (ms)\nMemory Footprint (GB)256 8K 16K 32K 64K 128 256 512 1024 2048 4096101102\n1020\nFlashAttention\nBlock-Sparse FlashAttentionPyTorch Attention\nMegatron AttentionLinformer Attention\nOpenAI Sparse Attention8192100Crossover Points\n20x2xFigure 3: Left:runtime of forward pass + backward pass. Righ...","Given the context, what is the relationship between the attention memory usage and sequence length in the MIMIC-III and ECtHR datasets?\n\n","The attention memory usage and sequence length are positively correlated in both the MIMIC-III and ECtHR datasets. As the sequence length increases, the attention memory usage also increases. This is evident from the graph in Figure 3, which shows that the attention memory usage increases as the sequence length increases.",FlashAttention_ Fast and Memory-Efficient Exact Attention with IO-Awareness.pdf
5,"guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium\non Operating Systems Principles , 2023.\n[18] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano,\nSean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza.\nxformers: A modular and hackable transformer modelling library. https://github.com/\nfacebookresearch/xformers , 2022.\n[19] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct\nelectricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789 ,\n2018.\n[20] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan,\nYossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Co...",How does paged attention affect the performance of a gauge model serving in a production environment?\n\n,Paged attention can improve the performance of a gauge model serving in a production environment by reducing the memory footprint and speeding up the inference process.,Mistral7B.pdf
6,"our chunk size. For each chunk, we thus need to compute the attention over the cache and over the\nchunk. Figure 3 shows how the attention mask works over both the cache and the chunk.\ngodog0000100000thetoThecatsatonthe\n1matand111sawthe1000doggoto\n100000110000000011100000011110PastCacheCurrent\nFigure 3: Pre-fill and chunking. During pre-fill of the cache, long sequences are chunked to limit memory\nusage. We process a sequence in three chunks, “The cat sat on”, “the mat and saw”, “the dog go to”. The figure\nshows what happens for the third chunk (“the dog go to”): it attends itself using a causal mask (rightmost block),\nattends the cache using a sliding window (center block), and does not attend to past tokens as they are outside of\nthe sliding window (left block).\n3 Results\nW...",What is the purpose of chunking in the context of the pre-fill process in Mistral 7B?\n\n,"The purpose of chunking in the context of the pre-fill process in Mistral 7B is to limit memory usage by processing long sequences in smaller, more manageable chunks.",Mistral7B.pdf
7,"Mistral 7B\nAlbert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,\nDevendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel,\nGuillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux,\nPierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix,\nWilliam El Sayed\nAbstract\nWe introduce Mistral 7B, a 7–billion-parameter language model engineered for\nsuperior performance and efficiency. Mistral 7B outperforms the best open 13B\nmodel (Llama 2) across all evaluated benchmarks, and the best released 34B\nmodel (Llama 1) in reasoning, mathematics, and code generation. Our model\nleverages grouped-query attention (GQA) for faster inference, coupled with sliding\nwindow attention (SWA) to effectively handle sequences of a...",Can you explain how the grouped-query attention (GQA) and sliding window attention (SWA) mechanisms in Mistral 7B enable faster and more efficient inference compared to other language models?\n\n,"The grouped-query attention (GQA) mechanism in Mistral 7B groups multiple queries together and processes them simultaneously, reducing the inference cost. The sliding window attention (SWA) mechanism enables the model to effectively handle sequences of arbitrary length by processing the sequence in smaller windows, further reducing the inference cost. These mechanisms, combined with other optimizations, enable Mistral 7B to outperform other language models in terms of performance and efficiency.",Mistral7B.pdf
8,"We apply two established techniques (tiling, recomputation) to overcome the technical challenge of\ncomputing exact attention in sub-quadratic HBM accesses. We describe this in Algorithm 1. The main idea\nis that we split the inputs QKVinto blocks, load them from slow HBM to fast SRAM, then compute the\nattention output with respect to those blocks. By scaling the output of each block by the right normalization\nfactor before adding them up, we get the correct result at the end.\nTiling. We compute attention by blocks. Softmax couples columns of K, so we decompose the large\nsoftmax with scaling [51, 60, 66]. For numerical stability, the softmax of vector 𝑥2R𝐵is computed as:\n𝑚¹𝑥º:=max\n𝑖𝑥𝑖 𝑓¹𝑥º:=\n𝑒𝑥1�𝑚¹𝑥º 𝑒𝑥𝐵�𝑚¹𝑥º\n ℓ¹𝑥º:=∑︁\n𝑖𝑓¹𝑥º𝑖softmax¹𝑥º:=𝑓¹𝑥º\nℓ¹𝑥º\n4For vectors 𝑥¹1º...",Can you explain how the attention output is computed in Algorithm 1?\n\n,"The attention output is computed by splitting the inputs Q-K-V into blocks, loading them from slow HBM to fast SRAM, and then computing the attention output with respect to those blocks. By scaling the output of each block by the right normalization factor before adding them up, the correct result is obtained at the end. The softmax is decomposed using scaling for numerical stability. For vectors 𝑥¹1¹-𝑥¹2²R𝐵, the softmax of the concatenated 𝑥=𝑥¹1¹𝑥¹2²R𝐵 is decomposed as: 𝑚¹𝑥º=𝑚¹𝑥¹1¹𝑥¹2²R𝐵as and 𝑚¹𝑥º=𝑚¹𝑥¹1¹𝑥¹2²R𝐵as. 𝑚¹𝑥º=𝑚¹𝑥¹1¹𝑥¹2²R𝐵as is computed as: 𝑚¹𝑥º=𝑚¹𝑥¹1¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓...",FlashAttention_ Fast and Memory-Efficient Exact Attention with IO-Awareness.pdf
9,"or Azure using the vLLM [ 17] inference server and SkyPilot2. Integration with Hugging Face3is\nalso streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across\na myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat\nmodel fine-tuned from Mistral 7B that significantly outperforms the Llama 2 13B – Chat model.\nMistral 7B takes a significant step in balancing the goals of getting high performance while keeping\nlarge language models efficient. Through our work, our aim is to help the community create more\naffordable, efficient, and high-performing language models that can be used in a wide range of\nreal-world applications.\n2 Architectural details\nFigure 1: Sliding Window Attention. The number of o...",Can you explain how the Mistral 7B language model is designed to balance high performance and efficiency?\n\n,"The Mistral 7B language model is designed to balance high performance and efficiency by using a sliding window attention mechanism. This mechanism allows the model to attend to a smaller portion of the input sequence at a time, reducing the number of operations and memory usage required for inference. Additionally, the model is crafted for ease of fine-tuning across a wide range of tasks, making it adaptable and efficient for various real-world applications.",Mistral7B.pdf


### 1.3. Setup critique agents

The questions generated by the previous agent can have many flaws: we should do a quality check before validating these questions.

We thus build critique agents that will rate each question on several criteria, given in [this paper](https://huggingface.co/papers/2312.10003):
- **Groundedness:** can the question be answered from the given context?
- **Relevance:** is the question relevant to users? For instance, `"What is the date when transformers 4.29.1 was released?"` is not relevant for ML practicioners.

One last failure case we've noticed is when a function is tailored for the particular setting where the question was generated, but undecipherable by itself, like `"What is the name of the function used in this guide?"`.
We also build a critique agent for this criteria:
- **Stand-alone**: is the question understandable free of any context, for someone with domain knowledge/Internet access? The opposite of this would be `What is the function used in this article?` for a question generated from a specific blog article.

We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.

💡 ___When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.___

We now build and run these critique agents.

In [None]:


### Semi-working backup
# question_groundedness_critique_prompt = """
# You will be given a context and a question.
# Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
# Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

# Provide your answer as follows:

# Answer:::
# Evaluation: (your rationale for the rating, as a text)
# Total rating: (your rating, as a number between 1 and 5)

# You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

# Now here are the question and context.

# Question: {question}\n
# Context: {context}\n
# Answer::: """

# question_relevance_critique_prompt = """
# You will be given a question.
# Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
# Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

# Provide your answer as follows:

# Answer:::
# Evaluation: (your rationale for the rating, as a text)
# Total rating: (your rating, as a number between 1 and 5)

# You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

# Now here is the question.

# Question: {question}\n
# Answer::: """

# question_standalone_critique_prompt = """
# You will be given a question.
# Your task is to provide a 'total rating' representing how context-independent this question is.
# Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
# For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
# The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

# For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

# Provide your answer as follows:

# Answer:::
# Evaluation: (your rationale for the rating, as a text)
# Total rating: (your rating, as a number between 1 and 5)

# You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

# Now here is the question.

# Question: {question}\n
# Answer::: """

In [15]:

## Semi-working backup
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: !!!(your rating, as a number between 1 and 5)!!!

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.  'Total rating:' should be encolsed in !!! and !!! as in '!!!4.5!!!'.


Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: !!!(your rating, as a number between 1 and 5)!!!

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.  'Total rating:' should be encolsed in !!! and !!! as in '!!!4.5!!!'.


Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: !!!(your rating, as a number between 1 and 5)!!!

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.  'Total rating:' should be encolsed in !!! and !!! as in '!!!4.5!!!'.


Now here is the question.

Question: {question}\n
Answer::: """

In [None]:


# ## Semi-working backup
# question_groundedness_critique_prompt = """
# You will be given a context and a question.
# Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
# Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

# Provide your answer as follows:

# Answer:::
# Evaluation: (your rationale for the rating, as a text)
# Total rating: !!!(your rating, as a number between 1 and 5)!!!

# You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.  'Total rating:' should be preceded and followd by '!!!' as in 'Total rating: !!!4.5!!!'

# Now here are the question and context.

# Question: {question}\n
# Context: {context}\n
# Answer::: """

# question_relevance_critique_prompt = """
# You will be given a question.
# Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
# Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

# Provide your answer as follows:

# Answer:::
# Evaluation: (your rationale for the rating, as a text)
# Total rating: !!!(your rating, as a number between 1 and 5)!!!

# You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.  'Total rating:' should be preceded and followd by '!!!' as in 'Total rating: !!!4.5!!!'

# Now here is the question.

# Question: {question}\n
# Answer::: """

# question_standalone_critique_prompt = """
# You will be given a question.
# Your task is to provide a 'total rating' representing how context-independent this question is.
# Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
# For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
# The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

# For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

# Provide your answer as follows:

# Answer:::
# Evaluation: (your rationale for the rating, as a text)
# Total rating: !!!(your rating, as a number between 1 and 5)!!!

# You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.  'Total rating:' should be preceded and followd by '!!!' as in 'Total rating: !!!4.5!!!'

# Now here is the question.

# Question: {question}\n
# Answer::: """

In [19]:
# import re
# print("Generating critique for each QA couple...")
# for output in tqdm(outputs[:1]):
#     evaluations = {
#         "groundedness": call_llm(question=question_groundedness_critique_prompt.format(context=output["context"], question=output["question"]), 
#                                 generator=generator_llm,
#                                 tokenizer=generator_tokenizer,settings=generator_settings,
#                                 max_new_tokens=1024),
#         "relevance": call_llm(question=question_relevance_critique_prompt.format(question=output["question"]), 
#                                 generator=generator_llm,
#                                 tokenizer=generator_tokenizer,settings=generator_settings,
#                                 max_new_tokens=1024),
                    
#         "standalone": call_llm(question=question_standalone_critique_prompt.format(question=output["question"]),
#                                 generator=generator_llm,
#                                 tokenizer=generator_tokenizer,settings=generator_settings,
#                                 max_new_tokens=1024)
#     }
#     try:
#         # for criterion, evaluation in evaluations.items():
#         #     # score, eval = (
#         #     #     (evaluation.split("Total rating: ")[-1].strip()),
#         #     #     evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
#         #     # )
#         #     score, eval = (
#         #         float(evaluation.split("!!!")[1]),
#         #         evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
#         #     )
#         #     output.update(
#         #         {
#         #             f"{criterion}_score": score,
#         #             f"{criterion}_eval": eval,
#         #         }
#         #     )
#         for criterion, evaluation in evaluations.items():
#             score_match = re.search(r'Total rating: !!!([\d\.]+)!!!', evaluation)
#             #re.search(r'Total rating: !!!([\d\.]+)!!!.*', evaluation)
#             eval_match = re.search(r'Evaluation: (.*?)Total rating:', evaluation)
            
#             if score_match and eval_match:
#                 score = float(score_match.group(1))
#                 eval = eval_match.group(1).strip()
#                 output.update(
#                     {
#                         f"{criterion}_score": score,
#                         f"{criterion}_eval": eval,
#                     }
#                 )
#     except Exception as e:
#         print("\033[91m" + f"EVALUATION:" + "\033[0m")
#         print(evaluations)
#         print("\033[91m" + f"EXCEPTION: {e}" + "\033[0m")
#         break
#         #continue

Generating critique for each QA couple...


  0%|          | 0/1 [00:00<?, ?it/s]

<s>[INST] 
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: !!!(your rating, as a number between 1 and 5)!!!

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.  'Total rating:' should be encolsed in !!! and !!! as in '!!!4.5!!!'.


Now here are the question and context.

Question: Can a high-level language such as PyTorch be used to write attention algorithms with IO-aware implementations in CUDA?



Context: algorithm in a considerably lower-level language than PyTorch, and requires signiﬁcant engineering eﬀort.
Implementati

100%|██████████| 1/1 [00:04<00:00,  4.15s/it]

<s>[INST] 
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: !!!(your rating,




In [21]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

In [31]:
print("Generating critique for each QA couple...")
for output in tqdm(outputs):
    evaluations = {
        "groundedness": call_llm(question=question_groundedness_critique_prompt.format(context=output["context"], question=output["question"]), 
                                generator=generator_llm,
                                tokenizer=generator_tokenizer,settings=generator_settings,
                                max_new_tokens=1024),
        "relevance": call_llm(question=question_relevance_critique_prompt.format(question=output["question"]), 
                                generator=generator_llm,
                                tokenizer=generator_tokenizer,settings=generator_settings,
                                max_new_tokens=1024),
                    
        "standalone": call_llm(question=question_standalone_critique_prompt.format(question=output["question"]),
                                generator=generator_llm,
                                tokenizer=generator_tokenizer,settings=generator_settings,
                                max_new_tokens=1024)
    }
    try:
        for criterion, evaluation in evaluations.items():
            score, eval = (
                # int(evaluation.split("Total rating: ")[-1].strip()),
                (evaluation.split("Total rating: ")[-1].strip()),
                evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
            )
            output.update(
                {
                    f"{criterion}_score": score,
                    f"{criterion}_eval": eval,
                }
            )
    except Exception as e:
        print("\033[91m" + f"EVALUATION:" + "\033[0m")
        print(evaluations)
        print("\033[91m" + f"EXCEPTION: {e}" + "\033[0m")

Generating critique for each QA couple...


  0%|          | 0/10 [00:00<?, ?it/s]

<s>[INST] 
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: Can a high-level language such as PyTorch be used to write attention algorithms with IO-aware implementations in CUDA?



Context: algorithm in a considerably lower-level language than PyTorch, and requires signiﬁcant engineering eﬀort.
Implementations may also not be transferrable across GPU architectures. These limitations

 10%|█         | 1/10 [00:03<00:30,  3.42s/it]

<s>[INST] 
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as

 20%|██        | 2/10 [00:07<00:28,  3.56s/it]

<s>[INST] 
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as

 30%|███       | 3/10 [00:10<00:23,  3.39s/it]

<s>[INST] 
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as

 40%|████      | 4/10 [00:14<00:21,  3.60s/it]

<s>[INST] 
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as

 50%|█████     | 5/10 [00:17<00:18,  3.64s/it]

<s>[INST] 
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as

 60%|██████    | 6/10 [00:22<00:15,  3.81s/it]

<s>[INST] 
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as

 70%|███████   | 7/10 [00:26<00:11,  3.89s/it]

<s>[INST] 
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as

 80%|████████  | 8/10 [00:30<00:07,  3.96s/it]

<s>[INST] 
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as

 90%|█████████ | 9/10 [00:32<00:03,  3.38s/it]

<s>[INST] 
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as

100%|██████████| 10/10 [00:36<00:00,  3.66s/it]

<s>[INST] 
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as




In [None]:
evaluations

{'groundedness': "<s>[INST] \nYou will be given a context and a question.\nYour task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.\nGive your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.\n\nProvide your answer as follows:\n\nAnswer:::\nEvaluation: (your rationale for the rating, as a text)\nTotal rating: !!!(your rating, as a number between 1 and 5)!!!\n\nYou MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.  'Total rating:' should be encolsed in !!! and !!! as in '!!!4.5!!!'.\n\n\nNow here are the question and context.\n\nQuestion: Given that the LLM models discussed in [26], [27], [28], and [29] are designed to be efficient foundation language models, what are the key differences between them in terms of their architecture, training data, and

Now let us filter out bad questions based on our critique agent scores:

In [25]:
outputs[0].keys()

dict_keys(['context', 'question', 'answer', 'source_doc', 'groundedness_score', 'groundedness_eval', 'relevance_score', 'relevance_eval', 'standalone_score', 'standalone_eval'])

In [26]:
import pandas as pd

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(output)

print("Evaluation dataset before filtering:")
display(
    generated_questions.reindex(
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ],
        axis=1)
    
)

ValueError: If using all scalar values, you must pass an index

In [33]:
import pandas as pd

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(outputs)

print("Evaluation dataset before filtering:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)

Evaluation dataset before filtering:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,Can a high-level language such as PyTorch be used to write attention algorithms with IO-aware implementations in CUDA?\n\n,"Yes, a high-level language such as PyTorch can be used to write attention algorithms with IO-aware implementations in CUDA. This is because the IO-aware approach can extend beyond attention and can be applied to other modules in a deep network. The IO-aware implementation of attention is optimal within constants for computing attention on a single GPU, but it may not be parallelizable across multiple GPUs.","4.5\n\nThe answer is almost unambiguous, but there is a slight ambiguity in the question regarding the specific requirements for the IO-aware implementation. However, the context provides enough information to clarify that the IO-aware implementation should be optimal for computing attention on a single GPU and potentially parallelizable for multi-GPU scenarios.",4. This question is highly useful for developers who need to understand the capabilities of PyTorch and CUDA for implementing attention algorithms in NLP applications.,5.0
1,Can Mistral 7B perform fine-grained content moderation using a system prompt to enforce guardrails?\n\n,"Yes, according to the context, Mistral 7B can perform fine-grained content moderation using a system prompt to enforce guardrails. This is indicated in Table 4, where the mean official MT Bench score for Mistral system prompt is 6.58 ±0.05, which is higher than the score for no system prompt (6.84 ±0.07) and Llama 2 system prompt (6.38 ±0.07).",,,
2,"Given the context, what are the attention mechanisms used by Mistral 7B and how do they contribute to its enhanced performance and efficiency?\n\n","Mistral 7B uses grouped-query attention (GQA) and sliding window attention (SWA) as its attention mechanisms. GQA significantly accelerates the inference speed and reduces the memory requirement during decoding, allowing for higher batch sizes and higher throughput, which is crucial for real-time applications. SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms collectively contribute to the enhanced performance and efficiency of Mistral 7B.",4.5,4,4.0
3,"Suppose K1, K2, K3, K4, K5, K6, K7, K8, K9, K10, K11, K12, K13, K14, K15, K16, K17, K18, K19, K20, K21, K22, K23, K24, K25, K26, K27, K28, K29, K30, K31, K32, K33, K34, K35, K36, K37, K38, K39, K40, K41, K42, K43, K44, K45, K46, K47, K48, K49, K50, K51, K52, K53, K54, K55, K56, K57, K58, K59, K60, K61, K62, K63, K64, K65, K66, K67, K68, K69, K70, K71, K72, K73, K74, K75, K76, K77, K78, K79, K80, K81, K82, K83, K84, K85, K86, K87, K88, K89, K90, K91, K92, K93, K94, K95, K96, K97, K98, K99, K100, K101, K102, K103, K104, K105, K106, K107, K108, K109, K110, K111, K112, K113, K114, K115, K116, K117, K118, K119, K120, K121, K122, K123, K124, K125, K126, K127, K128, K129, K130, K131, K132, K133, K134, K135, K136, K137, K138, K139, K140, K141, K142, K143, K144, K145, K146, K147, K148, K149, K150, K151, K152, K153, K154, K155, K156, K157, K158, K159, K160, K161, K162, K163, K164, K165, K166, K167, K168, K169, K170, K171, K172, K173, K174, K175, K176, K177, K178, K179, K180, K181, K182, K183, K184, K185, K186, K187, K188, K189, K190, K191, K192, K193, K194, K195, K196, K197, K198, K199, K200, K201, K202, K203, K204, K205, K206, K207, K208, K209, K210, K211, K212, K213, K214, K215, K216, K217, K218, K219, K220, K221, K222, K223, K224, K225, K22","(your answer to the deep question)\n\nNow here is the context.\n\nContext: 𝑁\n𝐵𝑐m\nblocks\nK1K𝑇𝑐andV1V𝑇𝑐, of size𝐵𝑐𝑑each.\n4:Divide Ointo𝑇𝑟blocks O𝑖O𝑇𝑟of size𝐵𝑟𝑑each, divide ℓinto𝑇𝑟blocksℓ𝑖ℓ𝑇𝑟of size𝐵𝑟each,\ndivide𝑚into𝑇𝑟blocks𝑚1𝑚𝑇𝑟of size𝐵𝑟each.\n5:for1𝑗𝑇𝑐do\n6:Load K𝑗V𝑗from HBM to on-chip SRAM.\n7:for1𝑖𝑇𝑟do\n8:Load Q𝑖O𝑖ℓ𝑖𝑚𝑖from HBM to on-chip SRAM.\n9:On chip, compute S𝑖𝑗=Q𝑖K𝑇\n𝑗2R𝐵𝑟𝐵𝑐.\n10:On chip, compute ~𝑚𝑖𝑗=rowmax¹S𝑖𝑗º 2R𝐵𝑟,~P𝑖𝑗=exp¹S𝑖𝑗�~𝑚𝑖𝑗º 2R𝐵𝑟𝐵𝑐(pointwise), ~ℓ𝑖𝑗=\nrowsum¹~P𝑖𝑗º2R𝐵𝑟.\n11:On chip, compute 𝑚new\n𝑖=max¹𝑚𝑖~𝑚𝑖𝑗º2R𝐵𝑟,ℓnew\n𝑖=𝑒𝑚𝑖�𝑚new\n𝑖ℓ𝑖¸𝑒~𝑚𝑖𝑗�𝑚new\n𝑖~ℓ𝑖𝑗2R𝐵𝑟.\n12:Write O𝑖 diag¹ℓnew\n𝑖º�1¹diag¹ℓ𝑖º𝑒𝑚𝑖�𝑚new\n𝑖O𝑖¸𝑒~𝑚𝑖𝑗�𝑚new\n𝑖~P𝑖𝑗V𝑗ºto HBM.\n13:Writeℓ𝑖 ℓnew\n𝑖,𝑚𝑖 𝑚new\n𝑖to HBM.\n14:end for\n15:end for\n16:Return O.\nWe show FlashAttention ’s correctness, runtime, and memory requirement (proof in Appendix C).\nTheorem 1. Algorithm 1 returns O=softmax¹QK>ºVwith𝑂¹𝑁2𝑑ºFLOPs and requires 𝑂¹𝑁ºadditional\nmemory beyond inputs and output.\n3.2 Analysis: IO Complexity of FlashAttention\n\nOutput::: [/INST] Deep question: Suppose K1, K2, K3, K4, K5, K6, K7, K8, K9, K10, K11, K12, K13, K14, K15, K16, K17, K18, K19, K20, K21, K22, K23, K24, K25, K26, K27, K28, K29, K30, K31, K32, K33, K34, K35, K36, K37, K38, K39, K40, K41, K42, K43, K44, K45, K46, K47, K48, K49, K50, K51, K52, K53, K54, K55, K56, K57, K58, K59, K60, K61, K62, K63, K64, K65, K66, K67, K68, K69, K70, K71, K72, K73, K74, K75, K76, K77, K78, K79, K80, K81, K82, K83, K84, K85, K86, K87, K88, K89, K90, K91, K92, K93, K94, K95, K96, K97, K98, K99, K100, K101, K102, K103, K104, K105, K106, K107, K108, K109, K110, K111, K112, K113, K114, K115, K116, K117, K118, K119, K120, K121, K122, K123, K124, K125, K126, K127, K128, K129, K130, K131, K132, K133, K134, K135, K136, K137, K138, K139, K140, K141, K142, K143, K144, K145, K146, K147, K148, K149, K150, K151, K152, K153, K154, K155, K156, K157, K158, K159, K160, K161, K162, K163, K164, K165, K166, K167, K168, K169, K170, K171, K172, K173, K174, K175, K176, K177, K178, K179, K180, K181, K182, K183, K184, K185, K186, K187, K188, K189, K190, K191, K192, K193, K194, K195, K196, K197, K198, K199, K200, K201, K202, K203, K204, K205, K206, K207, K208, K209, K210, K211, K212, K213, K214, K215, K216, K217, K218, K219, K220, K221, K222, K223, K224, K225, K22",,,
4,"Given the context, what is the relationship between the attention memory usage and sequence length in the MIMIC-III and ECtHR datasets?\n\n","The attention memory usage and sequence length are positively correlated in both the MIMIC-III and ECtHR datasets. As the sequence length increases, the attention memory usage also increases. This is evident from the graph in Figure 3, which shows that the attention memory usage increases as the sequence length increases.",,,
5,How does paged attention affect the performance of a gauge model serving in a production environment?\n\n,Paged attention can improve the performance of a gauge model serving in a production environment by reducing the memory footprint and speeding up the inference process.,"3.5\n\nThe question is clearly answerable with the given context, but the specific impact of paged attention on the performance of a gauge model serving in a production environment may depend on the specific implementation and use case, making it a bit less unambiguous.",,
6,What is the purpose of chunking in the context of the pre-fill process in Mistral 7B?\n\n,"The purpose of chunking in the context of the pre-fill process in Mistral 7B is to limit memory usage by processing long sequences in smaller, more manageable chunks.",,,
7,Can you explain how the grouped-query attention (GQA) and sliding window attention (SWA) mechanisms in Mistral 7B enable faster and more efficient inference compared to other language models?\n\n,"The grouped-query attention (GQA) mechanism in Mistral 7B groups multiple queries together and processes them simultaneously, reducing the inference cost. The sliding window attention (SWA) mechanism enables the model to effectively handle sequences of arbitrary length by processing the sequence in smaller windows, further reducing the inference cost. These mechanisms, combined with other optimizations, enable Mistral 7B to outperform other language models in terms of performance and efficiency.",4.5,4,5.0
8,Can you explain how the attention output is computed in Algorithm 1?\n\n,"The attention output is computed by splitting the inputs Q-K-V into blocks, loading them from slow HBM to fast SRAM, and then computing the attention output with respect to those blocks. By scaling the output of each block by the right normalization factor before adding them up, the correct result is obtained at the end. The softmax is decomposed using scaling for numerical stability. For vectors 𝑥¹1¹-𝑥¹2²R𝐵, the softmax of the concatenated 𝑥=𝑥¹1¹𝑥¹2²R𝐵 is decomposed as: 𝑚¹𝑥º=𝑚¹𝑥¹1¹𝑥¹2²R𝐵as and 𝑚¹𝑥º=𝑚¹𝑥¹1¹𝑥¹2²R𝐵as. 𝑚¹𝑥º=𝑚¹𝑥¹1¹𝑥¹2²R𝐵as is computed as: 𝑚¹𝑥º=𝑚¹𝑥¹1¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹",5,4,5.0
9,Can you explain how the Mistral 7B language model is designed to balance high performance and efficiency?\n\n,"The Mistral 7B language model is designed to balance high performance and efficiency by using a sliding window attention mechanism. This mechanism allows the model to attend to a smaller portion of the input sequence at a time, reducing the number of operations and memory usage required for inference. Additionally, the model is crafted for ease of fine-tuning across a wide range of tasks, making it adaptable and efficient for various real-world applications.","3\n\nThe context provides some relevant information, but it does not fully answer the question.",,


In [34]:
generated_questions['groundedness_score']=generated_questions['groundedness_score'].astype(str).str.extract(r'(\d+\.?\d*)').astype(float)
generated_questions['relevance_score']=generated_questions['relevance_score'].astype(str).str.extract(r'(\d+\.?\d*)').astype(float)
generated_questions['standalone_score']=generated_questions['groundedness_score'].astype(str).str.extract(r'(\d+\.?\d*)').astype(float)

In [35]:
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)

Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,Can a high-level language such as PyTorch be used to write attention algorithms with IO-aware implementations in CUDA?\n\n,"Yes, a high-level language such as PyTorch can be used to write attention algorithms with IO-aware implementations in CUDA. This is because the IO-aware approach can extend beyond attention and can be applied to other modules in a deep network. The IO-aware implementation of attention is optimal within constants for computing attention on a single GPU, but it may not be parallelizable across multiple GPUs.",4.5,4.0,4.5
1,Can Mistral 7B perform fine-grained content moderation using a system prompt to enforce guardrails?\n\n,"Yes, according to the context, Mistral 7B can perform fine-grained content moderation using a system prompt to enforce guardrails. This is indicated in Table 4, where the mean official MT Bench score for Mistral system prompt is 6.58 ±0.05, which is higher than the score for no system prompt (6.84 ±0.07) and Llama 2 system prompt (6.38 ±0.07).",,,
2,"Given the context, what are the attention mechanisms used by Mistral 7B and how do they contribute to its enhanced performance and efficiency?\n\n","Mistral 7B uses grouped-query attention (GQA) and sliding window attention (SWA) as its attention mechanisms. GQA significantly accelerates the inference speed and reduces the memory requirement during decoding, allowing for higher batch sizes and higher throughput, which is crucial for real-time applications. SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms collectively contribute to the enhanced performance and efficiency of Mistral 7B.",4.5,4.0,4.5
3,"Suppose K1, K2, K3, K4, K5, K6, K7, K8, K9, K10, K11, K12, K13, K14, K15, K16, K17, K18, K19, K20, K21, K22, K23, K24, K25, K26, K27, K28, K29, K30, K31, K32, K33, K34, K35, K36, K37, K38, K39, K40, K41, K42, K43, K44, K45, K46, K47, K48, K49, K50, K51, K52, K53, K54, K55, K56, K57, K58, K59, K60, K61, K62, K63, K64, K65, K66, K67, K68, K69, K70, K71, K72, K73, K74, K75, K76, K77, K78, K79, K80, K81, K82, K83, K84, K85, K86, K87, K88, K89, K90, K91, K92, K93, K94, K95, K96, K97, K98, K99, K100, K101, K102, K103, K104, K105, K106, K107, K108, K109, K110, K111, K112, K113, K114, K115, K116, K117, K118, K119, K120, K121, K122, K123, K124, K125, K126, K127, K128, K129, K130, K131, K132, K133, K134, K135, K136, K137, K138, K139, K140, K141, K142, K143, K144, K145, K146, K147, K148, K149, K150, K151, K152, K153, K154, K155, K156, K157, K158, K159, K160, K161, K162, K163, K164, K165, K166, K167, K168, K169, K170, K171, K172, K173, K174, K175, K176, K177, K178, K179, K180, K181, K182, K183, K184, K185, K186, K187, K188, K189, K190, K191, K192, K193, K194, K195, K196, K197, K198, K199, K200, K201, K202, K203, K204, K205, K206, K207, K208, K209, K210, K211, K212, K213, K214, K215, K216, K217, K218, K219, K220, K221, K222, K223, K224, K225, K22","(your answer to the deep question)\n\nNow here is the context.\n\nContext: 𝑁\n𝐵𝑐m\nblocks\nK1K𝑇𝑐andV1V𝑇𝑐, of size𝐵𝑐𝑑each.\n4:Divide Ointo𝑇𝑟blocks O𝑖O𝑇𝑟of size𝐵𝑟𝑑each, divide ℓinto𝑇𝑟blocksℓ𝑖ℓ𝑇𝑟of size𝐵𝑟each,\ndivide𝑚into𝑇𝑟blocks𝑚1𝑚𝑇𝑟of size𝐵𝑟each.\n5:for1𝑗𝑇𝑐do\n6:Load K𝑗V𝑗from HBM to on-chip SRAM.\n7:for1𝑖𝑇𝑟do\n8:Load Q𝑖O𝑖ℓ𝑖𝑚𝑖from HBM to on-chip SRAM.\n9:On chip, compute S𝑖𝑗=Q𝑖K𝑇\n𝑗2R𝐵𝑟𝐵𝑐.\n10:On chip, compute ~𝑚𝑖𝑗=rowmax¹S𝑖𝑗º 2R𝐵𝑟,~P𝑖𝑗=exp¹S𝑖𝑗�~𝑚𝑖𝑗º 2R𝐵𝑟𝐵𝑐(pointwise), ~ℓ𝑖𝑗=\nrowsum¹~P𝑖𝑗º2R𝐵𝑟.\n11:On chip, compute 𝑚new\n𝑖=max¹𝑚𝑖~𝑚𝑖𝑗º2R𝐵𝑟,ℓnew\n𝑖=𝑒𝑚𝑖�𝑚new\n𝑖ℓ𝑖¸𝑒~𝑚𝑖𝑗�𝑚new\n𝑖~ℓ𝑖𝑗2R𝐵𝑟.\n12:Write O𝑖 diag¹ℓnew\n𝑖º�1¹diag¹ℓ𝑖º𝑒𝑚𝑖�𝑚new\n𝑖O𝑖¸𝑒~𝑚𝑖𝑗�𝑚new\n𝑖~P𝑖𝑗V𝑗ºto HBM.\n13:Writeℓ𝑖 ℓnew\n𝑖,𝑚𝑖 𝑚new\n𝑖to HBM.\n14:end for\n15:end for\n16:Return O.\nWe show FlashAttention ’s correctness, runtime, and memory requirement (proof in Appendix C).\nTheorem 1. Algorithm 1 returns O=softmax¹QK>ºVwith𝑂¹𝑁2𝑑ºFLOPs and requires 𝑂¹𝑁ºadditional\nmemory beyond inputs and output.\n3.2 Analysis: IO Complexity of FlashAttention\n\nOutput::: [/INST] Deep question: Suppose K1, K2, K3, K4, K5, K6, K7, K8, K9, K10, K11, K12, K13, K14, K15, K16, K17, K18, K19, K20, K21, K22, K23, K24, K25, K26, K27, K28, K29, K30, K31, K32, K33, K34, K35, K36, K37, K38, K39, K40, K41, K42, K43, K44, K45, K46, K47, K48, K49, K50, K51, K52, K53, K54, K55, K56, K57, K58, K59, K60, K61, K62, K63, K64, K65, K66, K67, K68, K69, K70, K71, K72, K73, K74, K75, K76, K77, K78, K79, K80, K81, K82, K83, K84, K85, K86, K87, K88, K89, K90, K91, K92, K93, K94, K95, K96, K97, K98, K99, K100, K101, K102, K103, K104, K105, K106, K107, K108, K109, K110, K111, K112, K113, K114, K115, K116, K117, K118, K119, K120, K121, K122, K123, K124, K125, K126, K127, K128, K129, K130, K131, K132, K133, K134, K135, K136, K137, K138, K139, K140, K141, K142, K143, K144, K145, K146, K147, K148, K149, K150, K151, K152, K153, K154, K155, K156, K157, K158, K159, K160, K161, K162, K163, K164, K165, K166, K167, K168, K169, K170, K171, K172, K173, K174, K175, K176, K177, K178, K179, K180, K181, K182, K183, K184, K185, K186, K187, K188, K189, K190, K191, K192, K193, K194, K195, K196, K197, K198, K199, K200, K201, K202, K203, K204, K205, K206, K207, K208, K209, K210, K211, K212, K213, K214, K215, K216, K217, K218, K219, K220, K221, K222, K223, K224, K225, K22",,,
4,"Given the context, what is the relationship between the attention memory usage and sequence length in the MIMIC-III and ECtHR datasets?\n\n","The attention memory usage and sequence length are positively correlated in both the MIMIC-III and ECtHR datasets. As the sequence length increases, the attention memory usage also increases. This is evident from the graph in Figure 3, which shows that the attention memory usage increases as the sequence length increases.",,,
5,How does paged attention affect the performance of a gauge model serving in a production environment?\n\n,Paged attention can improve the performance of a gauge model serving in a production environment by reducing the memory footprint and speeding up the inference process.,3.5,,3.5
6,What is the purpose of chunking in the context of the pre-fill process in Mistral 7B?\n\n,"The purpose of chunking in the context of the pre-fill process in Mistral 7B is to limit memory usage by processing long sequences in smaller, more manageable chunks.",,,
7,Can you explain how the grouped-query attention (GQA) and sliding window attention (SWA) mechanisms in Mistral 7B enable faster and more efficient inference compared to other language models?\n\n,"The grouped-query attention (GQA) mechanism in Mistral 7B groups multiple queries together and processes them simultaneously, reducing the inference cost. The sliding window attention (SWA) mechanism enables the model to effectively handle sequences of arbitrary length by processing the sequence in smaller windows, further reducing the inference cost. These mechanisms, combined with other optimizations, enable Mistral 7B to outperform other language models in terms of performance and efficiency.",4.5,4.0,4.5
8,Can you explain how the attention output is computed in Algorithm 1?\n\n,"The attention output is computed by splitting the inputs Q-K-V into blocks, loading them from slow HBM to fast SRAM, and then computing the attention output with respect to those blocks. By scaling the output of each block by the right normalization factor before adding them up, the correct result is obtained at the end. The softmax is decomposed using scaling for numerical stability. For vectors 𝑥¹1¹-𝑥¹2²R𝐵, the softmax of the concatenated 𝑥=𝑥¹1¹𝑥¹2²R𝐵 is decomposed as: 𝑚¹𝑥º=𝑚¹𝑥¹1¹𝑥¹2²R𝐵as and 𝑚¹𝑥º=𝑚¹𝑥¹1¹𝑥¹2²R𝐵as. 𝑚¹𝑥º=𝑚¹𝑥¹1¹𝑥¹2²R𝐵as is computed as: 𝑚¹𝑥º=𝑚¹𝑥¹1¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹𝑥¹2²R𝐵as = max¹𝑚¹𝑥¹1¹º-𝑓¹𝑥º=h𝑒𝑚¹𝑥¹1¹º-𝑚¹",5.0,4.0,5.0
9,Can you explain how the Mistral 7B language model is designed to balance high performance and efficiency?\n\n,"The Mistral 7B language model is designed to balance high performance and efficiency by using a sliding window attention mechanism. This mechanism allows the model to attend to a smaller portion of the input sequence at a time, reducing the number of operations and memory usage required for inference. Additionally, the model is crafted for ease of fine-tuning across a wide range of tasks, making it adaptable and efficient for various real-world applications.",3.0,,3.0


In [126]:
import pandas as pd

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(outputs)

print("Evaluation dataset before filtering:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)
# generated_questions = generated_questions.loc[
#     (generated_questions["groundedness_score"] >= 4)
#     & (generated_questions["relevance_score"] >= 4)
#     & (generated_questions["standalone_score"] >= 4)
# ]
# print("============================================")
# print("Final evaluation dataset:")
# display(
#     generated_questions[
#         [
#             "question",
#             "answer",
#             "groundedness_score",
#             "relevance_score",
#             "standalone_score",
#         ]
#     ]
# )

# eval_dataset = datasets.Dataset.from_pandas(
#     generated_questions, split="train", preserve_index=False
# )

Evaluation dataset before filtering:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,"What is the number of HBM accesses required by the backward pass of FlashAttention for a given sequence length, head dimension, and SRAM size?\n\n",The backward pass of FlashAttention requires Θ¹𝑁2𝑑2𝑀¹HBM accesses.,!!!2.5!!!,!!!3.5!!!,!!!4.5!!!
1,"Can the FlashAttention algorithm achieve wall-clock speedup compared to approximate attention methods, even with the added constraint of being IO-aware?\n\n","Yes, the FlashAttention algorithm can achieve wall-clock speedup compared to approximate attention methods, even with the added constraint of being IO-aware. The use of tiling in FlashAttention reduces the number of memory reads/writes, which results in faster processing times and lower memory usage. Additionally, the IO-awareness of the algorithm ensures that it is optimized for GPU memory usage, further improving its performance.",!!!4.0!!!,"!!!4.0!!!\n\nThe total rating of 4.0 indicates that this question is moderately useful for machine learning developers building NLP applications with the Hugging Face ecosystem. While the question provides valuable insights into the performance characteristics of different attention mechanisms, it may not be as relevant or informative as other questions related to specific NLP tasks or applications.",!!!5!!!
2,Can you explain the difference between the instruction fine-tuning process used for the Mistral 7B – Instruct model and the training tricks used for the other 7B models on MT-Bench?\n\n,"The instruction fine-tuning process used for the Mistral 7B – Instruct model involved fine-tuning the base model on publicly available instruction datasets on the Hugging Face repository, without using any proprietary data or training tricks. This was done to demonstrate the generalization capabilities of the base model. On the other hand, the other 7B models on MT-Bench were trained using proprietary data and training tricks, which may have contributed to their superior performance.",!!!2.5!!!,"!!!4.0!!! \n\nThe question is moderately useful as it provides insights into the specific fine-tuning process used for the Mistral 7B – Instruct model and the training tricks used for other 7B models on MT-Bench. However, it could be further improved by providing more context and specificity to the question, such as comparing the performance of the two models or discussing the specific training tricks used for each model.",
3,"What is the relative performance of GPT-2 with FlashAttention compared to Megatron-LM in terms of perplexity and training time speedup, when the context length is increased from 1K to 4K?\n\n","GPT-2 with FlashAttention achieved 0.7 better perplexity and 30% faster training time compared to Megatron-LM when the context length was increased from 1K to 4K. Specifically, GPT-2 small with FlashAttention 4k achieved a perplexity of 17.5 and a training time speedup of 1.3 compared to Megatron-LM small with a context length of 1K.",3.5,"4\n\nThe question is quite useful, but it could be improved by providing more specific details about the datasets used for training and testing, as well as any other relevant metrics that could be considered. Additionally, it would be helpful to know if there are any other factors that could impact the performance of these models, such as the size of the training corpus or the specific hardware used for training.",4
4,"In the context of model training, how can the effectiveness of naive kernel fusion be increased in the standard attention implementation?\n\n","In the standard attention implementation, the intermediate values still need to be written to HBM to save for the backward pass, reducing the effectiveness of naive kernel fusion. To increase the effectiveness of naive kernel fusion, one approach is to use techniques such as weight pruning or quantization to reduce the memory requirements of the intermediate values, allowing them to be stored in the HBM without taking up too much memory. Another approach is to use techniques such as batch normalization or layer normalization to stabilize the training process and reduce the need for intermediate values to be written to HBM during the forward pass.","4. While the question is somewhat ambiguous regarding the specifics of how to increase the effectiveness of naive kernel fusion, the provided context provides enough information to give a general understanding of the problem and potential solutions.",4,3
5,"Given the table data, what is the training time speedup achieved by FlashAttention in comparison to the Huggingface implementation for GPT-2 small and medium models?\n\n","The table shows that for GPT-2 small and medium models, FlashAttention achieved a training time speedup of 3.5 and 3.0, respectively, compared to the Huggingface implementation.",4.5,,
6,How does the use of kernel fusion and tiling in the implementation of the FlashAttention algorithm enable faster computation and reduced HBM accesses?\n\n,"The use of kernel fusion and tiling in the implementation of the FlashAttention algorithm allows for faster computation and reduced HBM accesses by enabling all computation steps to be performed in one CUDA kernel, loading input from HBM, performing all the computation steps (matrix multiply, softmax, optionally masking and dropout, matrix multiply), then writing the result back to HBM. This avoids repeatedly reading and writing of inputs and outputs from and to HBM, which can slow down computation and increase memory usage. By minimizing the number of blocks required and reducing the amount of data that needs to be transferred between HBM and on-chip SRAM, the implementation of FlashAttention is able to achieve faster computation and reduced HBM accesses.","4.5\n\nThe use of kernel fusion and tiling in the implementation of the FlashAttention algorithm is described in detail in the context, and the algorithm itself is also provided. The context also explains how the use of these techniques enables faster computation and reduced HBM accesses. Therefore, the question is clearly and unambiguously answerable with the given context.",4,
7,How does the independent human evaluation of the leaderboard on <https://llmboxing.com/leaderboard> compare the performance of Mistral 7B and Llama 2 13B models in terms of preferred responses?\n\n,"According to the evaluation, the outputs generated by Mistral 7B were preferred 5020 times, compared to 4143 times for Llama 2 13B.","5.0\n\nThe question is clearly and unambiguously answerable with the given context, as the context provides all the necessary information to answer the question. The answer is based on the results of an independent human evaluation conducted on the leaderboard of <https://llmboxing.com/leaderboard> and provides a clear comparison of the performance of the two models in terms of preferred responses.",4,
8,Can you explain the difference between Mistral 7B – Instruct and Llama 2 Chat 13B in terms of their ability to classify user prompts and their generated answers as acceptable or falling into categories related to illegal activities or hateful harassment?\n\n,"The difference between Mistral 7B – Instruct and Llama 2 Chat 13B lies in their ability to classify user prompts and their generated answers as acceptable or falling into categories related to illegal activities or hateful harassment. Mistral 7B – Instruct is able to accurately classify user prompts or their generated answers as either acceptable or falling into one of the following categories: Illegal activities such as terrorism, child abuse or fraud; Hateful, harassing or violent content. On the other hand, Llama 2 Chat 13B declines to answer harmful questions even when system prompts are activated and does not provide a correct response for the question ""How to kill a linux process with system prompts activated"", but answers correctly when system prompts are deactivated.","4.5\n\nThe total rating is 4.5 because while Mistral 7B – Instruct has a better ability to classify harmful questions, the context is limited to a specific example and does not provide a comprehensive analysis of the models' performance in this regard.",,
9,How does the rolling buffer cache work in the context of the FlashAttention and xFormers models?\n\n,"The rolling buffer cache works by limiting the cache size to a fixed value of W. The keys and values for the timesteps i are stored in position imodW of the cache. When the position i is larger than W, past values in the cache are overwritten, and the size of the cache stops increasing. This allows for efficient use of memory and can improve the speed of the models.","3.5. While the context provides a good overview of the models and their attention mechanisms, the question asks for a more specific explanation of the rolling buffer cache, which may not be fully covered in the context.","4.\n\nThe rolling buffer cache is an important aspect of the Hugging Face models, and understanding how it works can help developers optimize their models for better performance. However, the question could be improved by providing more context about the FlashAttention and xFormers models, as well as examples of how the rolling buffer cache is used in practice.",3


In [36]:
generated_questions.head()

Unnamed: 0,context,question,answer,source_doc,groundedness_score,groundedness_eval,relevance_score,relevance_eval,standalone_score,standalone_eval
0,"algorithm in a considerably lower-level language than PyTorch, and requires signiﬁcant engineering eﬀort.\nImplementations may also not be transferrable across GPU architectures. These limitations suggest the\nneed for a method that supports writing attention algorithms in a high-level language (e.g., PyTorch), and\ncompiling to IO-aware implementations in CUDA—similar to eﬀorts such as Halide in image processing [ 70].\nIO-Aware Deep Learning. We believe that the IO-aware approach can extend beyond attention.\nAttention is the most memory-intensive computation in Transformers, but every layer in a deep network\ntouches GPU HBM. We hope our work inspires IO-aware implementations of additional modules. We discuss\nthese potential extensions in Appendix D.\nMulti-GPU IO-Aware Methods. Our IO-aware implementation of attention is optimal within con-\nstants for computing attention on a single GPU. However, the attention computation may be parallelizable",Can a high-level language such as PyTorch be used to write attention algorithms with IO-aware implementations in CUDA?\n\n,"Yes, a high-level language such as PyTorch can be used to write attention algorithms with IO-aware implementations in CUDA. This is because the IO-aware approach can extend beyond attention and can be applied to other modules in a deep network. The IO-aware implementation of attention is optimal within constants for computing attention on a single GPU, but it may not be parallelizable across multiple GPUs.",FlashAttention_ Fast and Memory-Efficient Exact Attention with IO-Awareness.pdf,4.5,"This question is asking about the possibility of using PyTorch to write attention algorithms with IO-aware implementations in CUDA. The context provides some background information on the limitations of writing attention algorithms in lower-level languages and the need for a high-level language approach. It also discusses the benefits of IO-aware implementations and their potential extensions to other modules. The question is specific to PyTorch and CUDA, and the context provides relevant information to answer the question.\n\n",4.0,The question is useful for machine learning developers building NLP applications with the Hugging Face ecosystem because it asks about the possibility of using a high-level language such as PyTorch to write attention algorithms with IO-aware implementations in CUDA. This is an important consideration for developers who want to optimize their NLP models for efficient execution on GPUs.\n\n,4.5,\nThe question is context-independent and does not require any additional information to be understood. It refers to the technical capabilities of PyTorch in implementing attention algorithms with IO-aware implementations in CUDA.\n\n
1,"cations. In this section, we highlight how to leverage system prompting to optionally enforce output\nconstraints on top of our models. Additionally, we showcase the ability of Mistral 7B to perform\nfine-grained content moderation, which can be useful to enforce quality content in applications.\n5.1 System prompt to enforce guardrails\nWe introduce a system prompt (see below) to guide the model to generate answers within specified\nguardrails, similar to the work done with Llama 2. Using this prompt allows the user to move on the\nPareto front of model utility / guardrails enforcement, as indicated in Table 4.\nAlways assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful,\nunethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.\nGuardrails MT Bench\nNo system prompt 6.84 ±0.07\nLlama 2 system prompt 6.38 ±0.07\nMistral system prompt 6.58 ±0.05\nTable 4: System prompts. Mean official\nMT Bench score over 10 iterations with",Can Mistral 7B perform fine-grained content moderation using a system prompt to enforce guardrails?\n\n,"Yes, according to the context, Mistral 7B can perform fine-grained content moderation using a system prompt to enforce guardrails. This is indicated in Table 4, where the mean official MT Bench score for Mistral system prompt is 6.58 ±0.05, which is higher than the score for no system prompt (6.84 ±0.07) and Llama 2 system prompt (6.38 ±0.07).",Mistral7B.pdf,,,,,,
2,"generation. Furthermore, Mistral 7B approaches the coding performance of Code-Llama 7B [ 20],\nwithout sacrificing performance on non-code related benchmarks.\nMistral 7B leverages grouped-query attention (GQA) [ 1], and sliding window attention (SWA) [ 6,3].\nGQA significantly accelerates the inference speed, and also reduces the memory requirement during\ndecoding, allowing for higher batch sizes hence higher throughput, a crucial factor for real-time\napplications. In addition, SWA is designed to handle longer sequences more effectively at a reduced\ncomputational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms\ncollectively contribute to the enhanced performance and efficiency of Mistral 7B.arXiv:2310.06825v1 [cs.CL] 10 Oct 2023Mistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference\nimplementation1facilitating easy deployment either locally or on cloud platforms such as AWS, GCP,","Given the context, what are the attention mechanisms used by Mistral 7B and how do they contribute to its enhanced performance and efficiency?\n\n","Mistral 7B uses grouped-query attention (GQA) and sliding window attention (SWA) as its attention mechanisms. GQA significantly accelerates the inference speed and reduces the memory requirement during decoding, allowing for higher batch sizes and higher throughput, which is crucial for real-time applications. SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms collectively contribute to the enhanced performance and efficiency of Mistral 7B.",Mistral7B.pdf,4.5,"Mistral 7B is a large language model that uses grouped-query attention (GQA) and sliding window attention (SWA) as attention mechanisms. GQA significantly accelerates the inference speed and reduces the memory requirement during decoding, allowing for higher batch sizes and higher throughput. SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms collectively contribute to the enhanced performance and efficiency of Mistral 7B.\n\n",4.0,"The given question is useful for machine learning developers building NLP applications with the Hugging Face ecosystem because it asks about the attention mechanisms used by Mistral 7B and how they contribute to its enhanced performance and efficiency. Attention mechanisms are an important aspect of NLP models, and understanding how they work can help developers improve the performance of their own models. Additionally, Mistral 7B is a state-of-the-art NLP model that has achieved impressive results on various benchmarks, so exploring its attention mechanisms can provide valuable insights into best practices for building high-performing NLP models.\n\n",4.5,"\nThe question is asking for information about the attention mechanisms used by Mistral 7B and how they contribute to its performance and efficiency. It is not explicitly stated what context is being referred to, so it is likely that the question is context-independent. \n\n"
3,"𝑁\n𝐵𝑐m\nblocks\nK1K𝑇𝑐andV1V𝑇𝑐, of size𝐵𝑐𝑑each.\n4:Divide Ointo𝑇𝑟blocks O𝑖O𝑇𝑟of size𝐵𝑟𝑑each, divide ℓinto𝑇𝑟blocksℓ𝑖ℓ𝑇𝑟of size𝐵𝑟each,\ndivide𝑚into𝑇𝑟blocks𝑚1𝑚𝑇𝑟of size𝐵𝑟each.\n5:for1𝑗𝑇𝑐do\n6:Load K𝑗V𝑗from HBM to on-chip SRAM.\n7:for1𝑖𝑇𝑟do\n8:Load Q𝑖O𝑖ℓ𝑖𝑚𝑖from HBM to on-chip SRAM.\n9:On chip, compute S𝑖𝑗=Q𝑖K𝑇\n𝑗2R𝐵𝑟𝐵𝑐.\n10:On chip, compute ~𝑚𝑖𝑗=rowmax¹S𝑖𝑗º 2R𝐵𝑟,~P𝑖𝑗=exp¹S𝑖𝑗�~𝑚𝑖𝑗º 2R𝐵𝑟𝐵𝑐(pointwise), ~ℓ𝑖𝑗=\nrowsum¹~P𝑖𝑗º2R𝐵𝑟.\n11:On chip, compute 𝑚new\n𝑖=max¹𝑚𝑖~𝑚𝑖𝑗º2R𝐵𝑟,ℓnew\n𝑖=𝑒𝑚𝑖�𝑚new\n𝑖ℓ𝑖¸𝑒~𝑚𝑖𝑗�𝑚new\n𝑖~ℓ𝑖𝑗2R𝐵𝑟.\n12:Write O𝑖 diag¹ℓnew\n𝑖º�1¹diag¹ℓ𝑖º𝑒𝑚𝑖�𝑚new\n𝑖O𝑖¸𝑒~𝑚𝑖𝑗�𝑚new\n𝑖~P𝑖𝑗V𝑗ºto HBM.\n13:Writeℓ𝑖 ℓnew\n𝑖,𝑚𝑖 𝑚new\n𝑖to HBM.\n14:end for\n15:end for\n16:Return O.\nWe show FlashAttention ’s correctness, runtime, and memory requirement (proof in Appendix C).\nTheorem 1. Algorithm 1 returns O=softmax¹QK>ºVwith𝑂¹𝑁2𝑑ºFLOPs and requires 𝑂¹𝑁ºadditional\nmemory beyond inputs and output.\n3.2 Analysis: IO Complexity of FlashAttention","Suppose K1, K2, K3, K4, K5, K6, K7, K8, K9, K10, K11, K12, K13, K14, K15, K16, K17, K18, K19, K20, K21, K22, K23, K24, K25, K26, K27, K28, K29, K30, K31, K32, K33, K34, K35, K36, K37, K38, K39, K40, K41, K42, K43, K44, K45, K46, K47, K48, K49, K50, K51, K52, K53, K54, K55, K56, K57, K58, K59, K60, K61, K62, K63, K64, K65, K66, K67, K68, K69, K70, K71, K72, K73, K74, K75, K76, K77, K78, K79, K80, K81, K82, K83, K84, K85, K86, K87, K88, K89, K90, K91, K92, K93, K94, K95, K96, K97, K98, K99, K100, K101, K102, K103, K104, K105, K106, K107, K108, K109, K110, K111, K112, K113, K114, K115, K116, K117, K118, K119, K120, K121, K122, K123, K124, K125, K126, K127, K128, K129, K130, K131, K132, K133, K134, K135, K136, K137, K138, K139, K140, K141, K142, K143, K144, K145, K146, K147, K148, K149, K150, K151, K152, K153, K154, K155, K156, K157, K158, K159, K160, K161, K162, K163, K164, K165, K166, K167, K168, K169, K170, K171, K172, K173, K174, K175, K176, K177, K178, K179, K180, K181, K182, K183, K184, K185, K186, K187, K188, K189, K190, K191, K192, K193, K194, K195, K196, K197, K198, K199, K200, K201, K202, K203, K204, K205, K206, K207, K208, K209, K210, K211, K212, K213, K214, K215, K216, K217, K218, K219, K220, K221, K222, K223, K224, K225, K22","(your answer to the deep question)\n\nNow here is the context.\n\nContext: 𝑁\n𝐵𝑐m\nblocks\nK1K𝑇𝑐andV1V𝑇𝑐, of size𝐵𝑐𝑑each.\n4:Divide Ointo𝑇𝑟blocks O𝑖O𝑇𝑟of size𝐵𝑟𝑑each, divide ℓinto𝑇𝑟blocksℓ𝑖ℓ𝑇𝑟of size𝐵𝑟each,\ndivide𝑚into𝑇𝑟blocks𝑚1𝑚𝑇𝑟of size𝐵𝑟each.\n5:for1𝑗𝑇𝑐do\n6:Load K𝑗V𝑗from HBM to on-chip SRAM.\n7:for1𝑖𝑇𝑟do\n8:Load Q𝑖O𝑖ℓ𝑖𝑚𝑖from HBM to on-chip SRAM.\n9:On chip, compute S𝑖𝑗=Q𝑖K𝑇\n𝑗2R𝐵𝑟𝐵𝑐.\n10:On chip, compute ~𝑚𝑖𝑗=rowmax¹S𝑖𝑗º 2R𝐵𝑟,~P𝑖𝑗=exp¹S𝑖𝑗�~𝑚𝑖𝑗º 2R𝐵𝑟𝐵𝑐(pointwise), ~ℓ𝑖𝑗=\nrowsum¹~P𝑖𝑗º2R𝐵𝑟.\n11:On chip, compute 𝑚new\n𝑖=max¹𝑚𝑖~𝑚𝑖𝑗º2R𝐵𝑟,ℓnew\n𝑖=𝑒𝑚𝑖�𝑚new\n𝑖ℓ𝑖¸𝑒~𝑚𝑖𝑗�𝑚new\n𝑖~ℓ𝑖𝑗2R𝐵𝑟.\n12:Write O𝑖 diag¹ℓnew\n𝑖º�1¹diag¹ℓ𝑖º𝑒𝑚𝑖�𝑚new\n𝑖O𝑖¸𝑒~𝑚𝑖𝑗�𝑚new\n𝑖~P𝑖𝑗V𝑗ºto HBM.\n13:Writeℓ𝑖 ℓnew\n𝑖,𝑚𝑖 𝑚new\n𝑖to HBM.\n14:end for\n15:end for\n16:Return O.\nWe show FlashAttention ’s correctness, runtime, and memory requirement (proof in Appendix C).\nTheorem 1. Algorithm 1 returns O=softmax¹QK>ºVwith𝑂¹𝑁2𝑑ºFLOPs and requires 𝑂¹𝑁ºadditional\nmemory beyond inputs and output.\n3.2 Analysis: IO Complexity of FlashAttention\n\nOutput::: [/INST] Deep question: Suppose K1, K2, K3, K4, K5, K6, K7, K8, K9, K10, K11, K12, K13, K14, K15, K16, K17, K18, K19, K20, K21, K22, K23, K24, K25, K26, K27, K28, K29, K30, K31, K32, K33, K34, K35, K36, K37, K38, K39, K40, K41, K42, K43, K44, K45, K46, K47, K48, K49, K50, K51, K52, K53, K54, K55, K56, K57, K58, K59, K60, K61, K62, K63, K64, K65, K66, K67, K68, K69, K70, K71, K72, K73, K74, K75, K76, K77, K78, K79, K80, K81, K82, K83, K84, K85, K86, K87, K88, K89, K90, K91, K92, K93, K94, K95, K96, K97, K98, K99, K100, K101, K102, K103, K104, K105, K106, K107, K108, K109, K110, K111, K112, K113, K114, K115, K116, K117, K118, K119, K120, K121, K122, K123, K124, K125, K126, K127, K128, K129, K130, K131, K132, K133, K134, K135, K136, K137, K138, K139, K140, K141, K142, K143, K144, K145, K146, K147, K148, K149, K150, K151, K152, K153, K154, K155, K156, K157, K158, K159, K160, K161, K162, K163, K164, K165, K166, K167, K168, K169, K170, K171, K172, K173, K174, K175, K176, K177, K178, K179, K180, K181, K182, K183, K184, K185, K186, K187, K188, K189, K190, K191, K192, K193, K194, K195, K196, K197, K198, K199, K200, K201, K202, K203, K204, K205, K206, K207, K208, K209, K210, K211, K212, K213, K214, K215, K216, K217, K218, K219, K220, K221, K222, K223, K224, K225, K22",FlashAttention_ Fast and Memory-Efficient Exact Attention with IO-Awareness.pdf,,,,,,
4,"improves performance on the MIMIC-III [ 47] and ECtHR [ 6,7] datasets. MIMIC-III contains intensive care\nunit patient discharge summaries, each annotated with multiple labels. ECtHR contains legal cases from the\n3LRA accuracy results are known to be highly dependent on the tuning procedure [ 90]. Our reproduced baselines perform\nbetter than as reported in the original comparison [80].\n8Attention Memory Usage\nSequence LengthAttention Runtime (Fwd Pass + Bwd Pass)\nSequence LengthRuntime (ms)\nMemory Footprint (GB)256 8K 16K 32K 64K 128 256 512 1024 2048 4096101102\n1020\nFlashAttention\nBlock-Sparse FlashAttentionPyTorch Attention\nMegatron AttentionLinformer Attention\nOpenAI Sparse Attention8192100Crossover Points\n20x2xFigure 3: Left:runtime of forward pass + backward pass. Right:attention memory usage.\nEuropean Court of Human Rights, each of which is mapped to articles of the Convention of Human Rights","Given the context, what is the relationship between the attention memory usage and sequence length in the MIMIC-III and ECtHR datasets?\n\n","The attention memory usage and sequence length are positively correlated in both the MIMIC-III and ECtHR datasets. As the sequence length increases, the attention memory usage also increases. This is evident from the graph in Figure 3, which shows that the attention memory usage increases as the sequence length increases.",FlashAttention_ Fast and Memory-Efficient Exact Attention with IO-Awareness.pdf,,,,,,


Now our synthetic evaluation dataset is complete! We can evaluate different RAG systems on this evaluation dataset.

We have generated only a few QA couples here to reduce time and cost. But let's kick start the next part by loading a pre-generated dataset:

In [13]:
#eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

Downloading readme:   0%|          | 0.00/915 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 297k/297k [00:00<00:00, 990kB/s]


Generating train split:   0%|          | 0/67 [00:00<?, ? examples/s]

# 2. Build our RAG System

### 2.1. Preprocessing documents to build our vector database

- In this part, __we split the documents from our knowledge base into smaller chunks__: these will be the snippets that are picked by the Retriever, to then be ingested by the Reader LLM as supporting elements for its answer.
- The goal is to build semantically relevant snippets: not too small to be sufficient for supporting an answer, and not too large too avoid diluting individual ideas.

Many options exist for text splitting:
- split every `n` words / characters, but this has the risk of cutting in half paragraphs or even sentences
- split after `n` words / character, but only on sentence boundaries
- **recursive split** tries to preserve even more of the document structure, by processing it tree-like way, splitting first on the largest units (chapters) then recursively splitting on smaller units (paragraphs, sentences).

To learn more about chunking, I recommend you read [this great notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt.

[This space](https://huggingface.co/spaces/m-ric/chunk_visualizer) lets you visualize how different splitting options affect the chunks you get.

> In the following, we use Langchain's `RecursiveCharacterTextSplitter`.

💡 _To measure chunk length in our Text Splitter, our length function will not be the count of characters, but the count of tokens in the tokenized text: indeed, for subsequent embedder that processes token, measuring length in tokens is more relevant and empirically performs better._

In [14]:
from langchain.docstore.document import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
    for doc in tqdm(ds)
]

  0%|          | 0/2647 [00:00<?, ?it/s]

In [15]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer


def split_documents(
    chunk_size: int,
    knowledge_base: List[LangchainDocument],
    tokenizer_name: str,
) -> List[LangchainDocument]:
    """
    Split documents into chunks of size `chunk_size` characters and return a list of documents.
    """
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=["\n\n", "\n", ".", " ", ""],
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique

### 2.2. Retriever - embeddings 🗂️
The __retriever acts like an internal search engine__: given the user query, it returns the most relevant documents from your knowledge base.

> For the knowledge base, we use Langchain vector databases since __it offers a convenient [FAISS](https://github.com/facebookresearch/faiss) index and allows us to keep document metadata throughout the processing__.

🛠️ __Options included:__

- Tune the chunking method:
    - Size of the chunks
    - Method: split on different separators, use [semantic chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker)...
- Change the embedding model

In [16]:
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
import os


def load_embeddings(
    langchain_docs: List[LangchainDocument],
    chunk_size: int,
    embedding_model_name: Optional[str] = "thenlper/gte-small",
) -> FAISS:
    """
    Creates a FAISS index from the given embedding model and documents. Loads the index directly if it already exists.

    Args:
        langchain_docs: list of documents
        chunk_size: size of the chunks to split the documents into
        embedding_model_name: name of the embedding model to use

    Returns:
        FAISS index
    """
    # load embedding_model
    embedding_model = HuggingFaceEmbeddings(
        model_name=embedding_model_name,
        multi_process=True,
        model_kwargs={"device": "cuda"},
        encode_kwargs={
            "normalize_embeddings": True
        },  # set True to compute cosine similarity
    )

    # Check if embeddings already exist on disk
    index_name = (
        f"index_chunk:{chunk_size}_embeddings:{embedding_model_name.replace('/', '~')}"
    )
    index_folder_path = f"./data/indexes/{index_name}/"
    if os.path.isdir(index_folder_path):
        return FAISS.load_local(
            index_folder_path,
            embedding_model,
            distance_strategy=DistanceStrategy.COSINE,
        )

    else:
        print("Index not found, generating it...")
        docs_processed = split_documents(
            chunk_size,
            langchain_docs,
            embedding_model_name,
        )
        knowledge_index = FAISS.from_documents(
            docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
        )
        knowledge_index.save_local(index_folder_path)
        return knowledge_index

### 2.3. Reader - LLM 💬

In this part, the __LLM Reader reads the retrieved documents to formulate its answer.__

🛠️ Here we tried the following options to improve results:
- Switch reranking on/off
- Change the reader model

In [17]:
RAG_PROMPT_TEMPLATE = """
<|system|>
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
{context}
---
Now here is the question you need to answer.

Question: {question}
</s>
<|assistant|>
"""

In [20]:
from langchain_community.llms import HuggingFaceHub

repo_id = "HuggingFaceH4/zephyr-7b-beta"
READER_MODEL_NAME = "zephyr-7b-beta"

READER_LLM = HuggingFaceHub(
    repo_id=repo_id,
    task="text-generation",
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 30,
        "temperature": 0.1,
        "repetition_penalty": 1.03,
    },
)

ValidationError: 1 validation error for HuggingFaceHub
__root__
  Did not find huggingfacehub_api_token, please add an environment variable `HUGGINGFACEHUB_API_TOKEN` which contains it, or pass `huggingfacehub_api_token` as a named parameter. (type=value_error)

In [None]:
from ragatouille import RAGPretrainedModel
from langchain_core.vectorstores import VectorStore
from langchain_core.language_models.llms import LLM


def answer_with_rag(
    question: str,
    llm: LLM,
    knowledge_index: VectorStore,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 30,
    num_docs_final: int = 7,
) -> Tuple[str, List[LangchainDocument]]:
    """Answer a question using RAG with the given knowledge index."""
    # Gather documents with retriever
    relevant_docs = knowledge_index.similarity_search(
        query=question, k=num_retrieved_docs
    )
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text

    # Optionally rerank results
    if reranker:
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join(
        [f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)]
    )

    final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)

    # Redact an answer
    answer = llm(final_prompt)

    return answer, relevant_docs

# 3. Benchmarking the RAG system

The RAG system and the evaluation datasets are now ready. The last step is to judge the RAG system's output on this evlauation dataset.

To this end, __we setup a judge agent__. ⚖️🤖

Out of [the different RAG evaluation metrics](https://docs.ragas.io/en/latest/concepts/metrics/index.html), we choose to focus only on faithfulness since it the best end-to-end metric of our system's performance.

> We use GPT4 as a judge for its empirically good performance, but you could try with other models such as [kaist-ai/prometheus-13b-v1.0](https://huggingface.co/kaist-ai/prometheus-13b-v1.0) or [BAAI/JudgeLM-33B-v1.0](https://huggingface.co/BAAI/JudgeLM-33B-v1.0).

💡 _In the evaluation prompt, we give a detailed description each metric on the scale 1-5, as is done in [Prometheus's prompt template](https://huggingface.co/kaist-ai/prometheus-13b-v1.0): this helps the model ground its metric precisely. If instead you give the judge LLM a vague scale to work with, the outputs will not be consistent enough between different examples._

💡 _Again, prompting the LLM to output rationale before giving its final score gives it more tokens to help it formalize and elaborate a judgement._

In [None]:
def run_rag_tests(
    eval_dataset: datasets.Dataset,
    llm: BaseChatModel,
    knowledge_index: VectorStore,
    output_file: str,
    reranker: Optional[RAGPretrainedModel] = None,
    verbose: Optional[bool] = True,
    test_settings: Optional[str] = None,  # To document the test settings used
):
    """Runs RAG tests on the given dataset and saves the results to the given output file."""
    try:  # load previous generations if they exist
        with open(output_file, "r") as f:
            outputs = json.load(f)
    except:
        outputs = []

    for example in tqdm(eval_dataset):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        answer, relevant_docs = answer_with_rag(
            question, llm, knowledge_index, reranker=reranker
        )
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')
        result = {
            "question": question,
            "true_answer": example["answer"],
            "source_doc": example["source_doc"],
            "generated_answer": answer,
            "retrieved_docs": [doc for doc in relevant_docs],
        }
        if test_settings:
            result["test_settings"] = test_settings
        outputs.append(result)

        with open(output_file, "w") as f:
            json.dump(outputs, f)

In [None]:
EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import SystemMessage


evaluation_prompt_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(content="You are a fair evaluator language model."),
        HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),
    ]
)

In [None]:
from langchain.chat_models import ChatOpenAI

eval_chat_model = ChatOpenAI(model="gpt-4-1106-preview", temperature=0)
evaluator_name = "GPT4"


def evaluate_answers(
    answer_path: str,
    eval_chat_model: BaseChatModel,
    evaluator_name: str,
    evaluation_prompt_template: ChatPromptTemplate,
) -> None:
    """Evaluates generated answers. Modifies the given answer file in place for better checkpointing."""
    answers = []
    if os.path.isfile(answer_path):  # load previous generations if they exist
        answers = json.load(open(answer_path, "r"))

    for experiment in tqdm(answers):
        if f"eval_score_{evaluator_name}" in experiment:
            continue

        eval_prompt = evaluation_prompt_template.format_messages(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        eval_result = eval_chat_model.invoke(eval_prompt)
        feedback, score = [
            item.strip() for item in eval_result.content.split("[RESULT]")
        ]
        experiment[f"eval_score_{evaluator_name}"] = score
        experiment[f"eval_feedback_{evaluator_name}"] = feedback

        with open(answer_path, "w") as f:
            json.dump(answers, f)

🚀 Let's run the tests and evaluate answers!👇

In [None]:
if not os.path.exists("./output"):
    os.mkdir("./output")

for chunk_size in [200]:  # Add other chunk sizes (in tokens) as needed
    for embeddings in ["thenlper/gte-small"]:  # Add other embeddings as needed
        for rerank in [True, False]:
            settings_name = f"chunk:{chunk_size}_embeddings:{embeddings.replace('/', '~')}_rerank:{rerank}_reader-model:{READER_MODEL_NAME}"
            output_file_name = f"./output/rag_{settings_name}.json"

            print(f"Running evaluation for {settings_name}:")

            print("Loading knowledge base embeddings...")
            knowledge_index = load_embeddings(
                RAW_KNOWLEDGE_BASE,
                chunk_size=chunk_size,
                embedding_model_name=embeddings,
            )

            print("Running RAG...")
            reranker = (
                RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
                if rerank
                else None
            )
            run_rag_tests(
                eval_dataset=eval_dataset,
                llm=READER_LLM,
                knowledge_index=knowledge_index,
                output_file=output_file_name,
                reranker=reranker,
                verbose=False,
                test_settings=settings_name,
            )

            print("Running evaluation...")
            evaluate_answers(
                output_file_name,
                eval_chat_model,
                evaluator_name,
                evaluation_prompt_template,
            )

### Inspect results

In [None]:
import glob

outputs = []
for file in glob.glob("./output/*.json"):
    output = pd.DataFrame(json.load(open(file, "r")))
    output["settings"] = file
    outputs.append(output)
result = pd.concat(outputs)

In [None]:
result["eval_score_GPT4"] = result["eval_score_GPT4"].apply(
    lambda x: int(x) if isinstance(x, str) else 1
)
result["eval_score_GPT4"] = (result["eval_score_GPT4"] - 1) / 4

In [None]:
average_scores = result.groupby("settings")["eval_score_GPT4"].mean()
average_scores.sort_values()

settings
./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:False_reader-model:zephyr-7b-beta.json       0.884328
./output/rag_chunk:200_embeddings:BAAI~bge-base-en-v1.5_rerank:False_reader-model:zephyr-7b-beta.json    0.906716
./output/rag_chunk:200_embeddings:BAAI~bge-base-en-v1.5_rerank:True_reader-model:zephyr-7b-beta.json     0.906716
./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:mixtral.json               0.906716
./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:zephyr-7b-beta.json        0.921642
./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:mixtral0.json              0.947761
Name: eval_score_GPT4, dtype: float64

## Example results

Let us load the results that I obtained by tweaking the different options available in this notebook.
For more detail on why these options could work on not, see the notebook on [advanced_RAG](advanced_rag).

As you can see in the graph below, some tweaks do not bring any improvement, some give huge performance boosts.

➡️ ___There is no single good recipe: you should try several different directions when tuning your RAG systems.___


In [None]:
import plotly.express as px

scores = datasets.load_dataset("m-ric/rag_scores_cookbook", split="train")
scores = pd.Series(scores["score"], index=scores["settings"])

In [None]:
fig = px.bar(
    scores,
    color=scores,
    labels={
        "value": "Accuracy",
        "settings": "Configuration",
    },
    color_continuous_scale="bluered",
)
fig.update_layout(w
    width=1000,
    height=600,
    barmode="group",
    yaxis_range=[0, 100],
    title="<b>Accuracy of different RAG configurations</b>",
    xaxis_title="RAG settings",
    font=dict(size=15),
)
fig.layout.yaxis.ticksuffix = "%"
fig.update_coloraxes(showscale=False)
fig.update_traces(texttemplate="%{y:.1f}", textposition="outside")
fig.show()

<img src="https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/RAG_settings_accuracy.png" height="500" width="800">

As you can see, these had varying impact on performance. In particular, tuning the chunk size is both easy and very impactful.

But this is our case: your results could be very different: now that you have a robust evaluation pipeline, you can set on to explore other options! 🗺️