# Agentic RAG: turbocharge your RAG with query reformulation and self-query! 🚀
_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_

> This tutorial is advanced. You should have notions from [this other cookbook](advanced_rag) first!

> Reminder: Retrieval-Augmented-Generation (RAG) is “using an LLM to answer a user query, but basing the answer on information retrieved from a knowledge base”. It has many advantages over using a vanilla or fine-tuned LLM: to name a few, it allows to ground the answer on true facts and reduce confabulations, it allows to provide the LLM with domain-specific knowledge, and it allows fine-grained control of access to information from the knowledge base.

But vanilla RAG has limitations, most importantly these two:
- It **performs only one retrieval step**: if the results are bad, the generation in turn will be bad.
- __Semantic similarity is computed with the *user query* as a reference__, which might be suboptimal: for instance, the user query will often be a question and the document containing the true answer will be in affirmative voice, so its similarity score will be downgraded compared to other source documents in the interrogative form, leading to a risk of missing the relevant information.

But we can alleviate these problems by making a **RAG agent: very simply, an agent armed with a retriever tool!**

This agent will: ✅ Formulate the query itself and ✅ Critique to re-retrieve if needed.

So it should naively recover some advanced RAG techniques!
- Instead of directly using the user query as the reference in semantic search, the agent formulates itself a reference sentence that can be closer to the targeted documents, as in [HyDE](https://huggingface.co/papers/2212.10496)
- The agent can the generated snippets and re-retrieve if needed, as in [Self-Query](https://docs.llamaindex.ai/en/stable/examples/evaluation/RetryQuery/)

Let's build this system. 🛠️

Run the line below to install required dependencies:

In [1]:
!pip install -qqq datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/194.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!pip install pandas langchain langchain-community sentence-transformers faiss-cpu smolagents --upgrade -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m65.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m64.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m65.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.8/101.8 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m67.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Let's login in order to call the HF Inference API:

In [13]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

We first load a knowledge base on which we want to perform RAG: this dataset is a compilation of the documentation pages for many `huggingface` packages, stored as markdown.

In [3]:
import datasets

knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

huggingface_doc.csv:   0%|          | 0.00/22.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2647 [00:00<?, ? examples/s]

Now we prepare the knowledge base by processing the dataset and storing it into a vector database to be used by the retriever.

We use [LangChain](https://python.langchain.com/) for its excellent vector database utilities.
For the embedding model, we use [thenlper/gte-small](https://huggingface.co/thenlper/gte-small) since it performed well in our `RAG_evaluation` cookbook.

In [4]:
from tqdm import tqdm
from transformers import AutoTokenizer
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy

source_docs = [
    Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]})
    for doc in knowledge_base
]

text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    AutoTokenizer.from_pretrained("thenlper/gte-small"),
    chunk_size=200,
    chunk_overlap=20,
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""],
)

# Split docs and keep only unique ones
print("Splitting documents...")
docs_processed = []
unique_texts = {}
for doc in tqdm(source_docs):
    new_docs = text_splitter.split_documents([doc])
    for new_doc in new_docs:
        if new_doc.page_content not in unique_texts:
            unique_texts[new_doc.page_content] = True
            docs_processed.append(new_doc)

print(
    "Embedding documents... This should take a few minutes (5 minutes on MacBook with M1 Pro)"
)
embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-small",model_kwargs={"device": "cuda"})
vectordb = FAISS.from_documents(
    documents=docs_processed,
    embedding=embedding_model,
    distance_strategy=DistanceStrategy.COSINE,
)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Splitting documents...


100%|██████████| 2647/2647 [01:25<00:00, 30.78it/s]
  embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-small",model_kwargs={"device": "cuda"})


Embedding documents... This should take a few minutes (5 minutes on MacBook with M1 Pro)


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Now the database is ready: let’s build our agentic RAG system!

👉 We only need a `RetrieverTool` that our agent can leverage to retrieve information from the knowledge base.

Since we need to add a vectordb as an attribute of the tool, we cannot simply use the [simple tool constructor](https://huggingface.co/docs/transformers/main/en/agents#create-a-new-tool) with a `@tool` decorator: so we will follow the advanced setup highlighted in the [advanced agents documentation](https://huggingface.co/docs/transformers/main/en/agents_advanced#directly-define-a-tool-by-subclassing-tool-and-share-it-to-the-hub).

In [5]:
from smolagents import Tool
from langchain_core.vectorstores import VectorStore


class RetrieverTool(Tool):
    name = "retriever"
    description = "Using semantic similarity, retrieves some documents from the knowledge base that have the closest embeddings to the input query."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "string"

    def __init__(self, vectordb: VectorStore, **kwargs):
        super().__init__(**kwargs)
        self.vectordb = vectordb

    def forward(self, query: str) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        docs = self.vectordb.similarity_search(
            query,
            k=7,
        )

        return "\nRetrieved documents:\n" + "".join(
            [
                f"===== Document {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(docs)
            ]
        )

Now it’s straightforward to create an agent that leverages this tool!

The agent will need these arguments upon initialization:
- *`tools`*: a list of tools that the agent will be able to call.
- *`model`*: the LLM that powers the agent.

Our `model` must be a callable that takes as input a list of [messages](https://huggingface.co/docs/transformers/main/chat_templating) and returns text. It also needs to accept a `stop_sequences` argument that indicates when to stop its generation. For convenience, we directly use the `HfApiModel` class provided in the package to get a LLM engine that calls our [Inference API](https://huggingface.co/docs/api-inference/en/index).

And we use [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), served for free on Hugging Face's Inference API!

_Note:_ The Inference API hosts models based on various criteria, and deployed models may be updated or replaced without prior notice. Learn more about it [here](https://huggingface.co/docs/api-inference/supported-models).

In [16]:
from smolagents import HfApiModel, ToolCallingAgent,OpenAIServerModel

# model = OpenAIServerModel(model_id="deepseek-chat",
#                           api_base="https://api.deepseek.com",
#                           api_key="sk-ff38cdc7b1c44fa79eb537816c82af94")
model = HfApiModel()
retriever_tool = RetrieverTool(vectordb)
agent = ToolCallingAgent(
    tools=[retriever_tool], model=model, #verbose=True
)

Since we initialized the agent as a `ReactJsonAgent`, it has been automatically given a default system prompt that tells the LLM engine to process step-by-step and generate tool calls as JSON blobs (you could replace this prompt template with your own as needed).

Then when its `.run()` method is launched, the agent takes care of calling the LLM engine, parsing the tool call JSON blobs and executing these tool calls, all in a loop that ends only when the final answer is provided.

In [17]:
agent_output = agent.run("How can I push a model to the Hub?")

print("Final output:")
print(agent_output)

Final output:
To push a model to the Hugging Face Hub, follow these steps:\n1. Ensure you have git-lfs installed.\n2. Log into your Hugging Face account using `huggingface-cli login`.\n3. Identify the model you want to push and specify `push_to_hub=True` in your training configuration.\n4. Optionally, provide additional metadata such as `finetuned_from`, `tasks`, `dataset`, and `tags`.\n5. Use the `push_to_hub()` method from the model's library (e.g., `transformers` library) to upload your model to the Hub.


## Agentic RAG vs. standard RAG

Does the agent setup make a better RAG system? Well, let's compare it to a standard RAG system using LLM Judge!

We will use [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) for evaluation since it's one of the strongest OS models we tested for LLM judge use cases.

In [18]:
eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

README.md:   0%|          | 0.00/893 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/289k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/65 [00:00<?, ? examples/s]

Before running the test let's make the agent less verbose.

In [19]:
import logging

agent.logger.setLevel(logging.WARNING) # Let's reduce the agent's verbosity level

eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

AttributeError: 'AgentLogger' object has no attribute 'setLevel'

In [20]:
outputs_agentic_rag = []

for example in tqdm(eval_dataset):
    question = example["question"]

    enhanced_question = f"""Using the information contained in your knowledge base, which you can access with the 'retriever' tool,
give a comprehensive answer to the question below.
Respond only to the question asked, response should be concise and relevant to the question.
If you cannot find information, do not give up and try calling your retriever again with different arguments!
Make sure to have covered the question completely by calling the retriever tool several times with semantically different queries.
Your queries should not be questions but affirmative form sentences: e.g. rather than "How do I load a model from the Hub in bf16?", query should be "load a model from the Hub bf16 weights".

Question:
{question}"""
    answer = agent.run(enhanced_question)
    print("=======================================================")
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print(f'True answer: {example["answer"]}')

    results_agentic = {
        "question": question,
        "true_answer": example["answer"],
        "source_doc": example["source_doc"],
        "generated_answer": answer,
    }
    outputs_agentic_rag.append(results_agentic)

  0%|          | 0/65 [00:00<?, ?it/s]

  2%|▏         | 1/65 [00:07<07:46,  7.29s/it]

Question: What architecture is the `tokenizers-linux-x64-musl` binary designed for?

Answer: The `tokenizers-linux-x64-musl` binary is designed for the x86_64-unknown-linux-musl architecture.
True answer: x86_64-unknown-linux-musl


  3%|▎         | 2/65 [00:38<22:37, 21.55s/it]

Question: What is the purpose of the BLIP-Diffusion model?

Answer: BLIP-Diffusion is a pre-trained subject representation model that enables zero-shot subject-driven generation and control-guided zero-shot generation. It is designed to work in conjunction with the BLIP model, which performs multi-modal tasks such as image captioning, visual question answering, and image-text retrieval. BLIP-Diffusion specifically enhances these capabilities by utilizing diffusion models to generate more detailed and controlled images based on text prompts, leveraging the pre-trained capabilities of BLIP to understand and interpret subjects in images.
True answer: The BLIP-Diffusion model is designed for controllable text-to-image generation and editing.


  5%|▍         | 3/65 [01:00<22:17, 21.58s/it]

Question: How can a user claim authorship of a paper on the Hugging Face Hub?

Answer: A user can claim authorship of a paper on the Hugging Face Hub by clicking on their name in the corresponding Paper page and selecting 'claim authorship'. This redirects to the paper settings where the request can be confirmed. The admin team will validate the request, and once confirmed, the Paper page will be marked as verified.
True answer: By clicking their name on the corresponding Paper page and clicking "claim authorship", then confirming the request in paper settings for admin team validation.


  6%|▌         | 4/65 [01:05<15:13, 14.98s/it]

Question: What is the purpose of the /healthcheck endpoint in the Datasets server API?

Answer: The purpose of the /healthcheck endpoint in the Datasets server API is to ensure that the app is running.
True answer: Ensure the app is running


  8%|▊         | 5/65 [01:10<11:31, 11.53s/it]

Question: What is the default context window size for Local Attention in the LongT5 model?

Answer: The default context window size for Local Attention in the LongT5 model is 127 tokens, with `r=127` by default.
True answer: 127 tokens


  9%|▉         | 6/65 [01:52<21:26, 21.81s/it]

Question: What method is used to load a checkpoint for a task using `AutoPipeline`?

Answer: The method used to load a checkpoint for a task using `AutoPipeline` is the `from_pretrained()` method. This method automatically detects the correct pipeline class to use based on the provided task name or path to the pretrained weights. For example, `AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder")` loads a checkpoint for the text-to-image task.
True answer: from_pretrained()


 11%|█         | 7/65 [02:12<20:31, 21.23s/it]

Question: What is the purpose of Diffusers library?

Answer: The purpose of the Diffusers library is to provide a state-of-the-art, user-friendly, and modular toolbox for working with diffusion models. It allows users to generate images, audio, and even 3D molecular structures using pretrained models. The library emphasizes usability, simplicity, and customizability, making it accessible to both entry-level and advanced users. It supports both inference and training of diffusion models while adhering to the design principles of the PyTorch library.
True answer: To serve as a modular toolbox for both inference and training of state-of-the-art pretrained diffusion models across multiple modalities.


 12%|█▏        | 8/65 [02:17<15:15, 16.06s/it]

Question: What method does the EulerAncestralDiscreteScheduler use for sampling?

Answer: The EulerAncestralDiscreteScheduler uses ancestral sampling with Euler method steps for sampling.
True answer: Ancestral sampling with Euler method steps.


 14%|█▍        | 9/65 [03:01<23:12, 24.86s/it]

Question: What is the name of the large multimodal model that can solve image-text tasks and is based on Flamingo?

Answer: The large multimodal model that can solve image-text tasks and is based on Flamingo is IDEFICS, which comes in two variants - 80 billion parameters and 9 billion parameters.
True answer: IDEFICS


 15%|█▌        | 10/65 [03:07<17:23, 18.97s/it]

Question: What is the purpose of the `gradio.Blocks` API?

Answer: The purpose of the `gradio.Blocks` API is to provide a low-level approach for designing web apps with flexible layouts and data flows, allowing users to have full control over the data flows and layout of their application and build complex, multi-step applications.
True answer: The `gradio.Blocks` API allows you to have full control over the data flows and layout of your application, enabling the building of complex, multi-step applications.


 17%|█▋        | 11/65 [03:54<24:41, 27.44s/it]

Question: What is the purpose of the two-stage model proposed in the paper "Hierarchical Text-Conditional Image Generation with CLIP Latents"?

Answer: The purpose of the two-stage model proposed in the paper 'Hierarchical Text-Conditional Image Generation with CLIP Latents' is to generate a CLIP image embedding from a text caption using a prior model, and then use a diffusion model to decode the embedding into an image. This approach improves image diversity with minimal loss in photorealism and caption similarity while also enabling text-guided image manipulations in a zero-shot fashion.
True answer: The purpose of the two-stage model is to generate a CLIP image embedding given a text caption and then generate an image conditioned on the image embedding.


 18%|█▊        | 12/65 [04:06<20:08, 22.81s/it]

Question: What command is used to install the requirements for a research project using 🤗 Transformers?

Answer: To install the requirements for a research project using 🤗 Transformers, navigate to the folder of your choice and run the command `pip install -r requirements.txt`.
True answer: pip install -r requirements.txt


 20%|██        | 13/65 [04:20<17:32, 20.24s/it]

Question: What task does the `roberta-large-mnli` checkpoint perform?

Answer: The `roberta-large-mnli` checkpoint is used for the natural language inference (NLI) task. It is fine-tuned on the Multi-Genre Natural Language Inference (MNLI) dataset, which involves determining the logical relationship between pairs of sentences.
True answer: Text classification


 22%|██▏       | 14/65 [05:31<30:18, 35.65s/it]

Question: What service is replacing the Paid tier of the Inference API at Hugging Face?

Answer: Hugging Face Inference Endpoints are replacing the Paid tier of the Inference API. Inference Endpoints provide a secure, production-ready solution to deploy models with features like autoscaling, cost reduction, and advanced security, making it easier for developers and data scientists to manage their machine learning models in production without handling infrastructure.
True answer: Inference Endpoints


 23%|██▎       | 15/65 [05:36<21:53, 26.27s/it]

Question: What architectural feature does SqueezeBERT use instead of fully-connected layers for the Q, K, V, and FFN layers?

Answer: grouped convolutions
True answer: Grouped convolutions


 25%|██▍       | 16/65 [05:42<16:21, 20.03s/it]

Question: What type of license is the HuggingFace Team's software distributed under?

Answer: The HuggingFace Team's software is distributed under the Apache License, Version 2.0 (the "License"). You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
True answer: Apache License, Version 2.0


 26%|██▌       | 17/65 [05:47<12:31, 15.65s/it]

Question: What are the two parameter-reduction techniques proposed in the ALBERT model to lower memory consumption and increase training speed?

Answer: The two parameter-reduction techniques proposed in the ALBERT model to lower memory consumption and increase training speed are: splitting the embedding matrix into two smaller matrices and using repeating layers split among groups.
True answer: Splitting the embedding matrix into two smaller matrices and using repeating layers split among groups.


 28%|██▊       | 18/65 [06:07<13:12, 16.87s/it]

Question: What are the three main steps for fine-tuning a model with the 🤗 Datasets library?

Answer: The three main steps for fine-tuning a model with the 🤗 Datasets library and 🤗 Transformers library are:\n1. Load a dataset from the Hugging Face Hub.\n2. Preprocess the data with `Dataset.map()`.\n3. Load and compute metrics.
True answer: 1. Load a dataset from the Hugging Face Hub. 2. Preprocess the data with `Dataset.map()`. 3. Load and compute metrics.


 29%|██▉       | 19/65 [06:12<10:11, 13.29s/it]

Question: What is the maximum improvement in throughput achieved by Hugging Face Infinity compared to vanilla transformers?

Answer: Hugging Face Infinity can deliver up to 800% higher throughput compared to vanilla transformers.
True answer: +800%


 31%|███       | 20/65 [06:21<09:11, 12.25s/it]

Question: What is the command to upload a spaCy pipeline to the Hugging Face Hub?

Answer: The command to upload a spaCy pipeline to the Hugging Face Hub, once packaged with `spacy package` and with the `--build wheel` option, is:\n```bash"huggingface-cli login\npython -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl\n```
True answer: python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl


 32%|███▏      | 21/65 [07:26<20:32, 28.01s/it]

Question: What is the time and memory complexity of the Nyströmformer's approximation of self-attention?

Answer: The Nyströmformer's approximation of self-attention reduces the time and memory complexity from the quadratic \\(O(n^2)\\) complexity of standard self-attention to linear \\(O(n)\\) complexity. This is achieved by approximating the self-attention mechanism using the Nyström method, which involves sampling a subset of rows and columns from the attention matrix, thus avoiding the need to compute the full \\(QK^T\\) product.
True answer: O(n)


 34%|███▍      | 22/65 [07:36<16:08, 22.52s/it]

Question: What is the goal of the Named Entity Recognition task in token classification?

Answer: The goal of the Named Entity Recognition task in token classification is to find and label entities in a piece of text, such as person, location, or organization. This task is formulated as labeling each token with a class for each entity type, and another class for tokens that do not belong to any entity.
True answer: The goal of the Named Entity Recognition task is to find the entities in a piece of text, such as person, location, or organization.


 35%|███▌      | 23/65 [07:41<12:03, 17.23s/it]

Question: What is the resolution of images used by the CLIPSeg model?

Answer: The resolution of images used by the CLIPSeg model is 352 x 352 pixels.
True answer: 352 x 352 pixels


 37%|███▋      | 24/65 [08:17<15:41, 22.96s/it]

Question: What can you use Gradio for?

Answer: Gradio is a versatile Python library that enables you to quickly create customizable web applications for your machine learning models and data processing pipelines. You can use Gradio to:

1. **Build Interactive Demos**: Easily create interactive demos for your machine learning models, allowing users to input data and see the model's predictions in real-time.
2. **Deploy on Various Platforms**: Deploy your Gradio apps on your own web server using Nginx or for free on [Hugging Face Spaces](https://hf.space).
3. **Complex Multi-step Applications**: Use the Blocks API to create sophisticated, multi-step applications, providing full control over the data flows and layout.
4. **Share Applications**: Share your Gradio apps with others, including testing teams, enabling them to test models and provide feedback.
5. **Serverless Deployments**: Utilize Gradio-Lite (`@gradio/lite`), which allows you to run Gradio applications entirely in the browse

 38%|███▊      | 25/65 [08:26<12:29, 18.74s/it]

Question: What TensorFlow API function is used to load a saved tensor file?

Answer: The TensorFlow API function used to load a saved tensor file, specifically a SavedModel, is `tf.keras.models.load_model`. This function can be used to load a saved model that was previously saved using the `model.save` or `model.save_pretrained` method.
True answer: safetensors.tensorflow.load_file


 40%|████      | 26/65 [08:34<10:03, 15.48s/it]

Question: Where can you access the logs of your Endpoints in Hugging Face Endpoints?

Answer: Access to the logs of your Hugging Face Endpoints can be performed through the UI in the “Logs” tab of your Endpoint. You will have access to the build logs of your Image artifacts as well as the Container Logs during inference. Note that Container Logs are only available when your Endpoint is in the “Running” state. If your Endpoint creation fails, you can check the Build Logs to diagnose issues such as wrong version of a dependency.
True answer: In the "Logs" tab of your Endpoint through the UI.


 42%|████▏     | 27/65 [08:42<08:23, 13.24s/it]

Question: What is the latest task added to Hugging Face AutoTrain for Computer Vision?

Answer: The latest task added to Hugging Face AutoTrain for Computer Vision is Image Classification.
True answer: Image Classification


 43%|████▎     | 28/65 [08:48<06:48, 11.04s/it]

Question: What is the default repository type created by the `create_repo` function on Hugging Face Hub?

Answer: By default, the `create_repo` function on Hugging Face Hub creates a model repository.
True answer: model


 45%|████▍     | 29/65 [09:08<08:15, 13.75s/it]

Question: How many splits does the "duorc" dataset have?

Answer: The Duorc dataset has six splits and two configurations.
True answer: Six


 46%|████▌     | 30/65 [09:16<06:58, 11.96s/it]

Question: What is the purpose of Fully Sharded Data Parallel (FSDP) in distributed training?

Answer: Fully Sharded Data Parallel (FSDP) in distributed training is a method that shards a model's parameters, gradients, and optimizer states across available GPUs, significantly reducing memory usage compared to traditional DistributedDataParallel (DDP). It enables training of large pretrained models with up to 1T parameters by efficiently sharing the computational load and minimizing memory footprint, allowing for larger batch sizes and model sizes.
True answer: FSDP is developed for distributed training of large pretrained models up to 1T parameters by sharding the model parameters, gradients, and optimizer states across data parallel processes.


 48%|████▊     | 31/65 [09:21<05:38,  9.96s/it]

Question: What file format is used to save and store PyTorch model weights more securely than `.bin` files?

Answer: The file format used to save and store PyTorch model weights more securely than `.bin` files is `.safetensors`. 
True answer: `.safetensors`


 49%|████▉     | 32/65 [09:29<05:11,  9.44s/it]

Question: What type of security certification does Hugging Face have?

Answer: Hugging Face is SOC2 Type 2 certified, meaning they provide security certification to their customers and actively monitor and patch any security weaknesses.
True answer: SOC2 Type 2 certified


 51%|█████     | 33/65 [09:35<04:28,  8.39s/it]

Question: What do RAG models combine to generate outputs?

Answer: RAG models combine the powers of pretrained dense retrieval (DPR) and sequence-to-sequence models. They retrieve relevant documents, pass them to a seq2seq model, and then generate outputs by marginalizing the information from both components.
True answer: Pretrained dense retrieval (DPR) and sequence-to-sequence models.


 52%|█████▏    | 34/65 [09:40<03:48,  7.38s/it]

Question: What library does MarkupLMFeatureExtractor use to extract data from HTML and XML files?

Answer: MarkupLMFeatureExtractor uses BeautifulSoup, a Python library, to extract data from HTML and XML files.
True answer: Beautiful Soup


 54%|█████▍    | 35/65 [09:52<04:21,  8.70s/it]

Question: What is the file size limit for syncing to HF Spaces without using Git-LFS?

Answer: The file size limit for syncing to HF Spaces without using Git-LFS is 10MB. Files larger than this size require Git-LFS.
True answer: 10MB


 55%|█████▌    | 36/65 [09:57<03:41,  7.63s/it]

Question: What is the title of the paper introducing the ByT5 model?

Answer: The title of the paper introducing the ByT5 model is "ByT5: Towards a token-free future with pre-trained byte-to-byte models".
True answer: ByT5: Towards a token-free future with pre-trained byte-to-byte models


 57%|█████▋    | 37/65 [10:05<03:34,  7.66s/it]

Question: What is the dimension of the feature vector for the base BERT model?

Answer: The dimension of the feature vector for the base BERT model is 768.
True answer: 768


 58%|█████▊    | 38/65 [10:11<03:15,  7.22s/it]

Question: What special identifier does the WordPiece Model use for continuing subwords?

Answer: The WordPiece Model uses the `##` prefix to identify subwords that are part of a word (i. e., not starting a word).
True answer: ##


 60%|██████    | 39/65 [10:53<07:42, 17.78s/it]

Question: What is the purpose of the 🧨 Diffusers tutorials?

Answer: The purpose of the 🧨 Diffusers tutorials is to provide a beginner-friendly introduction to diffusion models and guide users through understanding the fundamental components of the Diffusers library. These tutorials aim to teach how to use the library for inference, deconstructing pipelines to understand the library as a modular toolbox, and even how to train your own diffusion model.
True answer: To provide a gentle introduction to diffusion models and help understand the library fundamentals.


 62%|██████▏   | 40/65 [10:57<05:39, 13.58s/it]

Question: What is the default setting for the `allow_flagging` parameter in Gradio's `Interface`?

Answer: manual
True answer: "manual"


 63%|██████▎   | 41/65 [11:01<04:16, 10.70s/it]

Question: Where can the full code for the Stable Diffusion demo be found?

Answer: https://hf.co/spaces/stabilityai/stable-diffusion/tree/main
True answer: https://hf.co/spaces/stabilityai/stable-diffusion/tree/main


 65%|██████▍   | 42/65 [11:10<03:52, 10.10s/it]

Question: What transformation does the FNet model use to replace the self-attention layer in a BERT model?

Answer: The FNet model replaces the self-attention layer in a BERT model with a Fourier Transform, which returns only the real parts of the transform. This transformation serves as a linear mixing of the input tokens, achieving about 92-97% accuracy of BERT counterparts on the GLUE benchmark while being much faster to train.
True answer: Fourier transform


 66%|██████▌   | 43/65 [12:03<08:24, 22.93s/it]

Question: What type of test should typically accompany a bug fix in Gradio's testing strategy?

Answer: According to the retrieved document, a type of test that should typically accompany a bug fix in Gradio's testing strategy is a dynamic code test. However, it could also be a linting rule or a new type of test if that is more appropriate.
True answer: Dynamic code test


 68%|██████▊   | 44/65 [12:45<10:03, 28.76s/it]

Question: How can you force mixed precision training when initializing the Accelerator in 🤗 Accelerate?

Answer: To force mixed precision training when initializing the Accelerator in 🤗 Accelerate, you can specify the mixed precision settings directly in the configuration. If you are configuring Accelerate via the command line, you can use the `accelerate config` command to set up a configuration interactively, where you can enable mixed precision (`fp16` or `bf16`). 

For a more programmatic approach, you can use the `Accelerator` class with the `mixed_precision` argument. Here is a basic example:

```python
from accelerate import Accelerator

accelerator = Accelerator(mixed_precision='fp16')  # or 'bf16' depending on your GPU support
```

This will initialize the Accelerator with mixed precision training enabled. Make sure your GPU supports the precision format you are choosing (`fp16` is commonly supported, while `bf16` is supported by more recent GPUs like NVIDIA's Ampere architect

 69%|██████▉   | 45/65 [12:51<07:15, 21.76s/it]

Question: What is the purpose of tokenizers in the NLP pipeline?

Answer: The purpose of tokenizers in the NLP pipeline is to translate text into data that can be processed by the model by converting raw text inputs into numerical data, as models can only process numbers.
True answer: To translate text into data that can be processed by the model.


 71%|███████   | 46/65 [13:02<05:53, 18.59s/it]

Question: What is the purpose of the Safety Checker in the Diffusers library?

Answer: The Safety Checker in the Diffusers library is a component that screens against harmful content by checking and comparing the class probability of a set of hard-coded harmful concepts in the embedding space against an image after it has been generated. It is designed to flag inappropriate content generated during inference and helps users interact with generative models responsibly and ethically. Model creators can choose to incorporate this component into their models.
True answer: The Safety Checker checks and compares the class probability of a set of hard-coded harmful concepts in the embedding space against an image after it has been generated to mitigate the risk of generating harmful content.


 72%|███████▏  | 47/65 [13:48<08:02, 26.79s/it]

Question: What Python class allows you to retrieve Discussions and Pull Requests from a given repository on the Hugging Face Hub?

Answer: The Python class that allows you to retrieve Discussions and Pull Requests from a given repository on the Hugging Face Hub is `HfApi`. Use methods like `get_repo_discussions` to retrieve these.
True answer: HfApi


 74%|███████▍  | 48/65 [14:17<07:46, 27.41s/it]

Question: What is the name of the new library introduced by Hugging Face for hosting scikit-learn models?

Answer: Hugging Face does not appear to have introduced a specific new library dedicated solely for hosting scikit-learn models as of the documents provided. The huggingface_hub library allows for the hosting and sharing of models and datasets on the Hugging Face Hub, but it supports a variety of machine learning libraries, not just scikit-learn.
True answer: Skops


 75%|███████▌  | 49/65 [14:24<05:43, 21.49s/it]

Question: What is the purpose of Textual Inversion?

Answer: Textual Inversion is a training technique for personalizing image generation models with just a few example images. It works by learning new text embeddings tied to a special word used in prompts, allowing models like Stable Diffusion to generate images reflecting the learned concepts or styles from the provided images. The resulting embeddings are small in size (a few KBs) and can be loaded into the text encoder for use in generating personalized images.
True answer: Textual Inversion is a training method for personalizing models by learning new text embeddings from a few example images.


 77%|███████▋  | 50/65 [14:53<05:56, 23.75s/it]

Question: What is the recommended multiple of batch size for fp16 data type on an A100 GPU?

Answer: Based on the information provided from the documents, there isn't a specific recommended multiple of batch size for fp16 data type on an A100 GPU. However, we see examples of batch sizes such as 16 used on A100 GPUs. It also mentions being able to use DeepSpeed Zero-Stage 2 to achieve up to a batch size of 200 without running into OOM errors. The specific multiple likely depends on the model size and architecture.
True answer: 64


 78%|███████▊  | 51/65 [14:59<04:18, 18.49s/it]

Question: How do you run a Gradio Blocks app in reload mode using a Python IDE?

Answer: To run a Gradio Blocks app in reload mode using a Python IDE, instead of running `python app.py`, type `gradio app.py` in your terminal. This will launch the demo in reload mode, automatically reloading the app whenever you make changes to the file.
True answer: Run `gradio run.py` in the terminal.


 80%|████████  | 52/65 [15:52<06:12, 28.65s/it]

Question: How can you install the Hugging Face Unity API in your Unity project?

Answer: To install the Hugging Face Unity API in your Unity project, follow these steps:\n1. Open your Unity project.\n2. Go to `Window` -> `Package Manager`.\n3. Click `+` and select `Add Package from git URL`.\n4. Enter `https://github.com/huggingface/unity-api.git`.\n5. Once installed, the Unity API wizard should pop up. If not, go to `Window` -> `Hugging Face API Wizard`.\n
True answer: To install the Hugging Face Unity API in your Unity project, go to `Window` -> `Package Manager`, click `+` and select `Add Package from git URL`, then enter `https://github.com/huggingface/unity-api.git`.


 82%|████████▏ | 53/65 [16:35<06:34, 32.90s/it]

Question: What is the pretraining objective of the Wav2Vec2 context network?

Answer: The pretraining objective of the Wav2Vec2 context network is a contrastive task. The model has to predict the true quantized speech representation of the masked prediction from a set of false ones, encouraging the model to find the most similar context vector and quantized speech unit (the target label).
True answer: The pretraining objective of the Wav2Vec2 context network is a contrastive task where the model has to predict the true quantized speech representation of the masked prediction from a set of false ones.


 83%|████████▎ | 54/65 [16:42<04:36, 25.12s/it]

Question: What is the default checkpoint used by the sentiment analysis pipeline in the Transformers library?

Answer: The default checkpoint used by the sentiment analysis pipeline in the Transformers library is `distilbert-base-uncased-finetuned-sst-2-english`.
True answer: distilbert base uncased finetuned sst2 english


 85%|████████▍ | 55/65 [16:57<03:42, 22.23s/it]

Question: What is the purpose of the notebook "How to use DeepSpeed to train models with billions of parameters on Habana Gaudi"?

Answer: The purpose of the notebook 'How to use DeepSpeed to train models with billions of parameters on Habana Gaudi' is to show how to use DeepSpeed to pre-train and fine-tune large models, such as the 1.6B-parameter GPT2-XL, for causal language modeling on Habana Gaudi hardware. The notebook provides a detailed guide on leveraging DeepSpeed along with the Habana Gaudi hardware to efficiently train and fine-tune large-scale models.
True answer: To show how to use DeepSpeed to pre-train/fine-tune the 1.6B-parameter GPT2-XL for causal language modeling on Habana Gaudi.


 86%|████████▌ | 56/65 [17:05<02:40, 17.86s/it]

Question: What command line module does PyTorch provide to run a script on multiple GPUs?

Answer: PyTorch provides the `torchrun` command line module to run a script on multiple GPUs. You can specify the number of GPUs to use with the `--nproc_per_node` argument. For example, to run a script on two GPUs, you would use the following command:\n```bash\torchrun --nproc_per_node=2 example_script.py```
True answer: torchrun


 88%|████████▊ | 57/65 [17:10<01:53, 14.18s/it]

Question: What is the most popular vision transformer model on the Hugging Face Model Hub for image classification?

Answer: The most popular vision transformer model on the Hugging Face Model Hub for image classification is `google/vit-base-patch16-224`.
True answer: google/vit-base-patch16-224


 89%|████████▉ | 58/65 [17:18<01:26, 12.35s/it]

Question: What is the command to upload an ESPnet model to a Hugging Face repository?

Answer: The command to upload an ESPnet model to a Hugging Face repository is:\n```bash ./run.sh --stage 15 --skip_upload_hf false --hf_repo username/model_repo ```
True answer: ./run.sh --stage 15 --skip_upload_hf false --hf_repo username/model_repo


 91%|█████████ | 59/65 [17:29<01:11, 11.92s/it]

Question: What file should be added to a model repository to install custom Python dependencies for Inference Endpoints?

Answer: To install custom Python dependencies for Inference Endpoints, you should add a `requirements.txt` file to your model repository on the Hugging Face Hub. This file should list all the necessary dependencies that need to be installed. When your Endpoint and Image artifacts are created, the dependencies listed in the `requirements.txt` file will be installed automatically.
True answer: requirements.txt


 92%|█████████▏| 60/65 [17:35<00:49,  9.92s/it]

Question: How many images are needed to teach new concepts to Stable Diffusion using Textual Inversion?

Answer: Stable Diffusion can teach new concepts using just 3-5 images through the method of Textual Inversion.
True answer: 3-5 images


 94%|█████████▍| 61/65 [17:43<00:37,  9.49s/it]

Question: What is the maximum size of a model checkpoint before it is automatically sharded in Transformers version 4.18.0?

Answer: 10GB
True answer: 10GB


 95%|█████████▌| 62/65 [17:49<00:25,  8.57s/it]

Question: What is the purpose of Weights and Biases (W&B) for data scientists and machine learning scientists?

Answer: Weights and Biases (W&B) is a tool that allows data scientists and machine learning scientists to track their machine learning experiments at every stage, from training to production. It provides a customizable and searchable dashboard to aggregate and visualize any metrics over samples.
True answer: To track their machine learning experiments at every stage, from training to production.


 97%|█████████▋| 63/65 [17:58<00:16,  8.43s/it]

Question: What is the name of the open-source library created by Hugging Face to simplify Transformer acceleration?

Answer: The open-source library created by Hugging Face to simplify Transformer acceleration is called Optimum.
True answer: Optimum


 98%|█████████▊| 64/65 [18:03<00:07,  7.55s/it]

Question: What parameter is used to ensure that elements in a row have the same height in Gradio?

Answer: The `equal_height` parameter is used to ensure that elements in a row have the same height in Gradio. You can pass this parameter to the `.style()` method of `gr.Row()`.
True answer: equal_height


100%|██████████| 65/65 [18:07<00:00, 16.73s/it]

Question: What is the command to install the latest version of Optimum with OpenVINO support?

Answer: pip install optimum[openvino]
True answer: pip install --upgrade-strategy eager optimum["openvino"]





In [21]:
from huggingface_hub import InferenceClient

reader_llm = InferenceClient("Qwen/Qwen2.5-72B-Instruct")

outputs_standard_rag = []

for example in tqdm(eval_dataset):
    question = example["question"]
    context = retriever_tool(question)

    prompt = f"""Given the question and supporting documents below, give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If you cannot find information, do not give up and try calling your retriever again with different arguments!

Question:
{question}

{context}
"""
    messages = [{"role": "user", "content": prompt}]
    answer = reader_llm.chat_completion(messages).choices[0].message.content

    print("=======================================================")
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print(f'True answer: {example["answer"]}')

    results_agentic = {
        "question": question,
        "true_answer": example["answer"],
        "source_doc": example["source_doc"],
        "generated_answer": answer,
    }
    outputs_standard_rag.append(results_agentic)

  2%|▏         | 1/65 [00:02<02:13,  2.08s/it]

Question: What architecture is the `tokenizers-linux-x64-musl` binary designed for?

Answer: The `tokenizers-linux-x64-musl` binary is designed for the **x86_64-unknown-linux-musl** architecture. (Source: Document 0)
True answer: x86_64-unknown-linux-musl


  3%|▎         | 2/65 [00:06<03:27,  3.29s/it]

Question: What is the purpose of the BLIP-Diffusion model?

Answer: The BLIP-Diffusion model is designed for controllable text-to-image generation and editing. It leverages pre-trained subject representation to enable zero-shot subject-driven generation and control-guided zero-shot generation, allowing for more precise and context-aware image creation and manipulation. (Source: Document 6)
True answer: The BLIP-Diffusion model is designed for controllable text-to-image generation and editing.


  5%|▍         | 3/65 [00:10<03:59,  3.87s/it]

Question: How can a user claim authorship of a paper on the Hugging Face Hub?

Answer: To claim authorship of a paper on the Hugging Face Hub, follow these steps:

1. Go to the paper page on the Hugging Face Hub.
2. Click on your name in the corresponding paper page.
3. Click "claim authorship."
4. You will be redirected to your paper settings where you can confirm the request.
5. The admin team will review and validate your request.
6. Once confirmed, the paper page will show as verified. (Source: Document 1)
True answer: By clicking their name on the corresponding Paper page and clicking "claim authorship", then confirming the request in paper settings for admin team validation.


  6%|▌         | 4/65 [00:12<03:05,  3.04s/it]

Question: What is the purpose of the /healthcheck endpoint in the Datasets server API?

Answer: The purpose of the /healthcheck endpoint in the Datasets server API is to ensure that the application is running correctly. (Source: Document 0)
True answer: Ensure the app is running


  8%|▊         | 5/65 [00:16<03:14,  3.23s/it]

Question: What is the default context window size for Local Attention in the LongT5 model?

Answer: The default context window size for Local Attention in the LongT5 model is defined by `config.attention_window` (Document 1). This parameter can be set to a specific value or different values for each layer if it is a list. However, the exact default value is not specified in the provided documents. Typically, this value needs to be set or retrieved from the model configuration.
True answer: 127 tokens


  9%|▉         | 6/65 [00:19<03:22,  3.44s/it]

Question: What method is used to load a checkpoint for a task using `AutoPipeline`?

Answer: The method used to load a checkpoint for a task using `AutoPipeline` is the `from_pretrained()` method. This method automatically retrieves the relevant pipeline given the name or path to the pretrained weights. (Source: Document 0)

Additionally, if you need to switch between tasks using the same checkpoint without consuming additional memory, you can use the `from_pipe()` method. (Source: Document 5)
True answer: from_pretrained()


 11%|█         | 7/65 [00:24<03:40,  3.80s/it]

Question: What is the purpose of Diffusers library?

Answer: The purpose of the Diffusers library is to provide a modular and accessible toolbox for state-of-the-art pretrained diffusion models. It is designed to support both simple inference solutions and the training of custom diffusion models, focusing on usability, simplicity, and customizability. The library enables the generation of images, audio, and 3D structures of molecules, making advanced generative AI accessible to a broader audience (Sources: Document 1, Document 3, Document 4).
True answer: To serve as a modular toolbox for both inference and training of state-of-the-art pretrained diffusion models across multiple modalities.


 12%|█▏        | 8/65 [00:26<03:08,  3.31s/it]

Question: What method does the EulerAncestralDiscreteScheduler use for sampling?

Answer: The EulerAncestralDiscreteScheduler uses ancestral sampling with Euler method steps for sampling. This method is designed to be fast and can often generate good outputs in 20-30 steps. (Source: Document 0)
True answer: Ancestral sampling with Euler method steps.


 14%|█▍        | 9/65 [00:29<03:03,  3.28s/it]

Question: What is the name of the large multimodal model that can solve image-text tasks and is based on Flamingo?

Answer: The name of the large multimodal model that can solve image-text tasks and is based on Flamingo is **IDEFICS**. This model is an open-access reproduction of Flamingo and performs comparably to the original model on various image-text understanding benchmarks. (Source: Document 2)
True answer: IDEFICS


 15%|█▌        | 10/65 [00:33<03:02,  3.33s/it]

Question: What is the purpose of the `gradio.Blocks` API?

Answer: The `gradio.Blocks` API is designed to provide a low-level approach for creating web apps with more flexible layouts and data flows. It allows you to control where components appear on the page, handle complex data flows (such as using outputs as inputs to other functions), and update properties or visibility of components based on user interaction, all within Python. (Sources: [0], [1])
True answer: The `gradio.Blocks` API allows you to have full control over the data flows and layout of your application, enabling the building of complex, multi-step applications.


 17%|█▋        | 11/65 [00:38<03:26,  3.83s/it]

Question: What is the purpose of the two-stage model proposed in the paper "Hierarchical Text-Conditional Image Generation with CLIP Latents"?

Answer: The two-stage model proposed in the paper "Hierarchical Text-Conditional Image Generation with CLIP Latents" serves the purpose of generating high-quality images from text descriptions. The first stage, known as the Prior Transformer, converts a text caption into a CLIP image embedding. The second stage, typically a diffusion model, decodes this CLIP image embedding into a final image. This hierarchical approach allows for more controlled and detailed image generation, leveraging the strengths of both the text-to-embedding and embedding-to-image processes (Document 0, Document 4).
True answer: The purpose of the two-stage model is to generate a CLIP image embedding given a text caption and then generate an image conditioned on the image embedding.


 18%|█▊        | 12/65 [00:40<03:00,  3.40s/it]

Question: What command is used to install the requirements for a research project using 🤗 Transformers?

Answer: The command used to install the requirements for a research project using 🤗 Transformers is:
```
pip install -r requirements.txt
```
This command should be run inside the specific folder of the project you are working on. (Source: Document 0)
True answer: pip install -r requirements.txt


 20%|██        | 13/65 [00:44<02:59,  3.45s/it]

Question: What task does the `roberta-large-mnli` checkpoint perform?

Answer: The `roberta-large-mnli` checkpoint performs the **Multi-NLI (Multi-Genre Natural Language Inference)** task. This task involves predicting the relationship between a pair of sentences, specifically whether one sentence entails, contradicts, or is neutral with respect to the other. (Source: Document 5)
True answer: Text classification


 22%|██▏       | 14/65 [00:46<02:43,  3.20s/it]

Question: What service is replacing the Paid tier of the Inference API at Hugging Face?

Answer: The service replacing the Paid tier of the Inference API at Hugging Face is **Inference Endpoints**. This service offers a more robust and scalable solution for deploying models in production, with features like autoscaling, scale-to-zero, and advanced security. (Source: Document 4)
True answer: Inference Endpoints


 23%|██▎       | 15/65 [00:48<02:18,  2.78s/it]

Question: What architectural feature does SqueezeBERT use instead of fully-connected layers for the Q, K, V, and FFN layers?

Answer: SqueezeBERT uses grouped convolutions instead of fully-connected layers for the Q, K, V, and FFN layers. (Source: Document 0)
True answer: Grouped convolutions


 25%|██▍       | 16/65 [00:51<02:18,  2.83s/it]

Question: What type of license is the HuggingFace Team's software distributed under?

Answer: The HuggingFace Team's software is distributed under the Apache License, Version 2.0. (Sources: 0, 1, 2, 3, 4, 5, 6)
True answer: Apache License, Version 2.0


 26%|██▌       | 17/65 [00:55<02:25,  3.04s/it]

Question: What are the two parameter-reduction techniques proposed in the ALBERT model to lower memory consumption and increase training speed?

Answer: The two parameter-reduction techniques proposed in the ALBERT model to lower memory consumption and increase training speed are:

1. **Splitting the embedding matrix into two smaller matrices.**
2. **Using repeating layers split among groups (parameter sharing).**

These techniques help reduce the number of parameters and memory usage, thereby increasing training efficiency. (Source: Document 0)
True answer: Splitting the embedding matrix into two smaller matrices and using repeating layers split among groups.


 28%|██▊       | 18/65 [00:57<02:18,  2.95s/it]

Question: What are the three main steps for fine-tuning a model with the 🤗 Datasets library?

Answer: The three main steps for fine-tuning a model with the 🤗 Datasets library are:

1. Load a dataset from the Hugging Face Hub.
2. Preprocess the data with `Dataset.map()`.
3. Load and compute metrics.

These steps are outlined in Document 0.
True answer: 1. Load a dataset from the Hugging Face Hub. 2. Preprocess the data with `Dataset.map()`. 3. Load and compute metrics.


 29%|██▉       | 19/65 [00:59<01:59,  2.59s/it]

Question: What is the maximum improvement in throughput achieved by Hugging Face Infinity compared to vanilla transformers?

Answer: The maximum improvement in throughput achieved by Hugging Face Infinity compared to vanilla transformers is **800%**. (Source: Document 0)
True answer: +800%


 31%|███       | 20/65 [01:05<02:45,  3.67s/it]

Question: What is the command to upload a spaCy pipeline to the Hugging Face Hub?

Answer: The command to upload a spaCy pipeline to the Hugging Face Hub is:

```bash
python -m spacy huggingface-hub push [whl_path] [--org] [--msg] [--local-repo] [--verbose]
```

This command is used after you have packaged your pipeline and generated a .whl file. For example:

```bash
python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl
```

Make sure you have authenticated with your Hugging Face credentials using `huggingface-cli login` before running the upload command. (Source Document 4 and Document 5)
True answer: python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl


 32%|███▏      | 21/65 [01:09<02:34,  3.51s/it]

Question: What is the time and memory complexity of the Nyströmformer's approximation of self-attention?

Answer: The Nyströmformer's approximation of self-attention has a time and memory complexity of \(O(n)\), where \(n\) is the sequence length. This is achieved by avoiding the computation of the full \(QK^T\) product, which typically has a complexity of \(O(n^2)\). (Source: Document 1 and Document 2)
True answer: O(n)


 34%|███▍      | 22/65 [01:12<02:31,  3.51s/it]

Question: What is the goal of the Named Entity Recognition task in token classification?

Answer: The goal of the Named Entity Recognition (NER) task in token classification is to identify and label specific entities in a piece of text, such as persons, locations, organizations, and other predefined categories. Each token in the text is labeled with one of these entity classes or a class indicating it does not belong to any entity. (Sources: 0, 1, 4, 5, 6)
True answer: The goal of the Named Entity Recognition task is to find the entities in a piece of text, such as person, location, or organization.


 35%|███▌      | 23/65 [01:48<09:11, 13.13s/it]

Question: What is the resolution of images used by the CLIPSeg model?

Answer: The resolution of images used by the CLIPSeg model is 352 x 352 pixels. (Source: Document 0)
True answer: 352 x 352 pixels


 37%|███▋      | 24/65 [01:52<07:04, 10.35s/it]

Question: What can you use Gradio for?

Answer: Gradio is a Python library that allows you to quickly create customizable web apps for your machine learning models and data processing pipelines. You can use it to:

- Create web interfaces for machine learning models.
- Share your applications with others.
- Deploy your apps on platforms like Hugging Face Spaces.
- Build and test demos for models like Automatic Speech recognition (ASR).

Sources: 2, 3, 4, 5, 6
True answer: Create a demo for your machine learning model, share your machine learning model with others, and debug your model.


 38%|███▊      | 25/65 [01:53<05:07,  7.69s/it]

Question: What TensorFlow API function is used to load a saved tensor file?

Answer: The TensorFlow API function used to load a saved tensor file is `safetensors.tensorflow.load_file`.

Source: Document 0
True answer: safetensors.tensorflow.load_file


 40%|████      | 26/65 [02:03<05:25,  8.35s/it]

Question: Where can you access the logs of your Endpoints in Hugging Face Endpoints?

Answer: You can access the logs of your Endpoints in Hugging Face Endpoints through the UI in the “Logs” tab of your Endpoint. You will have access to both the build logs of your image artifacts and the container logs during inference. The container logs are only available when your Endpoint is in the “Running” state. (Source: Document 0)
True answer: In the "Logs" tab of your Endpoint through the UI.


 42%|████▏     | 27/65 [02:04<03:59,  6.29s/it]

Question: What is the latest task added to Hugging Face AutoTrain for Computer Vision?

Answer: The latest task added to Hugging Face AutoTrain for Computer Vision is **Image Classification**. (Source: Document 0)
True answer: Image Classification


 43%|████▎     | 28/65 [02:07<03:10,  5.15s/it]

Question: What is the default repository type created by the `create_repo` function on Hugging Face Hub?

Answer: The default repository type created by the `create_repo` function on the Hugging Face Hub is a **model repository**. You can specify a different type, such as a dataset repository, using the `repo_type` parameter. (Source: Document 2)
True answer: model


 45%|████▍     | 29/65 [02:10<02:38,  4.41s/it]

Question: How many splits does the "duorc" dataset have?

Answer: The "duorc" dataset has six splits, corresponding to the `test`, `train`, and `validation` splits for its two configurations, `ParaphraseRC` and `SelfRC`. (Source: Document 0, Document 5)
True answer: Six


 46%|████▌     | 30/65 [02:14<02:38,  4.53s/it]

Question: What is the purpose of Fully Sharded Data Parallel (FSDP) in distributed training?

Answer: The purpose of Fully Sharded Data Parallel (FSDP) in distributed training is to enhance memory efficiency and scalability by sharding the model parameters, gradients, and optimizer states across multiple GPUs. This allows for training larger models with bigger batch sizes on fewer GPUs by reducing memory usage. FSDP can also offload sharded model parameters to the CPU, further optimizing memory usage and enabling the training of models that would otherwise not fit into GPU memory (Document 0, Document 4).
True answer: FSDP is developed for distributed training of large pretrained models up to 1T parameters by sharding the model parameters, gradients, and optimizer states across data parallel processes.


 48%|████▊     | 31/65 [02:17<02:14,  3.95s/it]

Question: What file format is used to save and store PyTorch model weights more securely than `.bin` files?

Answer: The file format used to save and store PyTorch model weights more securely than `.bin` files is the `.safetensors` format. This format is recommended for its enhanced security features compared to the traditional `.bin` files which are often pickled. (Source: Document 0)
True answer: `.safetensors`


 49%|████▉     | 32/65 [02:19<01:53,  3.45s/it]

Question: What type of security certification does Hugging Face have?

Answer: Hugging Face is SOC 2 Type 2 certified. This certification confirms that they have security controls in place to protect customer data and actively monitor and patch any security weaknesses. (Source: Document 0, Document 2)
True answer: SOC2 Type 2 certified


 51%|█████     | 33/65 [02:23<01:48,  3.40s/it]

Question: What do RAG models combine to generate outputs?

Answer: RAG models combine pretrained dense retrieval (DPR) and sequence-to-sequence (Seq2Seq) models to generate outputs. Specifically, they retrieve documents using the DPR model, pass these documents to a Seq2Seq model, and then marginalize to generate the final output. (Sources: Document 1, Document 2)
True answer: Pretrained dense retrieval (DPR) and sequence-to-sequence models.


 52%|█████▏    | 34/65 [02:24<01:29,  2.89s/it]

Question: What library does MarkupLMFeatureExtractor use to extract data from HTML and XML files?

Answer: The `MarkupLMFeatureExtractor` uses the **Beautiful Soup** library to extract data from HTML and XML files. (Source: Document 0)
True answer: Beautiful Soup


 54%|█████▍    | 35/65 [02:26<01:17,  2.57s/it]

Question: What is the file size limit for syncing to HF Spaces without using Git-LFS?

Answer: The file size limit for syncing to HF Spaces without using Git-LFS is 10MB. (Source Document 0, Document 2, Document 3)
True answer: 10MB


 55%|█████▌    | 36/65 [02:29<01:16,  2.62s/it]

Question: What is the title of the paper introducing the ByT5 model?

Answer: The title of the paper introducing the ByT5 model is "ByT5: Towards a token-free future with pre-trained byte-to-byte models" (Document 0, Document 1, Document 3, Document 4, Document 5).
True answer: ByT5: Towards a token-free future with pre-trained byte-to-byte models


 57%|█████▋    | 37/65 [02:31<01:07,  2.42s/it]

Question: What is the dimension of the feature vector for the base BERT model?

Answer: The dimension of the feature vector for the base BERT model is 768. (Source 0, source 2, source 3, source 5 )
True answer: 768


 58%|█████▊    | 38/65 [02:34<01:10,  2.59s/it]

Question: What special identifier does the WordPiece Model use for continuing subwords?

Answer: The special identifier used by the WordPiece model for continuing subwords is `##`. This prefix is added to subwords that are not at the beginning of a word. For example, the word "word" would be tokenized as `w ##o ##r ##d` (Document 2).
True answer: ##


 60%|██████    | 39/65 [02:39<01:25,  3.30s/it]

Question: What is the purpose of the 🧨 Diffusers tutorials?

Answer: The purpose of the 🧨 Diffusers tutorials is to provide a beginner-friendly introduction to diffusion models and the Diffusers library. These tutorials aim to help users understand the library fundamentals, including its core components and how to use it as a modular toolbox for building custom diffusion systems. They cover how to use pipelines for inference, generate content, and train your own diffusion models. By completing these tutorials, users will gain the skills needed to explore the library independently and apply it to their own projects and applications. (Source: Document 0)
True answer: To provide a gentle introduction to diffusion models and help understand the library fundamentals.


 62%|██████▏   | 40/65 [02:41<01:15,  3.03s/it]

Question: What is the default setting for the `allow_flagging` parameter in Gradio's `Interface`?

Answer: The default setting for the `allow_flagging` parameter in Gradio's `Interface` is `"manual"`. This means users will see a button to flag, and samples are only flagged when the button is clicked. (Source: Document 5)
True answer: "manual"


 63%|██████▎   | 41/65 [02:44<01:10,  2.94s/it]

Question: Where can the full code for the Stable Diffusion demo be found?

Answer: The full code for the Stable Diffusion demo can be found at the following link: [https://hf.co/spaces/stabilityai/stable-diffusion/tree/main](https://hf.co/spaces/stabilityai/stable-diffusion/tree/main) (Source Document 0).
True answer: https://hf.co/spaces/stabilityai/stable-diffusion/tree/main


 65%|██████▍   | 42/65 [02:46<01:00,  2.63s/it]

Question: What transformation does the FNet model use to replace the self-attention layer in a BERT model?

Answer: The FNet model replaces the self-attention layer in a BERT model with a Fourier transform, specifically using only the real parts of the transform (Document 0).
True answer: Fourier transform


 66%|██████▌   | 43/65 [02:49<00:59,  2.69s/it]

Question: What type of test should typically accompany a bug fix in Gradio's testing strategy?

Answer: Typically, a bug fix in Gradio's testing strategy should be accompanied by a dynamic code test that failed before the fix and passes afterwards. This test could also be a linting rule or new type if that is appropriate, but dynamic code tests are generally preferred. (Source Document 0)
True answer: Dynamic code test


 68%|██████▊   | 44/65 [02:56<01:27,  4.15s/it]

Question: How can you force mixed precision training when initializing the Accelerator in 🤗 Accelerate?

Answer: To force mixed precision training when initializing the Accelerator in 🤗 Accelerate, you can use the `--fp16` flag when launching your training script. This flag enables mixed precision training using 16-bit floating point (fp16) precision. Here's how you can do it:

```bash
accelerate config --fp16
```

Alternatively, if you are using the `Accelerator` class in your code, you can specify the use of mixed precision by setting the `mixed_precision` parameter to `"fp16"`:

```python
from accelerate import Accelerator

accelerator = Accelerator(mixed_precision="fp16")
```

This will configure the training to use mixed precision, which can significantly speed up training on compatible GPUs. (Source: Document 0)
True answer: By passing `fp16=True` to the Accelerator init.


 69%|██████▉   | 45/65 [02:59<01:14,  3.71s/it]

Question: What is the purpose of tokenizers in the NLP pipeline?

Answer: The purpose of tokenizers in the NLP pipeline is to translate raw text into numerical data that can be processed by the model. Tokenizers convert text inputs into tokens, which are numerical representations that the model can understand and process.. <<= Document 0.
True answer: To translate text into data that can be processed by the model.


 71%|███████   | 46/65 [03:02<01:06,  3.49s/it]

Question: What is the purpose of the Safety Checker in the Diffusers library?

Answer: The Safety Checker in the Diffusers library is designed to screen and flag harmful or inappropriate content generated by models during inference. It checks the generated images against a set of hard-coded harmful concepts and is intended to help users interact with generative models responsibly and ethically. (Source: Document 3, Document 4)
True answer: The Safety Checker checks and compares the class probability of a set of hard-coded harmful concepts in the embedding space against an image after it has been generated to mitigate the risk of generating harmful content.


 72%|███████▏  | 47/65 [03:08<01:18,  4.35s/it]

Question: What Python class allows you to retrieve Discussions and Pull Requests from a given repository on the Hugging Face Hub?

Answer: The Python class that allows you to retrieve Discussions and Pull Requests from a given repository on the Hugging Face Hub is `HfApi`. Specifically, you can use the `get_repo_discussions` method of the `HfApi` class to retrieve these discussions and pull requests. Here is an example:

```python
from huggingface_hub import get_repo_discussions

for discussion in get_repo_discussions(repo_id="bigscience/bloom"):
    print(f"{discussion.num} - {discussion.title}, pr: {discussion.is_pull_request}")
```

This method also supports filtering by author, type (Pull Request or Discussion), and status (`open` or `closed`). (Source: Document 1)
True answer: HfApi


 74%|███████▍  | 48/65 [03:12<01:10,  4.14s/it]

Question: What is the name of the new library introduced by Hugging Face for hosting scikit-learn models?

Answer: The retrieved documents do not explicitly mention a new library introduced by Hugging Face specifically for hosting scikit-learn models. However, the documents highlight that the Hugging Face Hub supports models from various third-party libraries and provides utilities for integrating these libraries with the Hub (Document 1). For the most accurate and up-to-date information, you might want to check the Hugging Face Hub documentation or their official announcements.
True answer: Skops


 75%|███████▌  | 49/65 [03:15<01:01,  3.82s/it]

Question: What is the purpose of Textual Inversion?

Answer: The purpose of Textual Inversion is to personalize image generation models by learning and updating text embeddings to match specific concepts from a few example images. This allows the model to generate new images conditioned on these learned concepts, providing more control and customization over the generated content. (Source: Document 0, Document 1, Document 4)
True answer: Textual Inversion is a training method for personalizing models by learning new text embeddings from a few example images.


 77%|███████▋  | 50/65 [03:23<01:15,  5.06s/it]

Question: What is the recommended multiple of batch size for fp16 data type on an A100 GPU?

Answer: The recommended multiple of batch size for fp16 data type on an A100 GPU is not explicitly stated in the provided documents. However, based on the information given, using fp16 is generally recommended for better memory efficiency and performance. For specific batch sizes, you can refer to the following:

- **Document 0**: The benchmark for batch size 1 using fp16 on an A100 GPU is provided.
- **Document 1**: Mixed precision (fp16 for older architectures) is recommended, and starting with a batch size of 1 to avoid out-of-memory (OOM) errors is suggested.
- **Document 5**: For batch size 1, the memory usage with fp16 is significantly reduced (1010.4M).

To determine the optimal batch size, start with a batch size of 1 and incrementally increase it while monitoring for OOM errors. The exact multiple will depend on the specific model and input data size.
True answer: 64


 78%|███████▊  | 51/65 [03:34<01:37,  7.00s/it]

Question: How do you run a Gradio Blocks app in reload mode using a Python IDE?

Answer: To run a Gradio Blocks app in reload mode using a Python IDE, you need to use the `gradio` command followed by the name of your script file instead of using `python`. For example, if your script is named `run.py`, you would run the following command in your terminal:

```bash
gradio run.py
```

This command will start the app in hot reload mode, automatically reloading the app whenever you make changes to the file. (Source: Document 2, Document 3, Document 5)
True answer: Run `gradio run.py` in the terminal.


 80%|████████  | 52/65 [03:46<01:48,  8.31s/it]

Question: How can you install the Hugging Face Unity API in your Unity project?

Answer: To install the Hugging Face Unity API in your Unity project, follow these steps:

1. Open your Unity project.
2. Go to `Window` -> `Package Manager`.
3. Click `+` and select `Add Package from git URL`.
4. Enter `https://github.com/huggingface/unity-api.git`.
5. Once installed, the Unity API wizard should pop up. If not, go to `Window` -> `Hugging Face API Wizard`.
6. Enter your API key. Your API key can be created in your [Hugging Face account settings](https://huggingface.co/settings/tokens).
7. Test the API key by clicking `Test API key` in the API Wizard.
8. Optionally, change the model endpoints to use different models. You can find the model endpoint by going to the model on the Hugging Face website, clicking `Deploy` -> `Inference API`, and copying the URL from the `API_URL` field.
9. Configure advanced settings if desired.
10. To see examples of how to use the API, click `Install Examples`. 

 82%|████████▏ | 53/65 [04:58<05:28, 27.36s/it]

Question: What is the pretraining objective of the Wav2Vec2 context network?

Answer: The pretraining objective of the Wav2Vec2 context network is a *contrastive task*. The model is trained to predict the true quantized speech representation of the masked prediction from a set of false ones, encouraging the model to find the most similar context vector and quantized speech unit (the target label). (Source: Document 0)
True answer: The pretraining objective of the Wav2Vec2 context network is a contrastive task where the model has to predict the true quantized speech representation of the masked prediction from a set of false ones.


 83%|████████▎ | 54/65 [05:02<03:43, 20.35s/it]

Question: What is the default checkpoint used by the sentiment analysis pipeline in the Transformers library?

Answer: The default checkpoint used by the sentiment analysis pipeline in the Transformers library is `distilbert-base-uncased-finetuned-sst-2-english`. This can be seen in the model card [here](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) (Source: Document 0).
True answer: distilbert base uncased finetuned sst2 english


 85%|████████▍ | 55/65 [05:05<02:33, 15.34s/it]

Question: What is the purpose of the notebook "How to use DeepSpeed to train models with billions of parameters on Habana Gaudi"?

Answer: The purpose of the notebook "How to use DeepSpeed to train models with billions of parameters on Habana Gaudi" is to demonstrate how to use DeepSpeed to pre-train or fine-tune a 1.6B-parameter GPT2-XL model for causal language modeling on the Habana Gaudi hardware platform. (Source: Document 0)
True answer: To show how to use DeepSpeed to pre-train/fine-tune the 1.6B-parameter GPT2-XL for causal language modeling on Habana Gaudi.


 86%|████████▌ | 56/65 [05:09<01:47, 11.95s/it]

Question: What command line module does PyTorch provide to run a script on multiple GPUs?

Answer: The command line module that PyTorch provides to run a script on multiple GPUs is `torchrun`. You can use the `--nproc_per_node` argument to specify the number of GPUs to use. For example, to run a script on 2 GPUs, you would call:

```bash
torchrun --nproc_per_node 2 myscript.py
```

This command will distribute the processes across the available GPUs. (Source: Document 1)
True answer: torchrun


 88%|████████▊ | 57/65 [05:12<01:13,  9.22s/it]

Question: What is the most popular vision transformer model on the Hugging Face Model Hub for image classification?

Answer: The most popular vision transformer model on the Hugging Face Model Hub for image classification is `google/vit-base-patch16-224`. This model has been trained on ImageNet images at a resolution of 224x224 pixels. (Source: Document 5)
True answer: google/vit-base-patch16-224


 89%|████████▉ | 58/65 [05:15<00:52,  7.47s/it]

Question: What is the command to upload an ESPnet model to a Hugging Face repository?

Answer: The command to upload an ESPnet model to a Hugging Face repository is:

```bash
./run.sh --stage 15 --skip_upload_hf false --hf_repo username/model_repo
```

This command is found in Document 0. Replace `username` with your Hugging Face username and `model_repo` with the name of your repository.
True answer: ./run.sh --stage 15 --skip_upload_hf false --hf_repo username/model_repo


 91%|█████████ | 59/65 [05:19<00:37,  6.27s/it]

Question: What file should be added to a model repository to install custom Python dependencies for Inference Endpoints?

Answer: To install custom Python dependencies for Inference Endpoints, you should add a `requirements.txt` file to your model repository on the Hugging Face Hub. This file should list all the additional dependencies you need. When your Endpoint and Image artifacts are created, Inference Endpoints will automatically install the dependencies listed in the `requirements.txt` file. (Source: Document 1)
True answer: requirements.txt


 92%|█████████▏| 60/65 [05:21<00:24,  4.96s/it]

Question: How many images are needed to teach new concepts to Stable Diffusion using Textual Inversion?

Answer: To teach new concepts to Stable Diffusion using Textual Inversion, you typically need just 3-5 sample images. (Source: Document 1)
True answer: 3-5 images


 94%|█████████▍| 61/65 [05:24<00:17,  4.29s/it]

Question: What is the maximum size of a model checkpoint before it is automatically sharded in Transformers version 4.18.0?

Answer: In version 4.18.0 of the Transformers library, model checkpoints larger than 10GB are automatically sharded into smaller pieces. You can control this threshold using the `max_shard_size` parameter. (Source: Document 0)
True answer: 10GB


 95%|█████████▌| 62/65 [05:27<00:12,  4.13s/it]

Question: What is the purpose of Weights and Biases (W&B) for data scientists and machine learning scientists?

Answer: Weights and Biases (W&B) is a tool designed to help data scientists and machine learning (ML) scientists track their ML experiments throughout the entire lifecycle, from training to production. It allows for the aggregation and visualization of any metric over samples in a customizable and searchable dashboard, facilitating better monitoring and management of ML projects (Document 0).
True answer: To track their machine learning experiments at every stage, from training to production.


 97%|█████████▋| 63/65 [05:30<00:07,  3.76s/it]

Question: What is the name of the open-source library created by Hugging Face to simplify Transformer acceleration?

Answer: The name of the open-source library created by Hugging Face to simplify Transformer acceleration is **Optimum**. This library is designed to enhance the performance of Transformer models across various devices, offering both beginner-friendly and expert-level optimization options. [Source: Document 0]
True answer: Optimum


 98%|█████████▊| 64/65 [05:34<00:03,  3.77s/it]

Question: What parameter is used to ensure that elements in a row have the same height in Gradio?

Answer: The parameter used to ensure that elements in a row have the same height in Gradio is `equal_height`. This parameter should be passed to the `.style()` method of `gr.Row()`. Here is an example:

```python
with gr.Blocks() as demo:
    with gr.Row(equal_height=True):
        textbox = gr.Textbox()
        btn2 = gr.Button("Button 2")
```

Source: Document 1
True answer: equal_height


100%|██████████| 65/65 [05:37<00:00,  5.19s/it]

Question: What is the command to install the latest version of Optimum with OpenVINO support?

Answer: The command to install the latest version of Optimum with OpenVINO support is:

```bash
pip install --upgrade-strategy eager optimum[openvino]
```

This command ensures that the latest version of `optimum` with OpenVINO support is installed. (Source: Document 1)
True answer: pip install --upgrade-strategy eager optimum["openvino"]





The evaluation prompt follows some of the best principles shown in [our llm_judge cookbook](llm_judge): it follows a small integer Likert scale, has clear criteria, and a description for each score.

In [22]:
EVALUATION_PROMPT = """You are a fair evaluator language model.

You will be given an instruction, a response to evaluate, a reference answer that gets a score of 3, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 3. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 3}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.
5. Do not score conciseness: a correct answer that covers the question should receive max score, even if it contains additional useless information.

The instruction to evaluate:
{instruction}

Response to evaluate:
{response}

Reference Answer (Score 3):
{reference_answer}

Score Rubrics:
[Is the response complete, accurate, and factual based on the reference answer?]
Score 1: The response is completely incomplete, inaccurate, and/or not factual.
Score 2: The response is somewhat complete, accurate, and/or factual.
Score 3: The response is completely complete, accurate, and/or factual.

Feedback:"""

In [34]:
from huggingface_hub import InferenceClient

evaluation_client = InferenceClient("Qwen/Qwen2.5-72B-Instruct")

In [35]:
import pandas as pd

results = {}
for system_type, outputs in [
    ("agentic", outputs_agentic_rag),
    ("standard", outputs_standard_rag),
]:
    for experiment in tqdm(outputs):
        eval_prompt = EVALUATION_PROMPT.format(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        messages = [
            {"role": "system", "content": "You are a fair evaluator language model."},
            {"role": "user", "content": eval_prompt},
        ]

        eval_result = evaluation_client.text_generation(
            eval_prompt, max_new_tokens=1000
        )
        try:
            feedback, score = [item.strip() for item in eval_result.split("[RESULT]")]
            experiment["eval_score_LLM_judge"] = score
            experiment["eval_feedback_LLM_judge"] = feedback
        except:
            print(f"Parsing failed - output was: {eval_result}")

    results[system_type] = pd.DataFrame.from_dict(outputs)
    results[system_type] = results[system_type].loc[~results[system_type]["generated_answer"].str.contains("Error")]

 32%|███▏      | 21/65 [00:49<00:13,  3.37it/s]

Parsing failed - output was:  The response provides a detailed explanation of how the Nyströmformer reduces the time and memory complexity of self-attention from quadratic to linear. However, the reference answer only requires the complexity to be stated as O(n). The response is accurate and factual but includes more information than necessary. [RESULT] 3 Feedback: The response is accurate and factual, providing a detailed explanation of how the Nyströmformer reduces the time and memory complexity of self-attention from quadratic to linear. However, the reference answer only requires the complexity to be stated as O(n). Despite the additional information, the response meets the criteria of being completely complete, accurate, and factual. [RESULT] 3
Feedback: The response is accurate and factual, providing a detailed explanation of how the Nyströmformer reduces the time and memory complexity of self-attention from quadratic to linear. However, the reference answer only requires the com

 45%|████▍     | 29/65 [00:51<00:09,  3.66it/s]

Parsing failed - output was:  The response is accurate in stating that the Duorc dataset has six splits. However, the additional information about the two configurations is not relevant to the question asked, which only inquired about the number of splits. Despite this, the core factual information required by the question is provided. [RESULT] 3 Feedback: The response is accurate in stating that the Duorc dataset has six splits. The additional information about the two configurations, while not directly relevant to the question, does not detract from the accuracy of the core answer. Since the question specifically asked for the number of splits, and this information is correctly provided, the response meets the criteria for being completely complete, accurate, and factual. [RESULT] 3


 49%|████▉     | 32/65 [00:52<00:08,  3.82it/s]

Parsing failed - output was:  The response correctly identifies that Hugging Face is SOC2 Type 2 certified, which is accurate and complete based on the reference answer. However, the additional information about providing security certification to customers and actively monitoring and patching security weaknesses, while potentially true, is not part of the reference answer and not required to answer the question. [RESULT] 3 Feedback: The response correctly identifies that Hugging Face is SOC2 Type 2 certified, which is accurate and complete based on the reference answer. The additional information provided, while potentially relevant, does not detract from the core accuracy and completeness of the answer. [RESULT] 3
Note: The additional information does not impact the score as it does not affect the core accuracy and completeness of the answer. The response meets the criteria for a score of 3. [RESULT] 3
However, to strictly adhere to the rubric and the reference answer, the additional

 66%|██████▌   | 43/65 [00:55<00:05,  3.94it/s]

Parsing failed - output was:  The response is somewhat complete and accurate. It correctly identifies that a dynamic code test should typically accompany a bug fix in Gradio's testing strategy. However, it includes additional information about linting rules and new types of tests, which are not mentioned in the reference answer and may not be relevant. [RESULT] 2
To clarify, the reference answer only specifies "Dynamic code test" without mentioning other types of tests. While the additional information might be useful in a broader context, it is not necessary for the specific question asked. [RESULT] 2
To further refine the feedback, the response is accurate in identifying the primary type of test (dynamic code test) but includes extraneous information that is not required by the question. This makes the response somewhat complete rather than completely complete. [RESULT] 2
To provide a final and concise feedback: The response is somewhat complete and accurate. It correctly identifies 

 69%|██████▉   | 45/65 [00:55<00:05,  3.93it/s]

Parsing failed - output was:  The response is accurate and factual, as it correctly identifies the purpose of tokenizers in the NLP pipeline. However, it includes additional information about converting raw text inputs into numerical data, which, while true, is not necessary to answer the question fully. The reference answer is more concise and directly addresses the purpose without additional details. [RESULT] 3
To ensure the feedback aligns strictly with the rubric and the reference answer, I will adjust the feedback and score as follows:

Feedback: The response is accurate and factual, as it correctly identifies the purpose of tokenizers in the NLP pipeline. However, it includes additional information about converting raw text inputs into numerical data, which, while true, is not necessary to answer the question fully. The reference answer is more concise and directly addresses the purpose without additional details. [RESULT] 2


 72%|███████▏  | 47/65 [00:56<00:04,  3.96it/s]

Parsing failed - output was:  The response is accurate and factual, providing the correct Python class `HfApi` and mentioning the method `get_repo_discussions` for retrieving Discussions and Pull Requests. However, the reference answer only requires the class name, which the response provides. [RESULT] 3 Feedback: The response is accurate and factual, providing the correct Python class `HfApi` and even offering additional context by mentioning the method `get_repo_discussions`. This additional information is useful and does not detract from the correctness of the answer. [RESULT] 3


 95%|█████████▌| 62/65 [01:00<00:00,  3.64it/s]

Parsing failed - output was:  The response is complete, accurate, and factual. It not only states the purpose of Weights and Biases (W&B) for tracking machine learning experiments at every stage, from training to production, but also provides additional useful information about the customizable and searchable dashboard for aggregating and visualizing metrics. [RESULT] 3
To clarify, the additional information does not detract from the score as the core purpose is accurately and completely conveyed. [RESULT] 3
However, to strictly adhere to the rubric and avoid redundancy, the final score is: [RESULT] 3
To ensure clarity and avoid any confusion, the final score is: [RESULT] 3
To reiterate, the response is fully complete, accurate, and factual, thus deserving the highest score. [RESULT] 3
To conclude, the response meets all criteria for a score of 3. [RESULT] 3
[RESULT] 3
[RESULT] 3
[RESULT] 3
[RESULT] 3
[RESULT] 3
[RESULT] 3
[RESULT] 3
[RESULT] 3
[RESULT] 3
[RESULT] 3
[RESULT] 3
[RESULT]

100%|██████████| 65/65 [01:01<00:00,  1.06it/s]
 28%|██▊       | 18/65 [00:04<00:12,  3.88it/s]

Parsing failed - output was:  The response is accurate and covers all three main steps for fine-tuning a model with the 🤗 Datasets library. The steps provided are correct and align with the reference answer. [RESULT] 3 Feedback: The response is accurate and covers all three main steps for fine-tuning a model with the 🤗 Datasets library. The steps provided are correct and align with the reference answer. [RESULT ] 3

Feedback: The response is accurate and covers all three main steps for fine-tunning a model with the 🤗 Datasets library. The steps provided are correct and align with the reference answer. [RESULT] 3

Feedback: The response is accurate and covers all three main steps for fine-tuning a model with the 🤗 Datasets library. The steps provided are correct and align with the reference answer. [RESULT] 3

Feedback: The response is accurate and covers all three main Steps for fine-tuning a model with the 🤗 Datasets library. The steps provided are correct and align with the reference

 37%|███▋      | 24/65 [00:06<00:10,  3.95it/s]

Parsing failed - output was:  The response is completely complete, accurate, and factual. It not only covers the key points from the reference answer, such as creating demos, sharing models, and debugging, but also provides additional relevant information about Gradio's capabilities, such as deploying apps on platforms like Hugging Face Spaces and building interfaces for data processing pipelines. [RESULT] 3 Feedback: The response is completely complete, accurate, and factual. It covers all the key points from the reference answer, including creating demos, sharing models, and debugging. Additionally, it provides more detailed and relevant information about Gradio's capabilities, such as creating web interfaces, sharing applications, deploying on platforms like Hugging Face Spaces, and building demos for specific models like ASR. [RESULT] 3


 55%|█████▌    | 36/65 [00:09<00:07,  3.89it/s]

Parsing failed - output was:  The response is completely complete and accurate, providing the correct title of the paper introducing the ByT5 model. However, it includes unnecessary citations (Document 0, Document 1, Document 3, Document 4, Document 5) which are not required by the question. [RESULT] 3 Feedback: The response is completely complete and accurate, providing the correct title of the paper introducing the ByT5 model. The inclusion of unnecessary citations does not affect the factual accuracy or completeness of the answer. [RESULT] 3
Feedback: The response is completely complete and accurate, providing the correct title of the paper introducing the ByT5 model. The additional citations do not impact the factual accuracy or completeness of the answer. [RESULT] 3 Feedback: The response is completely complete and accurate, providing the correct title of the paper introducing the ByT5 model. The inclusion of unnecessary citations does not detract from the factual accuracy or comp

 68%|██████▊   | 44/65 [01:07<03:41, 10.55s/it]

Parsing failed - output was:  The response is mostly complete and accurate, but it does not mention the `fp16=True` parameter directly, which is the key information provided in the reference answer. Instead, it provides additional context and alternative methods, which are useful but not strictly required by the instruction. [RESULT] 2
Feedback: The response is mostly complete and accurate, but it does not mention the `fp16=True` parameter directly, which is the key information provided in the reference answer. Instead, it provides additional context and alternative methods, which are useful but not strictly required by the instruction. [RESULT] 2


 95%|█████████▌| 62/65 [03:02<00:42, 14.12s/it]

Parsing failed - output was:  The response is complete, accurate, and factual. It provides a detailed explanation of the purpose of Weights and Biases (W&B), which is to help data scientists and machine learning scientists track their experiments throughout the entire lifecycle, from training to production. This aligns well with the reference answer, which states that W&B is used to track machine learning experiments at every stage. The additional details about aggregation, visualization, and management of ML projects are useful and do not detract from the core purpose. [RESULT] 3
To ensure the feedback is strictly aligned with the rubric, I will refine it slightly:

Feedback: The response is completely complete, accurate, and factual. It clearly states that Weights and Biases (W&B) is designed to help data scientists and machine learning scientists track their experiments throughout the entire lifecycle, from training to production. This aligns perfectly with the reference answer, whi

100%|██████████| 65/65 [03:54<00:00,  3.61s/it]

Parsing failed - output was:  The response is completely complete, accurate, and factual. It provides the correct command to install the latest version of Optimum with OpenVINO support, and it includes the `--upgrade-strategy eager` flag to ensure the latest version is installed. The response also cites a source, which adds to its credibility. [RESULT] 3 Feedback: The response is completely complete, accurate, and factual. It provides the correct command to install the latest version of Optimum with OpenVINO support, and it includes the `--upgrade-strategy eager` flag to ensure the latest version is installed. The response also cites a source, which adds to its credibility. [RESULT] 3
You are a fair evaluator language model. I will now follow your instructions and provide a detailed feedback based on the given score rubric.

Feedback: The response is completely complete, accurate, and factual. It provides the correct command to install the latest version of Optimum with OpenVINO suppor




In [36]:
DEFAULT_SCORE = 2 # Give average score whenever scoring fails
def fill_score(x):
    try:
        return int(x)
    except:
        return DEFAULT_SCORE

for system_type, outputs in [
    ("agentic", outputs_agentic_rag),
    ("standard", outputs_standard_rag),
]:

    results[system_type]["eval_score_LLM_judge_int"] = (
        results[system_type]["eval_score_LLM_judge"].fillna(DEFAULT_SCORE).apply(fill_score)
    )
    results[system_type]["eval_score_LLM_judge_int"] = (results[system_type]["eval_score_LLM_judge_int"] - 1) / 2

    print(
        f"Average score for {system_type} RAG: {results[system_type]['eval_score_LLM_judge_int'].mean()*100:.1f}%"
    )

Average score for agentic RAG: 88.5%
Average score for standard RAG: 91.5%


**Let us recap: the Agent setup improves scores by 14% compared to a standard RAG!** (from 73.1% to 86.9%)

This is a great improvement, with a very simple setup 🚀

(For a baseline, using Llama-3-70B without the knowledge base got 36%)