<a href="https://colab.research.google.com/github/docfhsp/fhsp-memorial/blob/main/wikipedia_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**In this notebook, we showcase how to use the KVpress pipelines by answering questions about NVIDIA Wikipedia article.**

The notebook explains

1.   How to set up a press and use it in transformer's pipeline
2.   How to answer multiple questions, reusing the compressed context
3.   How to achieve high compression ratios by using Snapkv press and adding the question to the context



In [None]:
# tensorflow, which is not needed, is compiled with numpy<2.0. This is causing an import error, as we are using numpy>=2.0
!pip uninstall tensorflow -y

[0m

In [None]:
!pip install kvpress --quiet

**Please restart the session if you enocunter an import issue below.**

This is a known issue of google colab

In [None]:
import requests
from bs4 import BeautifulSoup

import torch
from transformers import pipeline

from kvpress import (
    ExpectedAttentionPress,
    KnormPress,
    ObservedAttentionPress,
    RandomPress,
    SnapKVPress,
    StreamingLLMPress,
)

# Load the pipeline and data

In [None]:
# Load pipeline

device = "cuda:0"
ckpt = "Qwen/Qwen2.5-1.5B-Instruct"
# use attn_implementation = "eager" for ObservedAttentionPress or attn_implementation = "flash_attention_2" if you can use flash attention
# flash_attention_2 is not fully supported on T4 GPUs, so we are using sdpa
attn_implementation = "sdpa"
pipe = pipeline("kv-press-text-generation", model=ckpt, device=device, torch_dtype=torch.float16, model_kwargs={"attn_implementation":attn_implementation})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Load data
url = "https://en.wikipedia.org/wiki/Nvidia"
content = requests.get(url).content
soup = BeautifulSoup(content, "html.parser")
context = "".join([p.text for p in soup.find_all("p")]) + "\n\n"
tokens = pipe.tokenizer.encode(context, return_tensors="pt").to(device)
tokens = tokens
print(f"Number of tokens: {tokens.size(1)}")

Number of tokens: 9775


# Use the pipeline with a press

In [None]:
# First we ensure that the question cannot be answered using the model's internal knowledge
question = "What happened on March 1, 2024?"
true_answer = "Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion."
pred_answer = pipe(" ", question=question, press=ExpectedAttentionPress(0.0))["answer"]

print(f"Question:   {question}")
print(f"Answer:     {true_answer}")
print(f"Prediction: {pred_answer}")

Question:   What happened on March 1, 2024?
Answer:     Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion.
Prediction: I'm sorry, but I'm not able to provide information about specific events or dates. My knowledge cutoff is 2021, so I don't have up-to-date information about events that occurred on March 1, 202


In [None]:
# Pick a press with a compression ratio, you can run the following cells with different presses
compression_ratio = 0.3
press = ExpectedAttentionPress(compression_ratio)
# press = KnormPress(compression_ratio)
# press = RandomPress(compression_ratio)

In [None]:
# Run the pipeline on a single question

question = "What happened on March 1, 2024?"
true_answer = "Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion."
pred_answer = pipe(context, question=question, press=press)["answer"]

print(f"Question:   {question}")
print(f"Answer:     {true_answer}")
print(f"Prediction: {pred_answer}")

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


Question:   What happened on March 1, 2024?
Answer:     Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion.
Prediction: On March 1, 2024, Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion. This milestone was reached after only 180 days from reaching


In [None]:
# Increasing the compression_ratio causes the model to give an incorrect answer.
# The optimal compression ratio depends on the specific model, as well on the model size.

question = "What happened on March 1, 2024?"
true_answer = "Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion."
pred_answer = pipe(context, question=question, press=ExpectedAttentionPress(compression_ratio=0.5))["answer"]

print(f"Question:   {question}")
print(f"Answer:     {true_answer}")
print(f"Prediction: {pred_answer}")

Question:   What happened on March 1, 2024?
Answer:     Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion.
Prediction: Based on the information provided, there is no specific event or news item mentioned for March 1, 2024. The text does not contain any details about what happened on that particular date. To provide accurate information, I would need more


In [None]:
# Run the pipeline on multiple questions, the context will be compressed only once

questions = [
    "What happened on March 1, 2024?",
    "What was the unofficial company motto of Nvidia during the early days?",
]

true_answers = [
    "Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion",
    "Our company is thirty days from going out of business",
]

pred_answers = pipe(context, questions=questions, press=press)["answers"]
for question, pred_answer, true_answer in zip(questions, pred_answers, true_answers):
    print(f"Question:   {question}")
    print(f"Answer:     {true_answer}")
    print(f"Prediction: {pred_answer}")
    print()

Question:   What happened on March 1, 2024?
Answer:     Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion
Prediction: On March 1, 2024, Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion. This milestone was reached after only 180 days from reaching

Question:   What was the unofficial company motto of Nvidia during the early days?
Answer:     Our company is thirty days from going out of business
Prediction: According to the information provided, the unofficial company motto of Nvidia during the early days was:

"Our company is thirty days from going out of business."

This motto was reportedly coined by Huang during a time of extreme desperation and financial difficulty for the company. It



In [None]:
# Use an answer prefix and limit the number of tokens in the answer

question = "What is GTC ?"
true_answer = "Nvidia's GPU Technology Conference (GTC) is a series of technical conferences held around the world."
answer_prefix = "Come on you don't know GTC ? Everyone"
max_new_tokens = 30

pred_answer_with_prefix = pipe(context, question=question, answer_prefix=answer_prefix, press=press, max_new_tokens=max_new_tokens)["answer"]
pred_answer_without_prefix = pipe(context, question=question, press=press, max_new_tokens=max_new_tokens)["answer"]

print(f"Question:              {question}")
print(f"Answer:                {true_answer}")
print(f"Prediction w/o prefix: {pred_answer_without_prefix}")
print(f"Prediction w/ prefix : {answer_prefix + pred_answer_with_prefix}")

Question:              What is GTC ?
Answer:                Nvidia's GPU Technology Conference (GTC) is a series of technical conferences held around the world.
Prediction w/o prefix: GTC stands for GPU Technology Conference. It is an annual conference that focuses on the latest developments and advancements in graphics processing units (GPUs) and
Prediction w/ prefix : Come on you don't know GTC ? Everyone knows GTC. It's the GPU Technology Conference. It's a major event in the graphics processing unit (GPU) industry. It's where top


In [None]:
# SnapKV use the latest queries to prune the KV-cache. It's hence more efficient if we include the question during compression as the latest queries will correspond to the question.
# However it implies also implies that SnapKV cannot compress well the context independently of the question (e.g. as in a chat use case)


question = "What happened on March 1, 2024?"
true_answer = "Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion."

press = SnapKVPress(compression_ratio=0.7)

pred_answer_with_question = pipe(context + question, press=press)["answer"]
pred_answer_without_question = pipe(context, question=question, press=press, max_new_tokens=400)["answer"]

print(f"Question:         {question}")
print(f"Answer:           {true_answer}")
print(f"Prediction w/ Q:  {pred_answer_with_question}")
print(f"Prediction w/o Q: {pred_answer_without_question}")

Question:         What happened on March 1, 2024?
Answer:           Nvidia became the third company in the history of the United States to close with a market capitalization in excess of $2 trillion.
Prediction w/ Q:  On March 1, 2024, Nvidia became the third company in the S&P 500 to reach a market capitalization of $2 trillion. This milestone was reached during trading hours, and Nvidia needed only 18
Prediction w/o Q: On March 1, 2024, Nvidia CEO Jensen Huang announced at the company's annual meeting that Nvidia would be adding a new "Inception" program to its portfolio of AI-focused initiatives. This new program would focus on developing and deploying AI technologies in a wide range of industries, including healthcare, manufacturing, and transportation. The goal of the Inception program is to accelerate the development and deployment of AI solutions that can help solve some of the world's most pressing challenges.
