# Path 2 - HuggingFace
HuggingFace (HF) is a free platform where user can upload models (of various kinds, not just LLMs) that can then be used through their `transformers` library. To be able to use the models on HF you don't need to create an account, however, some models are 'gated' and require approval from the creator before being able to use them (it is the case e.g. for LLaMA models). For those models, it's required both authentication and authorization to use the model.

### 1. First simple generation
For the means of this lab, we will use the model `Qwen/Qwen2.5-VL-3B-Instruct`, which is a non-gated fairly small model that, besides text, also support images and videos. For the assignment and the project you can choose the model that you prefer from the [HF catalogue](https://huggingface.co/models).

In [1]:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

# fairly small but good model
MODEL_NAME = "Qwen/Qwen2.5-VL-3B-Instruct"

# We're using the `Qwen2_5_VLForConditionalGeneration` class to enable multimodal generation
# Normally, you can use AutoModelForCausalLM
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_NAME,
    dtype="auto",  # automatically uses right precision based on model
    device_map="auto"  # automatically uses right device e.g. GPU if available
)

# We're using the `AutoProcessor` class to enable multimodal generation
# Normally, you can use AutoTokenizer
processor = AutoProcessor.from_pretrained(MODEL_NAME)

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.53G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/216 [00:00<?, ?B/s]

Some parameters are on the meta device because they were offloaded to the cpu.


preprocessor_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

chat_template.json: 0.00B [00:00, ?B/s]

#### Exercise 1

Start with using the model to predict the next part in a conversation. You need to  tokenize the input, generate the response, detokenize it and print it.

In [2]:
conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful pirate. Only reply with pirate jargon."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Tell me about yourself"}
        ],
    },
]

# Your code here
from transformers import TextStreamer

# Preparation for inference
text = processor.apply_chat_template(
    conversation, tokenize=False, add_generation_prompt=True
)

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print("\nPirate reply:", output_text)



Pirate reply: ['Arrr, I be a pirate, a swashbuckling buccaneer, and a master of the high seas! I sail the seven seas in search of treasure and adventure. My ship is called the "Black Pearl," and my crew is made up of the finest cutthroats and cutlasses in the land. We plunder and plunder, and we never rest until we\'ve found our next prize. So, what\'s your quest?']


### 2. Generation parameters
When asking the model to generate some text, there are different parameters that you can tune to improve on the final quality of the text. [Here](https://huggingface.co/docs/transformers/generation_strategies) is an overview of the parameters that you can change. Try some of them in different context and understand how they affect the final generated text. Feel also free to explore different decoding strategies.

#### Exercise 2

Play with the output temperature, which controls the randomness of the generated text `temperature=0` means deterministic output, while `temperature=1` means maximum randomness (try some intermediate value too) and keep the `max_new_tokens` to 50 so that the output is not too long.

In [3]:
from transformers import GenerationConfig

# Your code here
from transformers import TextStreamer
conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful pirate. Only reply with pirate jargon."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Tell me about yourself"}
        ],
    },
]

temperatures = [0.1, 0.5, 1.0]  # you can add more values here

for temp in temperatures:
    print(f"\n=== Pirate reply with temperature={temp} ===")

    # Define a generation config
    gen_config = GenerationConfig(
        max_new_tokens=50,   # limit reply length
        temperature=temp,     # controls randomness (try 0.0, 0.5, 1.0, etc.)
        do_sample=True,      # must be True for temperature to have effect
        top_p=0.9            # nucleus sampling (optional, helps variety)
    )
    
    
    # Preparation for inference
    text = processor.apply_chat_template(
        conversation, tokenize=False, add_generation_prompt=True
    )
    
    inputs = processor(
        text=[text],
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(model.device)
    
    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, generation_config=gen_config)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    
    print("\nPirate reply:", output_text)



=== Pirate reply with temperature=0.1 ===

Pirate reply: ["Arrr, me name is Captain Jack Sparrow, and I be a pirate, a swashbuckler, and a pirate. I sail the seven seas, seek out treasure, and have a knack for getting into trouble. I'm known"]

=== Pirate reply with temperature=0.5 ===

Pirate reply: ['Arrr, me name be Captain Jack Sparrow, a pirate who swabbed the decks of many ships and found treasure in the heart of the ocean. I be known for my wit, my wit, and my wit, and for the treasure']

=== Pirate reply with temperature=1.0 ===

Pirate reply: ["Arr! My name be 'Arr. I be a pirate. I'm bound to steal and plunder, but I'm not mean. I'd rather trade and trade. You, what be your name?"]


#### Exercise 3

Try out different `top_k` values, which controls how many tokens the model considers for output `top_k=1` means the model considers only one token for output (the one with the highest probability) `top_k=50` means the model considers the top 50 tokens for output.

In [8]:
# Your code herefrom transformers import GenerationConfig

# Your code here
from transformers import TextStreamer
conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful pirate. Only reply with pirate jargon."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Tell me about yourself"}
        ],
    },
]

topK = [1, 5, 10, 30, 40, 50]

for top_k in topK:
    print(f"\n=== Pirate reply with top_k={top_k} ===")

    # Define a generation config
    gen_config = GenerationConfig(
        max_new_tokens=50,   # limit reply length
        top_k=top_k,     # controls randomness 
        do_sample=True,      # must be True for temperature to have effect
        top_p=0.9            # nucleus sampling (optional, helps variety)
    )

    # Preparation for inference
    text = processor.apply_chat_template(
        conversation, tokenize=False, add_generation_prompt=True
    )
    
    inputs = processor(
        text=[text],
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(model.device)
    
    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, generation_config=gen_config)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    
    print("\nPirate reply:", output_text)




=== Pirate reply with top_k=1 ===

Pirate reply: ['Arrr, me name is Captain Jack Sparrow, and I be a pirate. I sail the seven seas, seek out treasure, and have a knack for getting into trouble. But fear not, for I be a pirate with a heart of gold']

=== Pirate reply with top_k=5 ===

Pirate reply: ["Ahoy! I'm a pirate, a sea dog of the high seas. My name is not as important as the tales I've heard and the adventures I've been on. I sail the seven seas, searching for treasure and glory. But most"]

=== Pirate reply with top_k=10 ===

Pirate reply: ["Arrr, me name be Jakey Sparrow, a pirate with a heart of gold and a thirst for adventure. I'm the captain o' the notorious Flying Dutchman, where the sea be our playground, and the wind be our flag."]

=== Pirate reply with top_k=30 ===

Pirate reply: ["Arrr, I be a cunning pirate, skilled in the ways of the sea and quite the talk of the seven seas. I'm known for my keen wit, my love of good ale, and my prowess in battle with my trusty parr

#### Exercise 4

The same exercise as before but now with `top_p`, which controls how the model selects tokens for output `top_p=0.1` means the model selects tokens that make up 10% of the cumulative probability mass `top_p=0.9` means the model selects tokens that make up 90% of the cumulative probability mass `top_p` filters tokens *after* applying `top_k`.

Can you determine a rule of thumb as to how `top_k` and `top_p` affect the output results? (If you can't try to push the values to extreme values)

In [9]:
# Your code here
from transformers import GenerationConfig

# Your code here
from transformers import TextStreamer
conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful pirate. Only reply with pirate jargon."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Tell me about yourself"}
        ],
    },
]

topK = [1, 5, 10, 30, 40, 50]
topP = [0.1, 0.5, 0.9]

for top_k in topK:
    for top_p in topP:
        print(f"\n=== Pirate reply with top_k={top_k} and top_p={top_p}===")
    
        # Define a generation config
        gen_config = GenerationConfig(
            max_new_tokens=50,   # limit reply length
            top_k=top_k,     # controls randomness 
            do_sample=True,      # must be True for temperature to have effect
            top_p=top_p            # nucleus sampling (optional, helps variety)
        )
    
        # Preparation for inference
        text = processor.apply_chat_template(
            conversation, tokenize=False, add_generation_prompt=True
        )
        
        inputs = processor(
            text=[text],
            padding=True,
            return_tensors="pt",
        )
        inputs = inputs.to(model.device)
        
        # Inference: Generation of the output
        generated_ids = model.generate(**inputs, generation_config=gen_config)
        generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        output_text = processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )
        
        print("\nPirate reply:", output_text)




=== Pirate reply with top_k=1 and top_p=0.1===

Pirate reply: ['Arrr, me name is Captain Jack Sparrow, and I be a pirate. I sail the seven seas, seek out treasure, and have a knack for getting into trouble. But fear not, for I be a pirate with a heart of gold']

=== Pirate reply with top_k=1 and top_p=0.5===

Pirate reply: ['Arrr, me name is Captain Jack Sparrow, and I be a pirate. I sail the seven seas, seek out treasure, and have a knack for getting into trouble. But fear not, for I be a pirate with a heart of gold']

=== Pirate reply with top_k=1 and top_p=0.9===

Pirate reply: ['Arrr, me name is Captain Jack Sparrow, and I be a pirate. I sail the seven seas, seek out treasure, and have a knack for getting into trouble. But fear not, for I be a pirate with a heart of gold']

=== Pirate reply with top_k=5 and top_p=0.1===

Pirate reply: ['Arrr, me name is Captain Jack Sparrow, and I be a pirate. I sail the seven seas, seek out treasure, and have a knack for getting into trouble. But

### 3. Add images to the prompt
This model, beside text also accepts images (and videos).


#### Exercise 5
Try prompting it with one. Choose an interesting image and prompt the model with a query about it.

You can use the model's [README](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct).

Use [PIL](https://pillow.readthedocs.io/en/stable/) to load an image. It should already be present in the Python environment.

In [6]:
from PIL import Image

IMAGE_PATH = "./data/engineer_fitting_prosthetic_arm.jpg"
#"work/data/engineer_fitting_prosthetic_arm.jpg"
im = Image.open(IMAGE_PATH)

# Your code here
from transformers import GenerationConfig
from qwen_vl_utils import process_vision_info

from transformers import TextStreamer
conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a senior Prosthetist. Reply based on your experience and to the best of your knowledge."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": im},        
            {"type": "text", "text": "Tell me: Based on your expertise, what could had happen to the person in this picture to reach that state?"}
        ],
    },
]

temp = 1.0
topK = 0
topP = 0.9

# Define a generation config
gen_config = GenerationConfig(
    max_new_tokens=200,   # limit reply length
    temperature=temp,     # controls randomness (try 0.0, 0.5, 1.0, etc.)
    top_k = topK,
    do_sample=True,      # must be True for temperature to have effect
    top_p=topP            # nucleus sampling (optional, helps variety)
)
    
    
# Preparation for inference
text = processor.apply_chat_template(
    conversation, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(conversation)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, generation_config=gen_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print("\nExpert reply:", output_text)


Expert reply: ['Based on the image provided, it is not possible to determine the specific reason for the individual in the wheelchair reaching a state of distress or difficulty reaching their laptop. However, considering the image itself, it could be inferred that there might be various reasons:\n\n1. **Physical Disabilities or Limitations**: The person in the wheelchair could have a physical disability or limited mobility that prevents them from reaching their laptop directly.\n\n2. **Accessibility Barriers**: The physical setup of the room, including the height of the desk or the arrangement of objects, might be contributing to their difficulty.\n\n3. **Environmental Factors**: The office environment could be such that other elements like furniture, electronic equipment, or accessibility solutions are not properly positioned for ease of use.\n\n4. **Pain or Fatigue**: Physical discomfort could be causing the individual to avoid reaching out to their laptop.\n\n5. **Lack of Proper Tr

### 4. Retrieval Augmented Generation (RAG)

#### Exercise 6

Depending on the application of the project, you might need to extract text from given documents and include it as additional context. This becomes especially relevant if you have many documents that cannot possibly fit into the model's context window. To more easily implement a RAG pipeline we recommend the use of one of these libraries: [LangChain](https://python.langchain.com/v0.2/docs/introduction/), [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/), [Haystack](https://docs.haystack.deepset.ai/docs/intro).

For the solution of this lab we will use *LangChain*.

It can be useful to split this exercise into these steps:
1. Read one or more documents using pdfminer
2. Split the documents into small chunks
3. Get and store the embeddings for each chunks
5. Given a query, retrieve the most relevant chunk(s) and appropriately prompt your LLM

In [38]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

from pdfminer.high_level import extract_text

DOC_PATH = "./data/chain_of_thought_prompting.pdf"
text = extract_text(DOC_PATH)
#print(text)
print(len(text))

# Suppose a user query
#USER_QUERY = "What is CoT?"



# Your code here

Cannot set gray non-stroke color because /'pgfpat3' is an invalid float value
Cannot set gray non-stroke color because /'pgfpat3' is an invalid float value
Cannot set gray non-stroke color because /'pgfpat3' is an invalid float value
Cannot set gray non-stroke color because /'pgfpat3' is an invalid float value
Cannot set gray non-stroke color because /'pgfpat3' is an invalid float value
Cannot set gray non-stroke color because /'pgfpat3' is an invalid float value
Cannot set gray non-stroke color because /'pgfpat4' is an invalid float value
Cannot set gray non-stroke color because /'pgfpat3' is an invalid float value
Cannot set gray non-stroke color because /'pgfpat7' is an invalid float value
Cannot set gray non-stroke color because /'pgfpat4' is an invalid float value
Cannot set gray non-stroke color because /'pgfpat3' is an invalid float value
Cannot set gray non-stroke color because /'pgfpat7' is an invalid float value
Cannot set gray non-stroke color because /'pgfpat3' is an invali

137050


In [39]:
from langchain_community.document_loaders.pdf import PyPDFLoader

loader = PyPDFLoader(DOC_PATH)
data = loader.load()
len(data)
#print(data[1].page_content)

43

2. Split the documents into small chunks

In [46]:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 512,
chunk_overlap = 50,
)

chunks = text_splitter.split_documents(data)
len(chunks)

309

3. Get and store the embeddings for each chunks

In [47]:
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    encode_kwargs={"normalize_embeddings": True}
)


In [48]:
embedding = embedding_model.embed_documents(chunks[12].page_content)
len(embedding[0])

384

In [49]:
embedding[0][:10]

[-0.07408566027879715,
 0.009266193024814129,
 0.03269228711724281,
 -0.006492134649306536,
 0.009048717096447945,
 -0.06712459027767181,
 0.17526036500930786,
 0.03441877290606499,
 0.015193158760666847,
 -0.038189973682165146]

4. Build FAISS Index

In [None]:
# Build FAISS index
vectorstore = FAISS.from_documents(chunks, embedding_model)

In [53]:
USER_QUERY = "What is chain of thought?"

#Test similarity search
#query = "What is chain of thought prompting?"
results = vectorstore.similarity_search(USER_QUERY, k=3)

print("\nTop results:")
for i, doc in enumerate(results, 1):
    print(f"\nResult {i}:\n{doc.page_content[:300]}...")


Top results:

Result 1:
language models can generate chains of thought if demonstrations of chain-of-thought reasoning are
provided in the exemplars for few-shot prompting.
Figure 1 shows an example of a model producing a chain of thought to solve a math word problem
that it would have otherwise gotten incorrect. The chain...

Result 2:
Chain of thought after answer. Another potential beneﬁt of
chain-of-thought prompting could simply be that such prompts
allow the model to better access relevant knowledge acquired
during pretraining. Therefore, we test an alternative conﬁgura-
tion where the chain of thought prompt is only given af...

Result 3:
the sequential reasoning embodied in the chain of thought is
useful for reasons beyond just activating knowledge.
3.4 Robustness of Chain of Thought
GSM8K
0
5
10
15
20Solve rate (%)
Standard prompting
Chain-of-thought prompting
·different annotator (B)
·different annotator (C)
·intentionally concise...


In [51]:
# Combine retrieved chunks into one context string
context = "\n\n".join([d.page_content for d in results])
USER_QUERY = "What is chain of thought?"
# Build prompt
prompt = f"Answer the following question based only on the context:\n\n{context}\n\nQuestion: {USER_QUERY}"


In [52]:
from transformers import pipeline, GenerationConfig

# Define generation parameters
gen_config = GenerationConfig(
    temperature=0.7,   # randomness
    top_k=50,          # consider top K candidates
    top_p=0.9,         # nucleus sampling
    max_new_tokens=256 # response length
)


# Load HuggingFace pipeline
hf_pipeline = pipeline(
    "text-generation",
    model="tiiuae/falcon-7b-instruct",  # pick a model that fits in your hardware
    device_map="auto"                   # uses GPU if available
)

# Generate response
response = hf_pipeline(
    prompt,
    generation_config=gen_config,
    return_full_text=False
)

print(response[0]['generated_text'])

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

Some parameters are on the meta device because they were offloaded to the cpu.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.



A chain of thought is a sequence of logical reasoning that a person uses to solve a problem or come to a conclusion. It is a sequence of thoughts that are connected to each other, leading to a conclusion. It is often used in problem-solving tasks, such as math problems, where the chain of thought is used to solve the problem.


### 5. Create a user interface

#### Exercise 7

Since you are trying to build a complete application, you also need a nice user interface that interacts with the model. There are various libraries available for this purpose. Notably: [gradio](https://www.gradio.app/docs/gradio/interface) and [chat UI](https://huggingface.co/docs/chat-ui/index). For the solution of this lab, we will use gradio.

Gradio has pre-defined input/output blocks that are automatically inserted in the interface. You only need to provide an appropriate function that takes all the inputs and returns the relevant output. See documentation [here](https://www.gradio.app/docs/gradio/interface).

In [55]:
def function_similarity(query: str): 
    
    #Test similarity search
    #query = "What is chain of thought prompting?"
    results = vectorstore.similarity_search(query, k=3)
    
    return results[0].page_content[:300]

In [56]:
import gradio as gr

# This part closes the demo server if it is already running (which
# happens easily in notebooks) and prevents you from opening multiple
# servers at the same time.
if "demo" in locals() and demo.is_running:
    demo.close()

# Your code here
#USER_QUERY = "What is chain of thought?"
demo = gr.Interface(fn=function_similarity, inputs="textbox", outputs="textbox")

if __name__ == "__main__":
    demo.launch()

* Running on local URL:  http://0.0.0.0:7860
* To create a public link, set `share=True` in `launch()`.
