# Prompt engineering | Homework 5 | Konstantinos Vakalopoulos 12223236

The first option was chosen for homework 5.

## ***1.1 Option 1***

Create a Python implementation for the food-for-thought generating prompts strategy
presented by Prof. Gottlob in his lecture. Refer to his slide 60 for details.
The program will accept an original user query as an input and return two answers
of a large language model as an output:

1. The original answer of the model without any further prompting.

2. The answer of the model after it was primed using Gottlob‘s food-for-thought
method.

The program can be submitted either as a single runnable .py with example input
and outputs of an actual run as comments, or in the form of a Jupyter notebook
including that information.

### ***1.1.1 Selection of LLM***

You can use any LLM you can access and we encourage you to do so. If you otherwise do not have access to a decent model, you can use https://huggingface.co/HuggingFaceH4/zephyr-7b-beta?inference_api=true for your implementation,
which is free to use in the cloud with an—also free—HuggingFace account.

### ***1.1.2 Remembering chat history***

Keep in mind that the model is stateless, which means that you have to provide the
context of what has been talked before in every call as the model ”forgets” everything
said immediately. Different models were trained using different prompts, therefore
require different formats. You will have to send the full chat history in each request.


## Documention

First, the torch package was installed along with the proper API for the transformer. Regarding the transformer, the model zephyr-7b-beta model was used, from hugging face. According to the link provided by the exercise, all the code is given on how to use the LLM model and to do the prompt engineering by asking different questions. The things, which were changed from the original source code from the site, are the questions and the usage of a for loop in order to append all the different answers from the LLM. 

It is worth mentiong that according to the creators, the specific LLM model (Zephyr-7B-β) has not been aligned to human preferences for safety within the RLHF phase or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs. However, based on the provided model outputs, the answers were quite good and reasonable in terms of quality and quantity.  

In [1]:
# Install transformers from source - only needed for versions <= v4.34
# pip install git+https://github.com/huggingface/transformers.git
# pip install accelerate

!pip install torch
!pip install git+https://github.com/huggingface/transformers.git
!pip install accelerate

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-p6iusd0e
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-p6iusd0e
  Resolved https://github.com/huggingface/transformers.git to commit e547458c43dfdbbb8f6a7757237e234c44e20a8f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.38.0.dev0-py3-none-any.whl size=8404873 sha256=dbf30f92e8883e81dc8dee39400ee6139c62ebc9af7d62b4b5262f1d8b4951b5
  Stored in directory: /tmp/pip-ephem-wheel-cache-5wvu00lz/wheels/e7/9c/5b/e1a9c8007c343041e61cc484433d512ea9274272e3fcbe7c16
Successfully bu

In [2]:
# Import torch and transformers package

import torch
from transformers import pipeline
torch.cuda.empty_cache()
# Set up the model for the text generation pipeline
# Running this may take a minute
pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

The codeblock below is responsible for asking the original question to the LLM without any additional prompting.

In [61]:
job_description = "You are a knowledgeable chatbot providing information about everything!"

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages_without = [
    {
        "role": "system",
        "content": job_description,
    },
]

# Original prompt without prompting
Original_Prompt = "How can we reduce traffic congestion in urban areas?"
messages_without.append({"role": "user", "content": Original_Prompt})
prompt = pipe.tokenizer.apply_chat_template(messages_without, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
output = outputs[0]["generated_text"]
print(output)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<|system|>
You are a knowledgeable chatbot providing information about everything!</s>
<|user|>
How can we reduce traffic congestion in urban areas?</s>
<|assistant|>
Here are some ways to reduce traffic congestion in urban areas:

1. Encourage public transportation: Invest in and expand public transportation systems such as buses, subways, and trains to provide convenient and affordable alternatives to driving. This can also help to reduce the number of cars on the road.

2. Promote carpooling and ride-sharing: Encourage people to share rides by offering incentives such as preferred parking, toll discounts, or subsidies. This can help to reduce the number of cars on the road and alleviate congestion.

3. Implement congestion pricing: Charge drivers a fee to enter congested areas during peak hours. This can help to deter unnecessary driving and encourage the use of public transportation or other alternative modes of transportation.

4. Improve road infrastructure: Invest in infrastruct

The codeblock below is responsible for asking the prompting prompt question to the LLM  based on the slide 60 of the lecture notes. Basically, the LLM creates two questions that later will be answered by the LLM and improve its answer to the orinal prompt.

In [62]:
job_description = "You are a knowledgeable chatbot providing information about everything!"

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
message = [
    {
        "role": "system",
        "content": job_description,
    },
]

# Prompting Prompt
PP = "Which are two questions Q1, Q2 whose answers would be most helpful to solve the following problem: How can we reduce traffic congestion in urban areas?"

message.append({"role": "user", "content": PP})
prompt = pipe.tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
output = outputs[0]["generated_text"]

print(output)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<|system|>
You are a knowledgeable chatbot providing information about everything!</s>
<|user|>
Which are two questions Q1, Q2 whose answers would be most helpful to solve the following problem: How can we reduce traffic congestion in urban areas?</s>
<|assistant|>
1. Q1: What are the most congested areas in our city during peak hours?
2. Q2: What are the major causes of traffic congestion in our city during peak hours?

These questions would provide valuable insights into identifying the specific locations and factors contributing to traffic congestion, which can then be addressed through targeted solutions. Some possible solutions could include improving public transportation, implementing intelligent transportation systems, encouraging carpooling and bike-sharing, and optimizing traffic signals.


The codeblock below performs few natural language processing steps in order to extract the two questions from the LLM's answer. 

In [63]:
import re

response_start = output.rfind('<|assistant|>')
output[response_start + len('<|assistant|>'):]

Original_Prompt = "How can we reduce traffic congestion in urban areas?"
questions_string = output[response_start + len('<|assistant|>'):]

pattern = re.compile(r'\d+\.\s(Q\d+:\s.+?)\n')

# Find all matches in the input string
matches = re.findall(pattern, questions_string)

if len(matches) >= 2:
    question_q1 = matches[0]
    question_q2 = matches[1]
else:
    print("Could not find both questions.")

questions_list = []
questions_list.append(question_q1)
questions_list.append(question_q2)
Original_Prompt = "Final Question: " + Original_Prompt
questions_list.append(Original_Prompt)
questions_list

['Q1: What are the most congested areas in our city during peak hours?',
 'Q2: What are the major causes of traffic congestion in our city during peak hours?',
 'Final Question: How can we reduce traffic congestion in urban areas?']

Finally, the LLM answers the two questions and the original prompt. The answer is different compared to answer above.

In [64]:
messages = [
    {
        "role": "system",
        "content": job_description,
    },
]

for question in questions_list:
    print(question)
    messages.append({"role": "user", "content": question})
    prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=128, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
    output = outputs[0]["generated_text"]
    messages.append({"role": "assistant", "content": output})
    response_start = output.rfind('<|assistant|>')

    # Print in a form of a chat bot and not as system/user/assistant. This was done in the previous codeblock
    print(output[response_start + len('<|assistant|>'):])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Q1: What are the most congested areas in our city during peak hours?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



I don't have access to real-time traffic data, but based on historical traffic patterns and statistics, some of the most congested areas in [city name] during peak hours (usually between 7-10 am and 5-7 pm) are:

1. Downtown area: this includes major intersections like [intersection names], as well as busy streets like [street names].
2. Central business district (cbd): the area around [cbd name] is heavily congested during peak hours due to a high volume of commuters and office workers.
3
Q2: What are the major causes of traffic congestion in our city during peak hours?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



1. High volume of commuters: a large number of people travel to and from work, school, and other destinations during peak hours, leading to a significant increase in traffic.
2. Insufficient infrastructure: the city's road network is often inadequate to handle the high volume of vehicles during peak hours, resulting in congestion.
3. Public transportation issues: overcrowded buses, trains, and subways during peak hours can lead to delays, causing more people to take their cars, further contributing to traffic congestion.
4. Road accidents and construction: any road accidents or
Final Question: How can we reduce traffic congestion in urban areas?

1. Encourage public transportation: Investing in and improving public transportation systems, such as buses, trains, and subways, can significantly reduce the number of cars on the road during peak hours.

2. Promote carpooling and ride-sharing: Encouraging people to share rides can help reduce the number of cars on the road, as well as reduc

## Comments on the results

As is obvious from the questions and answers, their content is related to traffic congestion reduction in urban areas. Above are presented the original question with and without prompting along with the answers given by the LLM. Also, two additional questions, including the answers from the LLM, are presented. These two questions are responsible for the modified final answer from the LLM. The difference between with and without prompting is noticeable. The final answer has been slightly modified and affected from the two prompt questions. Finally, it is worth mentioning two things. First, due to the limitation of the tokens, it can be seen above that all answers are incomplete. Second, in case the code above will be executed again, the results will be different. Therefore, the examples above are just for demostration purposes after one execution. It is worth mentioning that only two questions have been used due to limitations in computational resorces from Google Collab.