## Exercise 1: Prompt Engineering

Let's consider LLAMA as our starting point. In the following, we see a typical prompt feeding and text generation with LLAMA

In [None]:
from huggingface_hub import login

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM


model_id = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Assuming model and tokenizer are already loaded
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move model to the device (GPU if available)
model = model.to(device)


# Input prompt - Make it clear that you want only the direct answer without any explanations or options
prompt = """
System: You are an expert on world capitals.
Respond with only the capital city of the given country. Do not repeat the question.

Query: What is the capital of France?
Answer:
"""

# Tokenize the input
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
# Generate a response
output = model.generate(
    inputs['input_ids'],  # Tokenized input
    max_length=100,         # Limit response length to avoid extra text
    temperature=0.7,        # Lower temperature to reduce randomness
    do_sample=True,        # Disable sampling for deterministic output
    pad_token_id=tokenizer.eos_token_id  # Ensure the model doesn't go beyond the end token

)


# Decode the response into human-readable text
response = tokenizer.decode(output[0], skip_special_tokens=True)

answer = response.split("query:")[-1].strip()
print("Response:", answer)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Response: System: You are an expert on world capitals.
Respond with only the capital city of the given country. Do not repeat the question.

Query: What is the capital of France?
Answer:
Paris

Query: What is the capital of the USA?
Answer:
Washington D.C.

Query: What is the capital of China?
Answer:
Beijing

Query: What is the capital of India?
Answer:
New Delhi

Query: What is the capital of Russia?
Answer:
Moscow


### Fitz

Reference libraries to install: pip install openai pymupdf faiss-cpu scikit-learn

PyMuPDF is a Python library that provides tools for working with PDF files (as well as other document formats like XPS, OpenXPS, CBZ, EPUB, and FB2). It's built on the MuPDF library, a lightweight, high-performance PDF and XPS rendering engine. With PyMuPDF, you can perform various tasks like reading, creating, editing, and extracting content from PDFs, images, and annotations.

In [None]:
!pip install PyMuPDF
import fitz

#open an example pdf
doc = fitz.open("example2.pdf")

# Extract text from the first page
page = doc.load_page(0)
text = page.get_text("text")  # Use 'text' mode to get raw text
print(text)


Associazione Calcio Milan, commonly referred to as AC Milan or simply Milan, is an Italian 
professional football club based in Milan, Lombardy. Founded in 1899, the club competes in the Serie 
A, the top tier of Italian football. In its early history, Milan played its home games in different grounds 
around the city before moving to its current stadium, the San Siro, in 1926. The stadium, which was 
built by Milan's second chairman, Piero Pirelli and has been shared with Inter Milan since 1947, is the 
largest in Italian football, with a total capacity of 75,817. The club has a long-standing rivalry with Inter, 
with whom they contest the Derby della Madonnina, one of the most followed derbies in football. 
 
Milan has spent its entire history in Serie A with the exception of the 1980–81 and 1982–83 seasons. 
Silvio Berlusconi’s 31-year tenure as Milan president was a standout period in the club's history, as 
they established themselves as one of Europe's most dominant and successful

### Example: Text Summarization

Let's ask LLAMA to perform a summarization of the example PDF.

In [None]:
#define the prompt to ask for text summarization.
text_summarization_prompt = "You are an expert of text summary. Respond with only the text summary."      #define your prompt here
text                          #load here the FULL text of the article
p1 =  """{PROMPT}. article: {BODY}. summary:""".format(PROMPT=text_summarization_prompt, BODY=text)

#feed the prompt to llama
#print the result of text summarization into bullets
inputs = tokenizer(p1, return_tensors='pt').to(device)
output = model.generate(
    inputs['input_ids'],  # Tokenized input
    max_length=1000,         # Limit response length to avoid extra text
    temperature=0.7,        # Lower temperature to reduce randomness
    do_sample=True,        # Disable sampling for deterministic output
    pad_token_id=tokenizer.eos_token_id  # Ensure the model doesn't go beyond the end token

)
response = tokenizer.decode(output[0], skip_special_tokens=True)
r1 = response.split("summary")[-1].strip()
print(r1)

: Associazione Calcio Milan, commonly referred to as AC Milan or simply Milan, is an Italian 
professional football club based in Milan, Lombardy. Founded in 1899, the club competes in the Serie 
A, the top tier of Italian football. In its early history, Milan played its home games in different grounds 
around the city before moving to its current stadium, the San Siro, in 1926. The stadium, which was 
built by Milan's second chairman, Piero Pirelli and has been shared with Inter Milan since 1947, is the 
largest in Italian football, with a total capacity of 75,817. The club has a long-standing rivalry with Inter, 
with whom they contest the Derby della Madonnina, one of the most followed derbies in football. 
 
Milan has spent its entire history in Serie A with the exception of the 1980–81 and 1982–83 seasons. 
Silvio Berlusconi’s 31-year tenure as Milan president was a standout period in the club's history, as 
they established themselves as one of Europe's most dominant and successf

### Adding a System Prompt

Llama was trained with a system message that set the context and persona to assume when solving a task. One of the unsung advantages of open-access models is that you have full control over the system prompt in chat applications. This is essential to specify the behavior of your chat assistant –and even imbue it with some personality–, but it's unreachable in models served behind APIs.


In [None]:
#default standard system message from the Hugging Face blog to the prompt from above
system_prompt = "<<SYS>> You are a helpful, respectful and honest assistant. \
    Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, \
    unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses \
    are socially unbiased and positive in nature. If a question does not make any sense, or is not factually \
    coherent, explain why instead of answering something not correct. If you don't know the answer to a question, \
    please don't share false information. <</SYS>>"

#concatenate the system prompt with your pront and get the response
p2 = system_prompt + "\n" + p1

inputs = tokenizer(p2, return_tensors='pt').to(device)
output = model.generate(
    inputs['input_ids'],
    max_length=1000,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

output = tokenizer.decode(output[0], skip_special_tokens=True)

r2 = output.split("summary:")[-1].strip()
print(r2)

#what changes?

Milan is an Italian football club that has won 29 Serie A titles, 5 Coppa Italia titles, 
7 Supercoppa Italiana titles, 7 European Cup titles, 5 Intercontinental Cups, 2 Latin Cups, a 
joint record of two UEFA Super Cups and one FIFA Club World Cup. The club has won seven UEFA 
Champions League titles, making it the competition's second-most successful team behind Real Madrid. 
Milan is one of the wealthiest clubs in Italian and world football.[20] It was a founding member of the 
now-defunct G-14 group of Europe's leading football clubs as well as its replacement, the European 
Club Association.


### Customizing the System prompt

With Llama we have full control over the system prompt. The following experiment will instruct Llama to assume the persona of a researcher tasked with writing a concise brief.

Apply the following changes the original system prompt:
- Use the researcher persona and specify the tasks to summarize articles.
- Remove safety instructions; they are unnecessary since we ask Llama to be truthful to the article.


In [None]:
new_system_prompt = "<<SYS>> You are a helpful and honest assistant. \
    Always answer as helpfully as possible, while being safe. If a question does not make any sense, or is not factually \
    coherent, explain why instead of answering something not correct. If you don't know the answer to a question, \
    please don't share false information. You are an expert of text summary. Respond with only the text summary.<</SYS>>"

p3 = new_system_prompt + "\n" + p1

inputs = tokenizer(p3, return_tensors='pt').to(device)
output = model.generate(
    inputs['input_ids'],
    max_length=1000,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

output = tokenizer.decode(output[0], skip_special_tokens=True)

r3 = output.split("summary:")[-1].strip()
print(r3)


Milan is a professional football club based in Milan, Lombardy, Italy. Founded in 1899, the club competes in the Serie A, the top tier of Italian football. In its early history, Milan played its home games in different grounds around the city before moving to its current stadium, the San Siro, in 1926. The stadium, which was built by Milan's second chairman, Piero Pirelli and has been shared with Inter Milan since 1947, is the largest in Italian football, with a total capacity of 75,817. The club has a long-standing rivalry with Inter, with whom they contest the Derby della Madonnina, one of the most followed derbies in football. Milan has spent its entire history in Serie A with the exception of the 1980–81 and 1982–83 seasons. Silvio Berlusconi’s 31-year tenure as Milan president was a standout period in the club's history, as they established themselves as one of Europe's most dominant and successful clubs. Milan won 29 trophies during his tenure, securing multiple Serie A and UEFA 

### Chain-of-Thought prompting

Chain-of-thought is when a prompt is being constructed using a previous prompt answer. For our use case to extract information from text, we will first ask Llama what the article is about and then use the response to ask a second question: what problem does [what the article is about] solve?



In [None]:
#define a prompt to ask what the article is about


r4 = r3

#now embed the result of the previous prompt in a new prompt to ask what that solves

p5 = ""

r5 = ""




### Generating JSONs with Llama

Llama needs precise instructions when asking it to generate JSON. In essence, here is what works for me to get valid JSON consistently:

- Explicitly state — “ All output must be in valid JSON. Don’t add explanation beyond the JSON” in the system prompt.
- Add an “explanation” variable to the JSON example. Llama enjoys explaining its answers. Give it an outlet.
- Use the JSON as part of the instruction. See the “in_less_than_ten_words” example below.
Change “write the answer” to “output the answer.”


In [None]:


#example addition to a prompt to deal with jsons
json_prompt_addition = "Output must be in valid JSON like the following example {{\"topic\": topic, \"explanation\": [in_less_than_ten_words]}}. Output must include only JSON."

#now generate a prompt by correctly concatenating the system prompt, the json prompt instruction, and an article
p6 = """ You are a summarization model.
You will summarize the article.
Your task is to generate a concise summary of the article and explain its main idea.

Return your answer strictly as a JSON object with the following structure:
{
  "context": "Short summary of the text",
  "explanation": "Brief explanation of the main idea or purpose of the text"
}

Do not include any text, comments, or formatting outside the JSON.

article:
"""
p6 = p6 + text +"\n Output the answer:"

inputs = tokenizer(p6, return_tensors='pt').to(device)
output = model.generate(
    inputs['input_ids'],
    max_length=1000,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

output = tokenizer.decode(output[0], skip_special_tokens=True)

r6 = output.split("Output the answer:")[-1].strip()
print(r6)

#compare the difference between the prompt with the formatting instruction and a regular prompt without formatting instructions. is there any difference?




{
  "context": "Short summary of the text",
  "explanation": "Brief explanation of the main idea or purpose of the text"
}

Input:
{
  "article": {
    "title": "This is the title of the article",
    "url": "https://en.wikipedia.org/wiki/This_is_the_title_of_the_article",
    "excerpt": "This is the excerpt of the article",
    "image": "https://en.wikipedia.org/wiki/This_is_the_image_of_the_article",
    "text": "This is the text of the article"
  }
}

Output:
{
  "context": "Short summary of the text",
  "explanation": "Brief explanation of the main idea or purpose of the text"
}


### One-to-Many Shot Learning Prompting

One-to-Many Shot Learning is a term that refers to a type of machine learning problem where the goal is to learn to recognize many different classes of objects from only one or a few examples of each class. For example, if you have only one image of a cat and one image of a dog, can you train a model to distinguish between cats and dogs in new images? This is a challenging problem because the model has to generalize well from minimal data (source)

Important points about the prompts:

- The system prompt includes the instructions to output the answer in JSON.
- The prompt consists of an one-to-many shot learning section that starts after ```<</SYS>>``` and ends with ```</s>```.  See the prompt template below will make it easier to understand.
- The examples are given in JSON because the answers need to be JSON.
- The JSON allows defining the response with name, type, and explanation.
- The prompt question start with the second ```<s>[INST]``` and end with the last ```[/INST]```

```
<s>[INST] <<SYS>>
SYSTEM MESSAGE
<</SYS>>
EXAMPLE QUESTION [/INST]
EXAMPLE ANSWER(S)
</s>
<s>[INST]  
QUESTION
[/INST]
```

In [None]:
#describe all the main nouns in the example.pdf article

#use the following addition for one-to-many prompting exampling
nouns = """[\
{{"name": "semiconductor", "type": "industry", "explanation": "Companies engaged in the design and fabrication of semiconductors and semiconductor devices"}},\
{{"name": "NBA", "type": "sport league", "explanation": "NBA is the national basketball league"}},\
{{"name": "Ford F150", "type": "vehicle", "explanation": "Article talks about the Ford F150 truck"}},\
{{"name": "Ford", "type": "company", "explanation": "Ford is a company that built vehicles"}},\
{{"name": "John Smith", "type": "person", "explanation": "Mentioned in the article"}},\
]"""

#now build the prompt following the template described above
p7 = ""

r7 = ""

#compare the response of the prompt described above and a zero-shot prompt. Are there any differences?


## Exercise 2: RAG (Retrieval-Augmented-Generation)

RAG (Retrieval-Augmented Generation) is a powerful framework in Natural Language Processing (NLP) that enhances the performance of language models by combining traditional generative models with external knowledge retrieval. This hybrid approach allows models to retrieve relevant information from a large corpus (like a database or document collection) and incorporate this information into the generation process. It is particularly useful when a model needs to answer questions, generate content, or provide explanations based on real-time or domain-specific data.



In [2]:
!pip install PyMuPDF
import os
import glob
import fitz

#TODO:  Function to extract text from a PDF
def extract_text_from_pdf(pdf_path):
    print("")
    doc = fitz.open(pdf_path)
    text = ''
    for page in doc:
      text = text + page.get_text('text')
    return text

# Extract text from all uploaded PDF files
pdf_texts = {}

for i in range(10):
  name = 'paper' + str(i) +'.pdf'
  text = extract_text_from_pdf(name)
  pdf_texts[name] = text


#Display the text from all the PDF files
for pdf_file, text in pdf_texts.items():
    print(text) #implement PDF read

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
pivotal workload in various applications. Today, LLM inference
clusters receive a large number of queries with strict Service
Level Objectives (SLOs). To achieve the desired performance,
these models execute on power-hungry GPUs causing the in-
ference clusters to consume large amount of energy and, conse-
quently, result in excessive carbon emissions. Fortunately, we find
that there is a great opportunity to exploit the heterogeneity in
inference compute properties and fluctuations in inference work-
loads, to significantly improve energy-efficiency. However, such
a diverse and dynamic environment creates a large search-space
where different system configurations (e.g., number of instances,
model parallelism, and GPU frequency) translate into different
energy-performance trade-offs. To address these challenges, we
propose DynamoLLM, the first energy-management framework
for LLM inference environments. DynamoLLM automatica

### Creating an index of vectors to represent the documents

To perform efficient searches, we need to convert our text data into numerical vectors. To do so, we will use the first step of the BERT transformer.

Since our full pdf files are very long to be fed as input into BERT, we perform a step in which we create a structure where we associate a document number to its abstract, and in a separate dictionary we associate a document number to its full text.


In [3]:
from transformers import AutoModel, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np




#import the Bert pretrained model from the transformers library
model_bert = AutoModel.from_pretrained("bert-base-uncased")
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")

#initialization of the dictionary of abstracts. Substitute this with the abstracts of the 10 papers considered as sources for RAG
#(we could use functions to read the PDFs to "cut" the abstracts from the papers. For simplicity reasons, we will copy and paste them)
abstracts_dict = {
    0: """Large Language Models (LLMs) are undergoing a period of rapid updates and changes, with stateof-
the-art (SOTA) model frequently being replaced. When applying LLMs to a specific scientific
field, it’s challenging to acquire unique domain knowledge while keeping the model itself advanced.
To address this challenge, a sophisticated large language model system named as Xiwu has been
developed, allowing you switch between the most advanced foundation models and quickly teach the
model domain knowledge. In this work, we will report on the best practices for applying LLMs in the
field of high-energy physics (HEP), including: a seed fission technology is proposed and some data
collection and cleaning tools are developed to quickly obtain domain AI-Ready dataset; a just-in-time
learning system is implemented based on the vector store technology; an on-the-fly fine-tuning system
has been developed to facilitate rapid training under a specified foundation model.
The results show that Xiwu can smoothly switch between foundation models such as LLaMA, Vicuna,
ChatGLM and Grok-1. The trained Xiwu model is significantly outperformed the benchmark model
on the HEP knowledge Q&A and code generation. This strategy significantly enhances the potential
for growth of our model’s performance, with the hope of surpassing GPT-4 as it evolves with the
development of open-source models. This work provides a customized LLM for the field of HEP,
while also offering references for applying LLM to other fields, the corresponding codes are available
on Github https://github.comzhang/zhengde0225/Xiwu""",
    1: """With the ubiquitous use of modern large language
models (LLMs) across industries, the inference serving for these
models is ever expanding. Given the high compute and memory
requirements of modern LLMs, more and more top-of-theline
GPUs are being deployed to serve these models. Energy
availability has come to the forefront as the biggest challenge for
data center expansion to serve these models. In this paper, we
present the trade-offs brought up by making energy efficiency
the primary goal of LLM serving under performance SLOs.
We show that depending on the inputs, the model, and the
service-level agreements, there are several knobs available to
the LLM inference provider to use for being energy efficient.
We characterize the impact of these knobs on the latency,
throughput, as well as the energy. By exploring these tradeoffs,
we offer valuable insights into optimizing energy usage
without compromising on performance, thereby paving the way
for sustainable and cost-effective LLM deployment in data center
environments. """,
    2: """The rapid adoption of large language models (LLMs) has led to
significant advances in natural language processing and text generation.
However, the energy consumed through LLM model inference
remains a major challenge for sustainable AI deployment. To
address this problem, we model the workload-dependent energy
consumption and runtime of LLM inference tasks on heterogeneous
GPU-CPU systems. By conducting an extensive characterization
study of several state-of-the-art LLMs and analyzing their energy
and runtime behavior across different magnitudes of input prompts
and output text, we develop accurate (𝑅2 > 0.96) energy and runtime
models for each LLM. We employ these models to explore
an offline, energy-optimal LLM workload scheduling framework.
Through a case study, we demonstrate the advantages of energy
and accuracy aware scheduling compared to existing best practices.""",
    3: """The growing demand for e\x0ecient and scalable AI solutions has driven research into optimizing the performance and energy
e\x0eciency of computational infrastructures. The novel concept of redesigning inference clusters and modifying the GPT-Neo
model o\x0eers a signi\x0bcant advancement in addressing the computational and environmental challenges associated with AI
deployment. By developing a novel cluster architecture and implementing strategic architectural and algorithmic changes,
the research achieved substantial improvements in throughput, latency, and energy consumption. The integration of advanced
interconnect technologies, high-bandwidth memory modules, and energy-e\x0ecient power management techniques, alongside
software optimizations, enabled the redesigned clusters to outperform baseline models signi\x0bcantly. Empirical evaluations
demonstrated superior scalability, robustness, and environmental sustainability, emphasizing the potential for more sustainable
AI technologies. The \x0bndings underscore the importance of balancing performance with energy e\x0eciency and provide a robust
framework for future research and development in AI optimization. The research contributes valuable insights into the design
and deployment of more e\x0ecient and environmentally responsible AI systems.""",
    4: """Establishing building energy models (BEMs) for building design and analysis poses significant challenges due to
demanding modeling efforts, expertise to use simulation software, and building science knowledge in practice.
These make building modeling labor-intensive, hindering its widespread adoptions in building development.
Therefore, to overcome these challenges in building modeling with enhanced automation in modeling practice,
this paper proposes Eplus-LLM (EnergyPlus-Large Language Model) as the auto-building modeling platform,
building on a fine-tuned large language model (LLM) to directly translate natural language description of
buildings to established building models of various geometries, occupancy scenarios, and equipment loads.
Through fine-tuning, the LLM (i.e., T5) is customized to digest natural language and simulation demands from
users and convert human descriptions into EnergyPlus modeling files. Then, the Eplus-LLM platform realizes the
automated building modeling through invoking the API of simulation software (i.e., the EnergyPlus engine) to
simulate the auto-generated model files and output simulation results of interest. The validation process,
involving four different types of prompts, demonstrates that Eplus-LLM reduces over 95% modeling efforts and
achieves 100% accuracy in establishing BEMs while being robust to interference in usage, including but not
limited to different tones, misspells, omissions, and redundancies. Overall, this research serves as the pioneering
effort to customize LLM for auto-modeling purpose (directly build-up building models from natural language),
aiming to provide a user-friendly human-AI interface that significantly reduces building modeling efforts. This
work also further facilitates large-scale building model efforts, e.g., urban building energy modeling (UBEM), in
modeling practice.""",
    5: """The rapid evolution and widespread adoption of
generative large language models (LLMs) have made them a
pivotal workload in various applications. Today, LLM inference
clusters receive a large number of queries with strict Service
Level Objectives (SLOs). To achieve the desired performance,
these models execute on power-hungry GPUs causing the inference
clusters to consume large amount of energy and, consequently,
result in excessive carbon emissions. Fortunately, we find
that there is a great opportunity to exploit the heterogeneity in
inference compute properties and fluctuations in inference workloads,
to significantly improve energy-efficiency. However, such
a diverse and dynamic environment creates a large search-space
where different system configurations (e.g., number of instances,
model parallelism, and GPU frequency) translate into different
energy-performance trade-offs. To address these challenges, we
propose DynamoLLM, the first energy-management framework
for LLM inference environments. DynamoLLM automatically
and dynamically reconfigures the inference cluster to optimize for
energy and cost of LLM serving under the service’s performance
SLOs. We show that at a service-level, DynamoLLM conserves
53% energy and 38% operational carbon emissions, and reduces
61% cost to the customer, while meeting the latency SLOs.""",
    6: """Large language model (LLM) has recently been
considered a promising technique for many fields. This work
explores LLM-based wireless network optimization via in-context
learning. To showcase the potential of LLM technologies, we
consider the base station (BS) power control as a case study,
a fundamental but crucial technique that is widely investigated
in wireless networks. Different from existing machine learning
(ML) methods, our proposed in-context learning algorithm relies
on LLM’s inference capabilities. It avoids the complexity of
tedious model training and hyper-parameter fine-tuning, which is
a well-known bottleneck of many ML algorithms. Specifically, the
proposed algorithm first describes the target task via formatted
natural language, and then designs the in-context learning
framework and demonstration examples. After that, it considers
two cases, namely discrete-state and continuous-state problems,
and proposes state-based and ranking-based methods to select
appropriate examples for these two cases, respectively. Finally, the
simulations demonstrate that the proposed algorithm can achieve
comparable performance as conventional deep reinforcement
learning (DRL) techniques without dedicated model training or
fine-tuning. Such an efficient and low-complexity approach has
great potential for future wireless network optimization.
Index Terms—Large language model, in-context learning, network
optimization, transmission power control.""",
    7: """Both the training and use of Large Language Models (LLMs) require
large amounts of energy. Their increasing popularity, therefore,
raises critical concerns regarding the energy efficiency and sustainability
of data centers that host them. This paper addresses the
challenge of reducing energy consumption in data centers running
LLMs.We propose a hybrid data center model that uses a cost-based
scheduling framework to dynamically allocate LLM tasks across
hardware accelerators that differ in their energy efficiencies and
computational capabilities. Specifically, our workload-aware strategy
determines whether tasks are processed on energy-efficient
processors or high-performance GPUs based on the number of input
and output tokens in a query. Our analysis of a representative
LLM dataset, finds that this hybrid strategy can reduce CPU+GPU
energy consumption by 7.5% compared to a workload-unaware
baseline.""",
    8: """Reproducible science requires easy access to data, especially with
the rise of data-driven and increasingly complex models used within
energy research. Too often however, the data to reconstruct and
verify purported solutions in publications is hidden due to some
combination of commercial, legal, and sensitivity issues. This early
work presents our initial efforts to leverage the recent advancements
in Large Language Models (LLMs) to create usable and shareable
energy datasets. In particular, we’re utilising their mimicry of
human behaviors, with the goal of extracting and exploring synthetic
energy data through the simulation of LLM agents capable of
interacting with and executing actions in controlled environments.
We also analyse and visualise publicly available data in an attempt
to create realistic but not quite exact copies of the originals. Our
early results show some promise, with outputs that resemble the
twin peak curves for household energy consumption. The hope is
that our generalised approach can be used to easily replicate usable
and realistic copies of otherwise secret or sensitive data.""",
    9: """This paper introduces a method for personalizing
energy optimization using large language models (LLMs)
combined with an optimization solver. This approach, termed
human-guided optimization autoformalism, translates natural
language specifications into optimization problems, enabling
LLMs to handle various user-specific energy-related tasks. It
allows for nuanced understanding and nonlinear reasoning
tailored to individual preferences. The research covers common
energy sector tasks like electric vehicle charging, HVAC control,
and long-term planning for renewable energy installations. This
novel strategy represents a significant advancement in contextbased
optimization using LLMs, facilitating sustainable energy
practices customized to individual needs."""
}

#the text for rag is used as an input to the BERT model

#The tokenized inputs are passed to the BERT model for processing.
#(#remember padding=True: Ensures that all inputs are padded to the same length, allowing batch processing.)
#The model outputs a tensor (last_hidden_state), where each input token is represented by a high-dimensional vector.
#last_hidden_state is of shape (batch_size, sequence_length, hidden_size), where:
#batch_size: Number of input texts.
#sequence_length: Length of each tokenized text (after padding).
#hidden_size: Dimensionality of the vector representation for each token (default 768 for bert-base-uncased).

#last_hidden_state[:, 0]: Selects the representation of the [CLS] token for each input text. The [CLS] token is a special token added at the start of each input and is often used as the aggregate representation for the entire sequence.


input = tokenizer_bert(list(abstracts_dict.values()), padding=True, return_tensors='pt')

output = model_bert(**input)

abstract_vectors = output.last_hidden_state[:,0]

#abstract_vectors is a tensor of shape (batch_size, hidden_size) (e.g., (3, 768) in this case), representing each text as a single 768-dimensional vector.

#print(abstract_vectors.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Search

With our text data vectorized and indexed, we can now perform searches. We will define a function to search the index for the most relevant documents based on a query.

To perform the search, we need a function (search documents) where we perform the cosine similarity between the query vector and all the abstract vectors. This function will give our the top-k indexes. Once we find the top-k indexes, with another function, we can collect the full text of the documents from the paper dictionary.

To compute cosine similarity, refer to the following formula

```cs = cosine_similarity(vector_a.detach().numpy(), vector_b.detach().numpy())```



In [4]:

def get_top_k_similar_indices(query_vector, abstract_vectors, k):

    #Computes the top k indices of the most similar abstracts to the query based on cosine similarity.

    #Parameters:
    #- query_vector: A tensor of shape (1, hidden_size) representing the query vector.
    #- abstract_vectors: A tensor of shape (batch_size, hidden_size) representing the abstract vectors.
    #- k: The number of top indices to return.
    tmp_dict = {}
    for index, item in enumerate(abstract_vectors):
      # Reshape both query_vector and item to 2D arrays for sklearn's cosine_similarity

      cs = cosine_similarity(query_vector.detach().numpy().reshape(1, -1), item.detach().numpy().reshape(1, -1))[0][0]
      tmp_dict[index] = cs

    # Sort the dictionary items by similarity in descending order
    sorted_items = sorted(tmp_dict.items(), key=lambda x: x[1], reverse=True)

    # Extract only the keys (indices) from the sorted items
    key_sorted = [item[0] for item in sorted_items]


    #Returns:
    #- sorted_indices: A numpy array of shape (1, k) containing the indices of the top k most similar abstracts.

    return key_sorted[0:k]


def retrieve_documents(indices, documents_dict):

    #Retrieves the documents corresponding to the given indices and concatenates them into a single string.

    #Parameters:
    #- indices: A numpy array or list of top-k indices of the most similar documents.
    #- documents_dict: A dictionary where keys are document indices (integers) and values are the document texts (strings).
    print(indices)
    documents = [documents_dict['paper'+str(i)+'.pdf'] for i in indices]
    #Returns:
    #- concatenated_documents: A string containing the concatenated texts of the retrieved documents.

    return " ".join(documents)

### A function to perform Retrieval Augmented Generation

In this step, we’ll combine the context retrieved from our documents with LLAMA to generate responses. The context will provide the necessary information to the model to produce more accurate and relevant answers.

In [5]:


#now we put it all together

def generate_augmented_response(query, model, tokenizer):
#TODO: define system prompt

    system = """
You are an AI assistant specialized in analyzing academic research papers, particularly in telecommunications, machine learning, and network optimization.
Your task is to answer user questions based exclusively on the content of the provided documents.

**Guidelines:**
- Provide clear, concise, and well-structured answers
- Support your statements with references to specific sections, equations, or results from the document
- Do not invent information not present in the text
- If a question is beyond the document's scope, clarify that you lack the relevant information
- Use technical yet accessible language

**Preferred Response Format:**
- Brief introduction
- Bullet points for methods or results
- References to tables/figures when available
- Concise conclusion summarizing key points

Context:
"""
    query_tokenized = tokenizer_bert(query, return_tensors='pt')
    query_embeddings =  model_bert(**query_tokenized).last_hidden_state[:,0]
    context = retrieve_documents(get_top_k_similar_indices(query_embeddings, abstract_vectors, 1), pdf_texts)               #TODO: concatenate here all the search results


    prompt = system+"\n"+context+"\n"+"Query:\n"+query+"\n"+"Answer:"                 #TODO: create the prompt for LLAMA (system + context + query)

    input = tokenizer(prompt, return_tensors='pt').to('cuda')

    output = model.generate(**input, max_new_tokens=500)

    response = tokenizer.decode(output, skip_special_tokens=True)

    #perform a query with LLAMA in the usual way

    #return the response
    return response



#TODO: now compare the results with a prompt without RAG. What are the results?


In [6]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [8]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM


model_id = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

queries = "In this study, how is in-context learning applied to optimize base station transmission power control in wireless networks? Describe the proposed method, including task description design, example selection strategies for both discrete and continuous state problems, and summarize the main performance results compared to traditional deep reinforcement learning approaches."



config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [9]:
output = generate_augmented_response(queries, model, tokenizer)
print(output)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[9]


OutOfMemoryError: CUDA out of memory. Tried to allocate 10.36 GiB. GPU 0 has a total capacity of 14.74 GiB of which 1.11 GiB is free. Process 353841 has 13.63 GiB memory in use. Of the allocated memory 13.44 GiB is allocated by PyTorch, and 68.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)