<a href="https://colab.research.google.com/github/emiliawisnios/Masterclass_Text_Analysis/blob/main/Masterclass.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Large Language Models (LLMs) and Natural Language Processing (NLP) Tasks

Welcome to this interactive notebook on LLMs and their applications in NLP tasks! This notebook is designed to complement the workshop on "Introduction to large language models (LLMs) implementations into NLP tasks" and provide hands-on experience with LLMs.

## Objective
The primary goal of this notebook is to bridge the gap between theoretical understanding and practical implementation of LLMs in text analysis tasks. By the end of this session, you will have gained practical experience in:

1. Understanding the basics of LLMs
2. Implementing various prompting techniques
3. Applying LLMs to simple NLP tasks
4. Evaluating LLM performance on different tasks

## Structure
This notebook is divided into several sections, each focusing on a different aspect of LLMs and their implementation:

- **Environment Setup**: We'll begin by setting up our working environment with the necessary libraries and tools.
- **Dataset Preparation**: We'll load and prepare a dataset for our NLP tasks.
- **Data Annotation**: You'll get hands-on experience with data annotation, a crucial step in many NLP tasks.
- **Model Selection**: We'll explore how to choose and load an appropriate LLM for our tasks.
- **Prompting Techniques**: We'll dive into various prompting methods, including:
  - Zero-shot prompting
  - One-shot prompting
  - Few-shot learning
  - Chain-of-thought prompting

- **Parameter Selection**: We'll experiment with different model parameters and their effects on outputs.
- **Reporting and Evaluation**: Finally, we'll look at how to report and evaluate the results of our "LLM implementations".

By working through this notebook, you'll gain practical insights into how LLMs can be leveraged for various NLP tasks, complementing the theoretical knowledge from the workshop. Let's get started on this exciting journey into the world of Large Language Models!

# Environment setup

In [None]:
!pip install transformers -q

In [None]:
import pandas as pd

# Dataset preparation

In [None]:
!wget https://raw.githubusercontent.com/emiliawisnios/Masterclass_Text_Analysis/refs/heads/main/data/UN_intervention.csv

In [None]:
df = pd.read_csv('/content/UN_intervention.csv')

In [None]:
df.head()

In [None]:
df_ = df.sample(30)

# Data annotation

Your task will be to annotate the sentiment of the speaker to the word *intervention* in the context of the sentence. Run the cell below and write
- `NEG` for negative sentinemt to the word *intervention*
- `NET` for neutral sentiment to the word *intervention*
- `POS` for positive sentiment to the word *intervention*

In [None]:
from IPython.display import HTML

In [None]:
sentiment = []
for index, row in df_.iterrows():
  display(HTML(f"<div>{row['sentence']}</div>"))
  sentiment.append(input())

df_['sentiment'] = sentiment

In [None]:
df_.head()

**Save and DOWNLOAD the result (in case of session disrutption)**

In [None]:
df_.to_csv('sentiment_annotated.csv', index=False)

# Getting Hugging Face (HF) token

To get the HF token you have to log in to the website [here](https://huggingface.co/). Go to Profile -> Settings -> Access Tokens -> Create new token.

Write token name and select option -> *Read access to contents of all public gated repos you can access*. This option should be enough for today's class.

Paste the token in the cell below.

**REMEMBER NEVER TO SHARE YOUR SECRET TOKENS ANYWHERE, INCLUDING GITHUB.**

In [None]:
# @title HF token
HF_TOKEN='' #@param {type:"string"}

# Model selection

In today's class we'll be using open-source model. We've chosen the smallest one from the Llama3.2 family due to our constraints in computational resources.

Let's start by downloading the model:

In [None]:
import torch
from transformers import pipeline

model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    token=HF_TOKEN
)

Now let's run some sample code to check if everyting is working:

In [None]:
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)
print(outputs[0]["generated_text"][-1])

Great!

Now let's talk about three basic roles in LLM prompting:
- `system`: A system prompt is essentially an instruction or a set of instructions you provide to an LLM before you give it a specific task or question.
Think of it as setting the stage for the AI model. You're giving it a role, context, or guidelines to follow.

*System prompts are crucial for several reasons:*

**Guidance and Context**: They tell the LLM how to approach the task. For example, you might instruct it to be concise, detailed, formal, informal, or adopt a specific persona.

**Consistency of Output**: By providing clear instructions, you increase the chances of getting consistently relevant and appropriate responses.

**Consistency of Output**: By providing clear instructions, you increase the chances of getting consistently relevant and appropriate responses.

- `user`: The user provides the prompt, which can be a question, a command, a statement, or even a piece of code. This input is what triggers the LLM to generate a response. The user's prompts steer the conversation or task. Follow-up questions, clarifications, and new instructions all come from the user, shaping the direction of the interaction.

- `assistant`: It's the role the LLM takes on during the interaction.

### Example with OpenAI API

Another way is to use paid API from some company, for example OpenAI, Anthropic or Google. Below is an example of code doing the same thing as above but with OpenAI API.


```
import openai

# Set up your OpenAI API key
openai.api_key = 'your-openai-api-key'

# Define messages for the conversation
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

# Call the OpenAI API to generate a response
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",  # Or "gpt-4" if you want to use GPT-4
    messages=messages,
    max_tokens=256,
    temperature=0.7,
    top_p=0.95,
    frequency_penalty=1.15
)

# Print the generated response (from the assistant)
print(response['choices'][0]['message']['content'])

```

## Zero-shot prompting

Now your task will be to find the best possible prompt to assign sentiment to the word `intervention`.

You will focus on system prompt for now. Try to state the task clearly. Do not give additional examples - we will try that later.

In [None]:
prompt1 = 'Your task is to classify the sentiment of the speaker to the word intervention. Response only with NEG (for negative sentiment), NET (for neutral sentiment), POS (for positive sentiment).'
prompt2 = ''
prompt3 = ''
prompt4 = ''
prompt5 = ''
prompts = [prompt1, prompt2, prompt3, prompt4, prompt5]

In [None]:
df_zero_shot = df_.copy()
i = 0

for prompt in prompts:
  i += 1
  answers = []
  for index, row in df_zero_shot.iterrows():

    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": row['sentence']},
    ]
    outputs = pipe(
        messages,
        max_new_tokens=256,
        temperature=0.1,
    )
    answers.append(outputs[0]["generated_text"][-1]['content'])

  df_zero_shot[f'prompt{i}'] = answers
  print(f'prompt{i} done')


prompt_to_acc = {}
for i, prompt in enumerate(prompts):
  prompt_to_acc[f'prompt{i+1}'] = (df_zero_shot['sentiment'] == df_zero_shot[f'prompt{i+1}']).mean()

for k, v in prompt_to_acc.items():
  print(f'{k} - {v}')

Now display the dataframe, check the answers for each of the prompt.

Where are the mistakes? Was your instruction clear enough? Or did you forget to specify some cases?

What do you think could help the model understand the task better?

In [None]:
#########################
## YOUR CODE GOES HERE ##
#########################

# One-shot prompting

The task is always easier if you have some example - it is also true with LLMs. In one-shot prompting we will simulate one round of the conversation with the model as in the example below:

In [None]:
messages = [
    {"role": "system", "content": "Your task is to return a second letter of a surname given. Return only the letter."},
    {"role": "user", "content": "John Smith"},
    {"role": "assistant", "content": "m"},
    {"role": "user", "content": "Adam Nowak"}
]
outputs = pipe(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Great! Now let's try something similar with our task. Remember NOT to use examples from the evaluation dataset.

In [None]:
prompt = ''

In [None]:
df_one_shot = df_.copy()

answers = []
for index, row in df_one_shot.iterrows():

  messages = [
      {"role": "system", "content": prompt},
      {"role": "user", "content": 'TODO'},
      {"role": "assistant", "content": 'TODO'},
      {"role": "user", "content": row['sentence']},

  ]
  outputs = pipe(
      messages,
      max_new_tokens=256,
      temperature=0.01,
  )
  answers.append(outputs[0]["generated_text"][-1]['content'])

df_one_shot[f'prompt'] = answers


print((df_one_shot['sentiment'] == df_one_shot['prompt']).mean())

Again check for the mistakes, where are the problems now?

In [None]:
#########################
## YOUR CODE GOES HERE ##
#########################

## Few-shot learning

Sometimes one example is not enough, especially of the task is complicated. We can simulate more rounds of conversation with the model. Please note, that the better are the most diverse examples, so the model could get the essence of the task. Let's try first with our dummy task.

In [None]:
messages = [
    {"role": "system", "content": "Your task is to return a second letter of a surname given. Return only the letter."},
    {"role": "user", "content": "John Smith"},
    {"role": "assistant", "content": "m"},
    {"role": "user", "content": "Adam Nowak"},
    {"role": "assistant", "content": "o"},
    {"role": "user", "content": "Maria Krzywa"},
    {"role": "assistant", "content": "r"},
    {"role": "user", "content": "Jan Pieprz"},
    {"role": "assistant", "content": "i"},
    {"role": "user", "content": "Izabela Lecka"}
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    temperature=1
)
print(outputs[0]["generated_text"][-1])

As we can see the model understood the task now, let's see if it manages also with our main task.

In [None]:
prompt = ''

In [None]:
df_few_shot = df_.copy()

answers = []
for index, row in df_few_shot.iterrows():

  messages = [
      {"role": "system", "content": prompt},
      {"role": "user", "content": 'TODO'},
      {"role": "assistant", "content": 'TODO'},
      {"role": "user", "content": 'TODO'},
      {"role": "assistant", "content": 'TODO'},
      {"role": "user", "content": 'TODO'},
      {"role": "assistant", "content": 'TODO'},
      {"role": "user", "content": 'TODO'},
      {"role": "assistant", "content": 'TODO'},
      {"role": "user", "content": row['sentence']},

  ]
  outputs = pipe(
      messages,
      max_new_tokens=256,
      temperature=0.01,
  )
  answers.append(outputs[0]["generated_text"][-1]['content'])

df_few_shot[f'prompt'] = answers


print((df_few_shot['sentiment'] == df_few_shot['prompt']).mean())

What where the mistakes now? Maybe you need more examples for complicated cases?

In [None]:
#########################
## YOUR CODE GOES HERE ##
#########################

## Chain-of-though prompting

For more analytical tasks, we can also try a method called chain-of thought. Implementation is easy - we just need to add '*Let's think step by step.*' To our prompt. Be careful though - it is proven that it is most effective with areas like mathematics.

In [None]:
messages = [
    {"role": "system", "content": "Your task is to calculate the sum of two numbers."},
    {"role": "user", "content": "What is 11 + 7?"},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    temperature=0.01
)
print(outputs[0]["generated_text"][-1]['content'])

In [None]:
messages = [
    {"role": "system", "content": "Your task is to calculate the sum of two numbers."},
    {"role": "user", "content": "What is 11 + 7? Let's think step by step."},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    temperature=0.01
)
print(outputs[0]["generated_text"][-1]['content'])

Now check on one example if this type of reasoning would help in our case.

In [None]:
messages = [
    {"role": "system", "content": "TODO"},
    {"role": "user", "content": "TODO"},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    temperature=0.01
)
print(outputs[0]["generated_text"][-1]['content'])

# Parameter selection

Your task will be to ask the model to summarize the text, play with the parameters.

In [None]:
text = """It is with gratitude and pride that I come here today to express the voice of the Italian people, a generous and responsible people, who every day have shown their commitment to saving the lives of hundreds of their brothers and sisters in the heart of the Mediterranean region. This Hall calls for profound thinking rather than slogans. In every part of the world, political life is more and more fixated on the present. It is tied to discussions shaped by 24/7 news stations, the Internet and social media. Let me be clear. I belong to a generation for whom social media represent an extraordinary tool, a horizon of freedom that allows us to change lives and prospects. There is nevertheless the risk of reducing that horizon to a discussion of the next opinion poll or tweet. I think we should reject what has become a dictatorship of the moment, and take the time to pay homage to this Hall for its efforts in engaging in more meaningful reflection. I am thinking of my country, which, on a map, appears to be shaped like a bridge, a bridge connecting North and South, Europe and Africa and East and West, a bridge that spans from the Middle East to the Balkans. Because of its geography, and especially its culture, Italy has always been a kind of extraordinary cultural laboratory affected by influences of every kind. That is the reason that we were the first country in Europe to grasp the momentous dimension of what is happening in the Mediterranean region. From the very beginning — even in this Hall last year — we have said that the refugee question is not a question of numbers (see A/69/PV.9). The problem of migration is not one of organization or statistics. The problem is fear, the fear that runs through our societies and that we must take seriously if we wish to defeat it. In Greek mythology, Phobos was the god of fear, able to paralyse the best armies and cause the most easily fought battles to be lost. That is why the glorious and ancient city of Sparta built a great temple to Phobos and did everything to gain his favour. Europe was born to defeat fear and replace it with the ideal of courage, peace, cooperation and civilization. And for a long time, Europe embodied that ideal. Over the past 70 years, our continent has left behind centuries of war and civil war. Europe had become a true miracle. For those like me, who as a young man witnessed the fall of the Berlin Wall and found in that event a reason to devote my life to public service, to see new walls going up today is intolerable. Europe was reborn to tear down walls, not to build them. That is why Italy is on the front line in rescuing thousands of migrants fleeing from war and despair. For that same reason, I had the privilege of accompanying Secretary-General Ban Ki-moon on one of our ships currently participating in the rescue operations. Addressing migratory flows requires the capacity to respond to that emergency with a global and comprehensive strategy. In that vein, Italy has partnered with African countries through a broad array of initiatives, and in particular with the African Union, a cooperation about which I an the opportunity to speak recently at the third International Conference on Financing for Development, held in Addis Ababa, which produced the Addis Ababa Action Agenda. In the 70 years since the birth of the Charter of the United Nations in San Francisco, the Organization has learned how crucial its role is. It has had the wisdom to recognize its mistakes and the strength to correct them by writing a new chapter that will ensure a better future for all of our children. I think that it will take an effort on the part of everyone, and Italy will not shirk from its responsibilities. That is why we decided to present our candidature to the Security Council for a non-permanent seat for 2017-2018, with the ideal in mind of building the peace 48/51 15-29431
 29/09/2015 A/70/PV.16 of tomorrow. We believe that it is the job of each one of us here today to create an alternative to the culture of violence and nihilism exemplified by the recent crises in the Mediterranean, the Middle East, Europe and at the borders of Europe. I am thinking, for example, of the consolidation of the ceasefire in Ukraine. I am thinking of the great joy with which we welcomed the news of the agreement between the United States and Cuba, one of extraordinary historic proportions. I am thinking of the hope that each of us now has as a consequence of the agreement with Iran on that country’s nuclear programme, which begins a new hopeful phase. While we are committed to the implementation of that agreement, we also firmly reiterate the right of the people and the State of Israel to exist. Only through dialogue and negotiation will we be able to find a future for coming generations. Moreover, on the delicate question of Israel and Palestine, there is no alternative to dialogue. It is essential to return to the negotiating table, with the goal of reaching a solution based on two States living side-by-side in peace and security. This open debate of the General Assembly has been characterized by many discussions on Syria. All of us have acknowledged and felt, on a very personal level, the failure that years of inertia has produced. We believe that the only way out of that quagmire is through a political solution that leads to a process of genuine transition. That will work only if we have the courage to stare reality in the face and acknowledge the presence of an enemy of unprecedented danger at our doors, namely, Daesh, the embodiment of extremism and terrorism. Through its Carabinieri corps, which plays an important role in Italy and the world, Italy is proud to lead the coalition for training the Iraqi police force. We know that the work of the security forces is decisive in ensuring daily security, enabling a family to return home without incident and enabling a mother to reassure her children. We will continue working with the global coalition to counter ISIL, in particular the United States and Saudi Arabia, and will maintain our leadership role in the working group to counter financing for Daesh. At the same time, we underscore that Daesh is not limited to the specific region of the Middle East, even if there is an extraordinary mosaic of pluralism and beauty there. Daesh may reaffirm itself with strength in Africa, starting with Libya. From this rostrum, I renew my appeal to all the parties who hope for peace and a unified nation in Libya. We must unite our forces. Our Libyan brothers and sisters must know that they are not alone, that the General Assembly has not forgotten them. Italy is ready to collaborate with a national unity Government and to restore cooperation in key areas so as to give Libya back its future. If the new Libyan Government asks us, Italy is ready to take on the leadership role in a mechanism, authorized by the international community, to assist in the stabilization of the country. There are many reasons for our role in the fight against terrorism. It is a battle for values, a battle for culture. The terrorists want us to die. Failing that, they want us to live under their rules. That is why the battle that we are waging is a battle against darkness and fear, because fear is the playground of terrorism. The first area in which we see that is that of culture. When terrorists attack Palmyra or the Bardo Museum in Tunisia, or a school or a university, from Asia to Africa, they are not attacking the past, they are targeting our future. Italy is the country where the culture of the conservation of cultural assets was born. Proud of our roots and of our Renaissance, we have the highest concentration of UNESCO cultural heritage sites in the world. That is why, together with our partner countries and friends, we aspire to be the guardians of culture throughout the world, carrying out concrete actions, both here in New York and at UNESCO headquarters in Paris, through United4Heritage, the Blue Helmets of culture. On the basis of a model developed in our country, we are proposing the establishment of an international task force, with military and civilian members, for operations to protect and rebuild art historical sites. That is our identity. That task force will be available to UNESCO, and it could be deployed in the framework of United Nations peacekeeping operations. Let us not forget that even Europe runs the risk, in the absence of a major educational project that would show that the evil seed of terrorism is growing on European soil as well. Let us not forget that what has happened in recent months and weeks — from Charlie Hebdo in Paris to what took place in Belgium and in Denmark — involved women and men born in European countries, raised and educated in European countries and yet transformed into terrorists who sought to undermine human rights and the very raison d’être of our continent. I think, therefore, that it is important that we all succeed in this educational challenge together and that our peacekeeping model, 15-29431 49/51
 A/70/PV.16 29/09/2015 which President Obama noted yesterday and for which we thank him, can serve as an established model that can be in deployed in various countries, such as is happening now in Afghanistan. I wish to recall Italy’s commitment to honouring the women and men who have sacrificed their lives for our collective security, in particular in that country. We are proud of the work of our soldiers and civilians aimed at supporting the Afghan Government on the road to peace and prosperity. The Security Council is at the centre of the challenge. That is not a bureaucratic issue, but rather a political one. The Uniting for Consensus group is ready to continue to work with all members. Human rights, which are today under attack, are for us a reference point at every level. I am thinking about Security Council resolution 1325 (2000), on women and peace and security. I am thinking of resolution 69/186 adopted by the General Assembly last December, with its moratorium on the death penalty, an issue on which we will tirelessly work. I also recall the words that the Holy Father Pope Francis pronounced here (see A/70/PV.3) and at the United States Congress. The resolutions against forced and early marriage (resolution 68/148) and against female genital mutilation (resolution 67/146) are clear signs of the shared commitment of our world community. The deep connection between peace and security and between human rights and development is also the message of the current Universal Exposition in Milan. The slogan of Expo 2015, “Feeding the Planet: Energy for Life”, is a message that brings together many of the aspirations of the General Assembly, in particular that of promoting sustainable agriculture. I wish to make a commitment, especially to the African countries, that we will never stop working in that direction, bolstered by Italian know-how and the desire to work together. Guaranteeing access to food for all, fighting world hunger, changing consumption patterns, ensuring the centrality of women as central stakeholders in agriculture, defending smallholder farmers, as well as easing tensions and conflicts caused by the degradation of arable land and the scarcity of water for irrigation, are not secondary issues. The legacy of Expo Milan is assured by the Charter of Milan and by the commitments of each of us to fight climate change. Italy stands alongside Secretary- General Ban Ki-moon and is mobilizing the necessary resources to ensure that the conferences in Lima and Paris are successful. With the adoption of the 2030 Agenda for Sustainable Development (resolution 70/1), Italy has accepted the challenge of the five Ps — people, prosperity, partnership, planet and peace, which we recognize and which inspire our action for the future. But let me say that Italy intends to contribute with strength, in particular in those battles in which some countries seem to be alone. In the next few weeks in Milan, we will welcome our partners, the small island developing States, which are considered small States but are actually great States for their value, to the events on climate-change adaptation that will take place in mid-October at Expo Milan, and we will bring a large delegation to Venice, where we will show participants, in one of the most beautiful artistic cities in the world, how we are working to combat the risks associated with the presence of high waters and the lack of attention on the part of the international community. In conclusion, as a candidate for a non-permanent seat on the Security Council, Italy wants the values we have discussed to occupy a central place in the Security Council. But I do not want us to think of those values in an abstract way. I do not want us to forget that what brings us here is not a document. It is a face; it is many faces. In Italian schools, our children learn about the strong connection among the ancient civilizations of the Mediterranean, Africa and the Middle East. Today, those children are not just extras in a movie. They are the reason why we are here today. We believe that of all the values we teach our children at school we cannot forget that the first value is life. Faced with the migration crisis, many of us were deeply moved this summer by the photo of a little boy named Aylan. He was a child from Kobani, who, together with his older brother, fell asleep without ever being able to see the future. He was photographed, dead, on the beach at Bodrum. We must not limit our commitment to the emotion of the moment. We must bear that image in mind and commit to doing our best. Many children have died in the heart of the Mediterranean. They died on the ships launched in the direction of Europe by traffickers, the new slave traders of today. However, together with all of those children who are no longer with us, I want to recall the names of children whom no one talks about: Yambambi, 50/51 15-29431
 29/09/2015 A/70/PV.16 Salvatore, Idris Ibrahim and Francesca Marina. They are some of the children who were born on the ships of the Italian Marines and Coast Guard, which saved thousands of women, and in some cases enabled them to give birth on those ships. I wish to thank my fellow citizens for the extraordinary work that they have carried out. I want their names to be remembered with the names of those who did not make it. Their heroic actions should serve as an admonishment for all of us. Politics can be restored to dignity when we are aware of the enormity of our challenges. The old Europe, born in the name of courage, does not give in to fear. Italy will proudly do its part."""

## Temperature

In [None]:
messages = [
    {"role": "system", "content": "TODO"},
    {"role": "user", "content": text},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    temperature="TODO"
)
print(outputs[0]["generated_text"][-1]['content'])

What are the differences of different temperature values?

## Top-p

In [None]:
messages = [
    {"role": "system", "content": "TODO"},
    {"role": "user", "content": text},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    temperature=1,
    top_p="TODO"
)
print(outputs[0]["generated_text"][-1]['content'])

What are the differences of different temperature values?

## Repetition penalty

In [None]:
messages = [
    {"role": "system", "content": "TODO"},
    {"role": "user", "content": text},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    temperature=1,
    repetition_penalty="TODO"
)
print(outputs[0]["generated_text"][-1]['content'])

What are the differences for different repetition penalty?

## Max new tokens

In [None]:
messages = [
    {"role": "system", "content": "TODO"},
    {"role": "user", "content": text},
]
outputs = pipe(
    messages,
    max_new_tokens="TODO",
    temperature=1,
)
print(outputs[0]["generated_text"][-1]['content'])

What are the differences for different max new tokens?

# Reporting Your Work with LLMs

Effectively communicating your work with Large Language Models (LLMs) is crucial in academic and professional settings. This section will guide you through the key elements to include when reporting on LLM-based projects.

## 1. Model Specification

Clearly state the LLM you used:
- Model name and version (e.g., Meta-Llama/Llama-3.2-1B-Instruct, GPT-3.5, BERT-large)
- Source/provider of the model (e.g., Meta, OpenAI, Hugging Face)
- Model size (number of parameters)
- Any fine-tuning or adaptations made to the base model

Example:
```
We used the Meta-Llama/Llama-3.2-1B-Instruct model. This model has 1 billion parameters and was used without any additional fine-tuning.
```

## 2. Dataset Description

Describe the data used in your project:
- Source of the data
- Size of the dataset (number of samples, tokens, etc.)
- Any preprocessing steps applied
- If using a benchmark dataset, cite the relevant paper

Example:
```
We used the UN Speeches dataset, which contains 30 annotated sentences. The data was sampled from the original dataset for this experiment.
```

## 3. Task Description

Clearly define the NLP task you're addressing:
- Type of task (e.g., sentiment analysis, text classification, summarization)
- Input and output format
- Any specific constraints or requirements

Example:
```
Our task was sentiment analysis of sentences containing the word "intervention". The input was a sentence, and the output was a sentiment label (NEG, NET, or POS) indicating the speaker's sentiment towards the word "intervention" in the context of the sentence.
```

## 4. Prompting Strategy

Detail your prompting approach:
- Type of prompting used (zero-shot, one-shot, few-shot, chain-of-thought)
- Exact prompt templates used
- Any iterations or refinements made to the prompts

Example:
```
We experimented with various prompting strategies:

1. Zero-shot: "Analyze the sentiment towards the word 'intervention' in the given sentence. Respond with NEG for negative, NET for neutral, or POS for positive."

2. Few-shot: We provided 4 examples of sentences with their correct sentiment labels before asking the model to classify a new sentence.

3. Chain-of-thought: We added "Let's think step by step" to our prompt to encourage more detailed reasoning.
```

## 5. Hyperparameters

Specify the key hyperparameters used:
- Temperature
- Top-p (nucleus sampling)
- Repetition penalty
- Max new tokens
- Any other relevant parameters

Example:
```
We experimented with different hyperparameter settings:
- Temperature: 0.1, 0.5, 1.0
- Top-p: 0.9, 0.95, 1.0
- Repetition penalty: 1.0, 1.15, 1.3
- Max new tokens: 64, 128, 256
```

## 6. Evaluation Metrics

Describe how you evaluated the model's performance:
- Metrics used (e.g., accuracy, F1-score, BLEU)
- How ground truth was established
- Any statistical tests performed

Example:
```
We evaluated our model using accuracy, comparing the model's predictions to human-annotated labels. We calculated accuracy for each prompting strategy and hyperparameter setting.
```

## 7. Results and Analysis

Present your findings:
- Overall performance metrics
- Comparison of different approaches
- Error analysis and challenging cases
- Visualizations of results

Example:
```
The few-shot prompting strategy achieved the highest accuracy of 85%. Zero-shot performance was 70%, while chain-of-thought reasoning improved results slightly to 75%. We observed that the model struggled most with neutral sentiment cases.
```

## 8. Limitations and Biases

Discuss any limitations or potential biases in your approach:
- Model limitations
- Dataset biases
- Potential ethical considerations

Example:
```
Our model showed a slight bias towards positive sentiment. The limited size of our dataset (30 samples) may not be representative of all possible use cases. Further testing on a larger, more diverse dataset is recommended.
```

## 9. Reproducibility Information

Provide details to ensure reproducibility:
- Seed values for any random processes
- Hardware specifications
- Software versions
- Any additional libraries or tools used

Example:
```
Experiments were run on a Google Colab notebook with a T4 GPU. We used Python 3.8 and the transformers library version 4.28.1. Random seed was set to 42 for all experiments.
```

## 10. Conclusion and Future Work

Summarize your findings and suggest next steps:
- Key takeaways from your experiments
- Potential improvements or extensions
- Broader implications of your work

Example:
```
Our experiments demonstrate the effectiveness of few-shot prompting for sentiment analysis in intervention-related contexts. Future work could explore larger models, more diverse datasets, and the integration of domain-specific knowledge to improve performance.
```

By including these elements in your report, you provide a comprehensive and transparent account of your work with LLMs, facilitating reproducibility and advancing the field of NLP research.

# Bonus - Multiple-Choice Questions (MCQs)

In [None]:
pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")

In [None]:
messages = [
    {"role": "system", "content": "Your task select the correct answer to the question. Return only A or B."},
    {"role": "user", "content": "What is the capital of Wakanda? \n Options: \n A. Warsaw \n B. Paris \n Answer:"},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    temperature=0
)
print(outputs[0]["generated_text"][-1]['content'])

In [None]:
messages = [
    {"role": "system", "content": "Your task select the correct answer to the question. Return only A or B."},
    {"role": "user", "content": "What is the capital of Wakanda? \n Options: \n A. Paris \n B. Warsaw \n Answer:"},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    temperature=0
)
print(outputs[0]["generated_text"][-1]['content'])

What are the options:
1. Generate the answers with temperature 0 and report consistency of the answers, meaning if the answers are the same on the permutations.
2. Generate answers multiple times (for example 5 per permutation) and raport the consistency of the average of answers.