# W14 Lab Exercise
This is the lab exercise for MIS590: Information Retrieval. </br>
In this lab, you will gain the following experience:</br>
- Understand how to load an open-source large language model (LLM)
- Inference with the LLM.
</br>

**Note:** When you see a pencil icon ✏️ in this notebook, it's time for you to code or answer the question!

# 1. Preliminaries
## 1.1 Install and Import Libraries

In [1]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
!pip install tqdm



In [None]:
import pandas as pd
import json
import random
from tqdm import tqdm
import torch
from unsloth import FastLanguageModel
from transformers import TextStreamer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## 1.2 Model Setup

In [None]:
max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models unsloth support for 4x faster downloading and no out-of-memory errors ensured.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster

    "unsloth/Llama-3.2-1B-bnb-4bit",           # Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth

# Load the selected model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

In [None]:
# Show the model detail
FastLanguageModel.for_inference(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaExtendedRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): L

### ✏️ Observe the results above and discuss the following:
- Explain the parameters we passed to ``FastLanguageModel.from_pretrained``

# 2. Inference with LLM
## 2.1 Prompt Construction

In [None]:
# Prior research suggests that role-playing with LLMs can enhance text generation performance
chatbot_role = "You are a linguistic expert specialized in extracting temporal knowledge from text."

# The instruction provided to the LLM, explaining how to complete the task
instruction = """In the following, you will be given an article. Extract the temporal knowledge from text and format as a quadruple (head entity; relation; tail entity; timestamp). Extract every complete quadruple you can find from the article. Make sure no element in the quadruple is missing. Do not output anything else other than the quadruple."""

# A one-shot setting
examplar = """Following is an example:
The inauguration of Joe Biden as the 46th president of the United States took place on Wednesday, January 20, 2021.
Temporal knowledge quadruple: (Joe Biden; president_of; The United States;  2021/01/20), (Joe Biden; deliver; United States presidential inauguration; 2021/01/20)"""

# Formatting the prompt
prompt = """Instruction: {0}
Example: {1}

Article: {2}"""

In [None]:
article = """
The 96th Academy Awards ceremony, presented by the Academy of Motion Picture Arts and Sciences (AMPAS), took place on March 10, 2024, at the Dolby Theatre in Hollywood, Los Angeles.
"""

complete_prompt = prompt.format(instruction, examplar, article)
# Set up the chat scenario with roles
messages = [
    {"role": "system", "content": chatbot_role},
    {"role": "user", "content": complete_prompt}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,  # Must add for generation
    return_tensors="pt",
).to("cuda")
# Create an attention mask to avoid warning
attention_mask = torch.ones_like(inputs).to("cuda")

# Get the length of the input prompt
len_input = inputs.shape[1]

# Generate the response
generated_ids = model.generate(
    input_ids=inputs,
    max_new_tokens=100,
    use_cache=True,
    temperature=0.7,
    min_p=0.1,
    attention_mask=attention_mask
)

# Extract only the generated tokens (excluding the prompt)
generated_tokens = generated_ids[0][len_input:]

# Decode the generated tokens to get the model's output
output_string = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(output_string)

(96th Academy Awards ceremony; took_place; March 10, 2024), 
(The Academy of Motion Picture Arts and Sciences; presented; 96th Academy Awards ceremony; 2024/03/10)


### ✏️ Observe the results above and discuss the following:
- Do you think the LLM did a good job? Why?
- Can we conclude that LLMs are capable of inferring temporal relations from text based on this example?

# Assignment 4
## 1. Discussion Questions
Answer the discussion questions above (those with ✏️ icon).
## 2. LLM-Based Sentiment Analysis
Let's try using LLMs for sentiment analysis task!
- **Dataset Preparation:**
  - Obtain the [IMDb Movie Reviews Dataset](https://paperswithcode.com/dataset/imdb-movie-reviews).
  - Randomly select 50 movie reviews from this dataset to form your sample set for this assignment.
- Prompt Engineering and Sentiment Analysis:
  - Carefully design a prompt to be used with an LLM (like the one you experimented with earlier) to analyze the sentiment expressed in each movie review within your sample set.
  - Experiment with different prompt strategies and settings to achieve optimal performance in sentiment classification.
  - The goal is to classify each review as having either positive or negative sentiment.
- Evaluation:
  - Calculate the following evaluation metrics to assess the performance of your LLM-based sentiment analysis system: **Accuracy**, **Precision**, and **Recall**.
  - Analyze the results, comparing the performance of different prompting strategies if you have tried any.

## 💻 Assignment Submission 💻
Write your code and display the results in this Jupyter Notebook. Then, export it as an HTML file and submit both the Jupyter Notebook and the HTML file to Cyber University. (I will show you how to download this notebook and export HTML in class) </br>
**Please ensure that the code is executed and the outputs are visible when exporting the HTML file.**