**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [19]:
# imports for the project

import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm
from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [20]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

'(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 9cbd8221-0f88-42bf-8f3f-3af0ea00b1c9)')' thrown while requesting GET https://huggingface.co/datasets/fancyzhx/ag_news/resolve/main/data/test-00000-of-00001.parquet
Retrying in 1s [Retry 1/5].


In [21]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.01)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape #, train_df.shape, 

(76, 2)

### 2. Establish connection and set model parameters

In [22]:
# Load the environment variables using python-decouple

WX_API_KEY = config('WX_API_KEY')

credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="32b6a744-5b49-4e5b-aee3-d9c56284fecf"
)

In [23]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",  # We could also try a larger model!
    params=PARAMS
)

### 3. Create a system prompt with basic, fewshot, and chain-of-thought prompt

In [24]:
SYSTEM_PROMPT = """You task is to classify news stories into one of five categories

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

SYSTEM_PROMPT_FEWSHOT = """You are a news classifier. Your task is to assign one of the following categories to a news story.

CATEGORIES:
{categories}

Examples:
Example 1:
TEXT: "The United Nations convened today to discuss global strategies on climate change."
Category: World

Example 2:
TEXT: "The local team clinched the championship title in a stunning overtime finish."
Category: Sports

Now, classify the following news text:

TEXT: {text}

Answer with the correct category and nothing else.

Category:
"""

SYSTEM_PROMPT_CHAIN = """You are a news classifier. Your job is to analyze the news text and determine the most appropriate category from the list provided.

CATEGORIES:
{categories}

Process:
1. Briefly summarize the key points in the text.
2. Identify important keywords or themes.
3. Decide which category best fits the text.
4. Provide only the final category as the answer. Answer with the correct category and nothing else.

TEXT: {text}

Let's think step-by-step:
1.
2.
3.
Final Category:
"""

### 4. Generate predictions

In [25]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

# Create a dictionary of prompt variants
prompts = {
    "baseline": SYSTEM_PROMPT,
    "few_shot": SYSTEM_PROMPT_FEWSHOT,
    "chain_of_thought": SYSTEM_PROMPT_CHAIN
}


# Dictionary to hold the predictions for each prompt type
results = {}

for prompt_name, prompt_template in prompts.items():
    predictions = []
    # Loop over the test texts with a progress bar indicating the prompt variant being processed
    for text in tqdm(test_df["text"], desc=f"Processing {prompt_name} prompt"):
        # Format the current prompt with the test text
        prompt = prompt_template.format(categories=CATEGORIES, text=text)
        
        # Generate the response from the model
        response = model.generate(prompt)
        
        # Extract the generated text and strip any extra whitespace
        prediction = response["results"][0]["generated_text"].strip()
        
        # Append the prediction to the list
        predictions.append(prediction)
    
    # Store predictions for the current prompt type
    results[prompt_name] = predictions

Processing baseline prompt: 100%|██████████| 76/76 [00:29<00:00,  2.57it/s]
Processing few_shot prompt: 100%|██████████| 76/76 [00:28<00:00,  2.71it/s]
Processing chain_of_thought prompt: 100%|██████████| 76/76 [00:30<00:00,  2.51it/s]


### 5. Evaluate performance

In [26]:
# Evaluate each set of predictions using classification_report
for prompt_variant, preds in results.items():
    print(f"Evaluation for prompt variant: {prompt_variant}")
    print(classification_report(test_df["label"], preds))

Evaluation for prompt variant: baseline
              precision    recall  f1-score   support

    Business       0.55      0.95      0.69        19
    Sci/Tech       1.00      0.26      0.42        19
      Sports       1.00      0.89      0.94        19
       World       0.76      0.84      0.80        19

    accuracy                           0.74        76
   macro avg       0.83      0.74      0.71        76
weighted avg       0.83      0.74      0.71        76

Evaluation for prompt variant: few_shot
              precision    recall  f1-score   support

    Business       0.50      0.95      0.65        19
    Sci/Tech       0.89      0.42      0.57        19
      Sports       1.00      0.84      0.91        19
       World       0.87      0.68      0.76        19

    accuracy                           0.72        76
   macro avg       0.81      0.72      0.73        76
weighted avg       0.81      0.72      0.73        76

Evaluation for prompt variant: chain_of_thought
  

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
