**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [1]:
# imports for the project

import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm
from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [2]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [3]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 2e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.04)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape #, train_df.shape, 

(304, 2)

### 2. Establish connection and set model parameters

In [4]:
# Load the environment variables using python-decouple

WX_API_KEY = config('WX_API_KEY')

credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="32b6a744-5b49-4e5b-aee3-d9c56284fecf"
)

In [5]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=70,          # allow more room for CoT to reason
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",
    params=PARAMS
)

### 3. Create a system prompt with basic, fewshot, and chain-of-thought prompt

In [6]:
SYSTEM_PROMPT = """You task is to classify news stories into one of four categories — exactly as written below.

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

SYSTEM_PROMPT_FEWSHOT = """You are a news classifier. Your task is to assign one of the following four CATEGORIES to a news story - only use the four categories listed below, nothing else!!!

CATEGORIES:
{categories}

Examples:
Example 1:
TEXT: "The United Nations convened today to discuss global strategies on climate change."
Category: World

Example 2:
TEXT: "The local team clinched the championship title in a stunning overtime finish."
Category: Sports

Example 3:
TEXT: "The stock market rallied today after a series of positive economic reports."
Category: Business

Example 4:
TEXT: "Scientists have unveiled a groundbreaking discovery in renewable energy technology."
Category: Sci/Tech

Now, classify the following news text:

TEXT: {text}

Answer with the correct category and nothing else (you can choose between the four categories I gave you).

Category:
"""

SYSTEM_PROMPT_CHAIN = """You are a news classifier. Analyze the following news text and determine the most appropriate category from the list below — exactly as written.

CATEGORIES:
{categories}

Instructions:
1. Briefly think through the key points in the text.
2. Identify important keywords or themes.
3. Decide which category best fits the text.
4. Output only a single line starting with 'Final Answer:' followed by one of the following exact labels: World, Sports, Business, Sci/Tech. Do not include any additional text.

TEXT: {text}

Let's think step-by-step internally. 

Final Answer:
"""

### 4. Generate predictions

In [7]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

# Create a dictionary of prompt variants
prompts = {
    "baseline": SYSTEM_PROMPT,
    "few_shot": SYSTEM_PROMPT_FEWSHOT,
    "chain_of_thought": SYSTEM_PROMPT_CHAIN
}


# Dictionary to hold the predictions for each prompt type
results = {}

for prompt_name, prompt_template in prompts.items():
    predictions = []
    # Loop over the test texts with a progress bar indicating the prompt variant being processed
    for text in tqdm(test_df["text"], desc=f"Processing {prompt_name} prompt"):
        # Format the current prompt with the test text
        prompt = prompt_template.format(categories=CATEGORIES, text=text)
        
        # Generate the response from the model
        response = model.generate(prompt)
        
        # Extract the generated text and strip any extra whitespace
        prediction = response["results"][0]["generated_text"].strip()
        
        # Append the prediction to the list
        predictions.append(prediction)
    
    # Store predictions for the current prompt type
    results[prompt_name] = predictions

Processing baseline prompt: 100%|██████████| 304/304 [01:38<00:00,  3.08it/s]
Processing few_shot prompt: 100%|██████████| 304/304 [01:36<00:00,  3.16it/s]
Processing chain_of_thought prompt: 100%|██████████| 304/304 [01:34<00:00,  3.22it/s]


### 5. Evaluate performance

In [8]:
# Evaluate each set of predictions using classification_report
for prompt_variant, preds in results.items():
    print(f"Evaluation for prompt variant: {prompt_variant}")
    print(classification_report(test_df["label"], preds))

Evaluation for prompt variant: baseline
              precision    recall  f1-score   support

    Business       0.54      0.95      0.69        76
    Sci/Tech       0.90      0.34      0.50        76
      Sports       0.93      0.91      0.92        76
       World       0.84      0.75      0.79        76

    accuracy                           0.74       304
   macro avg       0.80      0.74      0.72       304
weighted avg       0.80      0.74      0.72       304

Evaluation for prompt variant: few_shot
              precision    recall  f1-score   support

    Business       0.47      0.95      0.63        76
    Sci/Tech       0.90      0.34      0.50        76
      Sports       0.91      0.89      0.90        76
       World       0.81      0.50      0.62        76

    accuracy                           0.67       304
   macro avg       0.77      0.67      0.66       304
weighted avg       0.77      0.67      0.66       304

Evaluation for prompt variant: chain_of_thought
  

### Reflection on Hyperparameters and Performance

In this final part of the project, I implemented an LLM-based text classification system using IBM’s `granite-13b-instruct-v2` model via the watsonx platform. The system was tasked with assigning AG News samples to one of four categories using **prompt engineering**, without any model fine-tuning.

I experimented with several prompting strategies:
- **Baseline** (zero-shot classification with structured instructions)
- **Few-shot** (adding 1 example per class)
- **Chain-of-Thought (CoT)** (step-by-step reasoning before choosing a label)


Across all techniques, the baseline zero-shot prompt consistently achieved the highest accuracy (~74%) and macro F1-score (~0.72). Neither few-shot nor chain-of-thought prompting improved the performance. In fact, both led to slightly **lower accuracy and consistency** across classes.

I also adjusted parameters like `max_new_tokens`, but I was not able to surpass the baseline performance. The Sci/Tech category in particular was difficult for all techniques, which often received poor recall regardless of prompting strategy.

Unfortunately, despite these efforts, **performance did not improve**, and I am unsure of the specific reasons. This may relate to:
- Limits of zero-shot/few-shot prompting for classification
- The model's sensitivity to prompt phrasing or structure
- Difficulty in classifying short, context-light texts with LLMs without tuning

### Comparison with BoW & BERT models
| Model       | Accuracy | Macro F1 | Key Takeaway |
|-------------|----------|----------|---------------|
| **BoW + LogReg** | ~82%     | ~82%     | Simple but effective baseline |
| **BERT + LogReg** | ~88%     | ~88%     | Most robust and balanced overall |
| **LLM (Prompted)** | ~74%     | ~72%     | Good zero-shot, but no improvement from advanced prompting |

Compared to both traditional BoW and BERT-based classifiers, the LLM approach **underperformed**. The BERT-based model, in particular, offered the best trade-off between performance and effort. It outperformed the LLM without any prompting tricks or large token budgets.
