**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [3]:
# imports for the project

import pandas as pd
#from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

In [None]:
#api key
WX_API_KEY = config('WX_API_KEY')

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [5]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="c6cdb28d-c346-49c4-968a-d04e28b75eb7"
)

Testing the connection


In [None]:
model = ModelInference(
    api_client=client,
    model_id="ibm/granite-3-8b-instruct",
)

prompt = "How do I make a cake?"
generated_response = model.generate(prompt)

generated_response

{'model_id': 'ibm/granite-3-8b-instruct',
 'model_version': '1.1.0',
 'created_at': '2025-03-30T12:51:06.035Z',
 'results': [{'generated_text': '\n\n1. Preheat your oven to the temperature specified in your recipe.\n2.',
   'generated_token_count': 20,
   'input_token_count': 8,
   'stop_reason': 'max_tokens'}],
    'id': 'unspecified_max_new_tokens',
    'additional_properties': {'limit': 0,
     'new_value': 20,
     'parameter': 'parameters.max_new_tokens',
     'value': 0}}]}}

Parameter tuning

In [8]:
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

TextGenParameters.show()

+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| PARAMETER             | TYPE                                   | EXAMPLE VALUE                                                                                                                             |
| decoding_method       | str, TextGenDecodingMethod, NoneType   | sample                                                                                                                                    |
+-----------------------+----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| length_penalty        | dict, TextGenLengthPenalty, NoneType   | {'decay_factor': 2.5, 'start_index': 5}                                                                  

In [9]:
#LLms for classification
import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm

Load data and start modelling

In [10]:
splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [11]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape # , train_df.shape

(760, 2)

Setting model parameters

In [41]:
# Creating a parameter configuration for classification

PARAMS = TextGenParameters(
    temperature=0,            # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=25,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-20b-code-instruct", #meta text specialised model
    params=PARAMS
)

In [33]:
response = model.generate(prompt)
response

{'model_id': 'ibm/granite-20b-code-instruct',
 'model_version': '1.1.0',
 'created_at': '2025-03-30T13:54:25.017Z',
 'results': [{'generated_text': 'Business',
   'generated_token_count': 2,
   'input_token_count': 156,
   'stop_reason': 'eos_token'}]}

In [42]:
SYSTEM_PROMPT = """Your task is to classify a news story into one of the following four categories (and only those four categories):

{categories}

News Story:
{text}

Identify and return only the correct category name that best fits the news story. Do not include any additional commentary or text.

Category:"""

In [43]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [03:58<00:00,  3.18it/s]


In [44]:
print(classification_report(test_df.label, predictions))

              precision    recall  f1-score   support

    Business       0.59      0.51      0.55       190
    Politics       0.00      0.00      0.00         0
    Sci/Tech       0.80      0.22      0.34       190
      Sports       0.93      0.81      0.86       190
       World       0.45      0.90      0.60       190

    accuracy                           0.61       760
   macro avg       0.56      0.49      0.47       760
weighted avg       0.69      0.61      0.59       760



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



 ### Classification report for model ibm/granite-3-8b-instruct -
 
hyperparameters:
temperature=0,            
max_new_tokens=10,         
stop_sequences=[".", "\n"],

repetition_penalty=None,   
 
          precision    recall  f1-score   support

    Business       0.54      0.91      0.68       190
    Sci/Tech       0.89      0.35      0.50       190 Very low score for Sci/tech news
      Sports       0.96      0.91      0.94       190
       World       0.80      0.78      0.79       190

    accuracy                           0.74       760
   macro avg       0.80      0.74      0.73       760
weighted avg       0.80      0.74      0.73       760

Initially the LLM performs worse than both the other models

### 2nd model Classification using Meta model meta-llama/llama-3-2-3b-instruct

same parameters as the Granite model from before 

| Metric       | Precision | Recall | F1-score | Support |
|--------------|-----------|--------|----------|---------|
| Business     | 0.28      | 0.99   | 0.44     | 190     |
| Sci/Tech     | 1.00      | 0.01   | 0.01     | 190     |
| Sports       | 0.97      | 0.40   | 0.57     | 190     |
| World        | 0.00      | 0.00   | 0.00     | 190     |
| Accuracy     | 0.35      |        |          | 760     |
| Macro Avg    | 0.45      | 0.28   | 0.20     | 760     |
| Weighted Avg | 0.56      | 0.35   | 0.25     | 760     |

Very bad predictive power with the settings used in this book - tried again with added temperature, for a even worse result

### 3rd model ibm/granite-20b-code-instruct

params
decoding_method="greedy"

temperature=0.1,           
max_new_tokens=25       
stop_sequences=[".", "\n"]

Classification report

| Metric       | Precision | Recall | F1-score | Support | Notes                                                    |
|--------------|-----------|--------|----------|---------|----------------------------------------------------------|
| Business     | 0.48      | 0.92   | 0.63     | 190     |                                                          |
| Sci/Tech     | 0.66      | 0.49   | 0.57     | 190     |                                                          |
| Sports       | 0.76      | 0.89   | 0.82     | 190     |                                                          |
| World        | 0.74      | 0.12   | 0.21     | 190     | Does not seem to capture a patern for "world" articles   |
| Accuracy     | 0.61      |        |          | 760     |                                                          |
| Macro Avg    | 0.66      | 0.61   | 0.56     | 760     |                                                          |
| Weighted Avg | 0.66      | 0.61   | 0.56     | 760     |                                                          |





## refined system promt and no temperature
Same model as the report above

              precision    recall  f1-score   support

    Business       0.63      0.61      0.62       190
    Politics       0.00      0.00      0.00         0 - model has halucinated a fifth category, if we exclude this it performs fairly
    Sci/Tech       0.80      0.23      0.35       190
      Sports       0.95      0.85      0.90       190
       World       0.49      0.91      0.63       190

    accuracy                           0.65       760
   macro avg       0.57      0.52      0.50       760
weighted avg       0.72      0.65      0.62       760

### Reflections
The experiments shows how sensitive LLM performance is to both the prompt and decoding parameters. Small tweaks in temperature, max tokens, or the prompt wording can significantly alter precision, recall, and even lead to the hallucination of unintended categories (e.g., "Politics").

Small changes in prompt wording and decoding settings (e.g., temperature, max tokens) can drastically affect performance.

The IBM/granite-3-8b-instruct model achieves moderate accuracy (74%) but struggles with Sci/Tech recall.
The Meta-llama/llama-3-2-3b-instruct model performs poorly (35% accuracy), mostly predicting Business and neglecting other categories.
The IBM/granite-20b-code-instruct model shows improvements in some areas yet remains sensitive, especially with the World category.

Prompt Refinement:
A refined prompt helps, but it also risks model hallucinations (e.g., inventing a "Politics" category), showing the need for explicit instructions to select only from the predefined categories.

Overall Insight:
Success depends on clear, constrained prompts and fine-tuning decoding parameters.