**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [1]:
# imports for the project

import pandas as pd

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [2]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [3]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape,

((760, 2),)

## 2. Setup LLM Pipeline

In [7]:
from decouple import config
from dotenv import load_dotenv
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters
from ibm_watsonx_ai.foundation_models import ModelInference

load_dotenv(dotenv_path="../.env")

WX_API_KEY = config('WX_API_KEY')

credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com", # using Dallas region as the doc specified to do so
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials,
    project_id="933e6007-4781-432a-9591-21b932da4bcb"
)

PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

def get_model(model_id: str):
    """ Get the model from the API """
    return ModelInference(
        api_client=client,
        model_id=model_id,
        params=PARAMS
    )

### 2.1 Create a system prompt

In [19]:
SYSTEM_PROMPT = """Your task is to classify news stories into one of the four following categories.

CATEGORIES:
{categories}

Here are some examples:

Example 1:
TEXT: LEGO Group reported record-breaking revenue for Q4, driven by strong holiday sales and demand for licensed sets.
Category: Business

Example 2:
TEXT: Apple unveiled its latest line of MacBook Pros featuring the new M3 chip, promising faster performance and improved battery life.
Category: Sci/Tech

Example 3:
TEXT: Manchester United narrowly defeated Liverpool in a thrilling 3-2 match at Old Trafford.
Category: Sports

Example 4:
TEXT: The European Union has proposed new regulations to combat climate change, aiming for a 55% reduction in emissions by 2030.
Category: World

Now classify the following news story:

TEXT: {text}

Please assign the correct category to the text. Answer with the correct category and nothing else. So only the category specified in the examples above, where it is most fitting.

Category:
"""

## 3. Generate predictions

In [20]:
from tqdm import tqdm

CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

# ibm/granite model
ibm_model = get_model("ibm/granite-13b-instruct-v2")
predictions_ibm_granite = []

# meta/llama model
meta_model = get_model("meta-llama/llama-3-405b-instruct")
predictions_meta_llama = []

models = [ibm_model, meta_model]

for model in models:
    # Array to store predictions for each model
    predictions = []

    # Train on all models in model ids
    for text in tqdm(test_df["text"]):

        # format the prompt with the categories and the text
        prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)

        # generate the response from the model
        response = model.generate(prompt)

        # extract the generated text from the response
        prediction = response["results"][0]["generated_text"].strip()

        # append the prediction to the list of predictions
        predictions.append(prediction)

    # Store the predictions in the correct variable
    if model.model_id == "ibm/granite-13b-instruct-v2":
        predictions_ibm_granite = predictions
    elif model.model_id == "meta-llama/llama-3-405b-instruct":
        predictions_meta_llama = predictions

100%|██████████| 760/760 [04:29<00:00,  2.82it/s]
100%|██████████| 760/760 [04:19<00:00,  2.93it/s]


## 4. Evaluate performance

In [21]:
from sklearn.metrics import classification_report

print("IBM Granite")
print(classification_report(test_df.label, predictions_ibm_granite))
print("Meta Llama")
print(classification_report(test_df.label, predictions_meta_llama))

IBM Granite
               precision    recall  f1-score   support

     Business       0.55      0.93      0.69       190
    Interview       0.00      0.00      0.00         0
          Law       0.00      0.00      0.00         0
Miscellaneous       0.00      0.00      0.00         0
     Sci/Tech       0.87      0.44      0.58       190
        Space       0.00      0.00      0.00         0
       Sports       0.93      0.92      0.92       190
          War       0.00      0.00      0.00         0
        World       0.86      0.67      0.75       190

     accuracy                           0.74       760
    macro avg       0.36      0.33      0.33       760
 weighted avg       0.80      0.74      0.74       760

Meta Llama
                    precision    recall  f1-score   support

(Your answer here)       0.00      0.00      0.00         0
                 ?       0.00      0.00      0.00         0
          Business       0.78      0.92      0.84       190
          Sci/Tech

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## 5. Reflections

As a result, we got an accuracy of 0.89 for the Meta Llama model, and 0.74 for the IBM Granite model. To compare the models from Part I and II, considering speed performance, the Meta Llama model is faster and better than the BERT model. The BoW model still has the best results (though it is also trained on more data) but is a few minutes slower than the LLM Llama model.

In our first iteration of our ´SYSTEM_PROMPT´, the IBM model return a lot of other categories we haven't defined, even though we made use of few shot prompting. After trying to specify that it should only return the categories we defined, it started to work a little bit better but still added other categories.

The Llama model was better at sticking to categories. So for the task of categorising news articles, the Llama model could be a quicker solution with a few percent differences to the BoW, but if accuracy is the most important, the BoW model is assumably the better option based on our assignments, though if time allowed we would be more sure by testing the Llama model with more data, as the BoW model was trained on all of the data.