<a href="https://colab.research.google.com/github/wanadzhar913/aitinkerers-hackathon-supa-team-werecooked/blob/master/notebooks-benchmarking-exercises/03_benchmark_malaysian_mistral_llmasajudge_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we benchmark a [Malaysian Mistral model](https://huggingface.co/wanadzhar913/malaysian-mistral-llmasajudge-v3) finetuned to classify and generate reasoning on whether texts are logically consistent or not and answer yes/no questions. We achieve the following metrics on the validation dataset:

| Language         | Accuracy (%) | F1 Score (%) | Precision (%) | Recall (%) |
|------------------|--------------|--------------|---------------|------------|
| Malay + English  |    61.3    |   69.1   |    68.6   |  69.7  |
| Malay            |    61.0    |   68.3   |    69.7   |  66.9  |

- weave (Malay + English): https://wandb.ai/adzhar-faiq/benchmark_malaysian_mistral_llmasajudge_v3/r/call/0192b9f7-843d-7793-b4c6-f67c5f63af5e
- weave (Malay): https://wandb.ai/adzhar-faiq/benchmark_malaysian_mistral_llmasajudge_v3/r/call/0192ba10-b104-7fb3-9c2c-175b78f579dd





### 0.0 Load datasets & dependencies

In [None]:
!wget https://raw.githubusercontent.com/wanadzhar913/aitinkerers-hackathon-supa-team-werecooked/refs/heads/master/datasets/for_presentation/boolq-eng-val-200.jsonl -q
!wget https://raw.githubusercontent.com/wanadzhar913/aitinkerers-hackathon-supa-team-werecooked/refs/heads/master/datasets/for_presentation/boolq-malay-val-200.jsonl -q
!wget https://raw.githubusercontent.com/wanadzhar913/aitinkerers-hackathon-supa-team-werecooked/refs/heads/master/datasets/for_presentation/fib-eng-val-200.jsonl -q
!wget https://raw.githubusercontent.com/wanadzhar913/aitinkerers-hackathon-supa-team-werecooked/refs/heads/master/datasets/for_presentation/fib-malay-val-200.jsonl -q

In [1]:
!pip install weave flash_attn accelerate bitsandbytes -U -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.7/2.7 MB[0m [31m155.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.0/301.0 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m586.9/586.9 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.7/310.7 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.2/203.2 kB[0m [31m17.2 MB/s

In [2]:
# import torch
# import weave
# import transformers
# import flash_attn
# import accelerate
# import bitsandbytes

# print(f'torch version: {torch.__version__}')
# print(f'transformers version: {transformers.__version__}')
# print(f'weave version: {weave.__version__}')
# print(f'flash_attn version: {flash_attn.__version__}')
# print(f'accelerate version: {accelerate.__version__}')
# print(f'bitsandbytes version: {bitsandbytes.__version__}')

torch version: 2.5.0+cu121
transformers version: 4.46.2
weave version: 0.51.19
flash_attn version: 2.7.0.post2
accelerate version: 1.1.1
bitsandbytes version: 0.44.1


In [None]:
import re
import json
from glob import glob
from typing import Dict

import weave
from tqdm.notebook import tqdm

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, \
                         BitsAndBytesConfig, pipeline

In [None]:
PROJECT_NAME = 'benchmark_malaysian_mistral_llmasajudge_v3'

weave.init(PROJECT_NAME)

Logged in as Weights & Biases user: adzhar-faiq.
View Weave data at https://wandb.ai/adzhar-faiq/benchmark_malaysian_mistral_llmasajudge_v3/weave


<weave.trace.weave_client.WeaveClient at 0x7fd5356faf50>

In [None]:
!nvidia-smi

Wed Oct 23 15:20:27 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0              46W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### 1.0 Load models and prepare validation dataset



In [None]:
dataset_list = glob('*.jsonl')
dataset_list

['boolq-malay-val-200.jsonl',
 'boolq-eng-val-200.jsonl',
 'fib-eng-val-200.jsonl',
 'fib-malay-val-200.jsonl']

In [None]:
# construct Malay + English dataset
data_all = []

for k in dataset_list:
    with open(k) as fopen:
        for d in tqdm(fopen):
            d = json.loads(d)
            data_all.append(d)

print(f'Size of dataset: {len(data_all)}')

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

Size of dataset: 800


In [None]:
data_all[0]

{'question': 'bolehkah anda memandu di kanada dengan lesen AS',
 'answer': 1,
 'passage': 'Orang yang memandu masuk ke Kanada mesti mempunyai dokumen pendaftaran kenderaan mereka dan bukti insurans.',
 'language': 'Malay'}

In [None]:
# construct Malay + English dataset
data_malay = []

for k in dataset_list:
    if 'malay' in k:
        with open(k) as fopen:
            for d in tqdm(fopen):
                d = json.loads(d)
                data_malay.append(d)

print(f'Size of dataset: {len(data_malay)}')

0it [00:00, ?it/s]

0it [00:00, ?it/s]

Size of dataset: 400


In [None]:
TORCH_DTYPE = 'bfloat16'

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=getattr(torch, TORCH_DTYPE)
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained('wanadzhar913/malaysian-mistral-llmasajudge-v3')
model = AutoModelForCausalLM.from_pretrained(
    'wanadzhar913/malaysian-mistral-llmasajudge-v3',
    use_flash_attention_2 = True,
    quantization_config = nf4_config
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

### 2.0 Create scoring metrics and Classes



In [None]:
pipe = pipeline(
    "text-generation",
    tokenizer = tokenizer,
    model=model,
)

In [None]:
@weave.op()
def call_llm(message: str) -> str:
    """Function to call the LLM and generate output"""
    return pipe(
        message,
        max_new_tokens = 18,
        return_full_text=False,
        temperature = 0.1,
        do_sample = True,
        top_p = 0.97,
        top_k = 50,
    )[0]['generated_text']

In [None]:
class MalaysianMistralAsAJudge(weave.Model):
    prompt: str

    @weave.op
    def create_message(self, passage: str, question: str):
        return self.prompt.format(passage=passage, question=question)

    @weave.op
    def predict(self, passage:str, question:str):
        message = self.create_message(passage, question)
        return call_llm(message=message)

The below are scoring metrics we'll be using to evaluate the LLM's outputs.

In [None]:
def accuracy(model_output, answer):
    try:
        model_output = json.loads(model_output)
        class_model_output = model_output.get('consistency', None)
    except json.JSONDecodeError:
        # to handle edge cases where the LLM outputs improper JSON like this: '1 {"consistency": 0'
        match = re.search(r'\"consistency\":\s*([01])', model_output)

        if match:
            number = match.group(1)
            class_model_output = int(number)
        else:
            # to handle this pattern: # Nota: Output yang diberikan adalah "1", yang
            match = re.search(r'\d+', model_output)

            if match:
                number = match.group()
                class_model_output = int(number)
            else:
                class_model_output = None
    return {"accuracy": class_model_output == answer}

In [None]:
class BinaryMetrics(weave.Scorer):
    class_name: str
    eps: float = 1e-8

    @weave.op()
    def summarize(self, score_rows) -> dict:
        # filter out None rows, model may error out sometimes...
        score_rows = [score for score in score_rows if score["correct"] is not None]
        # Compute f1, precision, recall
        tp = sum([not score["negative"] and score["correct"] for score in score_rows])
        fp = sum([not score["negative"] and not score["correct"] for score in score_rows])
        fn = sum([score["negative"] and not score["correct"] for score in score_rows])
        precision = tp / (tp + fp + self.eps)
        recall = tp / (tp + fn + self.eps)
        f1 = 2 * precision * recall / (precision + recall + self.eps)
        result = {"f1": f1, "precision": precision, "recall": recall}
        return result

    @weave.op()
    def score(self, answer: dict, model_output: dict|str) -> dict:
        try:
            model_output = json.loads(model_output)
            class_model_output = model_output.get(self.class_name, None)
        except json.JSONDecodeError:
            # to handle edge cases where the LLM outputs improper JSON like this: '1 {"consistency": 1'
            match = re.search(r'\"consistency\":\s*([01])', model_output)

            if match:
                number = match.group(1)
                class_model_output = int(number)
            else:
                # to handle this pattern: # Nota: Output yang diberikan adalah "1", yang
                match = re.search(r'\d+', model_output)

                if match:
                    number = match.group()
                    class_model_output = int(number)
                else:
                    class_model_output = None
        result = {
            "correct": class_model_output == answer,
            "negative": not class_model_output,
        }
        return result

F1 = BinaryMetrics(class_name="consistency")

### 3.0 Run evaluations

In [None]:
# Define prompt_v1
prompt_v1 = """Anda adalah pakar dalam mengesan ketidakkonsistenan fakta dan halusinasi. Anda akan diberi satu dokumen dan satu soalan/kenyataan. Baca
dokumen dan soalan/kenyataan yang diberikan dengan teliti dan kenal pasti Ketidakkonsistenan Fakta (iaitu mana-mana soalan/kenyataan yang
tidak disokong atau bercanggah dengan maklumat dalam dokumen).

### Anda perlu memilih antara dua pilihan berikut:
- Tidak Konsisten dengan Fakta: Jika mana-mana soalan/kenyataan tidak disokong, terjawab atau bercanggah dengan dokumen, labelkannya sebagai 0.
- Konsisten dengan Fakta: Jika semua soalan/kenyataan disokong/terjawab oleh dokumen, labelkannya sebagai 1.

### Sebagai contoh:
Dokumen: "Gajah adalah mamalia besar yang biasanya ditemui di Afrika dan Asia. Mereka hidup dalam kumpulan yang dikenali sebagai kawanan dan terkenal kerana mempunyai ingatan yang baik."

Soalan/Kenyataan: "Gajah adalah mamalia besar yang biasanya ditemui di Eropah."
Jawapan: {{'consistency': 0}}

Soalan/Kenyataan: "Gajah adalah mamalia besar yang biasanya ditemui di Afrika dan Asia."
Jawapan: {{'consistency': 1}}

### Jawab berdasarkan dokumen dan soalan/kenyataan berikut:
Dokumen: {passage}
Soalan/Kenyataan: {question}

Kembalikan pilihan konsistenan dalam format JSON untuk pilihan yang diberikan. Anda wajib beri menjawab dalam format ini: {{'consistency': 1}} atau {{'consistency': 0}}.
"""

In [None]:
mistralasajudge = MalaysianMistralAsAJudge(prompt=prompt_v1)

#### 3.1 Evaluate performance on English & Malay texts

In [None]:
evaluation_all = weave.Evaluation(dataset=data_all, scorers=[accuracy, F1])

In [None]:
await evaluation_all.evaluate(mistralasajudge)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset








































































































































🍩 https://wandb.ai/adzhar-faiq/benchmark_malaysian_mistral_llmasajudge_v3/r/call/0192b9f7-843d-7793-b4c6-f67c5f63af5e


{'accuracy': {'accuracy': {'true_count': 490, 'true_fraction': 0.6125}},
 'BinaryMetrics': {'f1': 0.6912350547475041,
  'precision': 0.6857707509745895,
  'recall': 0.6967871485803858},
 'model_latency': {'mean': 12.576710639894008}}

#### 3.2 Evaluate performance on Malay texts only

In [None]:
evaluation_malay = weave.Evaluation(dataset=data_malay, scorers=[accuracy, F1])

In [None]:
await evaluation_malay.evaluate(mistralasajudge)



















































🍩 https://wandb.ai/adzhar-faiq/benchmark_malaysian_mistral_llmasajudge_v3/r/call/0192ba10-b104-7fb3-9c2c-175b78f579dd


{'accuracy': {'accuracy': {'true_count': 244, 'true_fraction': 0.61}},
 'BinaryMetrics': {'f1': 0.6829268242425971,
  'precision': 0.6970954356557222,
  'recall': 0.6693227091366803},
 'model_latency': {'mean': 13.331802816987038}}