<a href="https://colab.research.google.com/github/anjelammcgraw/Evaluating-Models-with-Eletheur-AI-s-Evaluation-Harness/blob/main/5_Model_Evaluation_Baseline_Mistral_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### ⚠ IMPORTANT ⚠

Please ensure your Colab runtime is set to the following:

A100 GPU


## Baseline Evaluation

Let's start with Mistral AI's `Mistral-7B` model.

We're going to load and compare everything in 4-bit quantization.

Let's start by setting up and loading our model

### Load Mistral AI's Mistral-7B in 4-bit Quantization


In [None]:
!pip install -qU bitsandbytes datasets accelerate loralib peft transformers trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m104.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.9/150.9 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━

In [None]:
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

In [None]:
model_id = "mistralai/Mistral-7B-v0.1"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

####❓ Question:

Taking a look at the [model card](https://huggingface.co/mistralai/Mistral-7B-v0.1) (and the linked resources on the card) is this an instruct-tuned model or not?

**ANSWER** The model card for "mistralai/Mistral-7B-v0.1" indicates that it is not a instruct-tuned model.

### Collect and Load the Eleuther AI Evaluation Harness


In [None]:
!git clone https://github.com/EleutherAI/lm-evaluation-harness
%cd lm-evaluation-harness
!pip install -e .

In [None]:
import lm_eval
from lm_eval.models.huggingface import HFLM
eval_model = HFLM(model, batch_size=4)



In [None]:
lm_eval.tasks.initialize_tasks()



In [None]:
results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["hellaswag", "arc_easy"],
    num_fewshot=0,
    batch_size=16,
)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
INFO:lm-eval:num_fewshot has been set to 0 for hellaswag in its config. Manual configuration will be ignored.
INFO:lm-eval:num_fewshot has been set to 0 for arc_easy in its config. Manual configuration will be ignored.
INFO:lm-eval:Building contexts for task on rank 0...
INFO:lm-eval:Building contexts for task on rank 0...
INFO:lm-eval:Running loglikelihood requests
100%|██████████| 49669/49669 [17:51<00:00, 46.37it/s]


In [None]:
import pandas as pd

pd.DataFrame(results["results"])

Unnamed: 0,hellaswag,arc_easy
"acc,none",0.608743,0.798822
"acc_stderr,none",0.00487,0.008226
"acc_norm,none",0.805616,0.786195
"acc_norm_stderr,none",0.003949,0.008413
alias,hellaswag,arc_easy


In [None]:
fs_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_n_shot_loglikelihood_machine_learning"],
    num_fewshot=5,
    batch_size=16,
)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

INFO:lm-eval:Building contexts for task on rank 0...
INFO:lm-eval:Running loglikelihood requests
100%|██████████| 448/448 [00:37<00:00, 11.95it/s]


In [None]:
import pandas as pd

pd.DataFrame(fs_mmlu_results["results"])

Unnamed: 0,mmlu_flan_n_shot_loglikelihood_machine_learning
"acc,none",0.401786
"acc_norm,none",0.401786
"acc_norm_stderr,none",0.046533
"acc_stderr,none",0.046533
alias,mmlu_flan_n_shot_loglikelihood_machine_learning


### Zero-Shot MMLU

In [None]:
zs_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_n_shot_loglikelihood_machine_learning"],
    num_fewshot=0,
    batch_size=16,
)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
INFO:lm-eval:Building contexts for task on rank 0...
INFO:lm-eval:Running loglikelihood requests
100%|██████████| 448/448 [00:09<00:00, 44.81it/s]


In [None]:
import pandas as pd

pd.DataFrame(zs_mmlu_results["results"])

Unnamed: 0,mmlu_flan_n_shot_loglikelihood_machine_learning
"acc,none",0.3125
"acc_norm,none",0.3125
"acc_norm_stderr,none",0.043995
"acc_stderr,none",0.043995
alias,mmlu_flan_n_shot_loglikelihood_machine_learning


### Chain of Thought

In [None]:
cot_mmlu_results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=["mmlu_flan_cot_zeroshot_machine_learning"],
    num_fewshot=0,
    batch_size=16,
)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
INFO:lm-eval:num_fewshot has been set to 0 for mmlu_flan_cot_zeroshot_machine_learning in its config. Manual configuration will be ignored.
INFO:lm-eval:Building contexts for task on rank 0...
INFO:lm-eval:Running generate_until requests
100%|██████████| 11/11 [01:06<00:00,  6.02s/it]


In [None]:
import pandas as pd

pd.DataFrame(cot_mmlu_results["results"])

Unnamed: 0,mmlu_flan_cot_zeroshot_machine_learning
alias,mmlu_flan_cot_zeroshot_machine_learning
"exact_match,get-answer",0.0
"exact_match_stderr,get-answer",0.0
