# Mistral 7B LM Eval Testing

Code to evaluate three variants of a Mistral-7B on the Open LLM Leaderboard eval.

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [2]:
!nvidia-smi

Wed Dec 27 01:06:53 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              52W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!pip install -q -U transformers peft torch accelerate einops sentencepiece bitsandbytes

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m73.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

In [None]:
# clone repository
!git clone https://github.com/EleutherAI/lm-evaluation-harness.git

In [None]:
# change to repo directory
import os

os.chdir("/content/lm-evaluation-harness")
# install
!pip install -e .

In [None]:
import datetime

now = datetime.datetime.now()
now = now.strftime("%Y_%m_%d_%H_%M_%S")

os.mkdir(f"/content/{now}")
os.mkdir(f"/content/{now}/arc")
os.mkdir(f"/content/{now}/hellaswag")
os.mkdir(f"/content/{now}/mmlu")
os.mkdir(f"/content/{now}/truthfulqa")
os.mkdir(f"/content/{now}/winogrande")
os.mkdir(f"/content/{now}/gsm8k")

In [None]:
os.environ["now_log_folder"] = now

# arc challenge

AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.


In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype="bfloat16" \
    --tasks arc_challenge \
    --batch_size 16 \
    --write_out \
    --output_path /content/$now_log_folder/arc/arc_challenge_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 25 \
    --verbosity DEBUG


# hellaswag

* HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.


In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype="bfloat16" \
    --tasks hellaswag \
    --batch_size 16 \
    --write_out \
    --output_path /content/$now_log_folder/hellaswag/hellaswag_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 10 \
    --verbosity DEBUG


## MMLU

MMLU (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.


In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype="bfloat16" \
    --tasks mmlu \
    --batch_size 16 \
    --write_out \
    --output_path /content/$now_log_folder/mmlu/mmlu_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 5 \
    --verbosity DEBUG


# TruthfulQA

TruthfulQA (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.

In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype="bfloat16" \
    --tasks truthfulqa_mc2 \
    --batch_size 16 \
    --write_out \
    --output_path /content/$now_log_folder/truthfulqa/truthfulqa_mc2_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 0 \
    --verbosity DEBUG


# Winogrande
Winogrande (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.



In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype="bfloat16" \
    --tasks winogrande \
    --batch_size 16 \
    --write_out \
    --output_path /content/$now_log_folder/winogrande/winogrande_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 5 \
    --verbosity DEBUG


# GSM8k

GSM8k (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.

In [None]:

!lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype="bfloat16" \
    --tasks gsm8k \
    --batch_size 16 \
    --write_out \
    --output_path /content/$now_log_folder/gsm8k/gsm8k_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 5 \
    --verbosity DEBUG


# Zip Results

In [None]:
!zip

# Task List

In [None]:
!lm-eval --tasks list