# Mistral 7B LM Eval Testing

Code to evaluate three variants of a Mistral-7B on the Open LLM Leaderboard eval.

In [None]:
!nvcc --version

In [None]:
!nvidia-smi

In [None]:
!pip install -q -U transformers peft torch accelerate einops sentencepiece bitsandbytes

In [None]:
# clone repository
!git clone https://github.com/EleutherAI/lm-evaluation-harness.git

In [None]:
# change to repo directory
import os

os.chdir("/content/lm-evaluation-harness")
# install
!pip install -e .

In [None]:
import datetime

now = datetime.datetime.now()
now = now.strftime("%Y_%m_%d_%H_%M_%S")

os.mkdir(f"/content/{now}")
os.mkdir(f"/content/{now}/arc")
os.mkdir(f"/content/{now}/hellaswag")
os.mkdir(f"/content/{now}/mmlu")
os.mkdir(f"/content/{now}/truthfulqa")
os.mkdir(f"/content/{now}/winogrande")
os.mkdir(f"/content/{now}/gsm8k")

In [None]:
os.environ["now_log_folder"] = now

# arc challenge

AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.


In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype="bfloat16" \
    --tasks arc_challenge \
    --batch_size 16 \
    --write_out \
    --output_path /content/$now_log_folder/arc/arc_challenge_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 25 \
    --verbosity DEBUG


# hellaswag

* HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.


In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype="bfloat16" \
    --tasks hellaswag \
    --batch_size 16 \
    --write_out \
    --output_path /content/$now_log_folder/hellaswag/hellaswag_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 10 \
    --verbosity DEBUG


# MMLU

MMLU (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.


In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype="bfloat16" \
    --tasks mmlu \
    --batch_size 4 \
    --write_out \
    --output_path /content/$now_log_folder/mmlu/mmlu_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 5 \
    --verbosity DEBUG


# TruthfulQA

TruthfulQA (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.

In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype="bfloat16" \
    --tasks truthfulqa_mc2 \
    --batch_size 16 \
    --write_out \
    --output_path /content/$now_log_folder/truthfulqa/truthfulqa_mc2_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 0 \
    --verbosity DEBUG


# Winogrande
Winogrande (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.



In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype="bfloat16" \
    --tasks winogrande \
    --batch_size 16 \
    --write_out \
    --output_path /content/$now_log_folder/winogrande/winogrande_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 5 \
    --verbosity DEBUG


# GSM8k

GSM8k (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.

In [None]:

!lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype="bfloat16" \
    --tasks gsm8k \
    --batch_size 16 \
    --write_out \
    --output_path /content/$now_log_folder/gsm8k/gsm8k_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 5 \
    --verbosity DEBUG


# Zip Results

In [None]:
!zip

# Task List

In [None]:
!lm-eval --tasks list