# Mixtral 8x7B LM Eval Testing

Code to evaluate three variants of a Mistral-7B on the Open LLM Leaderboard eval.

In [None]:
!nvcc --version

In [None]:
!nvidia-smi

In [None]:
!pip install -q -U transformers peft torch accelerate einops sentencepiece bitsandbytes

In [None]:
# clone repository
!git clone https://github.com/EleutherAI/lm-evaluation-harness.git

In [None]:
# change to repo directory
import os

os.chdir("/content/lm-evaluation-harness")

In [None]:
# install
!pip install -e .

In [None]:
import datetime

now = datetime.datetime.now()
now = now.strftime("%Y_%m_%d_%H_%M_%S")

os.mkdir(f"/content/{now}")
os.mkdir(f"/content/{now}/arc")
os.mkdir(f"/content/{now}/hellaswag")
os.mkdir(f"/content/{now}/mmlu")
os.mkdir(f"/content/{now}/truthfulqa")
os.mkdir(f"/content/{now}/winogrande")
os.mkdir(f"/content/{now}/gsm8k")

In [None]:
os.environ["now_log_folder"] = now

In [None]:
os.environ["now_log_folder"]

# arc challenge

AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.


In [None]:
!lm_eval --model=hf \
    --model_args=pretrained=mistralai/Mixtral-8x7B-v0.1,dtype="bfloat16",peft=dfurman/Mixtral-8x7B-Instruct-v0.1,load_in_4bit=True \
    --tasks arc_challenge \
    --num_fewshot 25 \
    --batch_size 8 \
    --device cuda:0 \
    --output_path /content/$now_log_folder/arc/arc_challenge_formatted_lm_eval.json

In [None]:
#!lm_eval --model hf \
#    --model_args pretrained=mistralai/Mixtral-8x7B-v0.1,dtype="bfloat16",peft=dfurman/Mixtral-8x7B-Instruct-v0.1,load_in_4bit=True \
#    --tasks arc_challenge \
#    --batch_size 8 \
#    --write_out \
#    --output_path /content/$now_log_folder/arc/arc_challenge_formatted_lm_eval.json \
#    --device cuda:0 \
#    --num_fewshot 25 \
#    --verbosity DEBUG

# hellaswag

* HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.


In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mixtral-8x7B-v0.1,dtype="bfloat16",peft=dfurman/Mixtral-8x7B-Instruct-v0.1,load_in_4bit=True \
    --tasks hellaswag \
    --batch_size 8 \
    --output_path /content/$now_log_folder/hellaswag/hellaswag_mistralai_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 10

# MMLU

MMLU (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.


In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mixtral-8x7B-v0.1,dtype="bfloat16",peft=dfurman/Mixtral-8x7B-Instruct-v0.1,load_in_4bit=True \
    --tasks mmlu \
    --batch_size 2 \
    --output_path /content/$now_log_folder/mmlu/mmlu_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 5

# TruthfulQA

TruthfulQA (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.

In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mixtral-8x7B-v0.1,dtype="bfloat16",peft=dfurman/Mixtral-8x7B-Instruct-v0.1,load_in_4bit=True \
    --tasks truthfulqa_mc2 \
    --batch_size 16 \
    --output_path /content/$now_log_folder/truthfulqa/truthfulqa_mc2_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 0


# Winogrande
Winogrande (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.



In [None]:
!lm_eval --model hf \
    --model_args pretrained=mistralai/Mixtral-8x7B-v0.1,dtype="bfloat16",peft=dfurman/Mixtral-8x7B-Instruct-v0.1,load_in_4bit=True \
    --tasks winogrande \
    --batch_size 16 \
    --output_path /content/$now_log_folder/winogrande/winogrande_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 5

# GSM8k

GSM8k (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.

In [None]:

!lm_eval --model hf \
    --model_args pretrained=mistralai/Mixtral-8x7B-v0.1,dtype="bfloat16",peft=dfurman/Mixtral-8x7B-Instruct-v0.1,load_in_4bit=True \
    --tasks gsm8k \
    --batch_size 8 \
    --output_path /content/$now_log_folder/gsm8k/gsm8k_lm_eval.json \
    --device cuda:0 \
    --num_fewshot 5