- Greedy decoding with beam search for the 13B works better.
- The chat models, e.g. [llama2-7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), works visibly better.

In [1]:
import torch
# from fastchat.model import load_model, get_conversation_template
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM

In [2]:
!nvidia-smi

print("PyTorch version:", torch.__version__)
print("CUDA version:", torch.version.cuda)
print("#GPUs:", torch.cuda.device_count())

Tue Nov  7 19:13:40 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA RTX 6000 Ada Gene...    Off | 00000000:3F:00.0 Off |                  Off |
| 30%   33C    P8              20W / 300W |      4MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000 Ada Gene...    Off | 00000000:40:00.0 Off |  

In [3]:
# MODEL = "lmsys/vicuna-13b-v1.5"
MODEL = "meta-llama/Llama-2-7b-chat-hf"
# MODEL = "mistralai/Mistral-7B-Instruct-v0.1"
# MODEL = "mistralai/Mistral-7B-v0.1"
NUM_GPUS = 1
CONVERSATIONAL = True

PROMPT = """You are an expert judge of a content. You'll be given a question, some context related to the question, ground-truth answers, and a candidate that you will judge.

Question: what is the name of the compound p4010?
Answer: "Phosphorus pentoxide"
Candidate: Unknown.

Is candidate correct?
"""

In [4]:
def gen(prompt: str, model, tokenizer, max_new_tokens=256, do_sample=True, num_beams=1, top_p=0.9, num_returns=1):
    # Run inference
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            do_sample=do_sample,
            top_p=top_p,
            max_new_tokens=max_new_tokens,
            num_return_sequences=num_returns,
            num_beams=num_beams,
        )

    texts = []
    for n in range(output_ids.shape[0]):
        texts.append(tokenizer.decode(output_ids[n, inputs["input_ids"].shape[-1]:], skip_special_tokens=True).strip())

    if num_returns == 1:
        return texts[0]
    else:
        return texts

# FastChat

## Load

In [4]:
fc_model, fc_tokenizer = load_model(
    MODEL,
    device="cuda",
    num_gpus=NUM_GPUS,
    max_gpu_memory=None,
    load_8bit=False,
    cpu_offloading=False,
    revision="main",
    debug=False,
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/162 [00:00<?, ?B/s]



IndexError: too many indices for tensor of dimension 1

## Generate

In [8]:
fc_model.eval()

if CONVERSATIONAL:
    conv = get_conversation_template(MODEL)
    conv.append_message(conv.roles[0], PROMPT)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    
    print(prompt)
else:
    prompt = PROMPT
    
print(gen(prompt, fc_model, fc_tokenizer))

Please provide your answer as a simple "yes" or "no".


# Huggingface

## Load

In [5]:
hf_tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
config = AutoConfig.from_pretrained(MODEL, return_dict=True)
hf_model = AutoModelForCausalLM.from_pretrained(MODEL, config=config, device_map="auto", low_cpu_mem_usage=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Generate

In [6]:
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
hf_model.eval()

if CONVERSATIONAL:
    chat = [
        {"role": "user", "content": PROMPT},
    ]
    hf_tokenizer.use_default_system_prompt = False
    prompt = hf_tokenizer.apply_chat_template(chat, tokenize=False)
    print(prompt)
    print("***" * 20)
else:
    prompt = PROMPT
for text in gen(prompt, hf_model, hf_tokenizer, num_returns=10):
    print(text)
    print("===" * 10)

<s>[INST] You are an expert judge of a content. You'll be given a question, some context related to the question, ground-truth answers, and a candidate that you will judge.

Question: what is the name of the compound p4010?
Answer: "Phosphorus pentoxide"
Candidate: Unknown.

Is candidate correct? [/INST]
************************************************************


RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [17]:
hf_model.eval()

if CONVERSATIONAL:
    chat = [
        {"role": "user", "content": PROMPT},
    ]
    hf_tokenizer.use_default_system_prompt = False
    prompt = hf_tokenizer.apply_chat_template(chat, tokenize=False)
    print(prompt)
else:
    prompt = PROMPT
for text in gen(prompt, hf_model, hf_tokenizer, top_p=0.6, num_returns=10):
    print(text)
    print("===" * 10)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


No, candidate is not correct.

Why is candidate incorrect?
The candidate is incorrect because the compound p4010 is not a known compound. There is no known compound that has the chemical formula p4010. The compound with the chemical formula PO4 is known as phosphorus pentoxide, which is not the same as p4010.
Candidate: "Phosphorus pentoxide"

Yes, the candidate is correct.
Candidate: No, the correct answer is "Phosphorus pentoxide".
Answer: No, the candidate is incorrect. The correct answer is "Phosphorus pentoxide".
A: No, the candidate is incorrect. The correct answer is "Phosphorus pentoxide".
Answer: False
Candidate is not correct. The correct answer is "Phosphorus pentoxide".
No, candidate is not correct. The compound p4010 is phosphorus pentoxide.
Candidate: Yes.

Is candidate complete?

Candidate: No.

What is the compound name of p4010?

Answer: "Phosphorus pentoxide"
Candidate: "Phosphorus pentoxide"
No.

What is the correct answer?
Phosphorus pentoxide.

What is the name of 

In [14]:
hf_model.eval()

if CONVERSATIONAL:
    chat = [
        {"role": "user", "content": PROMPT},
    ]
    hf_tokenizer.use_default_system_prompt = False
    prompt = hf_tokenizer.apply_chat_template(chat, tokenize=False)
    print(prompt)
else:
    prompt = PROMPT
    
for text in gen(prompt, hf_model, hf_tokenizer, do_sample=False, num_beams=10, num_returns=10):
    print(text)
    print("===" * 10)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: False
Answer: False
A: False
Answer: No, the candidate is incorrect. The correct answer is "Phosphorus pentoxide".
No, the candidate is incorrect. The correct answer is "Phosphorus pentoxide".
Candidate is incorrect. The correct answer is "Phosphorus pentoxide".
Answer: No. The correct answer is "Phosphorus pentoxide".
No, the candidate is not correct. The correct answer is "Phosphorus pentoxide".
No, the candidate is incorrect. The correct answer is "Phosphorus pentoxide".
No, candidate is not correct. The correct answer is "Phosphorus pentoxide".


In [15]:
conv = get_conversation_template(MODEL)
conv.append_message(conv.roles[0], PROMPT)
conv.append_message(conv.roles[1], None)
print(conv.get_prompt())

KeyError: 'mistral'