In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import lm_eval
from lm_eval.models.huggingface import HFLM

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [3]:
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-MoE-A2.7B-Chat",
    dtype="auto",
    device_map="auto",
    trust_remote_code=False
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-MoE-A2.7B-Chat")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [4]:
print(response)

A large language model is a type of artificial intelligence that is designed to understand and generate human-like language. These models are typically very large in terms of their size, often consisting of billions of parameters, and are trained on vast amounts of text data in order to learn the patterns and structures of language.

Large language models are used in a variety of applications, including natural language processing (NLP) tasks such as language translation, sentiment analysis, and chatbot development. They can also be used for text generation, such as generating creative writing or summarizing long documents.

One of the key advantages of large language models is their ability to handle complex language tasks, as they have been trained on a wide range of language data and have learned to understand the nuances of language. However, this also means that they can sometimes produce unexpected or nonsensical output, which requires careful monitoring and tuning when they are 

In [5]:
eval_model = HFLM(
    pretrained=model, 
    device=device,
    batch_size=8
)

# Choose tasks (Hellaswag for commonsense, ARC-Easy for reasoning)
tasks = ["hellaswag", "arc_easy"]

# Run evaluation
results = lm_eval.simple_evaluate(
    model=eval_model,
    tasks=tasks,
    num_fewshot=0,  # Zero-shot evaluation
    limit=1000      # Limit samples for quick testing
)

# Print results
for task, metrics in results["results"].items():
    print(f"{task}: {metrics}")

`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
Overwriting default num_fewshot of arc_easy from None to 0
Overwriting default num_fewshot of hellaswag from None to 0
100%|██████████████████████████████████████| 1000/1000 [00:02<00:00, 420.53it/s]
100%|█████████████████████████████████████| 1000/1000 [00:00<00:00, 1336.37it/s]
Running loglikelihood requests: 100%|███████| 7997/7997 [22:27<00:00,  5.93it/s]


arc_easy: {'alias': 'arc_easy', 'acc,none': 0.703, 'acc_stderr,none': 0.014456832294801103, 'acc_norm,none': 0.679, 'acc_norm_stderr,none': 0.014770821817934638}
hellaswag: {'alias': 'hellaswag', 'acc,none': 0.515, 'acc_stderr,none': 0.015812179641814895, 'acc_norm,none': 0.669, 'acc_norm_stderr,none': 0.014888272588203941}
