-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantized mistral
model on Inf2 with Neuron?
#856
Comments
Hi @jpaye, from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
from transformers_neuronx import (
NeuronAutoModelForCausalLM,
NeuronConfig,
QuantizationConfig,
HuggingFaceGenerationModelAdapter,
)
name = 'mistralai/Mistral-7B-Instruct-v0.2'
def load_neuron_int8():
config = AutoConfig.from_pretrained(name)
model = NeuronAutoModelForCausalLM.from_pretrained(
name,
tp_degree=2,
amp='bf16',
n_positions=[256], # Limited seqlen for faster compilation
neuron_config=NeuronConfig(
quant=QuantizationConfig(
quant_dtype='s8',
dequant_dtype='bf16'
)
)
)
model.to_neuron()
return HuggingFaceGenerationModelAdapter(config, model)
def load_cpu_bf16():
return AutoModelForCausalLM.from_pretrained(name)
def infer(model):
tokenizer = AutoTokenizer.from_pretrained(name)
prompt = "[INST] What is your favourite condiment? [/INST]"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, top_k=1, max_new_tokens=256 - input_ids.shape[1])
print('Output:')
print(tokenizer.decode(output[0]))
if __name__ == '__main__':
infer(load_neuron_int8())
infer(load_cpu_bf16()) You’ll notice that the Neuron quantized int8 version and the CPU bf16 version produce slightly different greedy results due to precision loss: Neuron Output:
CPU Output:
We will look at updating the documentation for clarity. |
@jluntamazon thank you very much for the help! Attempting to run the code on
Will update! |
@jluntamazon thanks again for the help! I got through the above issue but now debugging the below I hit this when attempting to save the quantized model with
Will keep working on it, just posting in case it's an issue that's familiar to you |
@jluntamazon just an update -- I was able to work out the issues and did get this to work! However, I didn't really see the performance bump I would have expected -- in my testing the inference wasn't faster than the non-quantized model (on inf2.xlarge). Wondering if that's expected? I had been hoping that I'd see lower inference latency |
I'm trying to quantize a Mistral (7B) model to run with
aws-neuron
on anInf2
instanceIt seems like the
int8 weight storage
feature is what I'm looking for, but seems it is not yet supported for MistralJust wondering if there are any options currently for running a quantized Mistral model on
Inf2
, or if I should just wait for this feature to be released? And if so, do we know when it will be?thank you!
The text was updated successfully, but these errors were encountered: