Quantized `mistral` model on Inf2 with Neuron? #856

jpaye · 2024-04-01T18:37:02Z

I'm trying to quantize a Mistral (7B) model to run with aws-neuron on an Inf2 instance

It seems like the int8 weight storage feature is what I'm looking for, but seems it is not yet supported for Mistral

Just wondering if there are any options currently for running a quantized Mistral model on Inf2, or if I should just wait for this feature to be released? And if so, do we know when it will be?

thank you!

The text was updated successfully, but these errors were encountered:

jluntamazon · 2024-04-12T00:28:26Z

Hi @jpaye,
This can be configured today, but we recommend that you test your specific checkpoint to ensure that quantized weight storage does not produce poor results.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
from transformers_neuronx import (
    NeuronAutoModelForCausalLM,
    NeuronConfig,
    QuantizationConfig,
    HuggingFaceGenerationModelAdapter,
)

name = 'mistralai/Mistral-7B-Instruct-v0.2'


def load_neuron_int8():
    config = AutoConfig.from_pretrained(name)
    model = NeuronAutoModelForCausalLM.from_pretrained(
        name,
        tp_degree=2,
        amp='bf16',
        n_positions=[256], # Limited seqlen for faster compilation
        neuron_config=NeuronConfig(
            quant=QuantizationConfig(
                quant_dtype='s8',
                dequant_dtype='bf16'
            )
        )
    )
    model.to_neuron()
    return HuggingFaceGenerationModelAdapter(config, model)


def load_cpu_bf16():
    return AutoModelForCausalLM.from_pretrained(name)


def infer(model):

    tokenizer = AutoTokenizer.from_pretrained(name)
    prompt = "[INST] What is your favourite condiment? [/INST]"
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    output = model.generate(input_ids, top_k=1, max_new_tokens=256 - input_ids.shape[1])
    print('Output:')
    print(tokenizer.decode(output[0]))


if __name__ == '__main__':
    infer(load_neuron_int8())
    infer(load_cpu_bf16())

You’ll notice that the Neuron quantized int8 version and the CPU bf16 version produce slightly different greedy results due to precision loss:

Neuron Output:

[INST] What is your favourite condiment? [/INST] I don’t have a personal preference or the ability to taste or enjoy condiments, as I’m an artificial intelligence and don’t have a physical body or senses. However, I can tell you that some common favourite condiments include ketchup, mustard, mayonnaise, hot sauce, soy sauce, and relish. People’s preferences can vary greatly depending on their cultural background, dietary restrictions, and personal taste preferences.

CPU Output:

[INST] What is your favourite condiment? [/INST] I don’t have a personal preference or the ability to taste or enjoy condiments, as I’m an artificial intelligence and don’t have a physical body or senses. However, I can tell you that some common favourite condiments include ketchup, mustard, mayonnaise, hot sauce, soy sauce, and BBQ sauce. People’s preferences can vary greatly depending on their cultural background, dietary restrictions, and personal taste preferences.

We will look at updating the documentation for clarity.

jpaye · 2024-04-12T15:38:27Z

@jluntamazon thank you very much for the help! Attempting to run the code on inf2.8xlarge, currently working on debugging the below error

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Will update!

jpaye · 2024-04-16T15:27:52Z

@jluntamazon thanks again for the help! I got through the above issue but now debugging the below

I hit this when attempting to save the quantized model with save_pretrained

ValueError: Attempted to use an uninitialized parameter in <method 'detach' of 'torch._C._TensorBase' objects>. This error happens when you are using a `LazyModule` or explicitly manipulating `torch.nn.parameter.UninitializedParameter` objects. When using LazyModules Call `forward` with a dummy batch to initialize the parameters before calling torch functions

Will keep working on it, just posting in case it's an issue that's familiar to you

jpaye · 2024-05-08T13:33:27Z

@jluntamazon just an update -- I was able to work out the issues and did get this to work!

However, I didn't really see the performance bump I would have expected -- in my testing the inference wasn't faster than the non-quantized model (on inf2.xlarge). Wondering if that's expected? I had been hoping that I'd see lower inference latency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized `mistral` model on Inf2 with Neuron? #856

Quantized `mistral` model on Inf2 with Neuron? #856

jpaye commented Apr 1, 2024

jluntamazon commented Apr 12, 2024

jpaye commented Apr 12, 2024

jpaye commented Apr 16, 2024 •

edited

Loading

jpaye commented May 8, 2024

Quantized mistral model on Inf2 with Neuron? #856

Quantized mistral model on Inf2 with Neuron? #856

Comments

jpaye commented Apr 1, 2024

jluntamazon commented Apr 12, 2024

Neuron Output:

CPU Output:

jpaye commented Apr 12, 2024

jpaye commented Apr 16, 2024 • edited Loading

jpaye commented May 8, 2024

Quantized `mistral` model on Inf2 with Neuron? #856

Quantized `mistral` model on Inf2 with Neuron? #856

jpaye commented Apr 16, 2024 •

edited

Loading