Quantization Time #28

DRXD1000 · 2024-02-24T17:38:08Z

How long is the expected time to quantize a 7b mistral model ?

Vahe1994 · 2024-02-26T11:27:02Z

Hello!
On 2 GPU it would take approximately 6 –10 hours, it's depends on hyperparameters.

iamwavecut · 2024-02-26T13:00:23Z

I've tried to quantize another variant of mistral last week, but I was working on a layer 0 for like 4 hours using 8 x 4090 NVLINKed EPYC class server, so I aborted it due to the projected costs.

Is it OK to do it this slow with that mentioned configuration? Just trying to understand the optimal way of doing this.

DRXD1000 · 2024-02-26T16:13:35Z

Hello!
On 2 GPU it would take approximately 6 –10 hours, it's depends on hyperparameters.

Thanks for the fast Response . I use the exact Same hyperparameters as your example:
export CUDA_VISIBLE_DEVICES=0 # or e.g. 0,1,2,3 export MODEL_PATH=<PATH_TO_MODEL_ON_HUB> export DATASET_PATH= export SAVE_PATH=/path/to/save/quantized/model/ export WANDB_PROJECT=MY_AQ_EXPS export WANDB_NAME=COOL_EXP_NAME python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 \ --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 \ --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 \ --finetune_batch_size=32 --local_batch_size=1 --offload_activations \ --wandb --save $SAVE_PATH

And after Allmost 24 hours on 2 a40 48 GB gpus the Script was only on layer 8

iamwavecut · 2024-02-26T16:19:15Z

@Vahe1994 please, take a look

Vahe1994 · 2024-02-26T17:04:36Z

Thank you for reporting this!
I haven't experimented with quantization on 4090 (by the way, 8 4090 might be overkill) or on A40, but the processing time appears to be unusually slow. It's possible that recent code changes have caused this slowdown in the quantization process, though I'm not certain. I'll have a look at this and provide an update if I discover anything.

DRXD1000 · 2024-02-26T17:11:33Z

If you need more Details or a test after code updates i will be happy to help.

iamwavecut · 2024-02-26T20:28:03Z

+1

Vahe1994 · 2024-02-29T18:26:56Z

Hi!
Unfortunately, I could not obtain access neither for 4090 nor a40. So I conducted several experiments on A100.
I tried to quantize LLama-2 7B with provided parameters with recent commit and commit that was month ago, both gave 14.5 hours full quantization time on 2 A100 with ppl evaluation. Then I tried quantization on Mistral-7B model with recent commit with 2 A100 and also got slightly more, but acceptable time around 17 hours.
Note that 16bit codebooks quantization much slower than with smaller codebooks, and quantization time is dependent on relative tolerance.

Can you please try quantizing with the same config, but on one GPU, and reduce the number of samples proportionally to fit in memory. This is necessary to understand if the problem is related to inter-gpu communication or local computation.
For instance, I have once encountered such a problem with a faulty gpu-to-gpu pcie bus.

If you need more Details or a test after code updates i will be happy to help.

iamwavecut · 2024-02-29T20:45:24Z

Here now I'm trying with single H100 / Intel Xeon Platinum 8468 (192) / 2TB ram, and getting

Saving layer 0... to model-name-masked/0.pth                                                                  
{'layer_time': 3053.6494579315186, 'out_loss': 0.04229441657662392, 'Step': 0}

or about 50 min per layer.

So, if I extrapolate it, I predict: ~26.5 hours of quantizing, and that's... not expected :(

Launch flags

python main.py $MODEL_NAME $DATASET_NAME --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --local_batch_size=1 --save ${MODEL_NAME}-AQLM --dtype bfloat16 --beam_size 1 --max_epochs 100 --relative_mse_tolerance=0.01 --finetune_max_epochs 0 --offload_activations

iamwavecut · 2024-03-02T22:45:45Z

I was able to complete the quantization of the mistral 7b variant mentioned previously using the top-tier GH200 within a shocking 22 hours. That can't be possible, right?

P.S. It seems to be struggling the most on the 'initializing with kmeans:' stage, spending about 5 minutes at the'mlp.*' sublayer stages.

Vahe1994 · 2024-03-03T14:59:10Z

Hello, @iamwavecut!
As I mentioned earlier, using two A100 GPUs, the quantization time for the Mistral-7b model is approximately 17 hours. In light of this, the reported numbers for one H100 (26 hours) by you seem to be OK. Although quantization with AQLM is relatively time-consuming, it should be noted that this is a one-time process.

To expedite the model quantization process, consider using multiple GPUs in parallel(the provided code supports the use of multiple GPUs for a single model). If you're looking to reduce the quantization time further and are willing to make a potential compromise on perplexity (ppl), you can adjust quantization parameters such as nsamples, relative_mse_tolerance, finetune_relative_mse_tolerance, nbits_per_codebook, init_max_iter, init_max_points_per_centroid, etc..
Hope this helps. If you have any additional questions, please feel free to ask.

github-actions · 2024-04-03T01:45:06Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-04-17T01:45:19Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

iamwavecut mentioned this issue Mar 2, 2024

How long does it take to quantize? #32

Closed

github-actions bot added the stale label Apr 3, 2024

github-actions bot closed this as completed Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization Time #28

Quantization Time #28

DRXD1000 commented Feb 24, 2024

Vahe1994 commented Feb 26, 2024

iamwavecut commented Feb 26, 2024 •

edited

Loading

DRXD1000 commented Feb 26, 2024

iamwavecut commented Feb 26, 2024

Vahe1994 commented Feb 26, 2024

DRXD1000 commented Feb 26, 2024

iamwavecut commented Feb 26, 2024

Vahe1994 commented Feb 29, 2024

iamwavecut commented Feb 29, 2024

iamwavecut commented Mar 2, 2024 •

edited

Loading

Vahe1994 commented Mar 3, 2024

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 17, 2024

Quantization Time #28

Quantization Time #28

Comments

DRXD1000 commented Feb 24, 2024

Vahe1994 commented Feb 26, 2024

iamwavecut commented Feb 26, 2024 • edited Loading

DRXD1000 commented Feb 26, 2024

iamwavecut commented Feb 26, 2024

Vahe1994 commented Feb 26, 2024

DRXD1000 commented Feb 26, 2024

iamwavecut commented Feb 26, 2024

Vahe1994 commented Feb 29, 2024

iamwavecut commented Feb 29, 2024

iamwavecut commented Mar 2, 2024 • edited Loading

Vahe1994 commented Mar 3, 2024

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 17, 2024

iamwavecut commented Feb 26, 2024 •

edited

Loading

iamwavecut commented Mar 2, 2024 •

edited

Loading