Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization Time #28

Closed
DRXD1000 opened this issue Feb 24, 2024 · 13 comments
Closed

Quantization Time #28

DRXD1000 opened this issue Feb 24, 2024 · 13 comments
Labels

Comments

@DRXD1000
Copy link

How long is the expected time to quantize a 7b mistral model ?

@Vahe1994
Copy link
Owner

Hello!
On 2 GPU it would take approximately 6 –10 hours, it's depends on hyperparameters.

@iamwavecut
Copy link

iamwavecut commented Feb 26, 2024

I've tried to quantize another variant of mistral last week, but I was working on a layer 0 for like 4 hours using 8 x 4090 NVLINKed EPYC class server, so I aborted it due to the projected costs.

Is it OK to do it this slow with that mentioned configuration? Just trying to understand the optimal way of doing this.

@DRXD1000
Copy link
Author

Hello!
On 2 GPU it would take approximately 6 –10 hours, it's depends on hyperparameters.

Thanks for the fast Response . I use the exact Same hyperparameters as your example:
export CUDA_VISIBLE_DEVICES=0 # or e.g. 0,1,2,3 export MODEL_PATH=<PATH_TO_MODEL_ON_HUB> export DATASET_PATH= export SAVE_PATH=/path/to/save/quantized/model/ export WANDB_PROJECT=MY_AQ_EXPS export WANDB_NAME=COOL_EXP_NAME python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 \ --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 \ --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 \ --finetune_batch_size=32 --local_batch_size=1 --offload_activations \ --wandb --save $SAVE_PATH

And after Allmost 24 hours on 2 a40 48 GB gpus the Script was only on layer 8

@iamwavecut
Copy link

@Vahe1994 please, take a look

@Vahe1994
Copy link
Owner

Thank you for reporting this!
I haven't experimented with quantization on 4090 (by the way, 8 4090 might be overkill) or on A40, but the processing time appears to be unusually slow. It's possible that recent code changes have caused this slowdown in the quantization process, though I'm not certain. I'll have a look at this and provide an update if I discover anything.

@DRXD1000
Copy link
Author

If you need more Details or a test after code updates i will be happy to help.

@iamwavecut
Copy link

+1

@Vahe1994
Copy link
Owner

Hi!
Unfortunately, I could not obtain access neither for 4090 nor a40. So I conducted several experiments on A100.
I tried to quantize LLama-2 7B with provided parameters with recent commit and commit that was month ago, both gave 14.5 hours full quantization time on 2 A100 with ppl evaluation. Then I tried quantization on Mistral-7B model with recent commit with 2 A100 and also got slightly more, but acceptable time around 17 hours.
Note that 16bit codebooks quantization much slower than with smaller codebooks, and quantization time is dependent on relative tolerance.

Can you please try quantizing with the same config, but on one GPU, and reduce the number of samples proportionally to fit in memory. This is necessary to understand if the problem is related to inter-gpu communication or local computation.
For instance, I have once encountered such a problem with a faulty gpu-to-gpu pcie bus.

If you need more Details or a test after code updates i will be happy to help.

@iamwavecut
Copy link

Here now I'm trying with single H100 / Intel Xeon Platinum 8468 (192) / 2TB ram, and getting

Saving layer 0... to model-name-masked/0.pth                                                                  
{'layer_time': 3053.6494579315186, 'out_loss': 0.04229441657662392, 'Step': 0}

or about 50 min per layer.

So, if I extrapolate it, I predict: ~26.5 hours of quantizing, and that's... not expected :(

Launch flags python main.py $MODEL_NAME $DATASET_NAME --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --local_batch_size=1 --save ${MODEL_NAME}-AQLM --dtype bfloat16 --beam_size 1 --max_epochs 100 --relative_mse_tolerance=0.01 --finetune_max_epochs 0 --offload_activations

@iamwavecut
Copy link

iamwavecut commented Mar 2, 2024

I was able to complete the quantization of the mistral 7b variant mentioned previously using the top-tier GH200 within a shocking 22 hours. That can't be possible, right?

P.S. It seems to be struggling the most on the 'initializing with kmeans:' stage, spending about 5 minutes at the'mlp.*' sublayer stages.

@Vahe1994
Copy link
Owner

Vahe1994 commented Mar 3, 2024

Hello, @iamwavecut!
As I mentioned earlier, using two A100 GPUs, the quantization time for the Mistral-7b model is approximately 17 hours. In light of this, the reported numbers for one H100 (26 hours) by you seem to be OK. Although quantization with AQLM is relatively time-consuming, it should be noted that this is a one-time process.

To expedite the model quantization process, consider using multiple GPUs in parallel(the provided code supports the use of multiple GPUs for a single model). If you're looking to reduce the quantization time further and are willing to make a potential compromise on perplexity (ppl), you can adjust quantization parameters such as nsamples, relative_mse_tolerance, finetune_relative_mse_tolerance, nbits_per_codebook, init_max_iter, init_max_points_per_centroid, etc..
Hope this helps. If you have any additional questions, please feel free to ask.

Copy link

github-actions bot commented Apr 3, 2024

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Apr 3, 2024
Copy link

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants