-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantization Time #28
Comments
Hello! |
I've tried to quantize another variant of mistral last week, but I was working on a layer 0 for like 4 hours using Is it OK to do it this slow with that mentioned configuration? Just trying to understand the optimal way of doing this. |
Thanks for the fast Response . I use the exact Same hyperparameters as your example: And after Allmost 24 hours on 2 a40 48 GB gpus the Script was only on layer 8 |
@Vahe1994 please, take a look |
Thank you for reporting this! |
If you need more Details or a test after code updates i will be happy to help. |
+1 |
Hi! Can you please try quantizing with the same config, but on one GPU, and reduce the number of samples proportionally to fit in memory. This is necessary to understand if the problem is related to inter-gpu communication or local computation.
|
Here now I'm trying with single
or about 50 min per layer. So, if I extrapolate it, I predict: ~26.5 hours of quantizing, and that's... not expected :( Launch flagspython main.py $MODEL_NAME $DATASET_NAME --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --local_batch_size=1 --save ${MODEL_NAME}-AQLM --dtype bfloat16 --beam_size 1 --max_epochs 100 --relative_mse_tolerance=0.01 --finetune_max_epochs 0 --offload_activations
|
I was able to complete the quantization of the mistral 7b variant mentioned previously using the top-tier P.S. It seems to be struggling the most on the 'initializing with kmeans:' stage, spending about 5 minutes at the'mlp.*' sublayer stages. |
Hello, @iamwavecut! To expedite the model quantization process, consider using multiple GPUs in parallel(the provided code supports the use of multiple GPUs for a single model). If you're looking to reduce the quantization time further and are willing to make a potential compromise on perplexity (ppl), you can adjust quantization parameters such as nsamples, relative_mse_tolerance, finetune_relative_mse_tolerance, nbits_per_codebook, init_max_iter, init_max_points_per_centroid, etc.. |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
How long is the expected time to quantize a 7b mistral model ?
The text was updated successfully, but these errors were encountered: