Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization #164

Open
KnutJaegersberg opened this issue Sep 29, 2023 · 2 comments
Open

Quantization #164

KnutJaegersberg opened this issue Sep 29, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@KnutJaegersberg
Copy link

🚀 Feature

HF transformers implements 8 bit and 4 bit quantization. It would be nice if that feature can be leveraged for the xlm-r-xxl machine translation eval model.

Motivation

The large xlm-r-xxl model is too big for most commodity gpus. To increase access to top performance translation eval, please implement a quantize version.

Alternatives

I have seen a few libraries which quantize bert models outside the HF ecosystem.

Additional context

I tried to load the big model in 8 bit with HF, without autodevice, I could load the model, which then used 14gb vram but I don't know how to use it.

@KnutJaegersberg KnutJaegersberg added the enhancement New feature or request label Sep 29, 2023
@ricardorei
Copy link
Collaborator

Loading on 8bit and using flashattention would be great enhancements. There is a good example of RoBERTa with flash-attention.

@ricardorei
Copy link
Collaborator

ricardorei commented Oct 2, 2023

This also connects to @BramVanroy suggestion to use better transformer (#117 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants