Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding bert - WIP #328

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

michaelfeil
Copy link

#4 Trying to implement Bert. Opening this PR for visibility.

I am blocked at the following:

  • no previous LayerNorm -> cannot be scaled based on the previous layer -> ignored
  • other MLP Layers are not converted
    => Ouputs are very far from identical

Feel free to pick up this PR if it is helpful

@casper-hansen
Copy link
Owner

Hi @michaelfeil, great work on this! I am indeed interested in having support for BERT models. However, the main issues you highlighted were the same ones I ran into.

Do you have any ideas on how to solve the blockers? Or do you plan to leave it as-is for now?

@michaelfeil
Copy link
Author

@casper-hansen I saw that the outputs of the model really differ in embedding space.

  • Do I need to quantize all layers? I saw that all layers are replaced with GEMM, but I only quantized a few of them. (see the code)
  • Do you have some idea what is the reason for this difference?

Don't have the time to invest in the PR during the week atm.

@casper-hansen
Copy link
Owner

  • Do I need to quantize all layers? I saw that all layers are replaced with GEMM, but I only quantized a few of them. (see the code)

The layers that are not defined will use the RTN method to round down to 4-bit. You can also make use of the modules_to_not_convert argument like we do in Mixtral.

  • Do you have some idea what is the reason for this difference?

A good start is to use a standard benchmark for the model. For LLMs, we usually measure perplexity. A 1-2% degradation in a benchmark is acceptable. The reason could be manifold and it is hard to reason about. One potential issue is that some layers are very sensitive to quantization.

@michaelfeil
Copy link
Author

Thanks for the hint, I have not tried out modules_to_not_convert - are you refering to this example?

modules_to_not_convert = ["gate"]

I am trying to directly use Cosine-Similarity between query and paragraph as metric, the result was similar to a random initialized model in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants