Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CODE] port HF bloom to hivemind #2

Closed
justheuristic opened this issue May 31, 2022 · 5 comments
Closed

[CODE] port HF bloom to hivemind #2

justheuristic opened this issue May 31, 2022 · 5 comments
Assignees

Comments

@justheuristic
Copy link
Collaborator

justheuristic commented May 31, 2022

Why: so we can play with in in inference mode

Original bloom code: huggingface/transformers#17474

The quest is to

  • implement bloom transformer layer as a hivemind expert.
  • prepare a huggingface model that only has bloom embeddings and logits, but runs all transformer layers via hivemind.RemoteExpert
@justheuristic
Copy link
Collaborator Author

justheuristic commented Jun 2, 2022

@justheuristic justheuristic self-assigned this Jun 2, 2022
@justheuristic
Copy link
Collaborator Author

justheuristic commented Jun 2, 2022

On client-side computations

Since we plan to run embeddings/logits on client side, we need to compute them efficiently.
Embedding computation is cheap AF, but logits are more complicated

Computing final logits on colab CPU costs over a minute per one token
image

Solution 1: use fast KNN

  • top-99% probabilities are held by the 100 most likely tokens
  • use HNSW to find top-100 tokens that have highest dot product
  • FAISS of ScaNN for fast nearest neighbor search

Solution 2: just use GPU

  • colab T4 in fp16: 30ms per token (no longer a bottleneck)
  • kudesnik m40 (~2x colab k80): 67ms, still very much acceptable
    image
  • gpus might not be available

Current opinion: use GPUs, think about fast CPU mode later.

@justheuristic justheuristic transferred this issue from another repository Jun 12, 2022
@justheuristic
Copy link
Collaborator Author

Copied bloom version huggingface/transformers@ca2a55e
from huggingface

Their attention code is spectacularly bad, see #1

@justheuristic
Copy link
Collaborator Author

justheuristic commented Jun 12, 2022

Next steps:

  • push individual bloom layers to huggingface hub
  • implement BloomBlock as hivemind.moe.server.ExpertBackend

@justheuristic
Copy link
Collaborator Author

As of 8221469

  • Pushing model to hub is handled via python -m cli.convert_model --many_args_here, see README for usage example
  • Server can run forward, backward and inference of bloom blocks, see README for instructions on how to start a server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant