Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CODE] miscellaneous small issues for later #11

Closed
5 tasks done
justheuristic opened this issue Jun 19, 2022 · 0 comments
Closed
5 tasks done

[CODE] miscellaneous small issues for later #11

justheuristic opened this issue Jun 19, 2022 · 0 comments

Comments

@justheuristic
Copy link
Collaborator

justheuristic commented Jun 19, 2022

Things that can be done to improve the code, but were left out to launch MVP faster:

  • server-side: connection_handler, backend, runtime
    • modify task pool to deal cache handles as pure-python integers? (currently they are converted to tensors)
    • when running inference over multiple layers on the same server, avoid passing layer activations between cpu<->gpu by
      storing them in MemoryCache
      - moved to Miscellaneous server-side improvements #68
    • optimize disk space. r/n a server will eventually download all bloom blocks and store them into HF cache. Check for disk space in advance and/or figure out some cache eviction policy.
  • server-side: MemoryCache
    • in allocate_cache, if there is not enough memory, wait for memory to be freed by existing tasks up to a given timeout.
      - note: this can be done using mp.Condtion
    • allocate cache as one contigous buffer to avoid fragmentation
      - note: this feature is active as of #779959bc we will eventually switch back to non-cached version; rationale: did not observe significant issues from fragmentation, but contiguous buffers did complicate the code
    • quantize cached values using bitsandbytes
      - wontfix (as of 2022.01.02): our current code relies on transformers' default bloom implementation, so we can't intervene in attention internals
    • LRU-offload cache from gpu to ram?
      - moved to Miscellaneous server-side improvements #68
  • client-side: internals
    • make begin_inference_session into a contextmanager
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant