You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Things that can be done to improve the code, but were left out to launch MVP faster:
server-side: connection_handler, backend, runtime
modify task pool to deal cache handles as pure-python integers? (currently they are converted to tensors)
when running inference over multiple layers on the same server, avoid passing layer activations between cpu<->gpu by
storing them in MemoryCache
- moved to Miscellaneous server-side improvements #68
optimize disk space. r/n a server will eventually download all bloom blocks and store them into HF cache. Check for disk space in advance and/or figure out some cache eviction policy.
server-side: MemoryCache
in allocate_cache, if there is not enough memory, wait for memory to be freed by existing tasks up to a given timeout.
- note: this can be done using mp.Condtion
allocate cache as one contigous buffer to avoid fragmentation
- note: this feature is active as of #779959bc we will eventually switch back to non-cached version; rationale: did not observe significant issues from fragmentation, but contiguous buffers did complicate the code
quantize cached values using bitsandbytes
- wontfix (as of 2022.01.02): our current code relies on transformers' default bloom implementation, so we can't intervene in attention internals
Things that can be done to improve the code, but were left out to launch MVP faster:
storing them in MemoryCache
- moved to Miscellaneous server-side improvements #68
- note: this can be done using mp.Condtion
- note: this feature is active as of #779959bc we will eventually switch back to non-cached version; rationale: did not observe significant issues from fragmentation, but contiguous buffers did complicate the code
- wontfix (as of 2022.01.02): our current code relies on transformers' default bloom implementation, so we can't intervene in attention internals
- moved to Miscellaneous server-side improvements #68
The text was updated successfully, but these errors were encountered: