New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Improve tensor allocations in servings #233

Merged

jonatanklosko merged 2 commits into main from jk-serving-improvements

Aug 3, 2023

Member

jonatanklosko commented Aug 3, 2023 •

edited

Loading

Closes #217.

We want to always allocate tokenization input using binary backend, because it's zero copy, and there is no reason to involve XLA too early.
A new :preallocate_params option that moves params to the device as defined by :defn_options. This can be useful with multiple GPUs, where we could load params into CPU and then use :preallocate_params so each serving partition allocates params on the corresponding device.

jonatanklosko added 2 commits

August 3, 2023 18:39


          Allocate tokenizer output using binary backend

aa21be5


          Add option to preallocate params on serving init

e90cb20

josevalim approved these changes

View reviewed changes

jonatanklosko merged commit 333ba09 into main

2 checks passed

jonatanklosko deleted the jk-serving-improvements branch

August 3, 2023 18:10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment