New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add python backend support #86
Conversation
- Modify dockerfile to include bitsandbytes, transformers and latest version of pytorch - Minor modifications in utils/codegen.py so that same client works with FT and Py-backend - Minor modifications in launch.sh (no need to name models by GPU) - Add installation script for adding a new python model (with super simple config_template) - Modify setup.sh so that it aworks with both FT and Python backend models Signed-off-by: Parth Thakkar <thakkarparth007@gmail.com>
Oh this is wonderful! This was one of my big wishlist items :D I will try to take this for a test drive later this week and hopefully get it merged shortly after #73 goes in! |
Dockerfile
Outdated
|
||
# Install dependencies: torch | ||
RUN pip3 install -U torch --extra-index-url https://download.pytorch.org/whl/cu116 | ||
RUN pip3 install -U transformers bitsandbytes accelerate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moyix/triton_with_ft:22.09
comes from this repo https://github.com/moyix/fastertransformer_backend/blob/main/docker/Dockerfile
Seems to me more logical to add the dependencies there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a pr for you that resolved the merge conflicts etc. thakkarparth007#1
copilot_proxy/models.py
Outdated
@@ -4,7 +4,7 @@ | |||
|
|||
|
|||
class OpenAIinput(BaseModel): | |||
model: str | |||
model: str = "fastertransformer|py-model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the |
act as an or operator? If not then I think it should be fastertransformer
or no default at all. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it was meant to signal users that both fastertransformer and py-model are valid values. I wasn't sure how else to convey that, so ended up using this unusual format. I guess a better way to convey that would be to just mark it in the README/docs and let the user who wants to use python backend set that on their own.
Or maybe there's a way to annotate this information such that it shows up in Swagger UI. I'll try to see if there's such an option or fallback to the above option.
image: moyix/triton_with_ft:22.09 | ||
build: | ||
context: . | ||
dockerfile: Dockerfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in my other comment, I think the dependencies should be added in moyix/triton_with_ft
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I can make a PR to that repo instead
command: bash -c "CUDA_VISIBLE_DEVICES=${GPUS} mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/model" | ||
shm_size: '2gb' | ||
volumes: | ||
- ${MODEL_DIR}:/model | ||
- ${HF_CACHE_DIR}:/root/.cache/huggingface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no HF_CACHE_DIR
is set this breaks the deployment. Maybe we should default to true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version sets HF_CACHE_DIR to /tmp/hf_cache
. The reason I didn't set it to default to true was because docker messes up the file permissions by setting many of them to root (unless rootless docker is used). So, the cache is shared only if the user knows what they're doing.
I could default it to true, and warn the user about the permission issues. Not sure which is a good option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thakkarparth007 if a user does not use the cache, the volume is set to empty, which can not be mounted and causes the docker compose up
to fail, hence why I suggested to default to true but I understand the permission issues you mentioned.
I think adding something like HF_DATASETS_CACHE="fauxpilot/.hf-cache"
and removing the option to cache, will always cache and not mess up permissions? Cache path needs to verified, consider it just an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I just saw your comment @fdegier
So currently if you notice in the setup.sh (https://github.com/moyix/fauxpilot/pull/86/files#diff-4209d788ad32c40cbda3c66b3de47eefb929308ca703bb77a6382625986add17R148) then you'll see that HF_CACHE_DIR is being set to /tmp/hf_cache
if the user doesn't want to share hf_cache.
But yes, perhaps it'll be better to store the cache in the fauxpilot directory itself. Updated it!
Signed-off-by: Parth Thakkar <thakkarparth007@gmail.com>
@fdegier thanks for your PR that merges python_backend code on top of #73 I made a few small changes on top of those (except your last commit since my changes were already in progress). Specifically, fixed some issues in the setup.sh file and added a @moyix this test script builds towards your CI feature from the wishlist. I've run the test locally (just need to invoke the script using With a few modifications, it can be used for checking fastertransformer backend as well. I don't know if fastertransformer backend can work without GPU though. |
Signed-off-by: Parth Thakkar <thakkarparth007@gmail.com>
Signed-off-by: Parth Thakkar <thakkarparth007@gmail.com>
Update: Fixed the test and merged with main. For running the test, you can just run |
I'd been using dlpack for copying triton tensors to torch tensors, which I did because it was advertised to perform zero copy transfers. Turns out that only worked on my laptop, and didn't work on other machines. IDK why. But for now, I'm just copying the tensors as triton<->numpy<->torch. That works on the VM on which earlier code was segfaulting Signed-off-by: Parth <thakkarparth007@gmail.com>
Changes:
I've tested the workflow from top to bottom and it seems to work for me.
Limitations:
config_template.pbtxt
for unused parameters.model.generate
.device_map="auto"
)Caveats:
Signed-off-by: Parth Thakkar thakkarparth007@gmail.com