Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add python backend support #86

Merged
merged 8 commits into from Nov 23, 2022
Merged

Conversation

thakkarparth007
Copy link
Collaborator

@thakkarparth007 thakkarparth007 commented Oct 17, 2022

Changes:

  • Modify dockerfile to include bitsandbytes, transformers and latest version of pytorch
  • Minor modifications in utils/codegen.py so that same client works with FT and Py-backend
  • Minor modifications in launch.sh (no need to name models by GPU)
  • Add installation script for adding a new python model (with super simple config_template)
  • Modify setup.sh so that it aworks with both FT and Python backend models

I've tested the workflow from top to bottom and it seems to work for me.

Limitations:

  • Many parameters are unused while generating the output (e.g. beam_width). See config_template.pbtxt for unused parameters.
  • Doesn't take care of batching multiple requests together.
  • Slow performance. FasterTransformer seems to be 6-7x faster than the python backend. But this is still useful if you want to play with different models, or try out a bigger model (e.g., I can fit the 2B model in my tiny 4GB GPU which I can't with FT). Some numbers: for ~20 tokens it takes ~120ms for FT while 800-900ms for Python backend. Most of Python backend's time goes in model.generate.
  • Didn't test with multi-gpu world (but should work because of device_map="auto")
  • Code limited to installing codegen-{350M,2B}-{multi,mono} models. Can trivially allow others, but I'm not sure of the memory requirements of those (it should be O(model_params/1B) + constant GB).
  • Didn't modify README yet.

Caveats:

  • Model directory style is slightly different because we don't need a separate copy of the model based on the number of available GPUs. Accelerate takes care of that for us.
  • Haven't tested with models other than codegen models. The code should work for all decoder-only models IMO, and will probably need to be tweaked a bit while working with Encoder-Decoder models.

Signed-off-by: Parth Thakkar thakkarparth007@gmail.com

- Modify dockerfile to include bitsandbytes, transformers and latest version of pytorch
- Minor modifications in utils/codegen.py so that same client works with FT and Py-backend
- Minor modifications in launch.sh (no need to name models by GPU)
- Add installation script for adding a new python model (with super simple config_template)
- Modify setup.sh so that it aworks with both FT and Python backend models

Signed-off-by: Parth Thakkar <thakkarparth007@gmail.com>
@moyix
Copy link
Collaborator

moyix commented Oct 18, 2022

Oh this is wonderful! This was one of my big wishlist items :D I will try to take this for a test drive later this week and hopefully get it merged shortly after #73 goes in!

Dockerfile Outdated

# Install dependencies: torch
RUN pip3 install -U torch --extra-index-url https://download.pytorch.org/whl/cu116
RUN pip3 install -U transformers bitsandbytes accelerate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moyix/triton_with_ft:22.09 comes from this repo https://github.com/moyix/fastertransformer_backend/blob/main/docker/Dockerfile

Seems to me more logical to add the dependencies there?

Copy link
Collaborator

@fdegier fdegier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a pr for you that resolved the merge conflicts etc. thakkarparth007#1

@@ -4,7 +4,7 @@


class OpenAIinput(BaseModel):
model: str
model: str = "fastertransformer|py-model"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the | act as an or operator? If not then I think it should be fastertransformer or no default at all. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it was meant to signal users that both fastertransformer and py-model are valid values. I wasn't sure how else to convey that, so ended up using this unusual format. I guess a better way to convey that would be to just mark it in the README/docs and let the user who wants to use python backend set that on their own.

Or maybe there's a way to annotate this information such that it shows up in Swagger UI. I'll try to see if there's such an option or fallback to the above option.

image: moyix/triton_with_ft:22.09
build:
context: .
dockerfile: Dockerfile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my other comment, I think the dependencies should be added in moyix/triton_with_ft

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I can make a PR to that repo instead

command: bash -c "CUDA_VISIBLE_DEVICES=${GPUS} mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/model"
shm_size: '2gb'
volumes:
- ${MODEL_DIR}:/model
- ${HF_CACHE_DIR}:/root/.cache/huggingface
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no HF_CACHE_DIR is set this breaks the deployment. Maybe we should default to true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version sets HF_CACHE_DIR to /tmp/hf_cache. The reason I didn't set it to default to true was because docker messes up the file permissions by setting many of them to root (unless rootless docker is used). So, the cache is shared only if the user knows what they're doing.

I could default it to true, and warn the user about the permission issues. Not sure which is a good option.

@fdegier, @moyix thoughts?

Copy link
Collaborator

@fdegier fdegier Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thakkarparth007 if a user does not use the cache, the volume is set to empty, which can not be mounted and causes the docker compose up to fail, hence why I suggested to default to true but I understand the permission issues you mentioned.

I think adding something like HF_DATASETS_CACHE="fauxpilot/.hf-cache" and removing the option to cache, will always cache and not mess up permissions? Cache path needs to verified, consider it just an example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I just saw your comment @fdegier

So currently if you notice in the setup.sh (https://github.com/moyix/fauxpilot/pull/86/files#diff-4209d788ad32c40cbda3c66b3de47eefb929308ca703bb77a6382625986add17R148) then you'll see that HF_CACHE_DIR is being set to /tmp/hf_cache if the user doesn't want to share hf_cache.

But yes, perhaps it'll be better to store the cache in the fauxpilot directory itself. Updated it!

thakkarparth007 and others added 2 commits October 21, 2022 13:23
@thakkarparth007
Copy link
Collaborator Author

thakkarparth007 commented Oct 21, 2022

@fdegier thanks for your PR that merges python_backend code on top of #73

I made a few small changes on top of those (except your last commit since my changes were already in progress). Specifically, fixed some issues in the setup.sh file and added a test_setup.py script which performs an E2E test of the python backend functionality.

@moyix this test script builds towards your CI feature from the wishlist. I've run the test locally (just need to invoke the script using pytest). The script currently doesn't need GPU so it can be integrated in a free tier CI tool.

With a few modifications, it can be used for checking fastertransformer backend as well. I don't know if fastertransformer backend can work without GPU though.

@moyix moyix mentioned this pull request Oct 21, 2022
4 tasks
Signed-off-by: Parth Thakkar <thakkarparth007@gmail.com>
Signed-off-by: Parth Thakkar <thakkarparth007@gmail.com>
@thakkarparth007
Copy link
Collaborator Author

thakkarparth007 commented Nov 9, 2022

Update: Fixed the test and merged with main.

For running the test, you can just run pytest tests from the root directory.

I'd been using dlpack for copying triton tensors to torch tensors,
which I did because it was advertised to perform zero copy transfers.
Turns out that only worked on my laptop, and didn't work on other machines.
IDK why. But for now, I'm just copying the tensors as triton<->numpy<->torch.
That works on the VM on which earlier code was segfaulting

Signed-off-by: Parth <thakkarparth007@gmail.com>
@moyix moyix merged commit 92dc571 into fauxpilot:main Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants