New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some minor ergonomic changes for python backend #135
Conversation
- Add validation rule to ensure is set to fastertransformer or python-backend - Add warning if model is unavailable, likely the user has not set correctly Signed-off-by: Parth Thakkar <thakkarparth007@gmail.com>
Using this PR, I experimented with inference with python-backend to avoid an unexpected issue. But, I didn't get normal inference result. Others questioned whether or not it worked. Here is my evaluation result. 😭 Step1: Setup and launch FauxPilot(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ ./setup.sh
.env already exists, do you want to delete .env and recreate it? [y/n] y
Deleting .env
Checking for curl ...
/usr/bin/curl
Checking for zstd ...
/home/invain/anaconda3/envs/deepspeed/bin/zstd
Checking for docker ...
/usr/bin/docker
Enter number of GPUs [1]:
External port for the API [5000]:
Address for Triton [triton]:
Port of Triton host [8001]:
Where do you want to save your models [/work/qtlab/fauxpilot/models]?
Choose your backend:
[1] FasterTransformer backend (faster, but limited models)
[2] Python backend (slower, but more models, and allows loading with int8)
Enter your choice [1]: 2
Models available:
[1] codegen-350M-mono (1GB total VRAM required; Python-only)
[2] codegen-350M-multi (1GB total VRAM required; multi-language)
[3] codegen-2B-mono (4GB total VRAM required; Python-only)
[4] codegen-2B-multi (4GB total VRAM required; multi-language)
Enter your choice [4]: 2
Do you want to share your huggingface cache between host and docker container? y/n [n]: n
Do you want to use int8? y/n [y]:
Config written to /work/qtlab/fauxpilot/models/py-Salesforce-codegen-350M-multi/py-model/config.pbtxt
docker: 'compose' is not a docker command.
See 'docker --help'
[+] Building 4.3s (17/17) FINISHED
=> [fauxpilot-triton internal] load build definition from Dockerfile 0.6s
=> => transferring dockerfile: 32B 0.0s
=> [fauxpilot-copilot_proxy internal] load build definition from Dockerfile 0.8s
=> => transferring dockerfile: 32B 0.0s
=> [fauxpilot-triton internal] load .dockerignore 1.2s
=> => transferring context: 35B 0.0s
=> [fauxpilot-copilot_proxy internal] load .dockerignore 1.0s
=> => transferring context: 35B 0.0s
=> [fauxpilot-copilot_proxy internal] load metadata for docker.io/library/python:3.10-slim-buster 2.4s
=> [fauxpilot-triton internal] load metadata for docker.io/moyix/triton_with_ft:22.09 0.0s
=> [fauxpilot-triton 1/3] FROM docker.io/moyix/triton_with_ft:22.09 0.0s
=> CACHED [fauxpilot-triton 2/3] RUN python3 -m pip install --disable-pip-version-check -U torch --extra-index-url https://download.pytorch.org/whl/cu116 0.0s
=> CACHED [fauxpilot-triton 3/3] RUN python3 -m pip install --disable-pip-version-check -U transformers bitsandbytes accelerate 0.0s
=> [fauxpilot-copilot_proxy] exporting to image 1.7s
=> => exporting layers 0.0s
=> => writing image sha256:1e3a6721f024a29012f8f41e5bfdca2dc7c0dbdfedfe95edfe31c0fb1d2c5bcc 0.1s
=> => naming to docker.io/library/fauxpilot-triton 0.0s
=> => writing image sha256:6c1ee95a123bb52f3504bd38cf5699e861da93448bd15c345d4b6734f130c231 0.0s
=> => naming to docker.io/library/fauxpilot-copilot_proxy 0.0s
=> [auth] library/python:pull token for registry-1.docker.io 0.0s
=> [fauxpilot-copilot_proxy internal] load build context 0.2s
=> => transferring context: 1.15kB 0.0s
=> [fauxpilot-copilot_proxy 1/5] FROM docker.io/library/python:3.10-slim-buster@sha256:b0f095dee13b2b4552d545be4f0f1c257f26810c079720c0902dc5e7f3e6b514 0.0s
=> CACHED [fauxpilot-copilot_proxy 2/5] WORKDIR /python-docker 0.0s
=> CACHED [fauxpilot-copilot_proxy 3/5] COPY copilot_proxy/requirements.txt requirements.txt 0.0s
=> CACHED [fauxpilot-copilot_proxy 4/5] RUN pip3 install --no-cache-dir -r requirements.txt 0.0s
=> CACHED [fauxpilot-copilot_proxy 5/5] COPY copilot_proxy . 0.0s
Config complete, do you want to run FauxPilot? [y/n] y
unknown flag: --remove-orphans
[+] Running 2/0
⠿ Container fauxpilot-copilot_proxy-1 Running 0.0s
⠿ Container fauxpilot-triton-1 Running 0.0s
Attaching to fauxpilot-copilot_proxy-1, fauxpilot-triton-1
fauxpilot-copilot_proxy-1 | INFO: Shutting down
fauxpilot-copilot_proxy-1 | INFO: Waiting for application shutdown.
fauxpilot-copilot_proxy-1 | INFO: Application shutdown complete.
fauxpilot-copilot_proxy-1 | INFO: Finished server process [1]
fauxpilot-copilot_proxy-1 exited with code 0
fauxpilot-copilot_proxy-1 exited with code 0
fauxpilot-triton-1 | I0103 02:23:34.782117 89 server.cc:257] Waiting for in-flight requests to complete.
fauxpilot-triton-1 | I0103 02:23:34.782160 89 server.cc:273] Timeout 30: Found 0 model versions that have in-flight inferences
fauxpilot-triton-1 | I0103 02:23:34.782170 89 model_repository_manager.cc:1223] unloading: py-model:1
fauxpilot-triton-1 | I0103 02:23:34.782295 89 server.cc:288] All models are stopped, unloading models
fauxpilot-triton-1 | I0103 02:23:34.782305 89 server.cc:295] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
........... Omission .......... Step 2: Run client with a REST API(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"model":"py-model","prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"], "logprobs": 0}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-qEEtJHgU4QXZXojADgcNCT6u2OkaL", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"prompt":"int hello(){","max_tokens":50,"temperature":0.1,"stop":["\n\n"]}' http://localhost:5000/v1/engines/codegen/completions
{"id": "cmpl-4Lbo1AMRmrszM0TVo2SvOTj3Rojln", "choices": []}(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
(deepspeed) invain@mymate:/work/qtlab/fauxpilot$
|
It seems I'm missing something. I am looking for the cause with log messages of the FauxPilot server. |
I discovered a cause for the problem with this PR. I submitted PR number #137. Please verify it. |
I think what's happening is that your docker images are cached. See how even the last line shows I'm not certain, but could you try running I do run that command manually. Ideally it should be included in |
@thakkarparth007 , Please refer to the PR #137 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat. Let's go ahead.
Acked-by: Geunsik Lim geunsik.lim@samsung.com
Very simple changes:
model
is set to fastertransformer or python-backend. This is to avoid confusion as in Due to the Python backend, FauxPilot returns an inference error. #134model
a required, non-default field.models.py
has logprobs set as an optional integer, defaulting to None. The code that reads this variable from the payload is likedata.get('logprobs', None)
, and if it's None, the logprobs value is set to 1. It seems best to not compute logprobs unless the client explicitly asks for it.Signed-off-by: Parth Thakkar thakkarparth007@gmail.com