Problem with getting code-completions from fauxpilot server #196
andreybond
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm running setup & launching the server:
./setup.sh
.env already exists, do you want to delete .env and recreate it? [y/n] y
Deleting .env
Checking for curl ...
/usr/bin/curl
Checking for zstd ...
/usr/bin/zstd
Checking for docker ...
/usr/bin/docker
Enter number of GPUs [1]: 2
External port for the API [5000]:
Address for Triton [triton]: 192.168.1.15
Port of Triton host [8001]:
Where do you want to save your models [/home/andrey/works/ctco/codegen/fauxpilot/models]?
Choose your backend:
[1] FasterTransformer backend (faster, but limited models)
[2] Python backend (slower, but more models, and allows loading with int8)
Enter your choice [1]:
Models available:
[1] codegen-350M-mono (2GB total VRAM required; Python-only)
[2] codegen-350M-multi (2GB total VRAM required; multi-language)
[3] codegen-2B-mono (7GB total VRAM required; Python-only)
[4] codegen-2B-multi (7GB total VRAM required; multi-language)
[5] codegen-6B-mono (13GB total VRAM required; Python-only)
[6] codegen-6B-multi (13GB total VRAM required; multi-language)
[7] codegen-16B-mono (32GB total VRAM required; Python-only)
[8] codegen-16B-multi (32GB total VRAM required; multi-language)
Enter your choice [6]: 2
/home/andrey/works/ctco/codegen/fauxpilot/models/codegen-350M-multi-2gpu
Converted model for codegen-350M-multi-2gpu already exists.
Do you want to re-use it? y/n: y
Re-using model
Config complete, do you want to run FauxPilot? [y/n] y
[+] Building 1.1s (16/16) FINISHED
................
[+] Running 2/2
⠿ Container fauxpilot-copilot_proxy-1 Recreated 7.8s
⠿ Container fauxpilot-triton-1 Recreated
................
fauxpilot-triton-1 | I0428 07:24:39.094419 94 libfastertransformer.cc:307] Before Loading Model:
fauxpilot-triton-1 | [FT][WARNING] Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm.
fauxpilot-triton-1 | after allocation, free 10.60 GB total 10.92 GB
fauxpilot-triton-1 | after allocation, free 10.60 GB total 10.91 GB
fauxpilot-triton-1 | [WARNING] gemm_config.in is not found; using default GEMM algo
fauxpilot-triton-1 | [WARNING] gemm_config.in is not found; using default GEMM algo
fauxpilot-triton-1 | after allocation, free 3.50 GB total 10.92 GB
fauxpilot-triton-1 | after allocation, free 3.50 GB total 10.91 GB
fauxpilot-triton-1 | I0428 07:25:00.672894 94 libfastertransformer.cc:321] After Loading Model:
fauxpilot-triton-1 | I0428 07:25:00.673516 94 libfastertransformer.cc:537] Model instance is created on GPU NVIDIA GeForce GTX 1080 Ti
fauxpilot-triton-1 | I0428 07:25:00.673835 94 model_repository_manager.cc:1345] successfully loaded 'fastertransformer' version 1
fauxpilot-triton-1 | I0428 07:25:00.674023 94 server.cc:556]
fauxpilot-triton-1 | +------------------+------+
fauxpilot-triton-1 | | Repository Agent | Path |
fauxpilot-triton-1 | +------------------+------+
fauxpilot-triton-1 | +------------------+------+
fauxpilot-triton-1 |
fauxpilot-triton-1 | I0428 07:25:00.674169 94 server.cc:583]
fauxpilot-triton-1 | +-------------------+-----------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1 | | Backend | Path | Config |
fauxpilot-triton-1 | +-------------------+-----------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1 | | fastertransformer | /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
fauxpilot-triton-1 | +-------------------+-----------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1 |
fauxpilot-triton-1 | I0428 07:25:00.674275 94 server.cc:626]
fauxpilot-triton-1 | +-------------------+---------+--------+
fauxpilot-triton-1 | | Model | Version | Status |
fauxpilot-triton-1 | +-------------------+---------+--------+
fauxpilot-triton-1 | | fastertransformer | 1 | READY |
fauxpilot-triton-1 | +-------------------+---------+--------+
fauxpilot-triton-1 |
fauxpilot-triton-1 | I0428 07:25:00.684605 94 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce GTX 1080 Ti
fauxpilot-triton-1 | I0428 07:25:00.684642 94 metrics.cc:650] Collecting metrics for GPU 1: NVIDIA GeForce GTX 1080 Ti
fauxpilot-triton-1 | I0428 07:25:00.685160 94 tritonserver.cc:2159]
fauxpilot-triton-1 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1 | | Option | Value |
fauxpilot-triton-1 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1 | | server_id | triton |
fauxpilot-triton-1 | | server_version | 2.23.0 |
fauxpilot-triton-1 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
fauxpilot-triton-1 | | model_repository_path[0] | /model |
fauxpilot-triton-1 | | model_control_mode | MODE_NONE |
fauxpilot-triton-1 | | strict_model_config | 1 |
fauxpilot-triton-1 | | rate_limit | OFF |
fauxpilot-triton-1 | | pinned_memory_pool_byte_size | 268435456 |
fauxpilot-triton-1 | | cuda_memory_pool_byte_size{0} | 67108864 |
fauxpilot-triton-1 | | cuda_memory_pool_byte_size{1} | 67108864 |
fauxpilot-triton-1 | | response_cache_byte_size | 0 |
fauxpilot-triton-1 | | min_supported_compute_capability | 6.0 |
fauxpilot-triton-1 | | strict_readiness | 1 |
fauxpilot-triton-1 | | exit_timeout | 30 |
fauxpilot-triton-1 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
fauxpilot-triton-1 |
fauxpilot-triton-1 | I0428 07:25:00.686302 94 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
fauxpilot-triton-1 | I0428 07:25:00.686586 94 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
fauxpilot-triton-1 | I0428 07:25:00.733005 94 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
Invoking service via CURL:
curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"prompt":"def hello_world():","max_tokens":100,"temperature":0.1,"stop":["\n\n"]}' http://192.168.1.15:5000/v1/engines/codegen/completions
Gives empty result:
{"id": "cmpl-cDUwBbCvDXL8Cn7E0pBGQ2VZVluBn", "choices": []}
Moreover in my nvtop I see that GPU memory is occupied ( I have 2 1080Ti 11Gb each), but GPU itself (during the request) stays at 0% utilization.
Here are the logs from server:
fauxpilot-copilot_proxy-1 | [StatusCode.UNAVAILABLE] failed to connect to all addresses
fauxpilot-copilot_proxy-1 | WARNING: Model 'fastertransformer' is not available. Please ensure that
model
is set to either 'fastertransformer' or 'py-model' depending on your installationfauxpilot-copilot_proxy-1 | Returned completion in 20003.748178482056 ms
fauxpilot-copilot_proxy-1 | INFO: 2023-04-28 07:30:09,487 :: 192.168.1.245:45330 - "POST /v1/engines/codegen/completions HTTP/1.1" 200 OK
I'm a bit worried about that "StatusCode.UNAVAILABLE" can it be related to empty responses I'm receiving?
Calling from python gives same empty results:
Just in case: I'm running fauxpilot on my local server and trying to connect from laptop in the local network. CURL though gives same results regardless of place it was invoked (laptop or GPU server).
Beta Was this translation helpful? Give feedback.
All reactions