Problem with getting code-completions from fauxpilot server #196

andreybond · 2023-04-28T10:23:10Z

andreybond
Apr 28, 2023

I'm running setup & launching the server:

./setup.sh
.env already exists, do you want to delete .env and recreate it? [y/n] y
Deleting .env
Checking for curl ...
/usr/bin/curl
Checking for zstd ...
/usr/bin/zstd
Checking for docker ...
/usr/bin/docker
Enter number of GPUs [1]: 2
External port for the API [5000]:
Address for Triton [triton]: 192.168.1.15
Port of Triton host [8001]:
Where do you want to save your models [/home/andrey/works/ctco/codegen/fauxpilot/models]?
Choose your backend:
[1] FasterTransformer backend (faster, but limited models)
[2] Python backend (slower, but more models, and allows loading with int8)
Enter your choice [1]:
Models available:
[1] codegen-350M-mono (2GB total VRAM required; Python-only)
[2] codegen-350M-multi (2GB total VRAM required; multi-language)
[3] codegen-2B-mono (7GB total VRAM required; Python-only)
[4] codegen-2B-multi (7GB total VRAM required; multi-language)
[5] codegen-6B-mono (13GB total VRAM required; Python-only)
[6] codegen-6B-multi (13GB total VRAM required; multi-language)
[7] codegen-16B-mono (32GB total VRAM required; Python-only)
[8] codegen-16B-multi (32GB total VRAM required; multi-language)
Enter your choice [6]: 2
/home/andrey/works/ctco/codegen/fauxpilot/models/codegen-350M-multi-2gpu
Converted model for codegen-350M-multi-2gpu already exists.
Do you want to re-use it? y/n: y
Re-using model
Config complete, do you want to run FauxPilot? [y/n] y
[+] Building 1.1s (16/16) FINISHED
................
[+] Running 2/2
⠿ Container fauxpilot-copilot_proxy-1 Recreated 7.8s
⠿ Container fauxpilot-triton-1 Recreated

................

fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 |
fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 |
fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 |
fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 |
fauxpilot-triton-1 fauxpilot-triton-1 fauxpilot-triton-1 | I0428 07:24:39.094419 94 libfastertransformer.cc:307] Before Loading Model:
| [FT][WARNING] Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm.
| after allocation, free 10.60 GB total 10.92 GB
| after allocation, free 10.60 GB total 10.91 GB
| [WARNING] gemm_config.in is not found; using default GEMM algo
| [WARNING] gemm_config.in is not found; using default GEMM algo
| after allocation, free 3.50 GB total 10.92 GB
| after allocation, free 3.50 GB total 10.91 GB
| I0428 07:25:00.672894 94 libfastertransformer.cc:321] After Loading Model:
| I0428 07:25:00.673516 94 libfastertransformer.cc:537] Model instance is created on GPU NVIDIA GeForce GTX 1080 Ti
| I0428 07:25:00.673835 94 model_repository_manager.cc:1345] successfully loaded 'fastertransformer' version 1
| I0428 07:25:00.674023 94 server.cc:556]
| +------------------+------+
| | Repository Agent | Path |
| +------------------+------+
| +------------------+------+
| I0428 07:25:00.674169 94 server.cc:583]
| +-------------------+-----------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | Backend | Path | Config |
| +-------------------+-----------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | fastertransformer | /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
| +-------------------+-----------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| I0428 07:25:00.674275 94 server.cc:626]
| +-------------------+---------+--------+
| | Model | Version | Status |
| +-------------------+---------+--------+
| | fastertransformer | 1 | READY |
| +-------------------+---------+--------+
| I0428 07:25:00.684605 94 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce GTX 1080 Ti
| I0428 07:25:00.684642 94 metrics.cc:650] Collecting metrics for GPU 1: NVIDIA GeForce GTX 1080 Ti
| I0428 07:25:00.685160 94 tritonserver.cc:2159]
| +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | Option | Value |
| +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | server_id | triton |
| | server_version | 2.23.0 |
| | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| | model_repository_path[0] | /model |
| | model_control_mode | MODE_NONE |
| | strict_model_config | 1 |
| | rate_limit | OFF |
| | pinned_memory_pool_byte_size | 268435456 |
| | cuda_memory_pool_byte_size{0} | 67108864 |
| | cuda_memory_pool_byte_size{1} | 67108864 |
| | response_cache_byte_size | 0 |
| | min_supported_compute_capability | 6.0 |
| | strict_readiness | 1 |
| | exit_timeout | 30 |
| +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| I0428 07:25:00.686302 94 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
| I0428 07:25:00.686586 94 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
| I0428 07:25:00.733005 94 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

Invoking service via CURL:

curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"prompt":"def hello_world():","max_tokens":100,"temperature":0.1,"stop":["\n\n"]}' http://192.168.1.15:5000/v1/engines/codegen/completions

Gives empty result:

{"id": "cmpl-cDUwBbCvDXL8Cn7E0pBGQ2VZVluBn", "choices": []}

Moreover in my nvtop I see that GPU memory is occupied ( I have 2 1080Ti 11Gb each), but GPU itself (during the request) stays at 0% utilization.

Here are the logs from server:

fauxpilot-copilot_proxy-1 | [StatusCode.UNAVAILABLE] failed to connect to all addresses
fauxpilot-copilot_proxy-1 | WARNING: Model 'fastertransformer' is not available. Please ensure that model is set to either 'fastertransformer' or 'py-model' depending on your installation
fauxpilot-copilot_proxy-1 | Returned completion in 20003.748178482056 ms
fauxpilot-copilot_proxy-1 | INFO: 2023-04-28 07:30:09,487 :: 192.168.1.245:45330 - "POST /v1/engines/codegen/completions HTTP/1.1" 200 OK

I'm a bit worried about that "StatusCode.UNAVAILABLE" can it be related to empty responses I'm receiving?

Calling from python gives same empty results:

result = openai.Completion.create(model='fastertransformer', prompt='def hello', max_tokens=100, temperature=0.1, stop=["\n\n"])
result
<OpenAIObject id=cmpl-oZDomr5ZCcwdHct3IDghB3XisrwsJ at 0x7f3972cf0a70> JSON: {
"choices": [],
"id": "cmpl-oZDomr5ZCcwdHct3IDghB3XisrwsJ"
}

Just in case: I'm running fauxpilot on my local server and trying to connect from laptop in the local network. CURL though gives same results regardless of place it was invoked (laptop or GPU server).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with getting code-completions from fauxpilot server #196

{{title}}

Replies: 0 comments

Select a reply

Problem with getting code-completions from fauxpilot server #196

andreybond Apr 28, 2023

Replies: 0 comments

andreybond
Apr 28, 2023