From 18239fa7fbff25d01cefd072d1e2d0740fb5f5da Mon Sep 17 00:00:00 2001 From: Pierrick HYMBERT Date: Sun, 25 Feb 2024 19:30:15 +0100 Subject: [PATCH 1/5] server: docs - refresh and tease a little bit more the http server --- README.md | 5 +++++ examples/server/README.md | 20 +++++++++++++++++--- 2 files changed, 22 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index d61f9171b1b62..3067004c0d2a7 100644 --- a/README.md +++ b/README.md @@ -114,6 +114,11 @@ Typically finetunes of the base models below are supported as well. - [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM) - [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL) +**HTTP server** + +We offer a fast, lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server. + +[LLaMA.cpp web server](./examples/server) can be used to serve local models and easily connect them to existing clients. **Bindings:** diff --git a/examples/server/README.md b/examples/server/README.md index cb3fd6054095b..9973271965ef7 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -1,8 +1,22 @@ -# llama.cpp/example/server +# LLaMA.cpp HTTP Server -This example demonstrates a simple HTTP API server and a simple web front end to interact with llama.cpp. +Fast, lightweight, production ready pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**. -Command line options: +Set of LLM REST APIs and a simple web front end to interact with llama.cpp. + +**Features:** + * SOTA LLM inference performance with GGUF quantized models on GPU and CPU + * [OpenAI API](https://github.com/openai/openai-openapi) compatibles chat completions and embeddings routes + * Continuous batching + * KV cache attention + * Embedding + * Multimodal + * API Key security + * Production ready monitoring endpoints + +The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216). + +**Command line options:** - `--threads N`, `-t N`: Set the number of threads to use during generation. - `-tb N, --threads-batch N`: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation. From 1294debdc50106ee7e51a579225ce0149b28ea1d Mon Sep 17 00:00:00 2001 From: Pierrick Hymbert Date: Sun, 25 Feb 2024 21:23:19 +0100 Subject: [PATCH 2/5] Rephrase README.md server doc Co-authored-by: Georgi Gerganov --- README.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/README.md b/README.md index 3067004c0d2a7..d0af5d0b9b077 100644 --- a/README.md +++ b/README.md @@ -116,9 +116,7 @@ Typically finetunes of the base models below are supported as well. **HTTP server** -We offer a fast, lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server. - -[LLaMA.cpp web server](./examples/server) can be used to serve local models and easily connect them to existing clients. +[llama.cpp web server](./examples/server) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients. **Bindings:** From 42d781e2644f5aeb733eb75cbc4c93dad8374a66 Mon Sep 17 00:00:00 2001 From: Pierrick Hymbert Date: Sun, 25 Feb 2024 21:43:01 +0100 Subject: [PATCH 3/5] Update examples/server/README.md Co-authored-by: Georgi Gerganov --- examples/server/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/server/README.md b/examples/server/README.md index 9973271965ef7..2fdaeac7000ed 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -1,6 +1,6 @@ # LLaMA.cpp HTTP Server -Fast, lightweight, production ready pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**. +Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**. Set of LLM REST APIs and a simple web front end to interact with llama.cpp. From e647ed4ada79bf3036fabd3c6010a4be22a6b645 Mon Sep 17 00:00:00 2001 From: Pierrick Hymbert Date: Sun, 25 Feb 2024 21:43:34 +0100 Subject: [PATCH 4/5] Update examples/server/README.md Co-authored-by: Georgi Gerganov --- examples/server/README.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/examples/server/README.md b/examples/server/README.md index 2fdaeac7000ed..086a5ba9962af 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -5,14 +5,13 @@ Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/ Set of LLM REST APIs and a simple web front end to interact with llama.cpp. **Features:** - * SOTA LLM inference performance with GGUF quantized models on GPU and CPU - * [OpenAI API](https://github.com/openai/openai-openapi) compatibles chat completions and embeddings routes + * LLM inference of F16 and quantum models on GPU and CPU + * [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes + * Parallel decoding with multi-user support * Continuous batching - * KV cache attention - * Embedding - * Multimodal - * API Key security - * Production ready monitoring endpoints + * Multimodal (wip) + * API key security + * Monitoring endpoints The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216). From 69e8e66afd486d471f3d9ebf3b12cd65ea5f5668 Mon Sep 17 00:00:00 2001 From: Pierrick Hymbert Date: Sun, 25 Feb 2024 21:44:11 +0100 Subject: [PATCH 5/5] Update README.md --- examples/server/README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/examples/server/README.md b/examples/server/README.md index 086a5ba9962af..0e9bd7fd404ba 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -10,7 +10,6 @@ Set of LLM REST APIs and a simple web front end to interact with llama.cpp. * Parallel decoding with multi-user support * Continuous batching * Multimodal (wip) - * API key security * Monitoring endpoints The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216).