Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server example with API Rest #1443

Merged
merged 14 commits into from
May 21, 2023
Merged

Server example with API Rest #1443

merged 14 commits into from
May 21, 2023

Conversation

FSSRepo
Copy link
Collaborator

@FSSRepo FSSRepo commented May 14, 2023

Hi. These days, I have been working on adding an API to llama.cpp using cpp-httplib. I think it could help some people to implement Llama in their projects easily. I know that there are already several alternatives such as bindings to Node JS and Python, but with this example, I intend to implement it natively in C++.

For now, it can only be compiled with CMake on Windows, Linux, and MacOS. All usage and API information can be found in the README.md file inside the examples/server directory. It doesn't require an external dependency.

Edit:

Current available features:

  • Completion (wait to end, in loop)
  • Custom prompt (generation and interactive behavior)
  • Tokenize
  • Embeddings

Any suggestions or contributions to this PR are welcome.

@CRD716
Copy link
Contributor

CRD716 commented May 14, 2023

I feel like this goes pretty directly against the no dependencies rule.

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented May 14, 2023

@CRD716 why?

@CRD716
Copy link
Contributor

CRD716 commented May 14, 2023

Because this adds extra headers and an external library in the form of cpp-httplib.

Avoid adding third-party dependencies, extra files, extra headers, etc.

  • Readme

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented May 14, 2023

So, would it be better if this is a separate project that will never be part of the master repository? Is there no other way? Or should I just close this PR?. Cublas, CLblas are third party libraries.

@CRD716
Copy link
Contributor

CRD716 commented May 14, 2023

I personally think this would be better as a separate project, but it's really up to ggerganov on if this is acceptable within the examples or not.

@prusnak
Copy link
Sponsor Collaborator

prusnak commented May 14, 2023

@CRD716 this has been already discussed in #1025 (which this PR supersedes) and we want to have a server example in this repository

What I am not sure is introducing further dependencies (json11 in this PR, especially since json11 is no longer being developed/maintained). If we really want JSON support, maybe we can use https://github.com/nlohmann/json which is a) maintained and b) single-file include only library?

@ggerganov
Copy link
Owner

Yes, what @prusnak says

In general, for this kind of examples that depend on something extra, make sure to:

  • Not add them to the Makefile. Use only CMake and put the example behind a CMake option that is disabled by default
  • Put all 3rd party dependencies (i.e. json lib, etc.) inside the new folder in the examples. They will be used only by that example
  • If possible, provide a separate CI job for that example so that it gets long-term support when breaking changes occur

Keep in mind that we might decide to remove such examples at any point if the maintenance effort becomes too big and there is nobody to do it

An alternative approach is to make a fork and add a README example in llama.cpp that describes the purpose of the example and links to the fork. This way, the maintenance efforts will be up to the owner of the fork


Note that the rules in the README refer to third-party dependencies to the core llama and ggml code.
The examples can link to small dependencies, but we still have to be thoughtful about what we link and choose the minimal option

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented May 14, 2023

If we really want JSON support, maybe we can use https://github.com/nlohmann/json which is a) maintained and b) single-file include only library?

I will change it

github-actions[bot]

This comment was marked as off-topic.

github-actions[bot]

This comment was marked as off-topic.

examples/server/server.h Outdated Show resolved Hide resolved
fprintf(stderr, " -h, --help show this help message and exit\n");
fprintf(stderr, " -s SEED, --seed SEED RNG seed (default: -1, use random seed for < 0)\n");
fprintf(stderr, " --memory_f32 use f32 instead of f16 for memory key+value\n");
fprintf(stderr, " --keep number of tokens to keep from the initial prompt (default: %d, -1 = all)\n", params.n_keep);
Copy link
Sponsor Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_keep should be passed to the API, since it determines how any given input is evaluated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right

Comment on lines 687 to 699
auto role = ctx_msg["role"].get<std::string>();
if (role == "system")
{
llama->params.prompt = ctx_msg["content"].get<std::string>() + "\n\n";
}
else if (role == "user")
{
llama->params.prompt += llama->user_tag + " " + ctx_msg["content"].get<std::string>() + "\n";
}
else if (role == "assistant")
{
llama->params.prompt += llama->assistant_tag + " " + ctx_msg["content"].get<std::string>() + "\n";
}
Copy link
Sponsor Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formatting seems too fixed, with the spaces and newlines. I don't think it can work for Alpaca at all.

What happens if the role is not system or user or assistant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When role is not user, assistant or system the content is ignored.

Related Alpaca model, passing the following options in setting-context should work:

const axios = require('axios');

async function Test() {
    let result = await axios.post("http://127.0.0.1:8080/setting-context", {
        context: [
            { role: "system", content: "Below is an instruction that describes a task. Write a response that appropriately completes the request." },
            { role: "user", content: "Write 3 random words" },
            { role: "assistant", content: "1. Sunshine\n2. Elephant\n3. Symphony" }
        ],
        tags: { user: "### Instruction:", assistant: "### Response:" },
        batch_size: 256,
        temperature: 0.2,
        top_k: 40,
        top_p: 0.9,
        n_predict: 2048,
        threads: 5
    });
    result = await axios.post("http://127.0.0.1:8080/set-message", {
        message: 'How to do a Hello word in C++, step by step'
    });
    if(result.data.can_inference) {
		result = await axios.get("http://127.0.0.1:8080/completion?stream=true", { responseType: 'stream' });
        result.data.on('data', (data) => {
            let dat = JSON.parse(data.toString());
            // token by token completion
            process.stdout.write(dat.content);
            if(dat.stop) {
                console.log("Completed");
            }
        });
    }
}

Test();

Output:

node test.js


1. Include the <iostream> header file
2. Use the "cout" object to output the text "Hello World!"
3. Add a new line character at the end of the output
4. Display the output on the screen

Example code:
'''c++
#include <iostream>

int main() {
    cout << "Hello World!" << endl;
    return 0;
}
'''
Completed

Copy link
Sponsor Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it will actually format it as ### Instruction: Write 3 random words\n### Response: 1. Sunshine\n2. Elephant\n3. Symphony when Alpaca is trained on ### Instruction:\nWrite 3 random words\n\n### Response:\n1. Sunshine\n2. Elephant\n3. Symphony. It does work but that's because LLMs are flexible, but it's not optimal.

Don't get me wrong, I am trying to look out for you, if you decide to add this prompt templating into the API then you will be fixing and adding features to it for a long time when people come up with new models and ideas.

How about this?

const context = [
    { user: "Write 3 random words", assistant: "1. Sunshine\n2. Elephant\n3. Symphony" }
]
const system = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.'
const message = 'How to do a Hello word in C++, step by step'
const prompt = `${system}

${context.map(c => `### Instruction:
${c.user}

### Response:
${c.assistant}`)}

### Instruction:
${message}
`

// I think these two calls could be just merged into one endpoint
await axios.post("http://127.0.0.1:8080/setting-context", {
    content: prompt,
    batch_size: 256, // actually these could be command line arguments, since
    threads: 5,      // they are relevant to the server CPU usage
})
let result = await axios.get("http://127.0.0.1:8080/completion?stream=true", {
    responseType: 'stream', 
    temperature: 0.2,  // sampling parameters are used only for
    top_k: 40,         // generating new text
    top_p: 0.9,
    n_predict: 2048,
    
    // this is why it's nice to be able to get tokens
    keep: (await axios.post("http://127.0.0.1:8080/tokenize", { content: system })).length,
    // alternative is to pass keep as a string itself, but then you need to check if its tokens match what was given beforebegin
    
    stop: ['###', '\n'], // stop generating when any of these strings is generated,
                         // when streaming you have to make sure that '#' or '##' is not
                         // shown to the user because it could be a part of the stop keyword
                         // so there needs to be a little buffer.
})
let answer = ''
result.data.on('data', (data) => {
    let dat = JSON.parse(data.toString())
    // token by token completion
    process.stdout.write(dat.content)
    answer += dat.content
    if(dat.stop) {
        console.log("Completed")
        // save into chat log for the next question
        context.push({ user: message, assistant: answer })
    }
})

Actually there are libraries for this and other kinds of prompts in langchain for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a problem with cpp-httplib. And it's because it only allows streaming of data with GET requests. I tried to create a single POST endpoint called 'completion' to allow defining options, and it works. However, when it comes to streaming, it doesn't work for some reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea you propose seems interesting, but there are limitations such as the time it takes to evaluate the prompt, especially if the prompts are very long. It also involves restarting the prompt, and I'm not sure how that could be done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently in this server, the setting-context needs to be called only once, set-message and completion need to be called whenever a response is desired, that allows for fast responses. The interaction context is being stored in "embd_inp" in the server instance.

I'm thinking of something like changing the behavior of the API with an option called behavior that has the choices instruction, chat and generation. There could also be an endpoint for performing embeddings and tokenization.

Copy link
Sponsor Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the client is sending a string that starts with the same text as last time, there is no need to evaluate it again, tokenize the new string, find how many tokens are the same at the start and set that as n_past. Anyway, at this point with a GPU we can evaluate even 512 tokens in less than 10 seconds (depending on hardware).

For the streaming, maybe the easiest solution is to have no streaming, just return the generated text. If the client wants to show individual tokens appearing to the user, it could just limit n_predict to 1 and call the API in a loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try it. Good idea. I have to compare the new string tokens with embd_inp?

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 15, 2023

What is this API based on? Does it follow some existing standard or service so that it could be used with other software?

Overall, there is a lot of code dealing with prompt generation, when there should just really be two: evaluate text, generate text (with stop keyword). It's not that hard to generate prompt from a past chat, for example, in Javascript.

There could also be a tokenization endpoint so that the client can estimate the need for truncation or summarization.

But these are my opinions.

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented May 15, 2023

Initially, this server implementation was more geared towards serving as a chatbot-like interaction. It was roughly based on the OpenAI ChatGPT API although there were unfortunately many limitations such as the fact that prompt evaluation is a bit slow, so the initial context is constant from the beginning. It is also not possible to reset the context (without the need to reload the model) or at least I haven't found a way to do so.

gpt_params params;
params.model = "ggml-model.bin";

std::string hostname = "0.0.0.0";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to only listen on 127.0.0.1 by default, exposing the server to the network can be dangerous.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I will fix it.

examples/CMakeLists.txt Outdated Show resolved Hide resolved
@howard0su
Copy link
Collaborator

If you can simulate OpenAI REST API, this example will become much more useful (even with some limitations).
https://platform.openai.com/docs/api-reference/completions
https://platform.openai.com/docs/api-reference/embeddings

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 212. Check the log or trigger a new build to see more.


inline bool parse_multipart_boundary(const std::string &content_type,
std::string &boundary) {
auto boundary_keyword = "boundary=";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'auto boundary_keyword' can be declared as 'const auto *boundary_keyword' [readability-qualified-auto]

Suggested change
auto boundary_keyword = "boundary=";
const auto *boundary_keyword = "boundary=";

auto len = static_cast<size_t>(m.length(1));
bool all_valid_ranges = true;
split(&s[pos], &s[pos + len], ',', [&](const char *b, const char *e) {
if (!all_valid_ranges) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: statement should be inside braces [readability-braces-around-statements]

Suggested change
if (!all_valid_ranges) return;
if (!all_valid_ranges) { return;
}

Comment on lines +4135 to +4136
bool start_with_case_ignore(const std::string &a,
const std::string &b) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: method 'start_with_case_ignore' can be made static [readability-convert-member-functions-to-static]

Suggested change
bool start_with_case_ignore(const std::string &a,
const std::string &b) const {
static bool start_with_case_ignore(const std::string &a,
const std::string &b) {

Comment on lines +4156 to +4157
bool start_with(const std::string &a, size_t spos, size_t epos,
const std::string &b) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: method 'start_with' can be made static [readability-convert-member-functions-to-static]

Suggested change
bool start_with(const std::string &a, size_t spos, size_t epos,
const std::string &b) const {
static bool start_with(const std::string &a, size_t spos, size_t epos,
const std::string &b) {


inline std::string to_lower(const char *beg, const char *end) {
std::string out;
auto it = beg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'auto it' can be declared as 'const auto *it' [readability-qualified-auto]

Suggested change
auto it = beg;
const auto *it = beg;

Comment on lines +5056 to +5064
} else if (n <= static_cast<ssize_t>(size)) {
memcpy(ptr, read_buff_.data(), static_cast<size_t>(n));
return n;
} else {
memcpy(ptr, read_buff_.data(), size);
read_buff_off_ = size;
read_buff_content_size_ = static_cast<size_t>(n);
return static_cast<ssize_t>(size);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: do not use 'else' after 'return' [readability-else-after-return]

Suggested change
} else if (n <= static_cast<ssize_t>(size)) {
memcpy(ptr, read_buff_.data(), static_cast<size_t>(n));
return n;
} else {
memcpy(ptr, read_buff_.data(), size);
read_buff_off_ = size;
read_buff_content_size_ = static_cast<size_t>(n);
return static_cast<ssize_t>(size);
}
} if (n <= static_cast<ssize_t>(size)) {
memcpy(ptr, read_buff_.data(), static_cast<size_t>(n));
return n;
} else {
memcpy(ptr, read_buff_.data(), size);
read_buff_off_ = size;
read_buff_content_size_ = static_cast<size_t>(n);
return static_cast<ssize_t>(size);
}

return *this;
}

inline Server &Server::set_error_handler(Handler handler) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: the parameter 'handler' is copied for each invocation but only used as a const reference; consider making it a const reference [performance-unnecessary-value-param]

examples/server/httplib.h:707:

-   Server &set_error_handler(Handler handler);
+   Server &set_error_handler(const Handler& handler);
Suggested change
inline Server &Server::set_error_handler(Handler handler) {
inline Server &Server::set_error_handler(const Handler& handler) {


inline bool Server::bind_to_port(const std::string &host, int port,
int socket_flags) {
if (bind_internal(host, port, socket_flags) < 0) return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: statement should be inside braces [readability-braces-around-statements]

  if (bind_internal(host, port, socket_flags) < 0) return false;
                                                  ^

this fix will not be applied because it overlaps with another fix

Comment on lines +5334 to +5335
if (bind_internal(host, port, socket_flags) < 0) return false;
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: redundant boolean literal in conditional return statement [readability-simplify-boolean-expr]

Suggested change
if (bind_internal(host, port, socket_flags) < 0) return false;
return true;
return bind_internal(host, port, socket_flags) >= 0;

}
}

inline bool Server::parse_request_line(const char *s, Request &req) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: method 'parse_request_line' can be made static [readability-convert-member-functions-to-static]

examples/server/httplib.h:786:

-   bool parse_request_line(const char *s, Request &req);
+   static bool parse_request_line(const char *s, Request &req);

@ggerganov ggerganov merged commit 7e4ea5b into ggerganov:master May 21, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants