Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389

Draft
wants to merge 73 commits into
base: master
Choose a base branch
from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Mar 29, 2024

Still very rough, but sharing a draft to get early feedback on the general direction.

This is an experiment in adding grammar-constrained tool support to llama.cpp, with a simple example of running agentic code on top, and support for sandboxing unsafe tools (e.g. Python interpreter).

Instead of bloating server.cpp any further, this slaps a Python layer in front of it to handle tool calling (partly because it's hard to do things well w/o proper jinja2 support - templates handle tool calling peculiarly at best, and partly because this could be a way to simplify the C++ server and focus it on performance and security rather than dealing with schemas and chat templates; WDYT?).

So this PR has a long way to go, but here's what can be done with it:

Show install instructions
git clone https://github.com/ochafik/llama.cpp --branch agent-example --single-branch llama.cpp-agent
cd llama.cpp-agent
conda create -n agent python=3.11
conda activate agent
pip install -r examples/agent/requirements.txt

make clean && make -j server
python -m examples.agent \
    --model mixtral-8x7b-instruct-v0.1.Q8_0.gguf \
    --tools examples/agent/tools/example_math_tools.py \
    --greedy \
    --goal "What is the sum of 2535 squared and 32222000403 then multiplied by one and a half. What's a third of the result?"
Show output
πŸ› οΈ  add, divide, multiply, pow
πŸ’­ First, I will calculate the square of 2535, then add it to 32222000403. After that, I will multiply the result by 1.5 and finally, I will divide the result by 3.
βš™οΈ  pow(value=2535, power=2) -> 6426225.0
πŸ’­ Now that I have calculated the square of 2535, I will calculate the sum of 6426225 and 32222000403.
βš™οΈ  add(a=6426225, b=32222000403) -> 32228426628
πŸ’­ Now that I have calculated the sum, I will multiply it by 1.5.
βš™οΈ  multiply(a=32228426628, b=1.5) -> 48342639942.0
πŸ’­ Now that I have calculated the product, I will divide it by 3.
βš™οΈ  divide(a=48342639942.0, b=3) -> 16114213314.0
➑️  The result of the calculation is 16114213314.0.
python -c "print((2535**2 + 32222000403)*1.5 / 3)"
# 16114213314.0
python -m examples.agent \
    --tools examples/agent/tools/fake_weather_tools.py \
    --goal "What is the weather going to be like in San Francisco and Glasgow over the next 4 days." \
    --greedy
Show output
πŸ› οΈ  get_current_weather, get_n_day_weather_forecast
πŸ’­ I will first get the current weather in San Francisco, then get the 4-day weather forecast for both San Francisco and Glasgow.
βš™οΈ  get_current_weather(location=San Francisco, format=fahrenheit) -> ...
πŸ’­ I will first get the current weather in San Francisco, then get the 4-day weather forecast for both San Francisco and Glasgow.
βš™οΈ  get_n_day_weather_forecast(location=San Francisco, format=fahrenheit, num_days=4) -> ...
πŸ’­ I will first get the current weather in San Francisco, then get the 4-day weather forecast for both San Francisco and Glasgow.
βš™οΈ  get_n_day_weather_forecast(location=Glasgow, format=celsius, num_days=4) -> ...
The current weather in San Francisco is sunny and 87.8F. Here is the 4-day weather forecast:

For San Francisco:
- In 1 day: Cloudy, 60.8F
- In 2 days: Sunny, 73.4F
- In 3 days: Cloudy, 62.6F

For Glasgow:
- In 1 day: Cloudy, 16C
- In 2 days: Sunny, 23C
- In 3 days: Cloudy, 17C
python -m examples.agent --std-tools --goal "Say something nice in 1 minute."
Show output
πŸ› οΈ  ask_user, say_out_loud, wait_for_date, wait_for_duration
πŸ’­ Thinking about what to say in the next minute.
βš™οΈ  say_out_loud(something="In the next minute, I'll share a kind and uplifting message. Please wait...") -> None
πŸ’­ Waiting for the specified duration.
βš™οΈ  wait_for_duration(duration={"seconds": 60}) -> None
πŸ’­ Thinking about what to say after the waiting period.
βš™οΈ  say_out_loud(something="Thank you for your patience. Here's a nice message for you: 'A smile is the prettiest thing you can wear. So let your smile shine through.' - Dolly Parton") -> None
➑️ "The task of saying something nice in 1 minute is complete."

Add --verbose to see what's going on, and look at examples/agent/README & examples/openai/README for more details.

Tool sandboxing

Since tools can quickly become unsafe (don't want a rogue AI poking at your files), I've added a simple script to sandbox tools. It wraps a Python module as a REST server inside a Docker container exposing its port, and since it's using FastAPI it gives a neat OpenAPI schema that can be consumed by the agent code.

Run this in a separate terminal to get a sandboxed python interpreter (DATA_DIR will contain any files created by Python programs):

# Note: with limactl, the default sandbox location ~/.llama.cpp/sandbox won't be writable
# (see https://github.com/lima-vm/lima/discussions/393)
# export DATA_DIR=/tmp/lima/llama.cpp/sandbox
PORT=9999 examples/agent/run_sandboxed_tools.sh \
        examples/agent/tools/unsafe_python_tools.py

# INFO: using DATA_DIR: /Users/ochafik/.llama.cpp/sandbox
# ...
# INFO:     Uvicorn running on http://0.0.0.0:9999 (Press CTRL+C to quit)

Then tell the agent to discover tools at the new endpoint:

python -m examples.agent \
    --tools http://localhost:9999 \
    --goal "Whats cos(123) / 23 * 12.6 ?"
Show output
πŸ’­ To calculate the given mathematical expression, I will use the 'execute_python' tool.
βš™οΈ  execute_python(source="import math\nresult = math.cos(123) / 23 * 12.6") -> {'result': -0.4864525314920599}
➑️ "-0.4864525314920599"
python -m examples.agent \
    --tools http://localhost:9999 \
    --goal "Create a file with 100k spaces"
Show output
πŸ’­ To accomplish the task of creating a file with 100k spaces, I need to consider how to generate and save this content. I will use the 'execute_python' tool to create a Python script that generates the spaces and writes them to a file.
βš™οΈ  execute_python(source="with open('spaces_file.txt', 'w') as f:\n\tf.write(' ' * 100000)\n") -> {}
➑️ "File 'spaces_file.txt' has been created with 100,000 spaces."
wc -c ~/.llama.cpp/sandbox/spaces_file.txt
# 100000 ~/.llama.cpp/sandbox/spaces_file.txt
python -m examples.agent \
    --tools http://localhost:9999 \
    --goal "Write and run a program with a syntax error, then fix it"
Show output
πŸ’­ I need to create a Python program with a syntax error, run it, and then correct the error.
βš™οΈ  execute_python(source="def hello:\\n    print('Hello, world')\\nheelo()") -> Error: expected '(' (<string>, line 1)
πŸ’­ I see that there is a syntax error in the code due to a typo. I need to correct the typo and run the code again.
βš™οΈ  execute_python(source="def hello():\\n    print('Hello, world')\\nhello()") -> Error: unexpected character after line continuation character (<string>, line 1)
πŸ’­ It seems there is another syntax error in the corrected code. I need to investigate the issue and run the code again.
βš™οΈ  execute_python(source="def hello():\n    print('Hello, world')\nhello()") -> {}
➑️ "Hello, world"

Everybody gets tool calling support!

Some models have been explicitly fine-tuned for tool usage (e.g. Functionary with temptative support in #5695, or Hermes 2 Pro Mistral 7B which has a nice repo about it).

Some other models don't officially have support, at least in their OSS models... (Mixtral πŸ‘€)

But since #5978, all can be coerced into sticking to a specific JSON schema.

This example supports the following tool prompting strategies in examples/openai/prompting.py (see dizzying combos of outputs):

  • --style=thoughtful_steps: the default unless Functionary template is detected.

    Constrains the output to JSON with the following TypeScript signature (which it advertises as JSON schema), which fully constrains all of the function arguments:

      {
        thought_about_next_step_only: string,
        next_step: (
          {result: T} |
          {
            tool_calls: (
              {name: "function1", arguments: {"arg1": ..., "arg2":...}} |
              {name: "function1", arguments: {"arg1": ..., "arg2":...}} |
              ...
            )[]
          }
        )
      }
      // Where T is the output JSON schema from the --format flag, or 'any'

    It seems quite important to give the model some space to think before it even decides whether it has the final output or needs extra steps (thought might work just as well, YMMV). Note that by default only 1 tool call is allowed, but for models that support parallel tool calling, you can pass --parallel-calls (Functionary does this well, but Mixtral-instruct tends to hallucinate)

  • --style=functionary_v2: besides using the proper template, this formats the signatures to TypeScript and deals with interesting edge cases (TODO: check whether this model has the only template that expects function call's arguments to be a json string, as opposed to a JSON object)

  • --style=short / long: announces tools in a <tool>...schemas...</tool> system call, and uses less constrained output that allows mixing text and <tool_call>{json}</tool_call> inserts.

    Since there is no negative lookahead (nor reluctant repetition modifier), I found it hard to write a grammar that allows "any text not containing "<tool_call>" then maybe <tool_call>. I settled for something a bit brittle (content := [^<] | "<" [^t<] | "<t" [^o<]), suggestions welcome!

  • --style=mixtral: OK now it gets weird. Mixtral works well w/ --style=thoughtful_steps (I just had to collapse system and tool messages into user messages as its chat template is very restrictive), but when prompted w/ You have these tools <tools>{json schemas}</tools>, it spontaneously calls tools with the semi-standard syntax used by Hermes too... except it has spurious underscore escapes πŸ€”

    Imma tell you what i'm doin'
    <tool\_call>
    {"arguments": ..., "name": "my\_weirdly\_escaped\_function\_name"}
    </tool\_call>`
    

    So in the mixtral style I just unescape underscores and we get a tool-calling Mixtral (style is otherwise much like long / short and would also benefit from more grammar features)

TODOs

  • Auto discover tools exposed by an OpenAPI endpoint
  • Fix / mitigate occasional "lost child process" issue
  • Add a Python code tool + made sandbox to work
  • Turn Python tool to notebook / capture stdout / handle errors properly
  • Wait for spawned servers to be healthy
  • Add model URL / HF loading support
  • Send separate PR w/ (cleaner) JSON number length cap (context: I was asking Mixtral to divide by 3 and it multiplied by... 0.33333333333333...) β†’ JSON schema conversion: ⚑️ faster repetitions, min/maxLength for strings, cap number lengthΒ #6555
  • Send separate PR w/ gguf utils (kv reading, lazy kv map, reader example update)
  • Get some general approval & integrate feedback
  • Measure token overhead of schemas & compress tool call syntax if possible (TS signatures borrowed from Functionary already help)
  • Stream all the things
  • Finish unit tests (wrap grammar parsing tester from python?)
  • Open discussion re/ grammar features: reluctant modifier maybe? Is it actually needed?
  • Server integration tests
  • Prepare code for review (doc strings)

@phymbert
Copy link
Collaborator

Thanks for the effort to bring this nice feature πŸ₯‡ . Please mind to push commits on your fork first as it triggers lot of CI runs on the main repo.

Copy link
Contributor

github-actions bot commented Apr 9, 2024

πŸ“ˆ llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 532 iterations πŸš€

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8802.46ms p(95)=21471.34ms fails=, finish reason: stop=478 truncated=54
  • Prompt processing (pp): avg=108.38tk/s p(95)=484.77tk/s
  • Token generation (tg): avg=32.26tk/s p(95)=46.99tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=agent-example commit=298c098d51dba049681225bc24cbe9196420081d

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 532 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717974573 --> 1717975199
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 353.08, 353.08, 353.08, 353.08, 353.08, 596.48, 596.48, 596.48, 596.48, 596.48, 605.48, 605.48, 605.48, 605.48, 605.48, 636.31, 636.31, 636.31, 636.31, 636.31, 724.51, 724.51, 724.51, 724.51, 724.51, 740.66, 740.66, 740.66, 740.66, 740.66, 752.85, 752.85, 752.85, 752.85, 752.85, 768.74, 768.74, 768.74, 768.74, 768.74, 787.13, 787.13, 787.13, 787.13, 787.13, 788.7, 788.7, 788.7, 788.7, 788.7, 813.9, 813.9, 813.9, 813.9, 813.9, 829.22, 829.22, 829.22, 829.22, 829.22, 848.17, 848.17, 848.17, 848.17, 848.17, 845.3, 845.3, 845.3, 845.3, 845.3, 850.51, 850.51, 850.51, 850.51, 850.51, 849.98, 849.98, 849.98, 849.98, 849.98, 830.29, 830.29, 830.29, 830.29, 830.29, 831.2, 831.2, 831.2, 831.2, 831.2, 849.73, 849.73, 849.73, 849.73, 849.73, 853.47, 853.47, 853.47, 853.47, 853.47, 853.9, 853.9, 853.9, 853.9, 853.9, 860.86, 860.86, 860.86, 860.86, 860.86, 863.19, 863.19, 863.19, 863.19, 863.19, 877.29, 877.29, 877.29, 877.29, 877.29, 881.25, 881.25, 881.25, 881.25, 881.25, 881.92, 881.92, 881.92, 881.92, 881.92, 882.04, 882.04, 882.04, 882.04, 882.04, 851.73, 851.73, 851.73, 851.73, 851.73, 852.31, 852.31, 852.31, 852.31, 852.31, 852.41, 852.41, 852.41, 852.41, 852.41, 856.33, 856.33, 856.33, 856.33, 856.33, 856.4, 856.4, 856.4, 856.4, 856.4, 854.99, 854.99, 854.99, 854.99, 854.99, 858.83, 858.83, 858.83, 858.83, 858.83, 859.08, 859.08, 859.08, 859.08, 859.08, 859.05, 859.05, 859.05, 859.05, 859.05, 862.03, 862.03, 862.03, 862.03, 862.03, 862.92, 862.92, 862.92, 862.92, 862.92, 861.13, 861.13, 861.13, 861.13, 861.13, 856.58, 856.58, 856.58, 856.58, 856.58, 860.93, 860.93, 860.93, 860.93, 860.93, 861.89, 861.89, 861.89, 861.89, 861.89, 868.87, 868.87, 868.87, 868.87, 868.87, 873.36, 873.36, 873.36, 873.36, 873.36, 873.89, 873.89, 873.89, 873.89, 873.89, 872.64, 872.64, 872.64, 872.64, 872.64, 870.6, 870.6, 870.6, 870.6, 870.6, 868.93, 868.93, 868.93, 868.93, 868.93, 875.46, 875.46, 875.46, 875.46, 875.46, 873.95, 873.95, 873.95, 873.95, 873.95, 873.28, 873.28, 873.28, 873.28, 873.28, 875.53, 875.53, 875.53, 875.53, 875.53, 879.04, 879.04, 879.04, 879.04, 879.04, 883.08, 883.08, 883.08, 883.08, 883.08, 882.43, 882.43, 882.43, 882.43, 882.43, 880.52, 880.52, 880.52, 880.52, 880.52, 878.93, 878.93, 878.93, 878.93, 878.93, 879.3, 879.3, 879.3, 879.3, 879.3, 879.97, 879.97, 879.97, 879.97, 879.97, 880.65, 880.65, 880.65, 880.65, 880.65, 881.77, 881.77, 881.77, 881.77]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 532 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717974573 --> 1717975199
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.52, 41.52, 41.52, 41.52, 41.52, 33.66, 33.66, 33.66, 33.66, 33.66, 25.15, 25.15, 25.15, 25.15, 25.15, 29.4, 29.4, 29.4, 29.4, 29.4, 31.29, 31.29, 31.29, 31.29, 31.29, 32.6, 32.6, 32.6, 32.6, 32.6, 34.2, 34.2, 34.2, 34.2, 34.2, 34.79, 34.79, 34.79, 34.79, 34.79, 35.15, 35.15, 35.15, 35.15, 35.15, 34.53, 34.53, 34.53, 34.53, 34.53, 34.53, 34.53, 34.53, 34.53, 34.53, 34.64, 34.64, 34.64, 34.64, 34.64, 33.41, 33.41, 33.41, 33.41, 33.41, 32.69, 32.69, 32.69, 32.69, 32.69, 32.66, 32.66, 32.66, 32.66, 32.66, 29.7, 29.7, 29.7, 29.7, 29.7, 29.48, 29.48, 29.48, 29.48, 29.48, 29.8, 29.8, 29.8, 29.8, 29.8, 29.88, 29.88, 29.88, 29.88, 29.88, 29.47, 29.47, 29.47, 29.47, 29.47, 29.31, 29.31, 29.31, 29.31, 29.31, 29.27, 29.27, 29.27, 29.27, 29.27, 29.48, 29.48, 29.48, 29.48, 29.48, 29.79, 29.79, 29.79, 29.79, 29.79, 29.74, 29.74, 29.74, 29.74, 29.74, 29.79, 29.79, 29.79, 29.79, 29.79, 29.88, 29.88, 29.88, 29.88, 29.88, 30.06, 30.06, 30.06, 30.06, 30.06, 30.17, 30.17, 30.17, 30.17, 30.17, 30.53, 30.53, 30.53, 30.53, 30.53, 30.61, 30.61, 30.61, 30.61, 30.61, 30.72, 30.72, 30.72, 30.72, 30.72, 30.82, 30.82, 30.82, 30.82, 30.82, 30.99, 30.99, 30.99, 30.99, 30.99, 30.98, 30.98, 30.98, 30.98, 30.98, 30.65, 30.65, 30.65, 30.65, 30.65, 30.64, 30.64, 30.64, 30.64, 30.64, 29.92, 29.92, 29.92, 29.92, 29.92, 29.79, 29.79, 29.79, 29.79, 29.79, 29.9, 29.9, 29.9, 29.9, 29.9, 30.09, 30.09, 30.09, 30.09, 30.09, 30.12, 30.12, 30.12, 30.12, 30.12, 30.31, 30.31, 30.31, 30.31, 30.31, 30.28, 30.28, 30.28, 30.28, 30.28, 30.07, 30.07, 30.07, 30.07, 30.07, 29.86, 29.86, 29.86, 29.86, 29.86, 28.74, 28.74, 28.74, 28.74, 28.74, 28.4, 28.4, 28.4, 28.4, 28.4, 28.41, 28.41, 28.41, 28.41, 28.41, 28.4, 28.4, 28.4, 28.4, 28.4, 28.43, 28.43, 28.43, 28.43, 28.43, 28.41, 28.41, 28.41, 28.41, 28.41, 28.52, 28.52, 28.52, 28.52, 28.52, 28.58, 28.58, 28.58, 28.58, 28.58, 28.59, 28.59, 28.59, 28.59, 28.59, 28.6, 28.6, 28.6, 28.6, 28.6, 28.56, 28.56, 28.56, 28.56, 28.56, 28.63, 28.63, 28.63, 28.63, 28.63, 28.74, 28.74, 28.74, 28.74, 28.74, 28.87, 28.87, 28.87, 28.87, 28.87, 28.93, 28.93, 28.93, 28.93]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 532 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717974573 --> 1717975199
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.35, 0.35, 0.35, 0.35, 0.35, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.39, 0.39, 0.39, 0.39, 0.39, 0.41, 0.41, 0.41, 0.41, 0.41, 0.43, 0.43, 0.43, 0.43, 0.43, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.31, 0.31, 0.31, 0.31, 0.31, 0.25, 0.25, 0.25, 0.25, 0.25, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.3, 0.3, 0.3, 0.3, 0.3, 0.31, 0.31, 0.31, 0.31, 0.31, 0.43, 0.43, 0.43, 0.43, 0.43, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.07, 0.07, 0.07, 0.07, 0.07, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.28, 0.28, 0.28, 0.28, 0.28, 0.46, 0.46, 0.46, 0.46, 0.46, 0.58, 0.58, 0.58, 0.58, 0.58, 0.54, 0.54, 0.54, 0.54, 0.54, 0.42, 0.42, 0.42, 0.42, 0.42, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.25, 0.25, 0.25, 0.25, 0.25, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 532 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717974573 --> 1717975199
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0]
                    

@ochafik ochafik changed the title [WIP] agent example (w/ Tools!) & improved OAI compatibility layer (in Python) [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) Apr 10, 2024
@ochafik
Copy link
Collaborator Author

ochafik commented Apr 10, 2024

Please mind to push commits on your fork first as it triggers lot of CI runs on the main repo.

@phymbert sorry for the CI noise again today, wanted to get the PR in good working order. Branch agent-example-tmp will get most of the updates going forward (I might also experiment with skip-checks:true).

@phymbert
Copy link
Collaborator

phymbert commented Apr 10, 2024

@phymbert sorry for the CI noise again today, wanted to get the PR in good working order.

Please forgive my comment, firstly because I am sure you do your best, then I personally pushed 60+ commits last 4 days ;) and now the CI is canceling concurrent jobs. Good luck!

@HanClinto
Copy link
Collaborator

HanClinto commented Apr 30, 2024

This is a bit off-topic, but I noticed in your example call:

python -m examples.agent \
    --model mixtral-8x7b-instruct-v0.1.Q8_0.gguf \
    --tools examples/agent/tools/example_math_tools.py \
    --greedy \

Is there a particular reason you're using greedy sampling? If so, I think there is an opportunity for speedup when using grammars and greedy sampling, but I wasn't sure how frequently greedy sampling was used, so I haven't chased it down yet.

@skoulik
Copy link

skoulik commented May 6, 2024

Good day @ochafik ,
I know you are busy with important things, but still.
I've been experimenting with your agent-example branch in an attempt to pair llama.cpp with llamaindex agentic api.
It almost works as expected. The only glitch I've found so far is that llamaindex barks at me, because it expects function arguments to be string (JSON escaped? and converted to string) while your agent returns them as pure JSON. Is it something that is supposed to be confugurable? Or is it just a WIP? An omission?

According to the official OpenAI API, they produce string, too, look here, for instance:
https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models

Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_ujD1NwPxzeOSCbgw2NOabOin', function=Function(arguments='{\n "location": "Glasgow, Scotland",\n "format": "celsius",\n "num_days": 5\n}', name='get_n_day_weather_forecast'), type='function')]), internal_metrics=[{'cached_prompt_tokens': 128, 'total_accepted_tokens': 0, 'total_batched_tokens': 273, 'total_predicted_tokens': 0, 'total_rejected_tokens': 0, 'total_tokens_in_completion': 274, 'cached_embeddings_bytes': 0, 'cached_embeddings_n': 0, 'uncached_embeddings_bytes': 0, 'uncached_embeddings_n': 0, 'fetched_embeddings_bytes': 0, 'fetched_embeddings_n': 0, 'n_evictions': 0, 'sampling_steps': 40, 'sampling_steps_with_predictions': 0, 'batcher_ttft': 0.035738229751586914, 'batcher_initial_queue_time': 0.0007979869842529297}])

Might be something to do with extra security/agents isolation.

@skoulik
Copy link

skoulik commented May 6, 2024

Good day @ochafik , I know you are busy with important things, but still. I've been experimenting with your agent-example branch in an attempt to pair llama.cpp with llamaindex agentic api. It almost works as expected. The only glitch I've found so .
...
Might be something to do with extra security/agents isolation.

Found expects_stringified_function_arguments ...

@mofosyne mofosyne added enhancement New feature or request Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 10, 2024
@ochafik
Copy link
Collaborator Author

ochafik commented May 18, 2024

@skoulik thanks so much for testing this out, and for reporting this stringification issue! The expects_stringified_function_arguments thing was me trying to work around some templates hard-coding that expectation of stringified args, but I hadn't dug enough to realize that's how things were actually meant to be πŸ˜…

Preparing a fix.

Also, I've been toying w/ moving some or all of that logic to C++, stay tuned :-D

Good day @ochafik , I know you are busy with important things

I mean, this is important to me too haha, but yeah I've been very side-tracked by real-life stuff / the Spring here 😎

@github-actions github-actions bot added examples python python script changes server labels May 18, 2024
@skoulik
Copy link

skoulik commented May 20, 2024

Preparing a fix.

I've done a quick test and confirm that it works with Llamaingex example now. Langchain will most likely work too.

Also, I've been toying w/ moving some or all of that logic to C++, stay tuned :-D

My five cents: I noticed that you've been experimenting with prompts throughout your commits history. I reckon it might be beneficial to have them customizable, too (in addition to the hard coded templates). People are going to use the server with plethora of different models, some of which might prefer another flavours of promts to work better.

Good day @ochafik , I know you are busy with important things

I mean, this is important to me too haha, but yeah I've been very side-tracked by real-life stuff / the Spring here 😎
No rush here, take your time and enjoy the life :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request examples python python script changes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants