-
Notifications
You must be signed in to change notification settings - Fork 958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Functionary v2.5 + Refactor Functionary Code #1509
base: main
Are you sure you want to change the base?
Conversation
I have tested this PR and it seems to be working just fine. I used this {
"host": "0.0.0.0",
"port": 8080,
"models": [
{
"model": "functionary-small-v2.5.Q4_0.gguf",
"model_alias": "functionary-small-v2.5-auto-tokenizer",
"hf_model_repo_id": "meetkai/functionary-small-v2.5-GGUF",
"hf_pretrained_model_name_or_path": "meetkai/functionary-small-v2.5-GGUF",
"chat_format": "functionary",
"n_gpu_layers": 32,
"offload_kqv": true,
"n_ctx": 8192,
"use_mlock": false
}
]
} Start server with BASE_URL = "http://0.0.0.0:8080/v1"
API_KEY = "sk-xxx"
MODEL_ID = "functionary-small-v2.5-auto-tokenizer"
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
messages = [
{"role": "user", "content": "what's the weather like in Hanoi?"}
]
tools = [ # For functionary-7b-v2 we use "tools"; for functionary-7b-v1.4 we use "functions" = [{"name": "get_current_weather", "description":..., "parameters": ....}]
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g., San Francisco, CA",
}
},
"required": ["location"]
}
}
}
]
chat_response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
)
print(yaml.dump(chat_response.model_dump(), sort_keys=False, explicit_start=True)) Gives the following response: ---
id: chatcmpl-0a698bb0-d38a-4a77-b2e6-2226e09f71c9
choices:
- finish_reason: tool_calls
index: 0
logprobs: null
message:
content: null
role: assistant
function_call: null
tool_calls:
- id: call_ePVfb4Gg0aNbCV41cKc9tkAi
function:
arguments: '{"location": "Hanoi"}'
name: get_current_weather
type: function
created: 1720904370
model: functionary-small-v2.5-auto-tokenizer
object: chat.completion
service_tier: null
system_fingerprint: null
usage:
completion_tokens: 12
prompt_tokens: 126
total_tokens: 133 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with the code base, but motivated to use this feature so thought I would help with review. Great work and thank you for contributing this as I am very eager to have a working function calling (I have trouble with earlier versions so hoping this version is more robust)
@@ -1328,7 +1329,7 @@ def format_gemma( | |||
# Tricky chat formats that require custom chat handlers | |||
|
|||
|
|||
@register_chat_completion_handler("functionary") | |||
@register_chat_completion_handler("functionary-old") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which models can still use this version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe functionary-legacy
is a better name ?
completion = cast( | ||
llama_types.Completion, self.llama.create_completion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be written as
return self.llama.create_completion(
given this function itself has no mypy typing return value? my impression is that this isn't really doing anything.
If the idea is to have the typing be correct then consider implementing typing for the return value.
|
||
def get_grammar(self, function_name: str): | ||
function_body = None | ||
for function in self.functions or []: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression is the function
vs tools
was something used in older versions. Perhaps only support the newer APIs for streaming rather than both? Or do the conversion between formats once.
Edit: Swapped tools vs functions
@@ -1858,11 +2575,12 @@ def prepare_messages_for_inference( | |||
), | |||
) | |||
) | |||
|
|||
if tools is not None and tool_choice != "none" and any([tool["type"] == "code_interpreter" for tool in tools]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code_interpreter seems like a great feature, though perhaps it would be faster to review this PR if this is added in a follow up PR.
) | ||
for chunk in completion: | ||
if chunk["choices"][0]["text"]: | ||
delta_text = chunk["choices"][0]["text"].strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is overwritten each iteration of the loop. Is that intentional? Same in v2 i think. v25 seems to append them all into a list. Given max_tokens=1
in all of them, perhaps its not worth it to have the list -- want to just return either None
or a token always? That could reduce complexity in generate_streaming
since it has a big loop to handle this including a double break with to_break
etc.
generator = self.generate_streaming_tool_call(prompt=prompt) | ||
tool_id = "".join([random.choice(string.ascii_letters + string.digits) for _ in range(24)]) | ||
|
||
for response, finish_reason, logprobs, chunk, prompt in generator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I read, finish_reason
and logprobs
are always None
and so they can be removed (set explicitly in yield_response
)
prompt += completion_text + "\n" | ||
# Generate function args | ||
stops = [self.tool_call_token, self.stop_token] | ||
grammar = self.get_grammar(function_name) if function_name != "python" else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears different than the logic in prepare_for_generation_with_tool_func_choice
-- is it necessary to diverge?
content = "" | ||
calls = [] | ||
completion_tokens = 0 | ||
# If tool_choice/function_call is provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider pushing this logic inside of generate_tool_call
to have some extra arguments for an already chosen tool call, given the second half of the function is identical and this is repeated a few times. (prepare_for_generation_with_tool_func_choice can essentially be a separate branch in that). While it would make that function more complex, it may reduce overall the total number of things happening by collapsing a few cases together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can also simplify the streaming state management too.
) | ||
|
||
chunk_id, chunk_created = None, None | ||
delta = {"role": None, "content": None, "function_call": None, "tool_calls": None} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider simplifying how delta
is managed given that it is mutable and a lot of care needs to be taken to make sure that all other fields are cleared. It might be simpler to fully create the delta each time, then fill in the missing fields inside of yield_response
when creating the response. This also simplifies the final response.
], | ||
) | ||
|
||
chunk_id, chunk_created = None, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given these are only used in the If "auto"
case, push the scope of the variables inside that if statement.
Also I realize some of my comments are on existing code which i realize the intent was not to change to make review easier -- but given how much is changing already it may be worth simplifying where possible. |
completion_tokens += completion["usage"]["completion_tokens"] | ||
prompt += completion_text | ||
content_prefix = "<|start_header_id|>assistant<|end_header_id|>\n\n" | ||
content += prompt[prompt.rindex(content_prefix) + len(content_prefix):] + completion_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When testing this out I see responses like The smart curtains have been closed. smart curtains have been closed.
.
I am not sure, but from what I can see here this first appends the completion_text
to the prompt
, then gets the tail of the prompt
and adds completion_text
again to the content
. Does this mean completion_text
was added twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applying this fix seems to address the issue:
content += prompt[prompt.rindex(content_prefix) + len(content_prefix):] + completion_text | |
content += prompt[prompt.rindex(content_prefix) + len(content_prefix):] |
stops = [self.tool_call_token, self.stop_token] | ||
completion = self.create_completion(prompt=prompt, stop=stops, grammar=None) | ||
for chunk in completion: | ||
delta_text = chunk["choices"][0]["text"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does generate_streaming_content
for v2.5 also need to strip the content_prefix
assistant message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing job 🚀 ! Tested with with the Instructor library, just encountered a problem with passing the function result to the server.
When using the example from the Functions.ipynb I am getting the following error on the server side.
Exception: 'dict object' has no attribute 'name'
Traceback (most recent call last):
File "/opt/app-root/lib64/python3.11/site-packages/llama_cpp/server/errors.py", line 171, in custom_route_handler
response = await original_route_handler(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.11/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.11/site-packages/llama_cpp/server/app.py", line 483, in create_chat_completion
] = await run_in_threadpool(llama.create_chat_completion, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.11/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.11/site-packages/anyio/_backends/_asyncio.py", line 859, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.11/site-packages/llama_cpp/llama.py", line 1748, in create_chat_completion
return handler(
^^^^^^^^
File "/opt/app-root/lib64/python3.11/site-packages/llama_cpp/llama_chat_format.py", line 2610, in functionary_new_chat_handler
prompt = prepare_messages_for_inference(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.11/site-packages/llama_cpp/llama_chat_format.py", line 2598, in prepare_messages_for_inference
tokenizer.hf_tokenizer.apply_chat_template(all_messages, add_generation_prompt=True, tokenize=False)
File "/opt/app-root/lib64/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1812, in apply_chat_template
rendered_chat = compiled_template.render(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.11/site-packages/jinja2/environment.py", line 1304, in render
self.environment.handle_exception()
File "/opt/app-root/lib64/python3.11/site-packages/jinja2/environment.py", line 939, in handle_exception
raise rewrite_traceback_stack(source=source)
File "<template>", line 9, in top-level template code
jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'name'
Debugger
Running in debugger the function calling example, the first call works perfectly, I am getting the expected response with the right formatting.
However, the second call, where we add the tool response to the messages, got the error above.
Here is the messages
content before sending it for the second call.
[
{
'role': 'user', '
content': "what's the weather like in Hanoi?"
},
ChatCompletionMessage(
content=None,
role='assistant',
function_call=None,
tool_calls=[
ChatCompletionMessageToolCall(
id='call_L4tj8aG7VYqlL1XCLzyHeH8z',
function=Function(
arguments='{"location": "Hanoi"}',
name='get_current_weather'),
type='function'
)
]
),
{
'tool_call_id': 'call_L4tj8aG7VYqlL1XCLzyHeH8z',
'role': 'tool',
'name': 'get_current_weather',
'content': '{"location": "Hanoi", "temperature": "unknown"}'
}
]
Reproduce
To reproduce without using the jupyter, here is the python
import json
import openai
BASE_URL = "http://0.0.0.0:8000/v1"
API_KEY = "sk-xxx"
MODEL_ID = "/models/functionary-small-v2.5.Q4_0.gguf"
def get_current_weather(location, unit="fahrenheit"):
return json.dumps({"location": location, "temperature": "unknown"})
client = openai.OpenAI(base_url=BASE_URL, api_key=API_KEY)
messages = [
{"role": "user", "content": "what's the weather like in Hanoi?"}
]
tools = [ # For functionary-7b-v2 we use "tools"; for functionary-7b-v1.4 we use "functions" = [{"name": "get_current_weather", "description":..., "parameters": ....}]
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g., San Francisco, CA",
}
},
"required": ["location"]
}
}
}
]
response = client.chat.completions.create(
model=MODEL_ID,
messages=messages,
tools=tools,
)
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
if not tool_calls:
exit(1)
available_functions = {
"get_current_weather": get_current_weather,
}
messages.append(response_message)
for tool_call in tool_calls:
function_name = tool_call.function.name
function_to_call = available_functions[function_name]
function_args = json.loads(tool_call.function.arguments)
function_response = function_to_call(
location=function_args.get("location"),
unit=function_args.get("unit"),
)
messages.append(
{
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": function_response,
}
) # extend conversation with function response
second_response = client.chat.completions.create(
model=MODEL_ID,
messages=messages,
)
Edit
After some investigation, the name
property is filtered out because it is not on the ChatCompletionRequestToolMessage class. It must be added to fix the problem.
llama-cpp-python/llama_cpp/llama_types.py
Lines 221 to 224 in f6ed21f
class ChatCompletionRequestToolMessage(TypedDict): | |
role: Literal["tool"] | |
content: Optional[str] | |
tool_call_id: str |
@@ -1826,7 +2544,6 @@ def generate_schema_from_functions(functions, namespace="functions") -> str: | |||
def prepare_messages_for_inference( | |||
messages: List[llama_types.ChatCompletionRequestMessage], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For very weird reason, the messages do not contain the name
property when the role is tool
.
Here, in the example I provided #1509 (review) the messages
is equal to
[
....
{
"role":"tool",
"content":"{\"location\": \"Hanoi, Vietnam\", \"temperature\": \"unknown\"}",
"tool_call_id":"call_yQZk3sIQsIwGuco1lV4QVbTN"
}
]
In the tool message we do not have the name
which is required by the chat_format. leading to the error 'dict·object'·has·no·attribute·'name'
raised by jinja.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functionary
instead offunctionary-v1
,functionary-v2
,functionary-v2.5
. I changed the old chat_format for the old handler tofunctionary-old
. Hope it is not breaking any code at your side. Anyway, can we remove the old handler function already?{"type": "code_interpreter"}
in tools. Returns a tool call to a tool calledpython
with the argument being the generated code.