Thanks for sharing the code, inspirational!
Streaming: I noticed the compatibility only covers single completion and not streaming?
- it would be nice to have it, to be truly drop-in-replacement for existing frontends, taking the response and deliver it. Here some rudimentary code-snippet which works with
litellm client-based apps but otherwise not much tested yet:
if data.get('stream',False):
from flask import Response
import json
def generate():
r = { 'object': "chat.completion.chunk", 'choices': [ ] }
r['choices'].append({
'index': 0,
'delta': {
'role': 'assistant',
'content': final_response,
},
'finish_reason': None
})
yield f"data: {json.dumps(r)}\n\n"
yield "data: [DONE]\n\n"
return Response(generate(), mimetype='text/event-stream')
else:
....
Context: it's a single shot response, any kind of context always takes the first mesg?
- it would be good at least take the last
mesg["user"]
Port: The listening port 8000 is fixed, perhaps also passing as CLI argument (I run local LLM on port 8000 already).
Proxy vs Library
I personally wrote my own AI.py which uses litellm internally, and works like this:
m = AI("openai:http://localhost:8000/v1#llama-3.1-8b") # using <provider>:<model> respectively <provider>:<base_url>#<model> convention
def streaming(t):
print(t,end="",flush=True)
r = m.query(q)
m.query(q, stream=streaming)
and added .reason(q) recently which queries the same model with Chain of Thought "step-by-step" planning, and then I query each step again back to the model, and it returns then only the final response, a bit like you did.
So, I'm thinking to adapt your approaches and make it .reason(q,approach="cot_reflection") and not use this proxy layer, I like to have some additional flexibility. Are you considering a library approach as well?
In my chat bot I started following convention:
!what separates complexity domains as in scale in nature?, it inserts prior f"Use chain of thought to compose a reply. {q}"
^how much is 23+32*128?, which uses CoT planning, and queries then each step separately.
in this sense, I think I gonna map your approach to a single char or some other intuitive convention so I can use multiple strategies:
mcts
bon
moa
rto
z3
self_consistency or sc
pvg
rstar or r*
cot_reflection or cotr
plansearch or ps
leap
and then use something like ^{approach}<space>{query} like ^r* what is the actual source of gravity of atoms?.
Thanks for sharing the code, inspirational!
Streaming: I noticed the compatibility only covers single completion and not streaming?
litellmclient-based apps but otherwise not much tested yet:Context: it's a single shot response, any kind of context always takes the first mesg?
mesg["user"]Port: The listening port 8000 is fixed, perhaps also passing as CLI argument (I run local LLM on port 8000 already).
--port=8100or soProxy vs Library
I personally wrote my own AI.py which uses litellm internally, and works like this:
and added
.reason(q)recently which queries the same model with Chain of Thought "step-by-step" planning, and then I query each step again back to the model, and it returns then only the final response, a bit like you did.So, I'm thinking to adapt your approaches and make it
.reason(q,approach="cot_reflection")and not use this proxy layer, I like to have some additional flexibility. Are you considering a library approach as well?In my chat bot I started following convention:
!what separates complexity domains as in scale in nature?, it inserts priorf"Use chain of thought to compose a reply. {q}"^how much is 23+32*128?, which uses CoT planning, and queries then each step separately.in this sense, I think I gonna map your approach to a single char or some other intuitive convention so I can use multiple strategies:
mctsbonmoartoz3self_consistencyorscpvgrstarorr*cot_reflectionorcotrplansearchorpsleapand then use something like
^{approach}<space>{query}like^r* what is the actual source of gravity of atoms?.