Skip to content

Streaming, Context, Port & Proxy vs Library #3

@Spiritdude

Description

@Spiritdude

Thanks for sharing the code, inspirational!

Streaming: I noticed the compatibility only covers single completion and not streaming?

  • it would be nice to have it, to be truly drop-in-replacement for existing frontends, taking the response and deliver it. Here some rudimentary code-snippet which works with litellm client-based apps but otherwise not much tested yet:
  if data.get('stream',False):
       from flask import Response
       import json

       def generate():
          r = { 'object': "chat.completion.chunk", 'choices': [ ] }
          r['choices'].append({
             'index': 0,
             'delta': {
                 'role': 'assistant',
                 'content': final_response,
             },
             'finish_reason': None
          })
          yield f"data: {json.dumps(r)}\n\n"
          yield "data: [DONE]\n\n"

       return Response(generate(), mimetype='text/event-stream')
       
    else:
       ....

Context: it's a single shot response, any kind of context always takes the first mesg?

  • it would be good at least take the last mesg["user"]

Port: The listening port 8000 is fixed, perhaps also passing as CLI argument (I run local LLM on port 8000 already).

  • using --port=8100 or so

Proxy vs Library
I personally wrote my own AI.py which uses litellm internally, and works like this:

m = AI("openai:http://localhost:8000/v1#llama-3.1-8b")     # using <provider>:<model> respectively <provider>:<base_url>#<model> convention

def streaming(t):
   print(t,end="",flush=True)

r = m.query(q)  
m.query(q, stream=streaming)

and added .reason(q) recently which queries the same model with Chain of Thought "step-by-step" planning, and then I query each step again back to the model, and it returns then only the final response, a bit like you did.

So, I'm thinking to adapt your approaches and make it .reason(q,approach="cot_reflection") and not use this proxy layer, I like to have some additional flexibility. Are you considering a library approach as well?

In my chat bot I started following convention:

  • !what separates complexity domains as in scale in nature?, it inserts prior f"Use chain of thought to compose a reply. {q}"
  • ^how much is 23+32*128?, which uses CoT planning, and queries then each step separately.

in this sense, I think I gonna map your approach to a single char or some other intuitive convention so I can use multiple strategies:

  • mcts
  • bon
  • moa
  • rto
  • z3
  • self_consistency or sc
  • pvg
  • rstar or r*
  • cot_reflection or cotr
  • plansearch or ps
  • leap

and then use something like ^{approach}<space>{query} like ^r* what is the actual source of gravity of atoms?.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions