Streaming, Context, Port & Proxy vs Library

Thanks for sharing the code, inspirational!

**Streaming**: I noticed the compatibility only covers single completion and not streaming?
- it would be nice to have it, to be truly drop-in-replacement for existing frontends, taking the response and deliver it. Here some rudimentary code-snippet which works with `litellm` client-based apps but otherwise not much tested yet:
```
  if data.get('stream',False):
       from flask import Response
       import json

       def generate():
          r = { 'object': "chat.completion.chunk", 'choices': [ ] }
          r['choices'].append({
             'index': 0,
             'delta': {
                 'role': 'assistant',
                 'content': final_response,
             },
             'finish_reason': None
          })
          yield f"data: {json.dumps(r)}\n\n"
          yield "data: [DONE]\n\n"

       return Response(generate(), mimetype='text/event-stream')
       
    else:
       ....
```


**Context**: it's a single shot response, any kind of context always takes the first mesg?
- it would be good at least take the last `mesg["user"]`


**Port**: The listening port 8000 is fixed, perhaps also passing as CLI argument (I run local LLM on port 8000 already).
- using `--port=8100` or so


**Proxy vs Library**
I personally wrote my own AI.py which uses litellm internally, and works like this:
```
m = AI("openai:http://localhost:8000/v1#llama-3.1-8b")     # using <provider>:<model> respectively <provider>:<base_url>#<model> convention

def streaming(t):
   print(t,end="",flush=True)

r = m.query(q)  
m.query(q, stream=streaming)
```
and added `.reason(q)` recently which queries the same model with Chain of Thought "step-by-step" planning, and then I query each step again back to the model, and it returns then only the final response, a bit like you did.

So, I'm thinking to adapt your approaches and make it `.reason(q,approach="cot_reflection")` and not use this proxy layer, I like to have some additional flexibility. Are you considering a library approach as well?
 
In my chat bot I started following convention:
- `!what separates complexity domains as in scale in nature?`, it inserts prior `f"Use chain of thought to compose a reply. {q}"`
- `^how much is 23+32*128?`, which uses CoT planning, and queries then each step separately.

in this sense, I think I gonna map your approach to a single char or some other intuitive convention so I can use multiple strategies:
- `mcts`
- `bon`
- `moa`
- `rto`
- `z3`
- `self_consistency` or `sc`
- `pvg`
- `rstar` or `r*`
- `cot_reflection` or `cotr`
- `plansearch` or `ps`
- `leap`

and then use something like `^{approach}<space>{query}` like `^r* what is the actual source of gravity of atoms?`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming, Context, Port & Proxy vs Library #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Streaming, Context, Port & Proxy vs Library #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions