tocdepth: | 1 |
---|
The Python API for FlexFlow Serve enables users to initialize, manage and interact with large language models (LLMs) via FastAPI or Gradio.
- FlexFlow Serve setup with necessary configurations.
- FastAPI and Uvicorn for running the API server.
Users can configure the API using FastAPI to handle requests and manage the model.
- FastAPI Application Initialization Initialize the FastAPI application to create API endpoints.
- Request Model Definition Define the model for API requests using Pydantic.
- Global Variable for LLM Model Declare a global variable to store the LLM model.
from fastapi import FastAPI
from pydantic import BaseModel
import flexflow.serve as ff
app = FastAPI()
class PromptRequest(BaseModel):
prompt: str
llm = None
Create API endpoints for LLM interactions to handle generation requests.
- Initialize Model on Startup Use the FastAPI event handler to initialize and compile the LLM model when the API server starts.
- Generate Response Endpoint Create a POST endpoint to generate responses based on the user's prompt.
@app.on_event("startup")
async def startup_event():
global llm
# Initialize and compile the LLM model
llm.compile(
generation_config,
# ... other params as needed
)
llm.start_server()
@app.post("/generate/")
async def generate(prompt_request: PromptRequest):
# ... exception handling
full_output = llm.generate([prompt_request.prompt])[0].output_text.decode('utf-8')
# ... split prompt and response text for returning results
return {"prompt": prompt_request.prompt, "response": full_output}
Instructions for running and testing the FastAPI server.
- Run the FastAPI Server Use Uvicorn to run the FastAPI server with specified host and port.
- Testing the API Make requests to the API endpoints and verify the responses.
# Running within the inference/python folder:
uvicorn entrypoint.fastapi_incr:app --reload --port 3000
A complete code example for a web-document Q&A using FlexFlow can be found here: