-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
Summary
Request to add Amazon SageMaker AI as a supported LLM provider for both the indexing (tree generation) and retrieval phases.
Motivation
Enterprise users with existing SageMaker deployments would benefit from:
- Custom model hosting: Use fine-tuned or proprietary models deployed on SageMaker endpoints
- Cost optimization: Leverage reserved capacity, Savings Plans, or spot instances
- VPC integration: Keep all inference within private VPCs with no internet egress
- Model flexibility: Deploy any Hugging Face, custom, or third-party model
Use Cases
- Self-hosted open models: Deploy Llama, Mistral, Qwen, or other open models on SageMaker
- Fine-tuned models: Use domain-specific fine-tuned models for better accuracy
- Air-gapped environments: Operate in environments without external API access
- Cost control: Predictable pricing with provisioned endpoints vs per-token API costs
Proposed Implementation
Extend the LLMProvider abstraction to support SageMaker endpoints:
class SageMakerProvider(LLMProvider):
def __init__(self, endpoint_name: str, region: str = "us-east-1"):
self.client = boto3.client('sagemaker-runtime', region_name=region)
self.endpoint_name = endpoint_name
def call(self, prompt: str) -> str:
# Format depends on model deployed (HuggingFace TGI, vLLM, etc.)
payload = {
"inputs": prompt,
"parameters": {"max_new_tokens": 4096, "temperature": 0}
}
response = self.client.invoke_endpoint(
EndpointName=self.endpoint_name,
ContentType="application/json",
Body=json.dumps(payload)
)
result = json.loads(response['Body'].read().decode())
return result[0]['generated_text']
async def call_async(self, prompt: str) -> str:
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, self.call, prompt)Usage Example
# With SageMaker endpoint
python run_pageindex.py --pdf_path doc.pdf \
--provider sagemaker \
--endpoint my-llama-endpoint
# Environment-based
export PAGEINDEX_PROVIDER=sagemaker
export SAGEMAKER_ENDPOINT=my-llama-endpoint
export AWS_REGION=us-east-1
python run_pageindex.py --pdf_path doc.pdfImplementation Considerations
- Endpoint format variability: Different model containers (TGI, vLLM, Triton) have different request/response formats
- Streaming support: Some endpoints support streaming responses
- Batching: SageMaker supports batch inference for cost optimization during indexing
- IAM authentication: Standard boto3 credential chain
Related
- Issue Feature Request: Add Amazon Bedrock support for indexing and retrieval #104: Amazon Bedrock support
- Issue Support custom models #90: Support custom models
- Issue Ollama #27: Ollama support
- PR Feat: Add multi-provider LLM support (OpenAI + Gemini) #43: Multi-provider LLM support (OpenAI + Gemini)
Happy to contribute a PR if this feature is welcome!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels