Skip to content

Feature Request: Add Amazon SageMaker AI support for indexing and retrieval #105

@dgallitelli

Description

@dgallitelli

Summary

Request to add Amazon SageMaker AI as a supported LLM provider for both the indexing (tree generation) and retrieval phases.

Motivation

Enterprise users with existing SageMaker deployments would benefit from:

  • Custom model hosting: Use fine-tuned or proprietary models deployed on SageMaker endpoints
  • Cost optimization: Leverage reserved capacity, Savings Plans, or spot instances
  • VPC integration: Keep all inference within private VPCs with no internet egress
  • Model flexibility: Deploy any Hugging Face, custom, or third-party model

Use Cases

  1. Self-hosted open models: Deploy Llama, Mistral, Qwen, or other open models on SageMaker
  2. Fine-tuned models: Use domain-specific fine-tuned models for better accuracy
  3. Air-gapped environments: Operate in environments without external API access
  4. Cost control: Predictable pricing with provisioned endpoints vs per-token API costs

Proposed Implementation

Extend the LLMProvider abstraction to support SageMaker endpoints:

class SageMakerProvider(LLMProvider):
    def __init__(self, endpoint_name: str, region: str = "us-east-1"):
        self.client = boto3.client('sagemaker-runtime', region_name=region)
        self.endpoint_name = endpoint_name

    def call(self, prompt: str) -> str:
        # Format depends on model deployed (HuggingFace TGI, vLLM, etc.)
        payload = {
            "inputs": prompt,
            "parameters": {"max_new_tokens": 4096, "temperature": 0}
        }
        response = self.client.invoke_endpoint(
            EndpointName=self.endpoint_name,
            ContentType="application/json",
            Body=json.dumps(payload)
        )
        result = json.loads(response['Body'].read().decode())
        return result[0]['generated_text']

    async def call_async(self, prompt: str) -> str:
        loop = asyncio.get_event_loop()
        return await loop.run_in_executor(None, self.call, prompt)

Usage Example

# With SageMaker endpoint
python run_pageindex.py --pdf_path doc.pdf \
    --provider sagemaker \
    --endpoint my-llama-endpoint

# Environment-based
export PAGEINDEX_PROVIDER=sagemaker
export SAGEMAKER_ENDPOINT=my-llama-endpoint
export AWS_REGION=us-east-1
python run_pageindex.py --pdf_path doc.pdf

Implementation Considerations

  1. Endpoint format variability: Different model containers (TGI, vLLM, Triton) have different request/response formats
  2. Streaming support: Some endpoints support streaming responses
  3. Batching: SageMaker supports batch inference for cost optimization during indexing
  4. IAM authentication: Standard boto3 credential chain

Related

Happy to contribute a PR if this feature is welcome!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions