Open Source Document Processing Platform for Intelligent Search and Indexing
Pipestream AI is an open-source platform that transforms documents into searchable knowledge using AI-powered processing. It provides a flexible, network-based architecture for ingesting, parsing, chunking, and embedding documents for intelligent search and indexing.
- Network Graph Architecture - Not a linear pipeline, but a flexible network with fan-in and fan-out capabilities
- Multiple Entry Points - Connectors, direct API calls, or Kafka events
- Flexible Storage - S3 repository or in-memory processing
- Multiple Chunking Strategies - Apply different chunking approaches to the same document
- Multiple Embedding Models - Generate vector embeddings using multiple models simultaneously
- OpenSearch Integration - Full-text, vector, and hybrid search capabilities
- Transport Flexibility - gRPC for low latency, Kafka for high throughput
- Document Journey Guide - Comprehensive guide to how documents flow through the platform
- Architecture Overview - Platform architecture and design documentation
- Website - Visit our homepage for more information
The Pipestream Platform operates as a network graph, not a linear pipeline. The Pipeline Engine acts as the central routing hub, orchestrating data flow between processing nodes:
- Data Loading - Digital assets are ingested
- Data Transformation - Assets are transformed to text (parsing)
- Data Enhancement - Text is enhanced with chunking, embeddings, and AI processing
- Sink - Data is indexed to a search engine (OpenSearch)
- Connectors - Discover, authenticate, and stream documents from various sources
- Repository Service - Manages S3 storage and metadata, publishes events
- Pipeline Engine - Orchestrates routing and transport between modules
- Processing Modules - Parsers, chunkers, embedders, and specialized processors
This organization contains multiple repositories:
- Core Services - Platform services and orchestration
- Processing Modules - Specialized document processors
- Connectors - Document ingestion from various sources
- Frontend - Web interface and management tools
Pipestream AI is open source under the MIT License. We welcome contributions!
- Check out our documentation
- Review the architecture
- Open issues or pull requests in the relevant repositories
This project is licensed under the MIT License - see the LICENSE file in each repository for details.
- Website: https://pipestream.ai
- GitHub Organization: https://github.com/ai-pipestream
- Documentation: docs/
Building the future of intelligent document processing. 🚀