Collector is a Python library designed to collect data from various sources such as databases, big data files, cloud storage, APIs, and more, and transform the data into a unified output structure. This flexible and extensible tool allows you to define data collection and transformation rules using a custom configuration file format (.col), making data integration tasks streamlined and maintainable.
- Features
- Getting Started
- Configuration File (.col)
- Connectors
- Transformations
- Output Formats
- Examples
- Contributing
- License
- Multiple Data Sources: Supports SQL databases, cloud storage (AWS S3, Google Cloud Storage, Azure Blob), CSV files, APIs, JSON, Parquet, and more.
- Flexible Transformation Rules: Apply type conversions, renaming, formatting, and custom transformations.
- Unified Output: Output data in various formats such as CSV, JSON, and Parquet with custom options.
- Modular Configuration: Use
.colfiles to define data sources, transformations, and outputs, with support for imports to reuse configurations. - Data Collection Modes: Choose between parallel and sequential data collection modes for improved performance.
- Extensible Architecture: Easily add new connectors and transformations to expand functionality.
Follow these steps to get started with Collector:
-
Install Dependencies: Install required dependencies by running:
pip install -r requirements.txt
-
Define a Configuration File (.col): Create a
.colfile that specifies your data sources, transformation rules, and output configuration. -
Run the Collector: Use the provided script to run the collector with your configuration file:
python scripts/run_collector.py <your_col_file.col>
The .col file is the heart of Collector, allowing you to define how data should be collected, transformed, and output. Below is a basic example of a .col file:
VERSION 1.0
# Optional: Set Collection Mode (default is 'sequence')
COLLECT_MODE parallel # Can be 'parallel' or 'sequence'
# Define Data Sources
SOURCE sales_db TYPE sql {
HOST "localhost"
PORT 5432
USERNAME "user"
PASSWORD "pass"
DATABASE "sales"
QUERY "SELECT * FROM sales_data"
}
# Define Transformations
TRANSFORM unified_sales FROM sales_db {
FIELD sale_date TYPE date FORMAT "%Y-%m-%d"
FIELD amount TYPE float DEFAULT 0.0
}
# Define Output
OUTPUT unified_data TYPE parquet {
PATH "/output/unified_sales.parquet"
OPTIONS {
COMPRESSION "gzip"
}
}
parallel: Data from all sources is collected concurrently, speeding up the process for large datasets or slower APIs.sequence(default): Data is collected sequentially, one source at a time.
Collector includes connectors for various data sources:
- SQL Connector: Connect to SQL databases like MySQL, PostgreSQL, etc.
- CSV Connector: Read data from CSV files with customizable options.
- API Connector: Fetch data from RESTful APIs using GET, POST, and other methods.
- Parquet Connector: Read data from Parquet files with compression options.
- MongoDB Connector: Fetch data from MongoDB collections.
- Cloud Storage Connectors:
- AWS S3: Fetch data from Amazon S3 buckets.
- Google Cloud Storage: Fetch data from Google Cloud Storage buckets.
- Azure Blob Storage: Fetch data from Azure Blob containers.
Define transformation rules in your .col file to:
- Convert data types (e.g., string to date, int to float).
- Rename fields.
- Apply conditional transformations.
- Set default values.
TRANSFORM unified_sales FROM sales_db {
FIELD sale_date TYPE date FORMAT "%Y-%m-%d"
FIELD amount TYPE float DEFAULT 0.0
}
Collector supports various output formats:
- CSV: Output data to CSV files with customizable delimiters and headers.
- JSON: Save data as JSON with options for pretty printing.
- Parquet: Export data to Parquet files with optional compression.
Check out the examples/ directory for sample .col files demonstrating different configurations:
basic_example.col: A simple example using SQL and CSV sources.advanced_example.col: An advanced configuration with multiple data sources and transformations.parallel_example.col: Demonstrates parallel data collection from multiple sources.shared_sources.col: Demonstrates importing shared data sources across configurations.
We welcome contributions to improve Collector! To contribute:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Commit your changes and push to your fork.
- Open a pull request with a detailed description of your changes.
Please ensure that your code follows the project's coding standards and includes appropriate tests.
This project is licensed under the MIT License - see the LICENSE file for details.