The Collector

Collector is a Python library designed to collect data from various sources such as databases, big data files, cloud storage, APIs, and more, and transform the data into a unified output structure. This flexible and extensible tool allows you to define data collection and transformation rules using a custom configuration file format (.col), making data integration tasks streamlined and maintainable.

Features

Multiple Data Sources: Supports SQL databases, cloud storage (AWS S3, Google Cloud Storage, Azure Blob), CSV files, APIs, JSON, Parquet, and more.
Flexible Transformation Rules: Apply type conversions, renaming, formatting, and custom transformations.
Unified Output: Output data in various formats such as CSV, JSON, and Parquet with custom options.
Modular Configuration: Use .col files to define data sources, transformations, and outputs, with support for imports to reuse configurations.
Data Collection Modes: Choose between parallel and sequential data collection modes for improved performance.
Extensible Architecture: Easily add new connectors and transformations to expand functionality.

Getting Started

Follow these steps to get started with Collector:

Install Dependencies: Install required dependencies by running:
```
pip install -r requirements.txt
```
Define a Configuration File (.col): Create a .col file that specifies your data sources, transformation rules, and output configuration.
Run the Collector: Use the provided script to run the collector with your configuration file:
```
python scripts/run_collector.py <your_col_file.col>
```

Configuration File (.col)

The .col file is the heart of Collector, allowing you to define how data should be collected, transformed, and output. Below is a basic example of a .col file:

VERSION 1.0

# Optional: Set Collection Mode (default is 'sequence')
COLLECT_MODE parallel  # Can be 'parallel' or 'sequence'

# Define Data Sources
SOURCE sales_db TYPE sql {
    HOST "localhost"
    PORT 5432
    USERNAME "user"
    PASSWORD "pass"
    DATABASE "sales"
    QUERY "SELECT * FROM sales_data"
}

# Define Transformations
TRANSFORM unified_sales FROM sales_db {
    FIELD sale_date TYPE date FORMAT "%Y-%m-%d"
    FIELD amount TYPE float DEFAULT 0.0
}

# Define Output
OUTPUT unified_data TYPE parquet {
    PATH "/output/unified_sales.parquet"
    OPTIONS {
        COMPRESSION "gzip"
    }
}

Collect Mode

parallel: Data from all sources is collected concurrently, speeding up the process for large datasets or slower APIs.
sequence (default): Data is collected sequentially, one source at a time.

Connectors

Collector includes connectors for various data sources:

SQL Connector: Connect to SQL databases like MySQL, PostgreSQL, etc.
CSV Connector: Read data from CSV files with customizable options.
API Connector: Fetch data from RESTful APIs using GET, POST, and other methods.
Parquet Connector: Read data from Parquet files with compression options.
MongoDB Connector: Fetch data from MongoDB collections.
Cloud Storage Connectors:
- AWS S3: Fetch data from Amazon S3 buckets.
- Google Cloud Storage: Fetch data from Google Cloud Storage buckets.
- Azure Blob Storage: Fetch data from Azure Blob containers.

Transformations

Define transformation rules in your .col file to:

Convert data types (e.g., string to date, int to float).
Rename fields.
Apply conditional transformations.
Set default values.

Example Transformation

TRANSFORM unified_sales FROM sales_db {
    FIELD sale_date TYPE date FORMAT "%Y-%m-%d"
    FIELD amount TYPE float DEFAULT 0.0
}

Output Formats

Collector supports various output formats:

CSV: Output data to CSV files with customizable delimiters and headers.
JSON: Save data as JSON with options for pretty printing.
Parquet: Export data to Parquet files with optional compression.

Examples

Check out the examples/ directory for sample .col files demonstrating different configurations:

basic_example.col: A simple example using SQL and CSV sources.
advanced_example.col: An advanced configuration with multiple data sources and transformations.
parallel_example.col: Demonstrates parallel data collection from multiple sources.
shared_sources.col: Demonstrates importing shared data sources across configurations.

Contributing

We welcome contributions to improve Collector! To contribute:

Fork the repository.
Create a new branch for your feature or bug fix.
Commit your changes and push to your fork.
Open a pull request with a detailed description of your changes.

Please ensure that your code follows the project's coding standards and includes appropriate tests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Collector

Table of Contents

Features

Getting Started

Configuration File (.col)

Collect Mode

Connectors

Transformations

Example Transformation

Output Formats

Examples

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
collector		collector
examples		examples
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

The Collector

Table of Contents

Features

Getting Started

Configuration File (.col)

Collect Mode

Connectors

Transformations

Example Transformation

Output Formats

Examples

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages