This project simulates a mini donation data pipeline that:
- Ingests raw donation data in JSON format
- Validates the data using schema rules (via
pandera
) - Transforms the data with derived features
- Aggregates key metrics for reporting
- Stores outputs in Parquet format
- Reads JSON/CSV/Parquet files using
pandas
- Validates data types, ranges, and formats
- Saves cleaned data to a staging place
- Adds derived features: donation date, hour, tier
- Computes aggregated metrics:
- Donations per campaign per day
- Total donations per day
- Saves outputs to the final storage
There are 2 major layers:
- service layer: service classes (business logic layer)
- process layer: process used as main entries for Python jobs
Folder structure:
|-- src
| |-- constant
| |-- process
| |-- service
|-- tests
| |-- resources
| |-- process
| |-- service
|-- Pipfile
|-- pyproject.toml
|-- README.md
src/
directory: root of source codetests/
directory: root of test source codepyproject.toml
: This file is the heart of modern Python packaging. It defines the project's metadata and build system.README.md
: A good description of your project.
- Interface and Class implementation: Python's ABC module is used to achieve this. Link: https://docs.python.org/3.9/library/abc.html
- Dependency Injection: dependency-injector library is used to achieve Dependency Injection. Container is used to retrieve/inject the actual service implementation needed during runtime. Pypi link:
- Schema validation: Pandera open source package that provides a flexible and expressive API for performing data validation on dataframe-like objects
- Pandas: is a tool to process and manipulate tabular data.
# Ingestion Process
python process/ingestion_process.py --source_data_path <source_data_path> --output_dir_path <output_dir_path> --file_type <file_type>
# Transformation Process
python process/transformation_process.py --source_data_path <source_data_path> --output_dir_path <output_dir_path> --file_type <file_type>
Pytest is used for testing the application
# Test the whole app
pytest
# Example of Test a class or module
pytest tests/service/test_base_service.py
Test coverage will be checked during testing. It is configured in the pyproject.toml. Setting can be overwritten if set from command line.
pytest --cov=src/service
Pipenv is recommended, which is a Python virtualenv management tool that combines pip, virtualenv, and Pipfile into a single unified interface. It creates and manages virtual environments for your projects automatically, while also maintaining a Pipfile for package requirements and a Pipfile.lock for deterministic builds.
https://pipenv.pypa.io/en/latest/
Distributing your Python code in a standardized, efficient format is crucial for sharing your work and ensuring it's easily usable by others. The "wheel" format is the modern standard for Python distributions, offering faster installation and greater reliability compared to older formats.
You'll need the build package to create your wheel file. It's best to install it in your project's virtual environment:
pip install build
Once you have your pyproject.toml configured and build installed, creating the wheel is a single command executed in the root of your project directory:
python -m build
This command will create a dist/ directory in your project root. Inside this directory, there will be 2 files. The .whl file is the packaged project, ready to be installed by pip or uploaded to the Python Package Index (PyPI).
data-pipeline-0.1.0-py3-none-any.whl
: This is the wheel file.data-pipeline-0.1.0.tar.gz
: This is a source distribution (sdist).
Note: One needs to manually copy the dependencies from the [packages] section in Pipfile into the dependencies list in the pyproject.toml. There is no universally adopted, standard tool to automatically convert a Pipfile directly into packaging metadata.