diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 00000000..45e90372 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,49 @@ +# Auto detect text files and perform LF normalization +* text=auto + +# Source code +*.cc text +*.h text +*.py text +*.md text +*.txt text +*.toml text +*.yml text +*.yaml text +*.json text +*.cmake text +*.in text + +# Scripts +*.sh text eol=lf +*.bash text eol=lf + +# Documentation +*.rst text +*.ipynb text + +# Binary files +*.so binary +*.pyd binary +*.dylib binary +*.dll binary +*.a binary +*.o binary +*.png binary +*.jpg binary +*.jpeg binary +*.gif binary +*.ico binary +*.pdf binary + +# Git +.gitattributes export-ignore +.gitignore export-ignore +.github export-ignore + +# Language statistics for GitHub +*.h linguist-language=C++ +*.cc linguist-language=C++ +include/prtree/** linguist-language=C++ +src/cpp/** linguist-language=C++ +benchmarks/cpp/** linguist-language=C++ diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml new file mode 100644 index 00000000..5c66f8db --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -0,0 +1,103 @@ +name: Bug Report +description: Report a bug or unexpected behavior +title: "[Bug]: " +labels: ["bug", "needs-triage"] +body: + - type: markdown + attributes: + value: | + Thanks for taking the time to report a bug! Please fill out the information below. + + - type: textarea + id: description + attributes: + label: Bug Description + description: A clear and concise description of what the bug is. + placeholder: Describe the bug... + validations: + required: true + + - type: textarea + id: reproduce + attributes: + label: Steps to Reproduce + description: Steps to reproduce the behavior + placeholder: | + 1. Create a tree with... + 2. Call query with... + 3. See error... + validations: + required: true + + - type: textarea + id: expected + attributes: + label: Expected Behavior + description: What did you expect to happen? + placeholder: Expected to return... + validations: + required: true + + - type: textarea + id: actual + attributes: + label: Actual Behavior + description: What actually happened? Include any error messages. + placeholder: | + Error message: + ``` + paste error here + ``` + validations: + required: true + + - type: textarea + id: code + attributes: + label: Minimal Reproducible Example + description: Please provide a minimal code example that reproduces the issue + placeholder: | + ```python + from python_prtree import PRTree2D + # your code here + ``` + render: python + validations: + required: true + + - type: input + id: version + attributes: + label: python_prtree Version + description: What version are you using? + placeholder: "0.7.0" + validations: + required: true + + - type: input + id: python-version + attributes: + label: Python Version + description: What Python version are you using? + placeholder: "3.11" + validations: + required: true + + - type: dropdown + id: os + attributes: + label: Operating System + options: + - Linux + - macOS + - Windows + - Other + validations: + required: true + + - type: textarea + id: additional + attributes: + label: Additional Context + description: Add any other context about the problem here + placeholder: Any additional information... diff --git a/.github/ISSUE_TEMPLATE/feature_request.yml b/.github/ISSUE_TEMPLATE/feature_request.yml new file mode 100644 index 00000000..dc0a18d9 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.yml @@ -0,0 +1,54 @@ +name: Feature Request +description: Suggest a new feature or enhancement +title: "[Feature]: " +labels: ["enhancement"] +body: + - type: markdown + attributes: + value: | + Thanks for suggesting a feature! Please fill out the information below. + + - type: textarea + id: problem + attributes: + label: Problem Statement + description: Is your feature request related to a problem? Please describe. + placeholder: I'm always frustrated when... + validations: + required: true + + - type: textarea + id: solution + attributes: + label: Proposed Solution + description: Describe the solution you'd like + placeholder: I would like to be able to... + validations: + required: true + + - type: textarea + id: alternatives + attributes: + label: Alternatives Considered + description: Describe alternatives you've considered + placeholder: I've considered... + + - type: textarea + id: example + attributes: + label: Example Usage + description: How would you use this feature? + placeholder: | + ```python + # Example code showing desired API + tree.new_feature(...) + ``` + render: python + + - type: checkboxes + id: contribution + attributes: + label: Contribution + description: Would you be willing to contribute this feature? + options: + - label: I'm willing to submit a PR for this feature diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 00000000..eb1ba587 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,77 @@ +## Description + + + +Fixes #(issue) + +## Type of Change + + + +- [ ] Bug fix (non-breaking change which fixes an issue) +- [ ] New feature (non-breaking change which adds functionality) +- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) +- [ ] Documentation update +- [ ] Code refactoring +- [ ] Performance improvement +- [ ] Test addition or modification + +## Changes Made + + + +- +- +- + +## Testing + + + +- [ ] All existing tests pass (`make test` or `pytest`) +- [ ] Added new tests for new functionality +- [ ] Tested on multiple Python versions (if applicable) +- [ ] Tested on multiple platforms (if applicable) + +### Test Commands Run + +```bash +# List the test commands you ran +make test +pytest tests/unit/test_*.py -v +``` + +## Documentation + +- [ ] Updated docstrings for modified functions/classes +- [ ] Updated README.md (if needed) +- [ ] Updated CHANGES.md +- [ ] Updated type hints (if applicable) + +## Checklist + +- [ ] My code follows the project's code style (`make format` and `make lint` pass) +- [ ] I have performed a self-review of my code +- [ ] I have commented my code, particularly in hard-to-understand areas +- [ ] My changes generate no new warnings +- [ ] I have added tests that prove my fix is effective or that my feature works +- [ ] New and existing unit tests pass locally with my changes +- [ ] Any dependent changes have been merged and published + +## Performance Impact + + + +- [ ] No performance impact +- [ ] Performance improvement (describe below) +- [ ] Potential performance regression (describe below and justify) + +## Breaking Changes + + + +N/A + +## Additional Notes + + diff --git a/.github/workflows/cibuildwheel.yml b/.github/workflows/cibuildwheel.yml index 75c941c4..55faac35 100644 --- a/.github/workflows/cibuildwheel.yml +++ b/.github/workflows/cibuildwheel.yml @@ -45,12 +45,11 @@ jobs: python-version: ${{ matrix.python }} - name: Install dependencies run: | - python -m pip install --upgrade pip wheel setuptools - python -m pip install numpy pytest + python -m pip install --upgrade pip build - name: Build and install - run: python -m pip install -e . + run: python -m pip install -e ".[dev]" - name: Run tests - run: pytest tests -vv + run: python -m pytest tests -vv build_wheels: # Skip wheel builds on PRs - only build on main branch and tags diff --git a/.gitignore b/.gitignore index 3fcee396..fe449e1e 100644 --- a/.gitignore +++ b/.gitignore @@ -1,55 +1,59 @@ -cmake-build-*/ -docker/ -ldata/ +# Build artifacts build/ -build_*/ dist/ -_build/ -_generate/ +*.egg-info/ *.so -*.so.* +*.pyd +*.dylib +*.dll *.a -*.py[cod] -*.egg-info -.eggs/ -.idea/ -input/* -!input/.gitkeep +*.o + +# Python __pycache__/ -.ipynb_checkpoints/ +*.py[cod] +*$py.class +*.egg +.Python +.pytest_cache/ +.coverage +htmlcov/ +.tox/ +.nox/ +.hypothesis/ +.mypy_cache/ +.dmypy.json +dmypy.json +.ruff_cache/ + +# IDEs .vscode/ +.idea/ +*.swp +*.swo +*~ .DS_Store -*.prof - -# Test coverage -htmlcov/ -.coverage -.coverage.* -coverage.xml -*.cover -# Pytest -.pytest_cache/ -.pytest_cache +# CMake +CMakeCache.txt +CMakeFiles/ +cmake_install.cmake +Makefile +compile_commands.json -# Build artifacts -*.o -*.obj -*.lib -*.exp +# Profiling +*.prof +*.log +callgrind.* +perf.data* -# Temporary files -*.tmp -*.bak -*~ +# Documentation +docs/_build/ +site/ -# Phase 0 profiling artifacts (keep templates, ignore generated data) -docs/baseline/reports/*.txt -docs/baseline/reports/*.out -docs/baseline/reports/*.data -docs/baseline/flamegraphs/*.svg -*_benchmark_results.csv -*.prof -perf.data -perf.data.old -cachegrind.out* \ No newline at end of file +# Local development +.env +.venv +venv/ +ENV/ +env/ diff --git a/CMakeLists.txt b/CMakeLists.txt index ecf1e8ff..cc3ee0ba 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -41,7 +41,16 @@ elseif(ENABLE_UBSAN) endif() project(PRTree) -file(GLOB MYCPP ${CMAKE_CURRENT_SOURCE_DIR}/cpp/*) + +# Source files +set(PRTREE_SOURCES + ${CMAKE_CURRENT_SOURCE_DIR}/src/cpp/bindings/python_bindings.cc +) + +# Include directories +set(PRTREE_INCLUDE_DIRS + ${CMAKE_CURRENT_SOURCE_DIR}/include +) option(SNAPPY_BUILD_TESTS "" OFF) option(SNAPPY_BUILD_BENCHMARKS "" OFF) @@ -57,7 +66,13 @@ add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/third/pybind11/) add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/third/cereal/) add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/third/snappy/) -pybind11_add_module(PRTree ${MYCPP}) +pybind11_add_module(PRTree ${PRTREE_SOURCES}) + +# Include directories +target_include_directories(PRTree PRIVATE + ${PRTREE_INCLUDE_DIRS} +) + set_target_properties(snappy PROPERTIES POSITION_INDEPENDENT_CODE ON C_VISIBILITY_PRESET hidden @@ -93,8 +108,8 @@ if(BUILD_BENCHMARKS) message(STATUS "Building performance benchmarks") # Construction benchmark - add_executable(benchmark_construction benchmarks/benchmark_construction.cpp) - target_include_directories(benchmark_construction PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/cpp) + add_executable(benchmark_construction benchmarks/cpp/benchmark_construction.cpp) + target_include_directories(benchmark_construction PRIVATE ${PRTREE_INCLUDE_DIRS}) target_link_libraries(benchmark_construction PRIVATE cereal snappy) set_target_properties(benchmark_construction PROPERTIES CXX_STANDARD 20 @@ -103,8 +118,8 @@ if(BUILD_BENCHMARKS) ) # Query benchmark - add_executable(benchmark_query benchmarks/benchmark_query.cpp) - target_include_directories(benchmark_query PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/cpp) + add_executable(benchmark_query benchmarks/cpp/benchmark_query.cpp) + target_include_directories(benchmark_query PRIVATE ${PRTREE_INCLUDE_DIRS}) target_link_libraries(benchmark_query PRIVATE cereal snappy) set_target_properties(benchmark_query PROPERTIES CXX_STANDARD 20 @@ -113,8 +128,8 @@ if(BUILD_BENCHMARKS) ) # Multithreaded benchmark - add_executable(benchmark_parallel benchmarks/benchmark_parallel.cpp) - target_include_directories(benchmark_parallel PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/cpp) + add_executable(benchmark_parallel benchmarks/cpp/benchmark_parallel.cpp) + target_include_directories(benchmark_parallel PRIVATE ${PRTREE_INCLUDE_DIRS}) target_link_libraries(benchmark_parallel PRIVATE cereal snappy) set_target_properties(benchmark_parallel PROPERTIES CXX_STANDARD 20 @@ -123,8 +138,8 @@ if(BUILD_BENCHMARKS) ) # Stress test - add_executable(stress_test_concurrent benchmarks/stress_test_concurrent.cpp) - target_include_directories(stress_test_concurrent PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/cpp) + add_executable(stress_test_concurrent benchmarks/cpp/stress_test_concurrent.cpp) + target_include_directories(stress_test_concurrent PRIVATE ${PRTREE_INCLUDE_DIRS}) target_link_libraries(stress_test_concurrent PRIVATE cereal snappy pthread) set_target_properties(stress_test_concurrent PROPERTIES CXX_STANDARD 20 diff --git a/MANIFEST.in b/MANIFEST.in index bdde255c..a54b12c2 100644 --- a/MANIFEST.in +++ b/MANIFEST.in @@ -1,6 +1,21 @@ -include README.md LICENSE -include requirements.txt +include README.md LICENSE CHANGES.md CONTRIBUTING.md DEVELOPMENT.md ARCHITECTURE.md +include pyproject.toml setup.py global-include CMakeLists.txt *.cmake -recursive-include cpp * -recursive-include src * -recursive-include third * \ No newline at end of file + +# C++ headers and source +recursive-include include *.h +recursive-include src/cpp *.h *.cc *.cpp + +# Python source +recursive-include src/python_prtree *.py *.typed + +# Third-party dependencies (git submodules) +recursive-include third * +exclude third/.git* +prune third/**/.git + +# Exclude build artifacts and caches +global-exclude *.pyc __pycache__ *.so *.pyd *.dylib +prune build +prune dist +prune .pytest_cache \ No newline at end of file diff --git a/Makefile b/Makefile index 7a145085..5a8fff53 100644 --- a/Makefile +++ b/Makefile @@ -135,21 +135,28 @@ install: ## Install package $(PIP) install . @echo "$(GREEN)✓ Installation complete$(RESET)" -dev-install: ## Install in development mode (pip install -e .) +dev-install: ## Install in development mode with all dependencies @echo "$(BOLD)Installing in development mode...$(RESET)" - $(PIP) install -e . + $(PIP) install -e ".[dev,docs,benchmark]" @echo "$(GREEN)✓ Development installation complete$(RESET)" install-deps: ## Install development dependencies @echo "$(BOLD)Installing development dependencies...$(RESET)" - $(PIP) install pytest pytest-cov pytest-xdist numpy + $(PIP) install -e ".[dev]" @echo "$(GREEN)✓ Dependencies installed$(RESET)" -format: ## Format C++ code (requires clang-format) +format: ## Format code (Python with black, C++ with clang-format) + @echo "$(BOLD)Formatting Python code...$(RESET)" + @if command -v black >/dev/null 2>&1 || $(PYTHON) -m black --version >/dev/null 2>&1; then \ + $(PYTHON) -m black $(SRC_DIR) $(TEST_DIR); \ + echo "$(GREEN)✓ Python formatting complete$(RESET)"; \ + else \ + echo "$(YELLOW)Warning: black not installed (pip install black)$(RESET)"; \ + fi + @echo "$(BOLD)Formatting C++ code...$(RESET)" @if command -v clang-format >/dev/null 2>&1; then \ - echo "$(BOLD)Formatting C++ code...$(RESET)"; \ find $(CPP_DIR) -name '*.h' -o -name '*.cc' | xargs clang-format -i; \ - echo "$(GREEN)✓ Formatting complete$(RESET)"; \ + echo "$(GREEN)✓ C++ formatting complete$(RESET)"; \ else \ echo "$(YELLOW)Warning: clang-format not installed$(RESET)"; \ fi @@ -162,15 +169,25 @@ lint-cpp: ## Lint C++ code (requires clang-tidy) echo "$(YELLOW)Warning: clang-tidy not installed$(RESET)"; \ fi -lint-python: ## Lint Python code (requires flake8) - @if command -v flake8 >/dev/null 2>&1; then \ - echo "$(BOLD)Linting Python code...$(RESET)"; \ - flake8 $(SRC_DIR) $(TEST_DIR) --max-line-length=100; \ +lint-python: ## Lint Python code (requires ruff) + @echo "$(BOLD)Linting Python code with ruff...$(RESET)" + @if command -v ruff >/dev/null 2>&1 || $(PYTHON) -m ruff --version >/dev/null 2>&1; then \ + $(PYTHON) -m ruff check $(SRC_DIR) $(TEST_DIR); \ + echo "$(GREEN)✓ Linting complete$(RESET)"; \ + else \ + echo "$(YELLOW)Warning: ruff not installed (pip install ruff)$(RESET)"; \ + fi + +type-check: ## Type check Python code (requires mypy) + @echo "$(BOLD)Type checking Python code...$(RESET)" + @if command -v mypy >/dev/null 2>&1 || $(PYTHON) -m mypy --version >/dev/null 2>&1; then \ + $(PYTHON) -m mypy $(SRC_DIR); \ + echo "$(GREEN)✓ Type checking complete$(RESET)"; \ else \ - echo "$(YELLOW)Warning: flake8 not installed$(RESET)"; \ + echo "$(YELLOW)Warning: mypy not installed (pip install mypy)$(RESET)"; \ fi -lint: lint-cpp lint-python ## Lint all code +lint: lint-cpp lint-python type-check ## Lint all code docs: ## Generate documentation (requires Doxygen) @if command -v doxygen >/dev/null 2>&1; then \ diff --git a/README.md b/README.md index 4b0025c2..dd72d5f8 100644 --- a/README.md +++ b/README.md @@ -184,17 +184,16 @@ results = tree.batch_query(queries) # Returns [[], [], ...] ## Installation from Source ```bash -# Install dependencies -pip install -U cmake pybind11 numpy - # Clone with submodules -git clone --recursive https://github.com/atksh/python_prtree +git clone --recursive https://github.com/atksh/python_prtree.git cd python_prtree -# Build and install -python setup.py install +# Install in development mode with all dependencies +pip install -e ".[dev]" ``` +For detailed development setup, see [DEVELOPMENT.md](docs/DEVELOPMENT.md). + ## API Reference ### PRTree2D / PRTree3D / PRTree4D @@ -238,6 +237,14 @@ Lars Arge, Mark de Berg, Herman Haverkort, Ke Yi SIGMOD 2004 [Paper](https://www.cse.ust.hk/~yike/prtree/) +## Documentation + +- **[CONTRIBUTING.md](CONTRIBUTING.md)** - How to contribute to the project +- **[CHANGES.md](CHANGES.md)** - Version history and changelog +- **[docs/DEVELOPMENT.md](docs/DEVELOPMENT.md)** - Development environment setup +- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** - Codebase structure and design +- **[docs/MIGRATION.md](docs/MIGRATION.md)** - Migration guide between versions + ## License See LICENSE file for details. diff --git a/benchmarks/benchmark_construction.cpp b/benchmarks/cpp/benchmark_construction.cpp similarity index 100% rename from benchmarks/benchmark_construction.cpp rename to benchmarks/cpp/benchmark_construction.cpp diff --git a/benchmarks/benchmark_parallel.cpp b/benchmarks/cpp/benchmark_parallel.cpp similarity index 100% rename from benchmarks/benchmark_parallel.cpp rename to benchmarks/cpp/benchmark_parallel.cpp diff --git a/benchmarks/benchmark_query.cpp b/benchmarks/cpp/benchmark_query.cpp similarity index 100% rename from benchmarks/benchmark_query.cpp rename to benchmarks/cpp/benchmark_query.cpp diff --git a/benchmarks/benchmark_utils.h b/benchmarks/cpp/benchmark_utils.h similarity index 100% rename from benchmarks/benchmark_utils.h rename to benchmarks/cpp/benchmark_utils.h diff --git a/benchmarks/stress_test_concurrent.cpp b/benchmarks/cpp/stress_test_concurrent.cpp similarity index 100% rename from benchmarks/stress_test_concurrent.cpp rename to benchmarks/cpp/stress_test_concurrent.cpp diff --git a/benchmarks/workloads.h b/benchmarks/cpp/workloads.h similarity index 100% rename from benchmarks/workloads.h rename to benchmarks/cpp/workloads.h diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 00000000..8376e840 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,335 @@ +# Project Architecture + +This document describes the architecture and directory structure of python_prtree. + +## Overview + +python_prtree is a Python package that provides fast spatial indexing using the Priority R-Tree data structure. It consists of: + +1. **C++ Core**: High-performance implementation of the Priority R-Tree algorithm +2. **Python Bindings**: pybind11-based bindings exposing C++ functionality to Python +3. **Python Wrapper**: User-friendly Python API with additional features + +## Directory Structure + +``` +python_prtree/ +├── include/ # C++ Public Headers (API) +│ └── prtree/ +│ ├── core/ # Core algorithm headers +│ │ └── prtree.h # Main PRTree class template +│ └── utils/ # Utility headers +│ ├── parallel.h # Parallel processing utilities +│ └── small_vector.h # Optimized vector implementation +│ +├── src/ # Source Code +│ ├── cpp/ # C++ Implementation +│ │ ├── core/ # Core implementation (future) +│ │ └── bindings/ # Python bindings +│ │ └── python_bindings.cc # pybind11 bindings +│ │ +│ └── python_prtree/ # Python Package +│ ├── __init__.py # Package entry point +│ ├── core.py # PRTree2D/3D/4D classes +│ └── py.typed # Type hints marker (PEP 561) +│ +├── tests/ # Test Suite +│ ├── unit/ # Unit tests (individual features) +│ │ ├── test_construction.py +│ │ ├── test_query.py +│ │ ├── test_insert.py +│ │ ├── test_erase.py +│ │ └── ... +│ ├── integration/ # Integration tests (workflows) +│ │ ├── test_insert_query_workflow.py +│ │ ├── test_persistence_query_workflow.py +│ │ └── ... +│ ├── e2e/ # End-to-end tests +│ │ ├── test_readme_examples.py +│ │ └── test_user_workflows.py +│ └── conftest.py # Shared test fixtures +│ +├── benchmarks/ # Performance Benchmarks +│ ├── cpp/ # C++ benchmarks +│ │ ├── benchmark_construction.cpp +│ │ ├── benchmark_query.cpp +│ │ ├── benchmark_parallel.cpp +│ │ └── stress_test_concurrent.cpp +│ └── python/ # Python benchmarks (future) +│ └── README.md +│ +├── docs/ # Documentation +│ ├── examples/ # Example notebooks and scripts +│ │ └── experiment.ipynb +│ ├── images/ # Documentation images +│ └── baseline/ # Benchmark baseline data +│ +├── tools/ # Development Tools +│ ├── analyze_baseline.py # Benchmark analysis +│ ├── profile.py # Profiling script +│ ├── profile.sh # Profiling shell script +│ └── profile_all_workloads.sh +│ +└── third/ # Third-party Dependencies (git submodules) + ├── pybind11/ # Python bindings framework + ├── cereal/ # Serialization library + └── snappy/ # Compression library +``` + +## Architectural Layers + +### 1. Core C++ Layer (`include/prtree/core/`) + +**Purpose**: Implements the Priority R-Tree algorithm + +**Key Components**: +- `prtree.h`: Main template class `PRTree` + - `T`: Index type (typically `int64_t`) + - `B`: Branching factor (default: 8) + - `D`: Dimensions (2, 3, or 4) + +**Design Principles**: +- Header-only template library for performance +- No Python dependencies at this layer +- Pure C++ with C++20 features + +### 2. Utilities Layer (`include/prtree/utils/`) + +**Purpose**: Supporting data structures and algorithms + +**Components**: +- `parallel.h`: Thread-safe parallel processing utilities +- `small_vector.h`: Cache-friendly vector with small size optimization + +**Design Principles**: +- Reusable utilities independent of PRTree +- Optimized for performance (SSE, cache-locality) + +### 3. Python Bindings Layer (`src/cpp/bindings/`) + +**Purpose**: Expose C++ functionality to Python using pybind11 + +**Key File**: `python_bindings.cc` + +**Responsibilities**: +- Create Python classes from C++ templates +- Handle numpy array conversions +- Expose methods with Python-friendly signatures +- Provide module-level documentation + +**Design Principles**: +- Thin binding layer (minimal logic) +- Direct mapping to C++ API +- Efficient numpy integration + +### 4. Python Wrapper Layer (`src/python_prtree/`) + +**Purpose**: User-friendly Python API with safety features + +**Key Files**: +- `__init__.py`: Package entry point and version info +- `core.py`: Main user-facing classes (`PRTree2D`, `PRTree3D`, `PRTree4D`) + +**Added Features**: +- Empty tree safety (prevent segfaults) +- Python object storage (pickle serialization) +- Convenient APIs (auto-indexing, return_obj parameter) +- Type hints and documentation + +**Design Principles**: +- Safety over raw performance +- Pythonic API design +- Backwards compatibility considerations + +## Data Flow + +### Construction +``` +User Code + ↓ (numpy arrays) +PRTree2D/3D/4D (Python) + ↓ (arrays + validation) +_PRTree2D/3D/4D (pybind11) + ↓ (type conversion) +PRTree (C++) + ↓ (algorithm) +Optimized R-Tree Structure +``` + +### Query +``` +User Code + ↓ (query box) +PRTree2D.query() (Python) + ↓ (empty tree check) +_PRTree2D.query() (pybind11) + ↓ (type conversion) +PRTree::find_one() (C++) + ↓ (tree traversal) +Result Indices + ↓ (optional: object retrieval) +User Code +``` + +## Separation of Concerns + +### By Functionality + +1. **Core Algorithm** (`include/prtree/core/`) + - Spatial indexing logic + - Tree construction and traversal + - No I/O, no Python + +2. **Utilities** (`include/prtree/utils/`) + - Generic helpers + - Reusable across projects + +3. **Bindings** (`src/cpp/bindings/`) + - Python/C++ bridge + - Type conversions only + +4. **Python API** (`src/python_prtree/`) + - User interface + - Safety and convenience + +### By Testing + +1. **Unit Tests** (`tests/unit/`) + - Test individual features in isolation + - Fast, focused tests + - Examples: `test_insert.py`, `test_query.py` + +2. **Integration Tests** (`tests/integration/`) + - Test feature interactions + - Workflow-based tests + - Examples: `test_insert_query_workflow.py` + +3. **E2E Tests** (`tests/e2e/`) + - Test complete user scenarios + - Documentation examples + - Examples: `test_readme_examples.py` + +## Build System + +### CMake Configuration + +**Key Variables**: +- `PRTREE_SOURCES`: Source files to compile +- `PRTREE_INCLUDE_DIRS`: Header search paths + +**Targets**: +- `PRTree`: Main Python extension module +- `benchmark_*`: C++ benchmark executables (optional) + +**Options**: +- `BUILD_BENCHMARKS`: Enable benchmark compilation +- `ENABLE_PROFILING`: Build with profiling symbols +- `ENABLE_ASAN/TSAN/UBSAN`: Enable sanitizers + +### Build Process + +``` +User runs: pip install -e . + ↓ +setup.py invoked + ↓ +CMakeBuild.build_extension() + ↓ +CMake configuration + - Find dependencies (pybind11, cereal, snappy) + - Set compiler flags + - Configure include paths + ↓ +CMake build + - Compile C++ to shared library (.so/.pyd) + - Link dependencies + ↓ +Extension installed in src/python_prtree/ +``` + +## Design Decisions + +### Header-Only Core + +**Decision**: Keep core PRTree as header-only template library + +**Rationale**: +- Enables full compiler optimization +- Simplifies distribution +- No need for .cc files at core layer + +**Trade-offs**: +- Longer compilation times +- Larger binary size + +### Separate Bindings File + +**Decision**: Single `python_bindings.cc` file separate from core + +**Rationale**: +- Clear separation: core C++ vs. Python interface +- Core can be reused in C++-only projects +- Easier to maintain Python API changes + +### Python Wrapper Layer + +**Decision**: Add Python wrapper on top of pybind11 bindings + +**Rationale**: +- Safety: prevent segfaults on empty trees +- Convenience: Pythonic APIs, object storage +- Evolution: can change API without C++ recompilation + +**Trade-offs**: +- Extra layer adds slight overhead +- More code to maintain + +### Test Organization + +**Decision**: Three-tier test structure (unit/integration/e2e) + +**Rationale**: +- Fast feedback loop with unit tests +- Comprehensive coverage with integration tests +- Real-world validation with e2e tests +- Easy to run subsets: `pytest tests/unit -v` + +## Future Improvements + +1. **Split prtree.h**: Large monolithic header could be split into: + - `prtree_fwd.h`: Forward declarations + - `prtree_node.h`: Node implementation + - `prtree_query.h`: Query algorithms + - `prtree_insert.h`: Insert/erase logic + +2. **C++ Core Library**: Extract core into `src/cpp/core/` for: + - Faster compilation + - Better code organization + - Easier testing of C++ layer independently + +3. **Python Benchmarks**: Add `benchmarks/python/` for: + - Performance regression testing + - Comparison with other Python libraries + - Memory profiling + +4. **Documentation**: Add `docs/api/` with: + - Sphinx-generated API docs + - Architecture diagrams + - Performance tuning guide + +## Contributing + +When adding new features, follow the separation of concerns: + +1. **Core algorithm changes**: Modify `include/prtree/core/prtree.h` +2. **Expose to Python**: Update `src/cpp/bindings/python_bindings.cc` +3. **Python API enhancements**: Update `src/python_prtree/core.py` +4. **Add tests**: Unit tests for features, integration tests for workflows + +See [DEVELOPMENT.md](DEVELOPMENT.md) for detailed contribution guidelines. + +## References + +- **Priority R-Tree Paper**: Arge et al., SIGMOD 2004 +- **pybind11**: https://pybind11.readthedocs.io/ +- **Python Packaging**: PEP 517, PEP 518, PEP 621 diff --git a/docs/DEVELOPMENT.md b/docs/DEVELOPMENT.md new file mode 100644 index 00000000..f1b5c64f --- /dev/null +++ b/docs/DEVELOPMENT.md @@ -0,0 +1,359 @@ +# Development Guide + +Welcome to the python_prtree development guide! This document will help you get started with contributing to the project. + +## Project Structure + +``` +python_prtree/ +├── include/ # C++ public headers +│ └── prtree/ +│ ├── core/ # Core algorithm +│ └── utils/ # Utilities +├── src/ # Source code +│ ├── cpp/ # C++ implementation +│ │ └── bindings/ # Python bindings +│ └── python_prtree/ # Python package +├── tests/ # Test suite +│ ├── unit/ # Unit tests +│ ├── integration/ # Integration tests +│ └── e2e/ # End-to-end tests +├── benchmarks/ # Performance benchmarks +│ ├── cpp/ # C++ benchmarks +│ └── python/ # Python benchmarks +├── docs/ # Documentation +│ ├── examples/ # Example code +│ ├── images/ # Images +│ └── baseline/ # Benchmark data +├── tools/ # Development tools +├── .github/workflows/ # CI/CD configuration +└── third/ # Third-party dependencies (git submodules) +``` + +For a detailed explanation of the architecture, see [ARCHITECTURE.md](ARCHITECTURE.md). + +## Prerequisites + +- Python 3.8 or higher +- CMake 3.22 or higher +- C++17 compatible compiler +- Git (for submodules) + +### Platform-Specific Requirements + +**macOS:** +```bash +brew install cmake +``` + +**Ubuntu/Debian:** +```bash +sudo apt-get install cmake build-essential +``` + +**Windows:** +- Visual Studio 2019 or later with C++ development tools +- CMake (can be installed via Visual Studio installer or from cmake.org) + +## Getting Started + +### 1. Clone the Repository + +```bash +git clone https://github.com/atksh/python_prtree.git +cd python_prtree +``` + +### 2. Initialize Submodules + +The project uses git submodules for third-party dependencies: + +```bash +git submodule update --init --recursive +``` + +Or use the Makefile: + +```bash +make init +``` + +### 3. Set Up Development Environment + +#### Using pip (recommended) + +```bash +# Install in development mode with all dependencies +pip install -e ".[dev,docs,benchmark]" +``` + +#### Using make + +```bash +# Initialize submodules and install dependencies +make dev +``` + +This will: +- Initialize git submodules +- Install the package in editable mode +- Install all development dependencies + +### 4. Build the C++ Extension + +```bash +# Build in debug mode (default) +make build + +# Or build in release mode +make build-release +``` + +## Development Workflow + +### Running Tests + +```bash +# Run all tests +make test + +# Run tests in parallel (faster) +make test-fast + +# Run tests with coverage report +make test-coverage + +# Run specific test +make test-one TEST=test_insert +``` + +Or use pytest directly: + +```bash +pytest tests -v +pytest tests/unit/test_insert.py -v +pytest tests -k "test_insert" -v +``` + +### Code Quality + +#### Format Code + +```bash +# Format both Python and C++ code +make format + +# Format only Python (uses black) +python -m black src/ tests/ + +# Format only C++ (uses clang-format) +clang-format -i cpp/*.cc cpp/*.h +``` + +#### Lint Code + +```bash +# Lint all code +make lint + +# Lint only Python (uses ruff) +make lint-python + +# Lint only C++ (uses clang-tidy) +make lint-cpp + +# Type check Python code (uses mypy) +make type-check +``` + +### Building Documentation + +```bash +make docs +``` + +### Cleaning Build Artifacts + +```bash +# Remove build artifacts +make clean + +# Clean everything including submodules +make clean-all +``` + +## Project Configuration + +All project metadata and dependencies are defined in `pyproject.toml`: + +- **Project metadata**: name, version, description, authors +- **Dependencies**: runtime and development dependencies +- **Build system**: setuptools with CMake integration +- **Tool configurations**: pytest, black, ruff, mypy, coverage + +## Testing Guidelines + +### Test Organization + +- `tests/unit/`: Unit tests for individual components +- `tests/integration/`: Tests for component interactions +- `tests/e2e/`: End-to-end workflow tests +- `tests/legacy/`: Legacy test suite + +### Writing Tests + +```python +import pytest +from python_prtree import PRTree + +def test_basic_insertion(): + """Test basic rectangle insertion.""" + tree = PRTree() + tree.insert([0, 0, 10, 10], "rect1") + assert tree.size() == 1 + +def test_query(): + """Test rectangle query.""" + tree = PRTree() + tree.insert([0, 0, 10, 10], "rect1") + results = tree.query([5, 5, 15, 15]) + assert len(results) > 0 +``` + +### Running Specific Test Categories + +```bash +# Run only unit tests +pytest tests/unit -v + +# Run only integration tests +pytest tests/integration -v + +# Run only e2e tests +pytest tests/e2e -v +``` + +## C++ Development + +### Building with Debug Symbols + +```bash +make debug-build +``` + +### Profiling + +```bash +# Run profiling scripts +./tools/profile.sh +python tools/profile.py +``` + +### Benchmarks + +```bash +# Run benchmarks (if available) +make benchmark +``` + +## Continuous Integration + +The project uses GitHub Actions for CI/CD: + +- **Pull Requests**: Runs unit tests on multiple platforms (Linux, macOS, Windows) and Python versions (3.8-3.14) +- **Main Branch**: Builds wheels for all platforms and Python versions +- **Version Tags**: Publishes packages to PyPI + +## Making Changes + +### Workflow + +1. Create a new branch: + ```bash + git checkout -b feature/my-feature + ``` + +2. Make your changes and write tests + +3. Run tests and linting: + ```bash + make test + make lint + ``` + +4. Commit your changes: + ```bash + git add . + git commit -m "Add feature: description" + ``` + +5. Push and create a pull request: + ```bash + git push origin feature/my-feature + ``` + +### Code Style + +- **Python**: Follow PEP 8, use black for formatting (100 char line length) +- **C++**: Follow Google C++ Style Guide, use clang-format +- **Commits**: Use conventional commit messages + - `feat:` for new features + - `fix:` for bug fixes + - `docs:` for documentation + - `test:` for test changes + - `refactor:` for refactoring + - `chore:` for maintenance tasks + +## Troubleshooting + +### Submodules Not Initialized + +```bash +git submodule update --init --recursive +``` + +### Build Fails + +1. Ensure CMake is installed and up to date +2. Check that all submodules are initialized +3. Try cleaning and rebuilding: + ```bash + make clean + make build + ``` + +### Tests Fail + +1. Ensure the extension is built: + ```bash + make build + ``` + +2. Check that all dependencies are installed: + ```bash + pip install -e ".[dev]" + ``` + +### Import Errors + +Ensure you've installed the package in development mode: +```bash +pip install -e . +``` + +## Additional Resources + +- [CONTRIBUTING.md](CONTRIBUTING.md) - Contribution guidelines +- [README.md](README.md) - Project overview +- [CHANGES.md](CHANGES.md) - Version history +- [GitHub Issues](https://github.com/atksh/python_prtree/issues) - Bug reports and feature requests + +## Questions? + +If you have questions or need help, please: + +1. Check existing [GitHub Issues](https://github.com/atksh/python_prtree/issues) +2. Open a new issue with your question +3. See [CONTRIBUTING.md](CONTRIBUTING.md) for more details + +Happy coding! 🎉 diff --git a/docs/MIGRATION.md b/docs/MIGRATION.md new file mode 100644 index 00000000..66259ecf --- /dev/null +++ b/docs/MIGRATION.md @@ -0,0 +1,196 @@ +# Migration Guide + +This document helps users migrate between major versions and structural changes. + +## v0.7.0 Project Restructuring + +### Overview + +Version 0.7.0 introduces a major project restructuring with clear separation of concerns. **The Python API remains 100% backwards compatible** - no code changes are needed. + +### What Changed + +#### For End Users (Python API) + +**No action required!** All existing code continues to work: + +```python +from python_prtree import PRTree2D + +# All existing code works exactly the same +tree = PRTree2D([1, 2], [[0, 0, 1, 1], [2, 2, 3, 3]]) +results = tree.query([0.5, 0.5, 2.5, 2.5]) +``` + +#### For Contributors (Project Structure) + +If you've been developing on the codebase, note these changes: + +**Directory Structure Changes:** + +``` +Old Structure → New Structure +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +cpp/ → include/prtree/core/ + ├── prtree.h → └── prtree.h + ├── parallel.h → include/prtree/utils/parallel.h + ├── small_vector.h → include/prtree/utils/small_vector.h + └── main.cc → src/cpp/bindings/python_bindings.cc + +src/python_prtree/ → src/python_prtree/ + └── __init__.py → ├── __init__.py (simplified) + → ├── core.py (new, main classes) + → └── py.typed (new, type hints) + +benchmarks/ → benchmarks/ + └── *.cpp → ├── cpp/ (C++ benchmarks) + → └── python/ (future) + +docs/ → docs/ + ├── experiment.ipynb → ├── examples/experiment.ipynb + ├── images/ → ├── images/ + └── baseline/ → └── baseline/ + +scripts/ → tools/ (consolidated) +run_*.sh → tools/*.sh +``` + +**Build System:** + +- `requirements.txt` → removed (use `pyproject.toml`) +- `requirements-dev.txt` → removed (use `pip install -e ".[dev]"`) +- CMake paths updated to use `include/` and `src/cpp/` + +**Development Workflow:** + +```bash +# Old way +pip install -r requirements.txt +pip install -r requirements-dev.txt +pip install -e . + +# New way (single command) +pip install -e ".[dev]" +``` + +### Migration Steps for Contributors + +#### 1. Update Your Development Environment + +```bash +# Clean old build artifacts +make clean + +# Update dependencies +pip install -e ".[dev]" + +# Rebuild +make build +``` + +#### 2. Update Include Paths (if you have C++ code) + +```cpp +// Old includes +#include "prtree.h" +#include "parallel.h" + +// New includes +#include "prtree/core/prtree.h" +#include "prtree/utils/parallel.h" +``` + +#### 3. Update Git Submodules + +```bash +git submodule update --init --recursive +``` + +#### 4. Update Your Fork + +```bash +git pull upstream main +git push origin main +``` + +### Benefits of New Structure + +1. **Clear Separation**: C++ core, bindings, and Python API are clearly separated +2. **Better Documentation**: Each layer has its own README +3. **Modern Tooling**: Uses pyproject.toml, type hints, modern linters +4. **Easier Contribution**: Clear where to add code for different types of changes +5. **Future-Ready**: Structure supports future modularization and improvements + +### Troubleshooting + +#### Build Errors + +**Error**: `prtree.h: No such file or directory` + +**Solution**: Clean and rebuild: +```bash +make clean +git submodule update --init --recursive +make build +``` + +#### Import Errors + +**Error**: `ImportError: cannot import name 'PRTree2D'` + +**Solution**: Reinstall the package: +```bash +pip uninstall python-prtree +pip install -e ".[dev]" +``` + +#### Test Failures + +**Error**: Tests fail after upgrading + +**Solution**: Ensure you're on the latest version: +```bash +git pull +pip install -e ".[dev]" +make test +``` + +### Getting Help + +If you encounter issues during migration: + +1. Check existing [GitHub Issues](https://github.com/atksh/python_prtree/issues) +2. See [DEVELOPMENT.md](DEVELOPMENT.md) for setup instructions +3. See [ARCHITECTURE.md](ARCHITECTURE.md) for structure details +4. Open a new issue with: + - Your Python version + - Your OS + - Error messages + - Steps you've tried + +## Future Migrations + +### v0.8.0 (Planned): C++ Modularization + +The large `prtree.h` file (1617 lines) will be split into modules: + +``` +prtree.h → { + prtree/core/detail/types.h + prtree/core/detail/bounding_box.h + prtree/core/detail/nodes.h + prtree/core/detail/pseudo_tree.h + prtree/core/prtree.h (main interface) +} +``` + +**Impact**: None for Python users. C++ users will need to include the main header only. + +### v1.0.0 (Future): Stable API + +Version 1.0 will mark API stability: +- Semantic versioning strictly followed +- No breaking changes without major version bump +- Long-term support for stable API + +Stay tuned for updates! diff --git a/docs/baseline/BASELINE_SUMMARY.md b/docs/baseline/BASELINE_SUMMARY.md deleted file mode 100644 index 8a4a8483..00000000 --- a/docs/baseline/BASELINE_SUMMARY.md +++ /dev/null @@ -1,228 +0,0 @@ -# Phase 0 Baseline Performance Summary - -**Date**: [YYYY-MM-DD] -**System**: [CPU model, cores, cache sizes, RAM] -**Compiler**: [Version and flags] -**Build Configuration**: [Release/Debug, optimization level] - ---- - -## Executive Summary - -[2-3 paragraph overview of key findings. Example:] - -> Performance profiling reveals that PRTree construction is dominated by cache misses during the partitioning phase, accounting for approximately 40% of total execution time on large datasets. The primary bottleneck is the random memory access pattern in `PseudoPRTree::construct`, which exhibits a 15% L3 cache miss rate. -> -> Query operations show excellent cache locality for small queries but degrade significantly for large result sets due to pointer chasing through the tree structure. Branch prediction is generally effective (>95% accuracy) except during tree descent in skewed data distributions. -> -> Parallel construction scales well up to 8 threads but shows diminishing returns beyond that point due to memory bandwidth saturation and false sharing in shared metadata structures. - ---- - -## Performance Bottlenecks (Priority Order) - -### 1. [Bottleneck Name - e.g., "L3 Cache Misses in Tree Construction"] -- **Impact**: [% of total execution time] -- **Root Cause**: [Technical explanation] -- **Evidence**: [Metric - e.g., "15% L3 miss rate, 2.5M misses per 100K elements"] -- **Affected Workloads**: [List workloads] -- **Recommendation**: [Optimization strategy for Phase 7+] - -### 2. [Second Bottleneck] -[Same structure as above] - -### 3. [Third Bottleneck] -[Same structure as above] - -[Continue for top 5-7 bottlenecks] - ---- - -## Hardware Counter Summary - -### Construction Phase - -| Workload | Elements | Time (ms) | Cycles (M) | IPC | L1 Miss% | L3 Miss% | Branch Miss% | Memory BW (GB/s) | -|----------|----------|-----------|------------|-----|----------|----------|--------------|------------------| -| small_uniform | 10K | - | - | - | - | - | - | - | -| large_uniform | 1M | - | - | - | - | - | - | - | -| clustered | 500K | - | - | - | - | - | - | - | -| skewed | 1M | - | - | - | - | - | - | - | -| sequential | 100K | - | - | - | - | - | - | - | - -### Query Phase - -| Workload | Queries | Avg Time (μs) | Throughput (K/s) | L1 Miss% | L3 Miss% | Branch Miss% | -|----------|---------|---------------|------------------|----------|----------|--------------| -| small_uniform | 1K | - | - | - | - | - | -| large_uniform | 10K | - | - | - | - | - | -| clustered | 5K | - | - | - | - | - | -| skewed | 10K | - | - | - | - | - | -| sequential | 1K | - | - | - | - | - | - ---- - -## Hotspot Analysis - -### Construction Hotspots (by CPU Time) - -| Rank | Function | CPU Time% | L3 Misses% | Branch Misses% | Notes | -|------|----------|-----------|------------|----------------|-------| -| 1 | `PseudoPRTree::construct` | - | - | - | - | -| 2 | `std::nth_element` | - | - | - | - | -| 3 | `BB::expand` | - | - | - | - | -| ... | ... | ... | ... | ... | ... | - -### Query Hotspots (by CPU Time) - -| Rank | Function | CPU Time% | L3 Misses% | Branch Misses% | Notes | -|------|----------|-----------|------------|----------------|-------| -| 1 | `PRTree::find` | - | - | - | - | -| 2 | `BB::intersects` | - | - | - | - | -| 3 | `refine_candidates` | - | - | - | - | -| ... | ... | ... | ... | ... | ... | - ---- - -## Cache Hierarchy Behavior - -### Cache Hit Ratios - -| Cache Level | Construction Hit Rate | Query Hit Rate | Notes | -|-------------|----------------------|----------------|-------| -| L1 Data | - | - | - | -| L2 | - | - | - | -| L3 (LLC) | - | - | - | -| TLB | - | - | - | - -### Cache-Line Utilization -- **Average bytes used per cache line**: [X bytes / 64 bytes = Y%] -- **False sharing detected**: [Yes/No, details in c2c reports] -- **Cold miss ratio**: [%] -- **Capacity miss ratio**: [%] -- **Conflict miss ratio**: [%] - ---- - -## Data Structure Layout Analysis - -### Critical Structures (from `pahole`) - -#### `DataType` -``` -struct DataType { - int64_t first; /* 0 8 */ - struct BB<2> second; /* 8 32 */ - - /* size: 40, cachelines: 1, members: 2 */ - /* sum members: 40, holes: 0, sum holes: 0 */ - /* padding: 24 */ - /* last cacheline: 40 bytes */ -}; -``` -**Analysis**: [Padding waste, alignment issues, potential improvements] - -#### [Other hot structures] -[Similar breakdown] - ---- - -## Thread Scaling Analysis - -### Parallel Construction Speedup - -| Threads | Time (ms) | Speedup | Efficiency | Scaling Bottleneck | -|---------|-----------|---------|------------|-------------------| -| 1 | - | 1.0x | 100% | Baseline | -| 2 | - | - | - | - | -| 4 | - | - | - | - | -| 8 | - | - | - | - | -| 16 | - | - | - | - | - -**Observations**: -- [Linear scaling up to X threads] -- [Memory bandwidth saturation at Y threads] -- [False sharing impact: Z%] - ---- - -## NUMA Effects (if applicable) - -### Memory Allocation Patterns -- **Local memory access**: [%] -- **Remote memory access**: [%] -- **Inter-node traffic**: [GB during construction] - -### NUMA-Aware Recommendations -[Suggestions for Phase 7 if NUMA effects are significant] - ---- - -## Memory Usage - -| Workload | Elements | Tree Size (MB) | Peak RSS (MB) | Overhead% | Bytes/Element | -|----------|----------|----------------|---------------|-----------|---------------| -| small_uniform | 10K | - | - | - | - | -| large_uniform | 1M | - | - | - | - | -| clustered | 500K | - | - | - | - | -| skewed | 1M | - | - | - | - | -| sequential | 100K | - | - | - | - | - ---- - -## Optimization Priorities for Subsequent Phases - -Based on the profiling data, we recommend the following optimization priorities: - -### High Priority (Phase 7 - Data Layout) -1. **[Optimization 1]**: [Expected impact X%, feasibility Y] -2. **[Optimization 2]**: [Expected impact X%, feasibility Y] -3. **[Optimization 3]**: [Expected impact X%, feasibility Y] - -### Medium Priority (Phase 8+) -1. **[Optimization 4]**: [Details] -2. **[Optimization 5]**: [Details] - -### Low Priority (Future) -1. **[Optimization 6]**: [Details] - ---- - -## Regression Detection - -All baseline metrics have been committed to `docs/baseline/reports/` for future comparison. The CI system will automatically compare future benchmarks against this baseline and fail if: -- Construction time regresses >5% -- Query time regresses >5% -- Cache miss rate increases >10% -- Memory usage increases >20% - -**Baseline Git Commit**: [commit SHA] - ---- - -## Approvals - -- **Engineer**: [Name, Date] -- **Tech Lead**: [Name, Date] -- **Architect**: [Name, Date] - ---- - -## References - -- Raw `perf stat` outputs: `docs/baseline/reports/perf_*.txt` -- Flamegraphs: `docs/baseline/flamegraphs/*.svg` -- Cachegrind reports: `docs/baseline/reports/cache_*.txt` -- C2C reports: `docs/baseline/reports/c2c_*.txt` -- Profiling scripts: `scripts/profile_*.sh` - ---- - -## Next Steps - -Upon approval of this baseline: -1. Proceed to **Phase 1**: Critical bugs + TSan infrastructure -2. Re-run benchmarks after Phase 1 to detect any regressions -3. Use this baseline for all future performance comparisons - -**Phase 0 Status**: [COMPLETE / IN PROGRESS / BLOCKED] diff --git a/docs/baseline/README.md b/docs/baseline/README.md deleted file mode 100644 index 820280e1..00000000 --- a/docs/baseline/README.md +++ /dev/null @@ -1,183 +0,0 @@ -# Phase 0: Microarchitectural Baseline Profiling - -This directory contains the baseline performance characteristics of PRTree before any optimizations are applied. All measurements must be completed and documented before proceeding with Phase 1. - -## 🔴 CRITICAL: Go/No-Go Gate - -**Phase 0 is complete ONLY when:** -- ✅ All artifacts generated for all workloads -- ✅ Baseline summary memo reviewed and approved -- ✅ Raw data committed to repository (for regression detection) -- ✅ Automated benchmark suite integrated into CI -- ✅ Performance regression detection scripts validated - -**If metrics cannot be collected: STOP. Fix tooling before proceeding.** - -## Directory Structure - -``` -baseline/ -├── README.md # This file -├── BASELINE_SUMMARY.md # Executive summary (REQUIRED) -├── perf_counters.md # Hardware counter baselines -├── hotspots.md # Top performance bottlenecks -├── layout_analysis.md # Data structure memory layout -├── numa_analysis.md # NUMA behavior (if applicable) -├── flamegraphs/ # Flamegraph visualizations -│ ├── construction_small.svg -│ ├── construction_large.svg -│ ├── construction_clustered.svg -│ ├── query_small.svg -│ ├── query_large.svg -│ └── batch_query_parallel.svg -└── reports/ # Raw profiling data - ├── construction_*.txt # Call-graph reports - ├── cache_*.txt # Cachegrind reports - └── c2c_*.txt # Cache-to-cache transfer reports -``` - -## Required Tooling - -### Linux Tools (Mandatory) -```bash -# Hardware performance counters -sudo apt-get install linux-tools-generic linux-tools-$(uname -r) - -# Cache topology -sudo apt-get install hwloc lstopo - -# Valgrind with Cachegrind -sudo apt-get install valgrind - -# FlameGraph generator -git clone https://github.com/brendangregg/FlameGraph.git -``` - -### macOS Tools -```bash -# Instruments (part of Xcode) -xcode-select --install - -# Homebrew tools -brew install hwloc valgrind -``` - -## Standard Workloads - -All benchmarks must be run with these representative workloads: - -1. **small_uniform**: 10,000 elements, uniform distribution, 1,000 small queries -2. **large_uniform**: 1,000,000 elements, uniform distribution, 10,000 medium queries -3. **clustered**: 500,000 elements, clustered distribution (10 clusters), 5,000 mixed queries -4. **skewed**: 1,000,000 elements, Zipfian distribution, 10,000 large queries -5. **sequential**: 100,000 elements, sequential data, 1,000 small queries - -## Metrics to Collect - -### Construction Phase -For each workload, collect: -- **Performance Counters**: cycles, instructions, IPC, cache misses (L1/L2/L3), TLB misses, branch misses -- **Call Graph**: Hotspot functions with CPU time percentages -- **Cache Behavior**: Cachegrind annotations showing cache line utilization -- **Memory Usage**: Peak RSS, allocations - -### Query Phase -Same metrics as construction phase, plus: -- **Query throughput**: Queries per second -- **Latency distribution**: P50, P95, P99 - -### Multithreaded Construction -For parallel construction, collect: -- **Thread scaling**: 1, 2, 4, 8, 16 threads -- **NUMA effects**: Local vs remote memory access -- **Cache-to-cache transfers**: False sharing detection -- **Parallel speedup**: Actual vs theoretical - -## How to Run Profiling - -### Step 1: Build with Profiling Symbols -```bash -mkdir -p build_profile -cd build_profile -cmake -DBUILD_BENCHMARKS=ON -DENABLE_PROFILING=ON .. -make -j$(nproc) -``` - -### Step 2: Run Benchmarks and Collect Metrics -```bash -# From repository root -./scripts/profile_all_workloads.sh -``` - -This will: -1. Run each benchmark with `perf stat` for hardware counters -2. Run with `perf record` for flamegraphs -3. Run with `valgrind --tool=cachegrind` for cache analysis -4. Generate reports in `docs/baseline/reports/` -5. Generate flamegraphs in `docs/baseline/flamegraphs/` - -### Step 3: Analyze and Document -```bash -# Generate summary analysis -./scripts/analyze_baseline.py -``` - -This creates: -- `perf_counters.md` - Tabulated counter results -- `hotspots.md` - Top 10 functions by various metrics -- `BASELINE_SUMMARY.md` - Executive summary with recommendations - -## Validation Checklist - -Before considering Phase 0 complete, verify: - -- [ ] All 5 workloads profiled successfully -- [ ] Hardware counters collected for all workloads -- [ ] Flamegraphs generated and readable -- [ ] Cachegrind reports show detailed cache line info -- [ ] Hotspot analysis identifies top bottlenecks -- [ ] Data structure layout documented with `pahole` -- [ ] Thread scaling measured (if applicable) -- [ ] NUMA analysis complete (if multi-socket system) -- [ ] Baseline summary memo written and reviewed -- [ ] All raw data committed to git -- [ ] CI integration tested and passing - -## Expected Timeline - -- **Tooling setup**: 2 hours -- **Benchmark implementation**: 4 hours -- **Data collection**: 2 hours (automated) -- **Analysis and documentation**: 4 hours -- **Review and approval**: 2 hours - -**Total: 2-3 days** - -## Troubleshooting - -### "perf_event_open failed: Permission denied" -```bash -# Temporary (until reboot) -sudo sysctl -w kernel.perf_event_paranoid=-1 - -# Permanent -echo 'kernel.perf_event_paranoid = -1' | sudo tee -a /etc/sysctl.conf -``` - -### "Cannot find debug symbols" -Ensure you built with `-DENABLE_PROFILING=ON` which adds `-g` and `-fno-omit-frame-pointer`. - -### "Cachegrind too slow" -For large workloads, you can sample: -```bash -valgrind --tool=cachegrind --cachegrind-out-file=cache.out \ - --I1=32768,8,64 --D1=32768,8,64 --LL=8388608,16,64 \ - ./benchmark_construction large_uniform -``` - -## References - -- [perf documentation](https://perf.wiki.kernel.org/index.php/Tutorial) -- [Cachegrind manual](https://valgrind.org/docs/manual/cg-manual.html) -- [FlameGraph guide](https://www.brendangregg.com/flamegraphs.html) -- [Intel VTune tutorial](https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top.html) diff --git a/docs/experiment.ipynb b/docs/examples/experiment.ipynb similarity index 100% rename from docs/experiment.ipynb rename to docs/examples/experiment.ipynb diff --git a/include/prtree/core/detail/bounding_box.h b/include/prtree/core/detail/bounding_box.h new file mode 100644 index 00000000..836de6e3 --- /dev/null +++ b/include/prtree/core/detail/bounding_box.h @@ -0,0 +1,138 @@ +/** + * @file bounding_box.h + * @brief Axis-Aligned Bounding Box (AABB) implementation + * + * Provides the BB class for D-dimensional bounding boxes with + * geometric operations like intersection, union, and area calculation. + */ +#pragma once + +#include +#include + +#include + +#include "prtree/core/detail/types.h" + +using Real = float; + +template class BB { +private: + Real values[2 * D]; + +public: + BB() { clear(); } + + BB(const Real (&minima)[D], const Real (&maxima)[D]) { + Real v[2 * D]; + for (int i = 0; i < D; ++i) { + v[i] = -minima[i]; + v[i + D] = maxima[i]; + } + validate(v); + for (int i = 0; i < D; ++i) { + values[i] = v[i]; + values[i + D] = v[i + D]; + } + } + + BB(const Real (&v)[2 * D]) { + validate(v); + for (int i = 0; i < D; ++i) { + values[i] = v[i]; + values[i + D] = v[i + D]; + } + } + + Real min(const int dim) const { + if (unlikely(dim < 0 || D <= dim)) { + throw std::runtime_error("Invalid dim"); + } + return -values[dim]; + } + Real max(const int dim) const { + if (unlikely(dim < 0 || D <= dim)) { + throw std::runtime_error("Invalid dim"); + } + return values[dim + D]; + } + + bool validate(const Real (&v)[2 * D]) const { + bool flag = false; + for (int i = 0; i < D; ++i) { + if (unlikely(-v[i] > v[i + D])) { + flag = true; + break; + } + } + if (unlikely(flag)) { + throw std::runtime_error("Invalid Bounding Box"); + } + return flag; + } + void clear() noexcept { + for (int i = 0; i < 2 * D; ++i) { + values[i] = -1e100; + } + } + + Real val_for_comp(const int &axis) const noexcept { + const int axis2 = (axis + 1) % (2 * D); + return values[axis] + values[axis2]; + } + + BB operator+(const BB &rhs) const { + Real result[2 * D]; + for (int i = 0; i < 2 * D; ++i) { + result[i] = std::max(values[i], rhs.values[i]); + } + return BB(result); + } + + BB operator+=(const BB &rhs) { + for (int i = 0; i < 2 * D; ++i) { + values[i] = std::max(values[i], rhs.values[i]); + } + return *this; + } + + void expand(const Real (&delta)[D]) noexcept { + for (int i = 0; i < D; ++i) { + values[i] += delta[i]; + values[i + D] += delta[i]; + } + } + + bool operator()( + const BB &target) const { // whether this and target has any intersect + + Real minima[D]; + Real maxima[D]; + bool flags[D]; + bool flag = true; + + for (int i = 0; i < D; ++i) { + minima[i] = std::min(values[i], target.values[i]); + maxima[i] = std::min(values[i + D], target.values[i + D]); + } + for (int i = 0; i < D; ++i) { + flags[i] = -minima[i] <= maxima[i]; + } + for (int i = 0; i < D; ++i) { + flag &= flags[i]; + } + return flag; + } + + Real area() const { + Real result = 1; + for (int i = 0; i < D; ++i) { + result *= max(i) - min(i); + } + return result; + } + + inline Real operator[](const int i) const { return values[i]; } + + template void serialize(Archive &ar) { ar(values); } +}; diff --git a/include/prtree/core/detail/data_type.h b/include/prtree/core/detail/data_type.h new file mode 100644 index 00000000..02016442 --- /dev/null +++ b/include/prtree/core/detail/data_type.h @@ -0,0 +1,47 @@ +/** + * @file data_type.h + * @brief Data storage structures for PRTree + * + * Contains DataType class for storing index-bounding box pairs + * and related utility functions. + */ +#pragma once + +#include + +#include "prtree/core/detail/bounding_box.h" +#include "prtree/core/detail/types.h" + +// Phase 8: Apply C++20 concept constraints +template class DataType { +public: + BB second; + T first; + + DataType() noexcept = default; + + DataType(const T &f, const BB &s) { + first = f; + second = s; + } + + DataType(T &&f, BB &&s) noexcept { + first = std::move(f); + second = std::move(s); + } + + void swap(DataType& other) noexcept { + using std::swap; + swap(first, other.first); + swap(second, other.second); + } + + template void serialize(Archive &ar) { ar(first, second); } +}; + +template +void clean_data(DataType *b, DataType *e) { + for (DataType *it = e - 1; it >= b; --it) { + it->~DataType(); + } +} diff --git a/include/prtree/core/detail/nodes.h b/include/prtree/core/detail/nodes.h new file mode 100644 index 00000000..46234cec --- /dev/null +++ b/include/prtree/core/detail/nodes.h @@ -0,0 +1,166 @@ +/** + * @file nodes.h + * @brief PRTree node implementations + * + * Contains PRTreeLeaf, PRTreeNode, PRTreeElement classes and utility + * functions for the actual PRTree structure. + */ +#pragma once + +#include +#include +#include + +#include "prtree/core/detail/bounding_box.h" +#include "prtree/core/detail/data_type.h" +#include "prtree/core/detail/pseudo_tree.h" +#include "prtree/core/detail/types.h" + +// Phase 8: Apply C++20 concept constraints +template class PRTreeLeaf { +public: + BB mbb; + svec, B> data; + + PRTreeLeaf() { mbb = BB(); } + + PRTreeLeaf(const Leaf &leaf) { + mbb = leaf.mbb; + data = leaf.data; + } + + Real area() const { return mbb.area(); } + + void update_mbb() { + mbb.clear(); + for (const auto &datum : data) { + mbb += datum.second; + } + } + + void operator()(const BB &target, vec &out) const { + if (mbb(target)) { + for (const auto &x : data) { + if (x.second(target)) { + out.emplace_back(x.first); + } + } + } + } + + void del(const T &key, const BB &target) { + if (mbb(target)) { + auto remove_it = + std::remove_if(data.begin(), data.end(), [&](auto &datum) { + return datum.second(target) && datum.first == key; + }); + data.erase(remove_it, data.end()); + } + } + + void push(const T &key, const BB &target) { + data.emplace_back(key, target); + update_mbb(); + } + + template void save(Archive &ar) const { + vec> _data; + for (const auto &datum : data) { + _data.push_back(datum); + } + ar(mbb, _data); + } + + template void load(Archive &ar) { + vec> _data; + ar(mbb, _data); + for (const auto &datum : _data) { + data.push_back(datum); + } + } +}; + +// Phase 8: Apply C++20 concept constraints +template class PRTreeNode { +public: + BB mbb; + std::unique_ptr> leaf; + std::unique_ptr> head, next; + + PRTreeNode() {} + PRTreeNode(const BB &_mbb) { mbb = _mbb; } + + PRTreeNode(BB &&_mbb) noexcept { mbb = std::move(_mbb); } + + PRTreeNode(Leaf *l) { + leaf = std::make_unique>(); + mbb = l->mbb; + leaf->mbb = std::move(l->mbb); + leaf->data = std::move(l->data); + } + + bool operator()(const BB &target) { return mbb(target); } +}; + +// Phase 8: Apply C++20 concept constraints +template class PRTreeElement { +public: + BB mbb; + std::unique_ptr> leaf; + bool is_used = false; + + PRTreeElement() { + mbb = BB(); + is_used = false; + } + + PRTreeElement(const PRTreeNode &node) { + mbb = BB(node.mbb); + if (node.leaf) { + Leaf tmp_leaf = Leaf(*node.leaf.get()); + leaf = std::make_unique>(tmp_leaf); + } + is_used = true; + } + + bool operator()(const BB &target) { return is_used && mbb(target); } + + template void serialize(Archive &archive) { + archive(mbb, leaf, is_used); + } +}; + +// Phase 8: Apply C++20 concept constraints +template +void bfs( + const std::function> &)> &func, + vec> &flat_tree, const BB target) { + queue que; + auto qpush_if_intersect = [&](const size_t &i) { + PRTreeElement &r = flat_tree[i]; + // std::cout << "i " << (long int) i << " : " << (bool) r.leaf << std::endl; + if (r(target)) { + // std::cout << " is pushed" << std::endl; + que.emplace(i); + } + }; + + // std::cout << "size: " << flat_tree.size() << std::endl; + qpush_if_intersect(0); + while (!que.empty()) { + size_t idx = que.front(); + // std::cout << "idx: " << (long int) idx << std::endl; + que.pop(); + PRTreeElement &elem = flat_tree[idx]; + + if (elem.leaf) { + // std::cout << "func called for " << (long int) idx << std::endl; + func(elem.leaf); + } else { + for (size_t offset = 0; offset < B; offset++) { + size_t jdx = idx * B + offset + 1; + qpush_if_intersect(jdx); + } + } + } +} diff --git a/include/prtree/core/detail/pseudo_tree.h b/include/prtree/core/detail/pseudo_tree.h new file mode 100644 index 00000000..6652bd0a --- /dev/null +++ b/include/prtree/core/detail/pseudo_tree.h @@ -0,0 +1,225 @@ +/** + * @file pseudo_tree.h + * @brief Pseudo PRTree structures used during construction + * + * Contains Leaf, PseudoPRTreeNode, and PseudoPRTree classes that form + * the intermediate data structure during PRTree construction. + */ +#pragma once + +#include +#include +#include +#include +#include +#include + +#include "prtree/core/detail/bounding_box.h" +#include "prtree/core/detail/data_type.h" +#include "prtree/core/detail/types.h" + +// Phase 8: Apply C++20 concept constraints +template class Leaf { +public: + BB mbb; + svec, B> data; // You can swap when filtering + int axis = 0; + + // T is type of keys(ids) which will be returned when you post a query. + Leaf() { mbb = BB(); } + Leaf(const int _axis) { + axis = _axis; + mbb = BB(); + } + + void set_axis(const int &_axis) { axis = _axis; } + + void push(const T &key, const BB &target) { + data.emplace_back(key, target); + update_mbb(); + } + + void update_mbb() { + mbb.clear(); + for (const auto &datum : data) { + mbb += datum.second; + } + } + + bool filter(DataType &value) { // false means given value is ignored + // Phase 2: C++20 requires explicit 'this' capture + auto comp = [this](const auto &a, const auto &b) noexcept { + return a.second.val_for_comp(axis) < b.second.val_for_comp(axis); + }; + + if (data.size() < B) { // if there is room, just push the candidate + auto iter = std::lower_bound(data.begin(), data.end(), value, comp); + DataType tmp_value = DataType(value); + data.insert(iter, std::move(tmp_value)); + mbb += value.second; + return true; + } else { // if there is no room, check the priority and swap if needed + if (data[0].second.val_for_comp(axis) < value.second.val_for_comp(axis)) { + size_t n_swap = + std::lower_bound(data.begin(), data.end(), value, comp) - + data.begin(); + std::swap(*data.begin(), value); + auto iter = data.begin(); + for (size_t i = 0; i < n_swap - 1; ++i) { + std::swap(*(iter + i), *(iter + i + 1)); + } + update_mbb(); + } + return false; + } + } +}; + +// Phase 8: Apply C++20 concept constraints +template class PseudoPRTreeNode { +public: + Leaf leaves[2 * D]; + std::unique_ptr left, right; + + PseudoPRTreeNode() { + for (int i = 0; i < 2 * D; i++) { + leaves[i].set_axis(i); + } + } + PseudoPRTreeNode(const int axis) { + for (int i = 0; i < 2 * D; i++) { + const int j = (axis + i) % (2 * D); + leaves[i].set_axis(j); + } + } + + template void serialize(Archive &archive) { + // archive(cereal::(left), cereal::defer(right), leaves); + archive(left, right, leaves); + } + + void address_of_leaves(vec *> &out) { + for (auto &leaf : leaves) { + if (leaf.data.size() > 0) { + out.emplace_back(&leaf); + } + } + } + + template auto filter(const iterator &b, const iterator &e) { + auto out = std::remove_if(b, e, [&](auto &x) { + for (auto &l : leaves) { + if (l.filter(x)) { + return true; + } + } + return false; + }); + return out; + } +}; + +// Phase 8: Apply C++20 concept constraints +template class PseudoPRTree { +public: + std::unique_ptr> root; + vec *> cache_children; + const int nthreads = std::max(1, (int)std::thread::hardware_concurrency()); + + PseudoPRTree() { root = std::make_unique>(); } + + template PseudoPRTree(const iterator &b, const iterator &e) { + if (!root) { + root = std::make_unique>(); + } + construct(root.get(), b, e, 0); + clean_data(b, e); + } + + template void serialize(Archive &archive) { + archive(root); + // archive.serializeDeferments(); + } + + template + void construct(PseudoPRTreeNode *node, const iterator &b, + const iterator &e, const int depth) { + if (e - b > 0 && node != nullptr) { + bool use_recursive_threads = std::pow(2, depth + 1) <= nthreads; +#ifdef MY_DEBUG + use_recursive_threads = false; +#endif + + vec threads; + threads.reserve(2); + PseudoPRTreeNode *node_left, *node_right; + + const int axis = depth % (2 * D); + auto ee = node->filter(b, e); + auto m = b; + std::advance(m, (ee - b) / 2); + std::nth_element(b, m, ee, + [axis](const DataType &lhs, + const DataType &rhs) noexcept { + return lhs.second[axis] < rhs.second[axis]; + }); + + if (m - b > 0) { + node->left = std::make_unique>(axis); + node_left = node->left.get(); + if (use_recursive_threads) { + threads.push_back( + std::thread([&]() { construct(node_left, b, m, depth + 1); })); + } else { + construct(node_left, b, m, depth + 1); + } + } + if (ee - m > 0) { + node->right = std::make_unique>(axis); + node_right = node->right.get(); + if (use_recursive_threads) { + threads.push_back( + std::thread([&]() { construct(node_right, m, ee, depth + 1); })); + } else { + construct(node_right, m, ee, depth + 1); + } + } + std::for_each(threads.begin(), threads.end(), + [&](std::thread &x) { x.join(); }); + } + } + + auto get_all_leaves(const int hint) { + if (cache_children.empty()) { + using U = PseudoPRTreeNode; + cache_children.reserve(hint); + auto node = root.get(); + queue que; + que.emplace(node); + + while (!que.empty()) { + node = que.front(); + que.pop(); + node->address_of_leaves(cache_children); + if (node->left) + que.emplace(node->left.get()); + if (node->right) + que.emplace(node->right.get()); + } + } + return cache_children; + } + + std::pair *, DataType *> as_X(void *placement, + const int hint) { + DataType *b, *e; + auto children = get_all_leaves(hint); + T total = children.size(); + b = reinterpret_cast *>(placement); + e = b + total; + for (T i = 0; i < total; i++) { + new (b + i) DataType{i, children[i]->mbb}; + } + return {b, e}; + } +}; diff --git a/include/prtree/core/detail/types.h b/include/prtree/core/detail/types.h new file mode 100644 index 00000000..2eaab722 --- /dev/null +++ b/include/prtree/core/detail/types.h @@ -0,0 +1,123 @@ +/** + * @file types.h + * @brief Common types, concepts, and utility functions for PRTree + * + * This file contains: + * - Type aliases and concepts + * - Utility functions for Python/C++ interop + * - Common constants and macros + */ +#pragma once + +#include +#include +#include +#include + +#include +#include + +#include "prtree/utils/small_vector.h" + +namespace py = pybind11; + +// === Versioning === + +constexpr uint16_t PRTREE_VERSION_MAJOR = 1; +constexpr uint16_t PRTREE_VERSION_MINOR = 0; + +// === C++20 Concepts === + +template +concept IndexType = std::integral && !std::same_as; + +template +concept SignedIndexType = IndexType && std::is_signed_v; + +// === Type Aliases === + +template +using vec = std::vector; + +template +using svec = itlib::small_vector; + +template +using deque = std::deque; + +template +using queue = std::queue>; + +// === Constants === + +static const float REBUILD_THRE = 1.25; + +// === Branch Prediction Hints === + +#if defined(__GNUC__) || defined(__clang__) +#define likely(x) __builtin_expect(!!(x), 1) +#define unlikely(x) __builtin_expect(!!(x), 0) +#else +#define likely(x) (x) +#define unlikely(x) (x) +#endif + +// === Python Interop Utilities === + +/** + * @brief Convert a C++ sequence to a numpy array with zero-copy + * + * Transfers ownership of the sequence data to Python. + */ +template +inline py::array_t as_pyarray(Sequence &seq) { + auto size = seq.size(); + auto data = seq.data(); + std::unique_ptr seq_ptr = + std::make_unique(std::move(seq)); + auto capsule = py::capsule(seq_ptr.get(), [](void *p) { + std::unique_ptr(reinterpret_cast(p)); + }); + seq_ptr.release(); + return py::array(size, data, capsule); +} + +/** + * @brief Convert nested vector to tuple of numpy arrays + * + * Returns (sizes, flattened_data) where sizes[i] is the length of out_ll[i] + * and flattened_data contains all elements concatenated. + */ +template +auto list_list_to_arrays(vec> out_ll) { + vec out_s; + out_s.reserve(out_ll.size()); + std::size_t sum = 0; + for (auto &&i : out_ll) { + out_s.push_back(i.size()); + sum += i.size(); + } + vec out; + out.reserve(sum); + for (const auto &v : out_ll) + out.insert(out.end(), v.begin(), v.end()); + + return make_tuple(std::move(as_pyarray(out_s)), std::move(as_pyarray(out))); +} + +// === Compression Utilities === + +#include +#include + +inline std::string compress(std::string &data) { + std::string output; + snappy::Compress(data.data(), data.size(), &output); + return output; +} + +inline std::string decompress(std::string &data) { + std::string output; + snappy::Uncompress(data.data(), data.size(), &output); + return output; +} diff --git a/cpp/prtree.h b/include/prtree/core/prtree.h similarity index 65% rename from cpp/prtree.h rename to include/prtree/core/prtree.h index 18979ff1..41624ef6 100644 --- a/cpp/prtree.h +++ b/include/prtree/core/prtree.h @@ -1,4 +1,6 @@ #pragma once + +// Standard Library Includes #include #include #include @@ -22,9 +24,11 @@ #include #include #include -// Phase 8: C++20 features + +// C++20 features #include +// External Dependencies #include #include #include @@ -34,611 +38,31 @@ #include #include #include -#include //for smart pointers +#include #include #include #include -#include "parallel.h" -#include "small_vector.h" #include +// PRTree Modular Components +#include "prtree/core/detail/types.h" +#include "prtree/core/detail/bounding_box.h" +#include "prtree/core/detail/data_type.h" +#include "prtree/core/detail/pseudo_tree.h" +#include "prtree/core/detail/nodes.h" + +#include "prtree/utils/parallel.h" +#include "prtree/utils/small_vector.h" + #ifdef MY_DEBUG #include #endif using Real = float; -// Phase 4: Versioning for serialization -constexpr uint16_t PRTREE_VERSION_MAJOR = 1; -constexpr uint16_t PRTREE_VERSION_MINOR = 0; - namespace py = pybind11; -// Phase 8: C++20 Concepts for type safety -template -concept IndexType = std::integral && !std::same_as; - -template -concept SignedIndexType = IndexType && std::is_signed_v; - -template using vec = std::vector; - -template -inline py::array_t as_pyarray(Sequence &seq) { - - auto size = seq.size(); - auto data = seq.data(); - std::unique_ptr seq_ptr = - std::make_unique(std::move(seq)); - auto capsule = py::capsule(seq_ptr.get(), [](void *p) { - std::unique_ptr(reinterpret_cast(p)); - }); - seq_ptr.release(); - return py::array(size, data, capsule); -} - -template auto list_list_to_arrays(vec> out_ll) { - vec out_s; - out_s.reserve(out_ll.size()); - std::size_t sum = 0; - for (auto &&i : out_ll) { - out_s.push_back(i.size()); - sum += i.size(); - } - vec out; - out.reserve(sum); - for (const auto &v : out_ll) - out.insert(out.end(), v.begin(), v.end()); - - return make_tuple(std::move(as_pyarray(out_s)), std::move(as_pyarray(out))); -} - -template -using svec = itlib::small_vector; - -template using deque = std::deque; - -template using queue = std::queue>; - -static const float REBUILD_THRE = 1.25; - -// Phase 8: Branch prediction hints -// Note: C++20 provides [[likely]] and [[unlikely]] attributes, but we keep -// these macros for backward compatibility and cleaner syntax in conditions. -// Future refactoring could replace: if (unlikely(x)) with if (x) [[unlikely]] -#if defined(__GNUC__) || defined(__clang__) -#define likely(x) __builtin_expect(!!(x), 1) -#define unlikely(x) __builtin_expect(!!(x), 0) -#else -#define likely(x) (x) -#define unlikely(x) (x) -#endif - -std::string compress(std::string &data) { - std::string output; - snappy::Compress(data.data(), data.size(), &output); - return output; -} - -std::string decompress(std::string &data) { - std::string output; - snappy::Uncompress(data.data(), data.size(), &output); - return output; -} - -template class BB { -private: - Real values[2 * D]; - -public: - BB() { clear(); } - - BB(const Real (&minima)[D], const Real (&maxima)[D]) { - Real v[2 * D]; - for (int i = 0; i < D; ++i) { - v[i] = -minima[i]; - v[i + D] = maxima[i]; - } - validate(v); - for (int i = 0; i < D; ++i) { - values[i] = v[i]; - values[i + D] = v[i + D]; - } - } - - BB(const Real (&v)[2 * D]) { - validate(v); - for (int i = 0; i < D; ++i) { - values[i] = v[i]; - values[i + D] = v[i + D]; - } - } - - Real min(const int dim) const { - if (unlikely(dim < 0 || D <= dim)) { - throw std::runtime_error("Invalid dim"); - } - return -values[dim]; - } - Real max(const int dim) const { - if (unlikely(dim < 0 || D <= dim)) { - throw std::runtime_error("Invalid dim"); - } - return values[dim + D]; - } - - bool validate(const Real (&v)[2 * D]) const { - bool flag = false; - for (int i = 0; i < D; ++i) { - if (unlikely(-v[i] > v[i + D])) { - flag = true; - break; - } - } - if (unlikely(flag)) { - throw std::runtime_error("Invalid Bounding Box"); - } - return flag; - } - void clear() noexcept { - for (int i = 0; i < 2 * D; ++i) { - values[i] = -1e100; - } - } - - Real val_for_comp(const int &axis) const noexcept { - const int axis2 = (axis + 1) % (2 * D); - return values[axis] + values[axis2]; - } - - BB operator+(const BB &rhs) const { - Real result[2 * D]; - for (int i = 0; i < 2 * D; ++i) { - result[i] = std::max(values[i], rhs.values[i]); - } - return BB(result); - } - - BB operator+=(const BB &rhs) { - for (int i = 0; i < 2 * D; ++i) { - values[i] = std::max(values[i], rhs.values[i]); - } - return *this; - } - - void expand(const Real (&delta)[D]) noexcept { - for (int i = 0; i < D; ++i) { - values[i] += delta[i]; - values[i + D] += delta[i]; - } - } - - bool operator()( - const BB &target) const { // whether this and target has any intersect - - Real minima[D]; - Real maxima[D]; - bool flags[D]; - bool flag = true; - - for (int i = 0; i < D; ++i) { - minima[i] = std::min(values[i], target.values[i]); - maxima[i] = std::min(values[i + D], target.values[i + D]); - } - for (int i = 0; i < D; ++i) { - flags[i] = -minima[i] <= maxima[i]; - } - for (int i = 0; i < D; ++i) { - flag &= flags[i]; - } - return flag; - } - - Real area() const { - Real result = 1; - for (int i = 0; i < D; ++i) { - result *= max(i) - min(i); - } - return result; - } - - inline Real operator[](const int i) const { return values[i]; } - - template void serialize(Archive &ar) { ar(values); } -}; - -// Phase 8: Apply C++20 concept constraints -template class DataType { -public: - BB second; - T first; - - DataType() noexcept = default; - - DataType(const T &f, const BB &s) { - first = f; - second = s; - } - - DataType(T &&f, BB &&s) noexcept { - first = std::move(f); - second = std::move(s); - } - - void swap(DataType& other) noexcept { - using std::swap; - swap(first, other.first); - swap(second, other.second); - } - - template void serialize(Archive &ar) { ar(first, second); } -}; - -template -void clean_data(DataType *b, DataType *e) { - for (DataType *it = e - 1; it >= b; --it) { - it->~DataType(); - } -} - -// Phase 8: Apply C++20 concept constraints -template class Leaf { -public: - BB mbb; - svec, B> data; // You can swap when filtering - int axis = 0; - - // T is type of keys(ids) which will be returned when you post a query. - Leaf() { mbb = BB(); } - Leaf(const int _axis) { - axis = _axis; - mbb = BB(); - } - - void set_axis(const int &_axis) { axis = _axis; } - - void push(const T &key, const BB &target) { - data.emplace_back(key, target); - update_mbb(); - } - - void update_mbb() { - mbb.clear(); - for (const auto &datum : data) { - mbb += datum.second; - } - } - - bool filter(DataType &value) { // false means given value is ignored - // Phase 2: C++20 requires explicit 'this' capture - auto comp = [this](const auto &a, const auto &b) noexcept { - return a.second.val_for_comp(axis) < b.second.val_for_comp(axis); - }; - - if (data.size() < B) { // if there is room, just push the candidate - auto iter = std::lower_bound(data.begin(), data.end(), value, comp); - DataType tmp_value = DataType(value); - data.insert(iter, std::move(tmp_value)); - mbb += value.second; - return true; - } else { // if there is no room, check the priority and swap if needed - if (data[0].second.val_for_comp(axis) < value.second.val_for_comp(axis)) { - size_t n_swap = - std::lower_bound(data.begin(), data.end(), value, comp) - - data.begin(); - std::swap(*data.begin(), value); - auto iter = data.begin(); - for (size_t i = 0; i < n_swap - 1; ++i) { - std::swap(*(iter + i), *(iter + i + 1)); - } - update_mbb(); - } - return false; - } - } -}; - -// Phase 8: Apply C++20 concept constraints -template class PseudoPRTreeNode { -public: - Leaf leaves[2 * D]; - std::unique_ptr left, right; - - PseudoPRTreeNode() { - for (int i = 0; i < 2 * D; i++) { - leaves[i].set_axis(i); - } - } - PseudoPRTreeNode(const int axis) { - for (int i = 0; i < 2 * D; i++) { - const int j = (axis + i) % (2 * D); - leaves[i].set_axis(j); - } - } - - template void serialize(Archive &archive) { - // archive(cereal::(left), cereal::defer(right), leaves); - archive(left, right, leaves); - } - - void address_of_leaves(vec *> &out) { - for (auto &leaf : leaves) { - if (leaf.data.size() > 0) { - out.emplace_back(&leaf); - } - } - } - - template auto filter(const iterator &b, const iterator &e) { - auto out = std::remove_if(b, e, [&](auto &x) { - for (auto &l : leaves) { - if (l.filter(x)) { - return true; - } - } - return false; - }); - return out; - } -}; - -// Phase 8: Apply C++20 concept constraints -template class PseudoPRTree { -public: - std::unique_ptr> root; - vec *> cache_children; - const int nthreads = std::max(1, (int)std::thread::hardware_concurrency()); - - PseudoPRTree() { root = std::make_unique>(); } - - template PseudoPRTree(const iterator &b, const iterator &e) { - if (!root) { - root = std::make_unique>(); - } - construct(root.get(), b, e, 0); - clean_data(b, e); - } - - template void serialize(Archive &archive) { - archive(root); - // archive.serializeDeferments(); - } - - template - void construct(PseudoPRTreeNode *node, const iterator &b, - const iterator &e, const int depth) { - if (e - b > 0 && node != nullptr) { - bool use_recursive_threads = std::pow(2, depth + 1) <= nthreads; -#ifdef MY_DEBUG - use_recursive_threads = false; -#endif - - vec threads; - threads.reserve(2); - PseudoPRTreeNode *node_left, *node_right; - - const int axis = depth % (2 * D); - auto ee = node->filter(b, e); - auto m = b; - std::advance(m, (ee - b) / 2); - std::nth_element(b, m, ee, - [axis](const DataType &lhs, - const DataType &rhs) noexcept { - return lhs.second[axis] < rhs.second[axis]; - }); - - if (m - b > 0) { - node->left = std::make_unique>(axis); - node_left = node->left.get(); - if (use_recursive_threads) { - threads.push_back( - std::thread([&]() { construct(node_left, b, m, depth + 1); })); - } else { - construct(node_left, b, m, depth + 1); - } - } - if (ee - m > 0) { - node->right = std::make_unique>(axis); - node_right = node->right.get(); - if (use_recursive_threads) { - threads.push_back( - std::thread([&]() { construct(node_right, m, ee, depth + 1); })); - } else { - construct(node_right, m, ee, depth + 1); - } - } - std::for_each(threads.begin(), threads.end(), - [&](std::thread &x) { x.join(); }); - } - } - - auto get_all_leaves(const int hint) { - if (cache_children.empty()) { - using U = PseudoPRTreeNode; - cache_children.reserve(hint); - auto node = root.get(); - queue que; - que.emplace(node); - - while (!que.empty()) { - node = que.front(); - que.pop(); - node->address_of_leaves(cache_children); - if (node->left) - que.emplace(node->left.get()); - if (node->right) - que.emplace(node->right.get()); - } - } - return cache_children; - } - - std::pair *, DataType *> as_X(void *placement, - const int hint) { - DataType *b, *e; - auto children = get_all_leaves(hint); - T total = children.size(); - b = reinterpret_cast *>(placement); - e = b + total; - for (T i = 0; i < total; i++) { - new (b + i) DataType{i, children[i]->mbb}; - } - return {b, e}; - } -}; - -// Phase 8: Apply C++20 concept constraints -template class PRTreeLeaf { -public: - BB mbb; - svec, B> data; - - PRTreeLeaf() { mbb = BB(); } - - PRTreeLeaf(const Leaf &leaf) { - mbb = leaf.mbb; - data = leaf.data; - } - - Real area() const { return mbb.area(); } - - void update_mbb() { - mbb.clear(); - for (const auto &datum : data) { - mbb += datum.second; - } - } - - void operator()(const BB &target, vec &out) const { - if (mbb(target)) { - for (const auto &x : data) { - if (x.second(target)) { - out.emplace_back(x.first); - } - } - } - } - - void del(const T &key, const BB &target) { - if (mbb(target)) { - auto remove_it = - std::remove_if(data.begin(), data.end(), [&](auto &datum) { - return datum.second(target) && datum.first == key; - }); - data.erase(remove_it, data.end()); - } - } - - void push(const T &key, const BB &target) { - data.emplace_back(key, target); - update_mbb(); - } - - template void save(Archive &ar) const { - vec> _data; - for (const auto &datum : data) { - _data.push_back(datum); - } - ar(mbb, _data); - } - - template void load(Archive &ar) { - vec> _data; - ar(mbb, _data); - for (const auto &datum : _data) { - data.push_back(datum); - } - } -}; - -// Phase 8: Apply C++20 concept constraints -template class PRTreeNode { -public: - BB mbb; - std::unique_ptr> leaf; - std::unique_ptr> head, next; - - PRTreeNode() {} - PRTreeNode(const BB &_mbb) { mbb = _mbb; } - - PRTreeNode(BB &&_mbb) noexcept { mbb = std::move(_mbb); } - - PRTreeNode(Leaf *l) { - leaf = std::make_unique>(); - mbb = l->mbb; - leaf->mbb = std::move(l->mbb); - leaf->data = std::move(l->data); - } - - bool operator()(const BB &target) { return mbb(target); } -}; - -// Phase 8: Apply C++20 concept constraints -template class PRTreeElement { -public: - BB mbb; - std::unique_ptr> leaf; - bool is_used = false; - - PRTreeElement() { - mbb = BB(); - is_used = false; - } - - PRTreeElement(const PRTreeNode &node) { - mbb = BB(node.mbb); - if (node.leaf) { - Leaf tmp_leaf = Leaf(*node.leaf.get()); - leaf = std::make_unique>(tmp_leaf); - } - is_used = true; - } - - bool operator()(const BB &target) { return is_used && mbb(target); } - - template void serialize(Archive &archive) { - archive(mbb, leaf, is_used); - } -}; - -// Phase 8: Apply C++20 concept constraints -template -void bfs( - const std::function> &)> &func, - vec> &flat_tree, const BB target) { - queue que; - auto qpush_if_intersect = [&](const size_t &i) { - PRTreeElement &r = flat_tree[i]; - // std::cout << "i " << (long int) i << " : " << (bool) r.leaf << std::endl; - if (r(target)) { - // std::cout << " is pushed" << std::endl; - que.emplace(i); - } - }; - - // std::cout << "size: " << flat_tree.size() << std::endl; - qpush_if_intersect(0); - while (!que.empty()) { - size_t idx = que.front(); - // std::cout << "idx: " << (long int) idx << std::endl; - que.pop(); - PRTreeElement &elem = flat_tree[idx]; - - if (elem.leaf) { - // std::cout << "func called for " << (long int) idx << std::endl; - func(elem.leaf); - } else { - for (size_t offset = 0; offset < B; offset++) { - size_t jdx = idx * B + offset + 1; - qpush_if_intersect(jdx); - } - } - } -} - -// Phase 8: Apply C++20 concept constraints for type safety -// T must be an integral type (used as index), not bool template class PRTree { private: vec> flat_tree; diff --git a/cpp/parallel.h b/include/prtree/utils/parallel.h similarity index 100% rename from cpp/parallel.h rename to include/prtree/utils/parallel.h diff --git a/cpp/small_vector.h b/include/prtree/utils/small_vector.h similarity index 100% rename from cpp/small_vector.h rename to include/prtree/utils/small_vector.h diff --git a/init_develop.sh b/init_develop.sh deleted file mode 100644 index de030f86..00000000 --- a/init_develop.sh +++ /dev/null @@ -1,2 +0,0 @@ -pip install -r requirements-dev.txt -pip install -r requirements.txt \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml new file mode 100644 index 00000000..a5e3c2ad --- /dev/null +++ b/pyproject.toml @@ -0,0 +1,191 @@ +[build-system] +requires = ["setuptools>=61.0", "wheel", "cmake>=3.22", "pybind11>=2.9.0", "numpy>=1.16"] +build-backend = "setuptools.build_meta" + +[project] +name = "python_prtree" +version = "0.7.0" +description = "Python implementation of Priority R-Tree" +readme = "README.md" +requires-python = ">=3.8" +license = {text = "MIT"} +authors = [ + {name = "atksh"}, +] +maintainers = [ + {name = "atksh"}, +] +keywords = ["priority-rtree", "r-tree", "prtree", "rtree", "pybind11", "spatial-index", "data-structures"] +classifiers = [ + "Development Status :: 4 - Beta", + "Intended Audience :: Developers", + "Intended Audience :: Science/Research", + "License :: OSI Approved :: MIT License", + "Operating System :: OS Independent", + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.8", + "Programming Language :: Python :: 3.9", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", + "Programming Language :: Python :: 3.13", + "Programming Language :: C++", + "Topic :: Scientific/Engineering", + "Topic :: Software Development :: Libraries :: Python Modules", +] +dependencies = [ + "numpy>=1.16", +] + +[project.optional-dependencies] +dev = [ + "pytest>=7.1.2", + "pytest-cov>=3.0.0", + "pytest-xdist>=2.5.0", + "black>=22.0.0", + "ruff>=0.1.0", + "mypy>=1.0.0", +] +docs = [ + "sphinx>=5.0.0", + "sphinx-rtd-theme>=1.0.0", + "myst-parser>=0.18.0", +] +benchmark = [ + "matplotlib>=3.5.0", + "pandas>=1.4.0", +] + +[project.urls] +Homepage = "https://github.com/atksh/python_prtree" +Repository = "https://github.com/atksh/python_prtree" +"Bug Tracker" = "https://github.com/atksh/python_prtree/issues" +Documentation = "https://github.com/atksh/python_prtree#readme" +Changelog = "https://github.com/atksh/python_prtree/blob/main/CHANGES.md" + +[tool.setuptools] +zip-safe = false + +[tool.setuptools.packages.find] +where = ["src"] +include = ["python_prtree*"] +namespaces = false + +[tool.pytest.ini_options] +testpaths = ["tests"] +python_files = ["test_*.py"] +python_classes = ["Test*"] +python_functions = ["test_*"] +addopts = [ + "-v", + "--strict-markers", + "--strict-config", + "--showlocals", +] +markers = [ + "slow: marks tests as slow (deselect with '-m \"not slow\"')", + "integration: marks tests as integration tests", + "unit: marks tests as unit tests", + "e2e: marks tests as end-to-end tests", +] +filterwarnings = [ + "error", + "ignore::DeprecationWarning", + "ignore::PendingDeprecationWarning", +] + +[tool.coverage.run] +source = ["src/python_prtree"] +branch = true +parallel = true +omit = [ + "*/tests/*", + "*/test_*.py", +] + +[tool.coverage.report] +precision = 2 +show_missing = true +skip_covered = false +exclude_lines = [ + "pragma: no cover", + "def __repr__", + "raise AssertionError", + "raise NotImplementedError", + "if __name__ == .__main__.:", + "if TYPE_CHECKING:", + "@abstractmethod", +] + +[tool.black] +line-length = 100 +target-version = ["py38", "py39", "py310", "py311", "py312"] +include = '\.pyi?$' +extend-exclude = ''' +/( + # directories + \.eggs + | \.git + | \.hg + | \.mypy_cache + | \.tox + | \.venv + | build + | dist + | third +)/ +''' + +[tool.ruff] +line-length = 100 +target-version = "py38" +extend-exclude = ["third", "build", "dist"] + +[tool.ruff.lint] +select = [ + "E", # pycodestyle errors + "W", # pycodestyle warnings + "F", # pyflakes + "I", # isort + "B", # flake8-bugbear + "C4", # flake8-comprehensions + "UP", # pyupgrade +] +ignore = [ + "E501", # line too long, handled by black + "B008", # do not perform function calls in argument defaults + "C901", # too complex +] + +[tool.ruff.lint.per-file-ignores] +"__init__.py" = ["F401"] # imported but unused +"tests/*" = ["B011"] # assert False + +[tool.ruff.lint.isort] +known-first-party = ["python_prtree"] + +[tool.mypy] +python_version = "3.8" +warn_return_any = true +warn_unused_configs = true +disallow_untyped_defs = false +disallow_incomplete_defs = false +check_untyped_defs = true +disallow_untyped_decorators = false +no_implicit_optional = true +warn_redundant_casts = true +warn_unused_ignores = true +warn_no_return = true +strict_equality = true +exclude = [ + "third/", + "build/", + "dist/", +] + +[[tool.mypy.overrides]] +module = [ + "numpy.*", + "pytest.*", +] +ignore_missing_imports = true diff --git a/requirements-dev.txt b/requirements-dev.txt deleted file mode 100644 index d129afbb..00000000 --- a/requirements-dev.txt +++ /dev/null @@ -1,3 +0,0 @@ -pytest==7.1.2 -pybind11==2.9.0 -cmake==3.22.4 \ No newline at end of file diff --git a/requirements.txt b/requirements.txt deleted file mode 100644 index d9f5ff0c..00000000 --- a/requirements.txt +++ /dev/null @@ -1 +0,0 @@ -numpy>=1.16 diff --git a/run_test.sh b/run_test.sh deleted file mode 100755 index 4cac5a50..00000000 --- a/run_test.sh +++ /dev/null @@ -1,6 +0,0 @@ -set -e - -rm -rf build dist .pytest_cache -pip uninstall python_prtree -y || true -pip install -v -e . -python -m pytest tests -vv --capture=no diff --git a/setup.py b/setup.py index 8124c1d9..e4c7b1ee 100644 --- a/setup.py +++ b/setup.py @@ -1,3 +1,10 @@ +""" +Setup script for building C++ extensions. + +Note: Project metadata is defined in pyproject.toml. +This file is only used for building the C++ extensions via CMake. +""" + import os import platform import re @@ -6,21 +13,9 @@ from distutils.version import LooseVersion from multiprocessing import cpu_count -from setuptools import Extension, find_packages, setup +from setuptools import Extension, setup from setuptools.command.build_ext import build_ext -version = "v0.7.0" - -sys.path.append("./tests") - -here = os.path.abspath(os.path.dirname(__file__)) -with open(os.path.join(here, "README.md"), encoding="utf-8") as f: - long_description = f.read() - - -def _requires_from_file(filename): - return open(filename).read().splitlines() - class CMakeExtension(Extension): def __init__(self, name, sourcedir=""): @@ -154,32 +149,6 @@ def build_extension(self, ext): setup( - name="python_prtree", - version=version, - license="MIT", - description="Python implementation of Priority R-Tree", - author="atksh", - url="https://github.com/atksh/python_prtree", ext_modules=[CMakeExtension("python_prtree.PRTree")], cmdclass=dict(build_ext=CMakeBuild), - zip_safe=False, - python_requires=">=3.8", - install_requires=_requires_from_file("requirements.txt"), - package_dir={"": "src"}, - packages=find_packages("src"), - test_suite="test_PRTree.suite", - long_description=long_description, - long_description_content_type="text/markdown", - keywords="priority-rtree r-tree prtree rtree pybind11", - classifiers=[ - "License :: OSI Approved :: MIT License", - "Programming Language :: Python :: 3", - "Programming Language :: Python :: 3.8", - "Programming Language :: Python :: 3.9", - "Programming Language :: Python :: 3.10", - "Programming Language :: Python :: 3.11", - "Programming Language :: Python :: 3.12", - "Programming Language :: Python :: 3.13", - "Programming Language :: Python :: 3.14", - ], ) diff --git a/cpp/main.cc b/src/cpp/bindings/python_bindings.cc similarity index 99% rename from cpp/main.cc rename to src/cpp/bindings/python_bindings.cc index a5a7a791..2cccb713 100644 --- a/cpp/main.cc +++ b/src/cpp/bindings/python_bindings.cc @@ -1,4 +1,4 @@ -#include "prtree.h" +#include "prtree/core/prtree.h" #include #include #include diff --git a/src/python_prtree/__init__.py b/src/python_prtree/__init__.py index 24036624..26d57d57 100644 --- a/src/python_prtree/__init__.py +++ b/src/python_prtree/__init__.py @@ -1,137 +1,41 @@ -import codecs -import pickle - -from .PRTree import _PRTree2D, _PRTree3D, _PRTree4D +""" +python_prtree - Fast spatial indexing with Priority R-Tree + +This package provides efficient 2D, 3D, and 4D spatial indexing using +the Priority R-Tree data structure with C++ performance. + +Main classes: + - PRTree2D: 2D spatial indexing + - PRTree3D: 3D spatial indexing + - PRTree4D: 4D spatial indexing + +Example: + >>> from python_prtree import PRTree2D + >>> import numpy as np + >>> + >>> # Create tree with bounding boxes + >>> indices = np.array([1, 2, 3]) + >>> boxes = np.array([ + ... [0.0, 0.0, 1.0, 1.0], + ... [1.0, 1.0, 2.0, 2.0], + ... [2.0, 2.0, 3.0, 3.0], + ... ]) + >>> tree = PRTree2D(indices, boxes) + >>> + >>> # Query overlapping boxes + >>> results = tree.query([0.5, 0.5, 1.5, 1.5]) + >>> print(results) # [1, 2] + +For more information, see the documentation at: +https://github.com/atksh/python_prtree +""" + +from .core import PRTree2D, PRTree3D, PRTree4D + +__version__ = "0.7.0" __all__ = [ "PRTree2D", "PRTree3D", "PRTree4D", ] - - -def dumps(obj): - if obj is None: - return None - else: - return pickle.dumps(obj) - - -def loads(obj): - if obj is None: - return None - else: - return pickle.loads(obj) - - -class PRTree2D: - Klass = _PRTree2D - - def __init__(self, *args, **kwargs): - self._tree = self.Klass(*args, **kwargs) - - def __getattr__(self, name): - def handler_function(*args, **kwargs): - # Handle empty tree cases for methods that cause segfaults - if self.n == 0 and name in ('rebuild', 'save'): - # These operations are not meaningful/safe on empty trees - if name == 'rebuild': - return # No-op for empty tree - elif name == 'save': - raise ValueError("Cannot save empty tree") - - ret = getattr(self._tree, name)(*args, **kwargs) - return ret - - return handler_function - - @property - def n(self): - return self._tree.size() - - def __len__(self): - return self.n - - def erase(self, idx): - if self.n == 0: - raise ValueError("Nothing to erase") - - # Handle erasing the last element (library limitation workaround) - if self.n == 1: - # Call underlying erase to validate index, then handle the library bug - try: - self._tree.erase(idx) - # If we get here, erase succeeded (shouldn't happen with n==1) - return - except RuntimeError as e: - error_msg = str(e) - if "Given index is not found" in error_msg: - # Index doesn't exist - re-raise the error - raise - elif "#roots is not 1" in error_msg: - # This is the library bug we're working around - # Index was valid, so recreate empty tree - self._tree = self.Klass() - return - else: - # Some other RuntimeError - re-raise it - raise - - self._tree.erase(idx) - - def set_obj(self, idx, obj): - objdumps = dumps(obj) - self._tree.set_obj(idx, objdumps) - - def get_obj(self, idx): - obj = self._tree.get_obj(idx) - return loads(obj) - - def insert(self, idx=None, bb=None, obj=None): - if idx is None and obj is None: - raise ValueError("Specify index or obj") - if idx is None: - idx = self.n + 1 - if bb is None: - raise ValueError("Specify bounding box") - - objdumps = dumps(obj) - if self.n == 0: - self._tree = self.Klass([idx], [bb]) - self._tree.set_obj(idx, objdumps) - else: - self._tree.insert(idx, bb, objdumps) - - def query(self, *args, return_obj=False): - # Handle empty tree case to prevent segfault - if self.n == 0: - return [] - - if len(args) == 1: - out = self._tree.query(*args) - else: - out = self._tree.query(args) - if return_obj: - objs = [self.get_obj(i) for i in out] - return objs - else: - return out - - def batch_query(self, queries, *args, **kwargs): - # Handle empty tree case to prevent segfault - if self.n == 0: - # Return empty list for each query - import numpy as np - if hasattr(queries, 'shape'): - return [[] for _ in range(len(queries))] - return [] - - return self._tree.batch_query(queries, *args, **kwargs) - - -class PRTree3D(PRTree2D): - Klass = _PRTree3D - - -class PRTree4D(PRTree2D): - Klass = _PRTree4D diff --git a/src/python_prtree/core.py b/src/python_prtree/core.py new file mode 100644 index 00000000..7e9c1ff5 --- /dev/null +++ b/src/python_prtree/core.py @@ -0,0 +1,249 @@ +"""Core PRTree classes for 2D, 3D, and 4D spatial indexing.""" + +import pickle +from typing import Any, List, Optional, Sequence, Union + +from .PRTree import _PRTree2D, _PRTree3D, _PRTree4D + +__all__ = [ + "PRTree2D", + "PRTree3D", + "PRTree4D", +] + + +def _dumps(obj: Any) -> Optional[bytes]: + """Serialize Python object using pickle.""" + if obj is None: + return None + return pickle.dumps(obj) + + +def _loads(obj: Optional[bytes]) -> Any: + """Deserialize Python object using pickle.""" + if obj is None: + return None + return pickle.loads(obj) + + +class PRTreeBase: + """ + Base class for PRTree implementations. + + Provides common functionality for 2D, 3D, and 4D spatial indexing + with Priority R-Tree data structure. + """ + + Klass = None # To be overridden by subclasses + + def __init__(self, *args, **kwargs): + """Initialize PRTree with optional indices and bounding boxes.""" + if self.Klass is None: + raise NotImplementedError("Use PRTree2D, PRTree3D, or PRTree4D") + self._tree = self.Klass(*args, **kwargs) + + def __getattr__(self, name): + """Delegate attribute access to underlying C++ tree.""" + def handler_function(*args, **kwargs): + # Handle empty tree cases for methods that cause segfaults + if self.n == 0 and name in ('rebuild', 'save'): + # These operations are not meaningful/safe on empty trees + if name == 'rebuild': + return # No-op for empty tree + elif name == 'save': + raise ValueError("Cannot save empty tree") + + ret = getattr(self._tree, name)(*args, **kwargs) + return ret + + return handler_function + + @property + def n(self) -> int: + """Get the number of bounding boxes in the tree.""" + return self._tree.size() + + def __len__(self) -> int: + """Return the number of bounding boxes in the tree.""" + return self.n + + def erase(self, idx: int) -> None: + """ + Remove a bounding box by index. + + Args: + idx: Index of the bounding box to remove + + Raises: + ValueError: If tree is empty or index not found + """ + if self.n == 0: + raise ValueError("Nothing to erase") + + # Handle erasing the last element (library limitation workaround) + if self.n == 1: + # Call underlying erase to validate index, then handle the library bug + try: + self._tree.erase(idx) + # If we get here, erase succeeded (shouldn't happen with n==1) + return + except RuntimeError as e: + error_msg = str(e) + if "Given index is not found" in error_msg: + # Index doesn't exist - re-raise the error + raise + elif "#roots is not 1" in error_msg: + # This is the library bug we're working around + # Index was valid, so recreate empty tree + self._tree = self.Klass() + return + else: + # Some other RuntimeError - re-raise it + raise + + self._tree.erase(idx) + + def set_obj(self, idx: int, obj: Any) -> None: + """ + Store a Python object associated with a bounding box. + + Args: + idx: Index of the bounding box + obj: Any picklable Python object + """ + objdumps = _dumps(obj) + self._tree.set_obj(idx, objdumps) + + def get_obj(self, idx: int) -> Any: + """ + Retrieve the Python object associated with a bounding box. + + Args: + idx: Index of the bounding box + + Returns: + The stored Python object, or None if not set + """ + obj = self._tree.get_obj(idx) + return _loads(obj) + + def insert( + self, + idx: Optional[int] = None, + bb: Optional[Sequence[float]] = None, + obj: Any = None + ) -> None: + """ + Insert a new bounding box into the tree. + + Args: + idx: Index for the bounding box (auto-assigned if None) + bb: Bounding box coordinates (required) + obj: Optional Python object to associate + + Raises: + ValueError: If bounding box is not specified + """ + if idx is None and obj is None: + raise ValueError("Specify index or obj") + if idx is None: + idx = self.n + 1 + if bb is None: + raise ValueError("Specify bounding box") + + objdumps = _dumps(obj) + if self.n == 0: + self._tree = self.Klass([idx], [bb]) + self._tree.set_obj(idx, objdumps) + else: + self._tree.insert(idx, bb, objdumps) + + def query( + self, + *args, + return_obj: bool = False + ) -> Union[List[int], List[Any]]: + """ + Find all bounding boxes that overlap with the query box. + + Args: + *args: Query bounding box coordinates + return_obj: If True, return stored objects instead of indices + + Returns: + List of indices or objects that overlap with the query + """ + # Handle empty tree case to prevent segfault + if self.n == 0: + return [] + + if len(args) == 1: + out = self._tree.query(*args) + else: + out = self._tree.query(args) + + if return_obj: + objs = [self.get_obj(i) for i in out] + return objs + else: + return out + + def batch_query(self, queries, *args, **kwargs): + """ + Perform multiple queries in parallel. + + Args: + queries: Array of query bounding boxes + *args, **kwargs: Additional arguments passed to C++ implementation + + Returns: + List of result lists, one per query + """ + # Handle empty tree case to prevent segfault + if self.n == 0: + # Return empty list for each query + import numpy as np + if hasattr(queries, 'shape'): + return [[] for _ in range(len(queries))] + return [] + + return self._tree.batch_query(queries, *args, **kwargs) + + +class PRTree2D(PRTreeBase): + """ + 2D Priority R-Tree for spatial indexing. + + Supports efficient querying of 2D bounding boxes: + [xmin, ymin, xmax, ymax] + + Example: + >>> tree = PRTree2D([1, 2], [[0, 0, 1, 1], [2, 2, 3, 3]]) + >>> results = tree.query([0.5, 0.5, 2.5, 2.5]) + >>> print(results) # [1, 2] + """ + Klass = _PRTree2D + + +class PRTree3D(PRTreeBase): + """ + 3D Priority R-Tree for spatial indexing. + + Supports efficient querying of 3D bounding boxes: + [xmin, ymin, zmin, xmax, ymax, zmax] + + Example: + >>> tree = PRTree3D([1], [[0, 0, 0, 1, 1, 1]]) + >>> results = tree.query([0.5, 0.5, 0.5, 1.5, 1.5, 1.5]) + """ + Klass = _PRTree3D + + +class PRTree4D(PRTreeBase): + """ + 4D Priority R-Tree for spatial indexing. + + Supports efficient querying of 4D bounding boxes. + Useful for spatio-temporal data or higher-dimensional spaces. + """ + Klass = _PRTree4D diff --git a/src/python_prtree/py.typed b/src/python_prtree/py.typed new file mode 100644 index 00000000..c0ec82ae --- /dev/null +++ b/src/python_prtree/py.typed @@ -0,0 +1,2 @@ +# Marker file for PEP 561 +# This package supports type hints diff --git a/third/cereal/include/cereal/external/rapidxml/license.txt b/third/cereal/include/cereal/external/rapidxml/license.txt index 0095bc72..14098318 100644 --- a/third/cereal/include/cereal/external/rapidxml/license.txt +++ b/third/cereal/include/cereal/external/rapidxml/license.txt @@ -1,52 +1,52 @@ -Use of this software is granted under one of the following two licenses, -to be chosen freely by the user. - -1. Boost Software License - Version 1.0 - August 17th, 2003 -=============================================================================== - -Copyright (c) 2006, 2007 Marcin Kalicinski - -Permission is hereby granted, free of charge, to any person or organization -obtaining a copy of the software and accompanying documentation covered by -this license (the "Software") to use, reproduce, display, distribute, -execute, and transmit the Software, and to prepare derivative works of the -Software, and to permit third-parties to whom the Software is furnished to -do so, all subject to the following: - -The copyright notices in the Software and this entire statement, including -the above license grant, this restriction and the following disclaimer, -must be included in all copies of the Software, in whole or in part, and -all derivative works of the Software, unless such copies or derivative -works are solely in the form of machine-executable object code generated by -a source language processor. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT -SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE -FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, -ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER -DEALINGS IN THE SOFTWARE. - -2. The MIT License -=============================================================================== - -Copyright (c) 2006, 2007 Marcin Kalicinski - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies -of the Software, and to permit persons to whom the Software is furnished to do so, -subject to the following conditions: - -The above copyright notice and this permission notice shall be included in all -copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL -THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS -IN THE SOFTWARE. +Use of this software is granted under one of the following two licenses, +to be chosen freely by the user. + +1. Boost Software License - Version 1.0 - August 17th, 2003 +=============================================================================== + +Copyright (c) 2006, 2007 Marcin Kalicinski + +Permission is hereby granted, free of charge, to any person or organization +obtaining a copy of the software and accompanying documentation covered by +this license (the "Software") to use, reproduce, display, distribute, +execute, and transmit the Software, and to prepare derivative works of the +Software, and to permit third-parties to whom the Software is furnished to +do so, all subject to the following: + +The copyright notices in the Software and this entire statement, including +the above license grant, this restriction and the following disclaimer, +must be included in all copies of the Software, in whole or in part, and +all derivative works of the Software, unless such copies or derivative +works are solely in the form of machine-executable object code generated by +a source language processor. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT +SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE +FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, +ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER +DEALINGS IN THE SOFTWARE. + +2. The MIT License +=============================================================================== + +Copyright (c) 2006, 2007 Marcin Kalicinski + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies +of the Software, and to permit persons to whom the Software is furnished to do so, +subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL +THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS +IN THE SOFTWARE. diff --git a/scripts/analyze_baseline.py b/tools/analyze_baseline.py similarity index 100% rename from scripts/analyze_baseline.py rename to tools/analyze_baseline.py diff --git a/docs/run_profile.py b/tools/profile.py similarity index 100% rename from docs/run_profile.py rename to tools/profile.py diff --git a/run_profile.sh b/tools/profile.sh similarity index 100% rename from run_profile.sh rename to tools/profile.sh diff --git a/scripts/profile_all_workloads.sh b/tools/profile_all_workloads.sh similarity index 100% rename from scripts/profile_all_workloads.sh rename to tools/profile_all_workloads.sh