AETHER ISR Data Preparation

This repository contains a reproducible, script-based data preparation pipeline for satellite-to-simulation research. The pipeline discovers source archives, extracts them safely, inventories dataset formats, performs conservative cleaning, harmonises labels into a shared taxonomy, prepares segmentation and classification-ready outputs, derives geometry/topology/context features, and generates QA reports.

Folder Structure

data/             Canonical dataset estate.
data/raw/archives Immutable source ZIP/TAR inputs.
data/raw/direct_sources/ Canonical raw directory datasets that are not archive-based.
data/raw/         Raw-source catalogues and manifests.
data/extracted/   Safe, traceable archive extractions.
data/interim/     Quarantine, intermediate manifests, and repair products.
data/processed/   Non-final processed copies where needed.
data/final/       Research-ready task datasets.
reports/          Inventory, cleaning, feature, QA, and preview reports.
docs/             Methodology, data card, and class mapping notes.
logs/             Stage logs.
scripts/          Reproducible Python pipeline scripts.
notebooks/        Audit notebooks for cleaning, preprocessing, and features.
configs/          Central pipeline configuration.

Run Order

Set up the environment:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r requirements-dev.txt

Run the full conservative pipeline:

python scripts/run_pipeline.py --config configs/pipeline.yaml

Tidy legacy root-level archives, old data_* folders, and recognised root-level direct-source dataset directories into the canonical data/ estate:

python scripts/migrate_data_layout.py --config configs/pipeline.yaml

Run the unit tests:

pytest

Run the standard project quality gate:

python scripts/quality_gate.py

Check whether the project detected local/on-prem, RTX 4090, Colab, and Google Drive correctly:

python scripts/check_runtime.py --config configs/pipeline.yaml

Validate the required research/industry folder structure:

python scripts/validate_project_structure.py --config configs/pipeline.yaml

Before release or controlled handoff:

python scripts/quality_gate.py --strict

Review the notebook audit companions:

jupyter notebook notebooks/01_data_cleaning.ipynb
jupyter notebook notebooks/02_preprocessing_dataset_preparation.ipynb
jupyter notebook notebooks/03_feature_engineering.ipynb

The notebooks call the same .py stage scripts and then read the generated reports; they are not a separate implementation.

Or run individual stages:

python scripts/extract_archives.py --config configs/pipeline.yaml
python scripts/inspect_datasets.py --config configs/pipeline.yaml
python scripts/clean_files.py --config configs/pipeline.yaml
python scripts/harmonise_labels.py --config configs/pipeline.yaml
python scripts/prepare_segmentation.py --config configs/pipeline.yaml
python scripts/prepare_classification.py --config configs/pipeline.yaml
python scripts/derive_context_features.py --config configs/pipeline.yaml
python scripts/prepare_contextual_prediction.py --config configs/pipeline.yaml
python scripts/feature_engineering.py --config configs/pipeline.yaml
python scripts/qa_checks.py --config configs/pipeline.yaml
python scripts/generate_reports.py --config configs/pipeline.yaml

The extraction and inventory stages should be reviewed before accepting any class harmonisation or standardisation decisions for a paper or thesis appendix.

Current Assumptions

Source archives are stored under data/raw/archives/ and treated as immutable raw inputs.
Non-archive direct-source datasets are stored under data/raw/direct_sources/ and treated as immutable raw inputs.
ZIP and TAR archives are supported, including nested archives and multipart .tar.001 sequences.
Existing train/validation/test splits are preserved when detected.
Missing splits are handled with deterministic, seed-controlled assignment.
The unified segmentation taxonomy is: building, road, runway, vegetation, water, and bare_open_ground, with background index 0.
Ambiguous or irrelevant small-object categories are logged and excluded unless explicitly mapped in configs/pipeline.yaml.
Image resizing is disabled by default. Originals remain preserved; final outputs are copies or derived masks.
The Google/Open Buildings direct-source dataset is treated as a vector-only building-footprint source. It is cleaned, paired, standardised, and carried into classification/context feature engineering without duplicating the raw 200+ GB source tree.

Key Outputs

reports/inventory/dataset_inventory.csv
reports/inventory/dataset_inventory.md
reports/inventory/dataset_inventory.json
reports/inventory/direct_source_catalog.csv
reports/cleaning/cleaning_report.md
reports/cleaning/pairing_manifest.csv
reports/project_structure.md
reports/runtime_memory.csv
reports/stages/index.md
reports/stages/*.md
reports/stages/*.json
reports/features/object_geometry.csv
reports/features/scene_context.csv
reports/features/object_context_windows.csv
reports/features/contextual_relation_priors.csv
reports/google_open_buildings_shard_manifest.csv
reports/features/google_open_buildings_shard_features.csv
reports/features/google_open_buildings_geometry_sample.csv
reports/features/engineered_sample_features.csv
reports/features/engineered_classification_features.csv
reports/features/feature_dictionary.md
reports/qa/qa_report.md
docs/data_card.md
docs/methodology.md
docs/class_mapping.md
docs/feature_engineering.md
docs/contextual_prediction.md
docs/pairing_methodology.md
docs/notebook_methodology.md
docs/label_governance.md
docs/reproducibility.md
docs/environment_setup.md
docs/runtime_environment.md
docs/project_structure_standard.md
docs/memory_management.md
docs/quality_gates.md
docs/operations_runbook.md
docs/defence_research_governance.md
docs/research_ethics_and_responsible_ai.md
docs/documentation_standard.md
docs/review_and_approval.md
docs/branch_protection.md
docs/model_card_template.md
docs/release_checklist.md
docs/quality_assurance.md
docs/risk_register.md
docs/google_open_buildings_integration.md
CODE_OF_CONDUCT.md
GOVERNANCE.md
SUPPORT.md
MAINTAINERS.md
SECURITY.md

Current Run Summary

The pipeline discovered and inventoried 14 extracted dataset roots from the archives in this folder, including ZIP, TAR, nested ZIP, and multipart TAR inputs. Two large image ZIPs had malformed ZIP64 central-directory offsets; the extractor repaired the offset shift reproducibly and logged the retry provenance in reports/inventory/extraction_provenance_retry_bad_zip64.csv.

The archive-based counts below describe the last completed baseline run before the Google/Open Buildings direct-source integration. After rerunning the full pipeline, consult reports/final_pipeline_report.md, reports/google_open_buildings_shard_manifest.csv, and reports/features/google_open_buildings_shard_features.csv for the refreshed combined audit trail.

Prepared outputs from the current run:

Output	Count
Final segmentation samples	13,487
Classification rows	56,096
Existing source classification rows	41,999
Segmentation-derived classification rows	13,487
All-source fallback classification rows	610
Context scenes	13,487
Geometry objects/components	244,135
Topology adjacency edges	65,816
Engineered sample feature rows	13,487
Engineered classification feature rows	56,096
QA preview images	117

Final segmentation datasets were prepared from:

Dataset root	Samples
`5706578__6d6f2a1fca`	4,191
`archive_5__cf3bd094bb`	3,842
`annotations1024__34ad76d450`	2,492
`archive_2__51be8d6b24`	1,171
`archive_1__ea8ef46010`	803
`satellite.v1i.coco-segmentation__849f18754f`	735
`train_labels__3c0f552efa`	220
`airport_satellite_image_dataset.v1i.coco-segmentation__13691e9cc0`	33

Segmentation preparation now uses raster masks, COCO polygons, XML oriented/horizontal boxes, vehicle TXT polygons, and xView GeoJSON pixel bounds. QA found no final image-mask alignment failures and no duplicate-image leakage across final segmentation splits. One unreadable macOS artifact image from train_images__feb124b8a4 was quarantined under data/interim/quarantine/.

Every pipeline stage also writes a paired Markdown and JSON audit report under reports/stages/, including the final standardised segmentation taxonomy and classification label set used by the run.

The repository now also supports the Google/Open Buildings direct-source building-footprint corpus under data/raw/direct_sources/google_open_buildings/. Because that dataset is vector-only and very large, the pipeline keeps the raw files in place, creates a deterministic paired shard manifest, adds one classification-compatible row per paired shard, and engineers sampled point-density plus polygon-geometry probe features for audit and modelling.

Research Team

ME5-1 (Dr.) Lim Yong Zhi, Research Supervisor
Chua Jia Jun (NSF)
Goh Kun Ming (AI Research Intern)

Defence Research Handling

This project is treated as controlled defence research for AETHER (Air Emerging Technology High-Speed Experimentation and Research), under RAiD (RSAF Agile innovation Digital), under RSAF, SAF, and MINDEF. Follow official classification and handling guidance from the responsible authority. The repository now includes .env.example, dependency files, quality gates, security guidance, release checks, conduct rules, support guidance, responsible-AI notes, and governance notes, but these do not replace official security, licensing, export, or operational release decisions.

Important Review Items

5706578.zip includes a LoveDA datasheet. Its mask values were mapped as: 2 -> building, 3 -> road, 4 -> water, 5 -> bare_open_ground, 6/7 -> vegetation, and 0/1 -> background.
archive_1__ea8ef46010 and archive_2__51be8d6b24 include RGB class dictionaries; these were used for mask conversion.
Object-centric COCO categories such as aircraft, ships, vehicles, and helicopters are standardised into additional segmentation classes so polygon data can be used for segmentation without forcing them into surface classes. Ambiguous classes remain documented in reports/class_mapping_resolved.csv.
All extracted source imagery now contributes to classification. Rows are labelled as derived_from_segmentation, existing_dataset, or all_source_fallback so weak or fallback labels are auditable.
Invalid or placeholder class names such as bad, unknown, scenes, train, valid, test, and folder-only image labels are mapped to unlabelled or excluded by the class-governance policy.
Dataset/annotation pairing now uses same-dataset-first resolution plus a constrained related-dataset fallback for split archives such as image-only and label-only bundles. Pairing decisions and ambiguities are written to reports/cleaning/pairing_manifest.csv and reports/segmentation_pairing_manifest.csv.
Dense object scenes can make pairwise topology expensive. The context stage computes object geometry for all samples and skips adjacency only when a scene exceeds context.max_adjacency_components_per_sample in the config.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
artifacts		artifacts
configs		configs
docs		docs
experiments		experiments
models		models
notebooks		notebooks
scripts		scripts
tests		tests
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DATA_LICENSES.md		DATA_LICENSES.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements-geospatial.txt		requirements-geospatial.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AETHER ISR Data Preparation

Folder Structure

Run Order

Current Assumptions

Key Outputs

Current Run Summary

Research Team

Defence Research Handling

Important Review Items

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AETHER ISR Data Preparation

Folder Structure

Run Order

Current Assumptions

Key Outputs

Current Run Summary

Research Team

Defence Research Handling

Important Review Items

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages