This repository contains a reproducible, script-based data preparation pipeline for satellite-to-simulation research. The pipeline discovers source archives, extracts them safely, inventories dataset formats, performs conservative cleaning, harmonises labels into a shared taxonomy, prepares segmentation and classification-ready outputs, derives geometry/topology/context features, and generates QA reports.
data/ Canonical dataset estate.
data/raw/archives Immutable source ZIP/TAR inputs.
data/raw/direct_sources/ Canonical raw directory datasets that are not archive-based.
data/raw/ Raw-source catalogues and manifests.
data/extracted/ Safe, traceable archive extractions.
data/interim/ Quarantine, intermediate manifests, and repair products.
data/processed/ Non-final processed copies where needed.
data/final/ Research-ready task datasets.
reports/ Inventory, cleaning, feature, QA, and preview reports.
docs/ Methodology, data card, and class mapping notes.
logs/ Stage logs.
scripts/ Reproducible Python pipeline scripts.
notebooks/ Audit notebooks for cleaning, preprocessing, and features.
configs/ Central pipeline configuration.
Set up the environment:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r requirements-dev.txtRun the full conservative pipeline:
python scripts/run_pipeline.py --config configs/pipeline.yamlTidy legacy root-level archives, old data_* folders, and recognised
root-level direct-source dataset directories into the canonical data/ estate:
python scripts/migrate_data_layout.py --config configs/pipeline.yamlRun the unit tests:
pytestRun the standard project quality gate:
python scripts/quality_gate.pyCheck whether the project detected local/on-prem, RTX 4090, Colab, and Google Drive correctly:
python scripts/check_runtime.py --config configs/pipeline.yamlValidate the required research/industry folder structure:
python scripts/validate_project_structure.py --config configs/pipeline.yamlBefore release or controlled handoff:
python scripts/quality_gate.py --strictReview the notebook audit companions:
jupyter notebook notebooks/01_data_cleaning.ipynb
jupyter notebook notebooks/02_preprocessing_dataset_preparation.ipynb
jupyter notebook notebooks/03_feature_engineering.ipynbThe notebooks call the same .py stage scripts and then read the generated
reports; they are not a separate implementation.
Or run individual stages:
python scripts/extract_archives.py --config configs/pipeline.yaml
python scripts/inspect_datasets.py --config configs/pipeline.yaml
python scripts/clean_files.py --config configs/pipeline.yaml
python scripts/harmonise_labels.py --config configs/pipeline.yaml
python scripts/prepare_segmentation.py --config configs/pipeline.yaml
python scripts/prepare_classification.py --config configs/pipeline.yaml
python scripts/derive_context_features.py --config configs/pipeline.yaml
python scripts/prepare_contextual_prediction.py --config configs/pipeline.yaml
python scripts/feature_engineering.py --config configs/pipeline.yaml
python scripts/qa_checks.py --config configs/pipeline.yaml
python scripts/generate_reports.py --config configs/pipeline.yamlThe extraction and inventory stages should be reviewed before accepting any class harmonisation or standardisation decisions for a paper or thesis appendix.
- Source archives are stored under
data/raw/archives/and treated as immutable raw inputs. - Non-archive direct-source datasets are stored under
data/raw/direct_sources/and treated as immutable raw inputs. - ZIP and TAR archives are supported, including nested archives and multipart
.tar.001sequences. - Existing train/validation/test splits are preserved when detected.
- Missing splits are handled with deterministic, seed-controlled assignment.
- The unified segmentation taxonomy is:
building,road,runway,vegetation,water, andbare_open_ground, with background index0. - Ambiguous or irrelevant small-object categories are logged and excluded unless
explicitly mapped in
configs/pipeline.yaml. - Image resizing is disabled by default. Originals remain preserved; final outputs are copies or derived masks.
- The Google/Open Buildings direct-source dataset is treated as a vector-only building-footprint source. It is cleaned, paired, standardised, and carried into classification/context feature engineering without duplicating the raw 200+ GB source tree.
reports/inventory/dataset_inventory.csvreports/inventory/dataset_inventory.mdreports/inventory/dataset_inventory.jsonreports/inventory/direct_source_catalog.csvreports/cleaning/cleaning_report.mdreports/cleaning/pairing_manifest.csvreports/project_structure.mdreports/runtime_memory.csvreports/stages/index.mdreports/stages/*.mdreports/stages/*.jsonreports/features/object_geometry.csvreports/features/scene_context.csvreports/features/object_context_windows.csvreports/features/contextual_relation_priors.csvreports/google_open_buildings_shard_manifest.csvreports/features/google_open_buildings_shard_features.csvreports/features/google_open_buildings_geometry_sample.csvreports/features/engineered_sample_features.csvreports/features/engineered_classification_features.csvreports/features/feature_dictionary.mdreports/qa/qa_report.mddocs/data_card.mddocs/methodology.mddocs/class_mapping.mddocs/feature_engineering.mddocs/contextual_prediction.mddocs/pairing_methodology.mddocs/notebook_methodology.mddocs/label_governance.mddocs/reproducibility.mddocs/environment_setup.mddocs/runtime_environment.mddocs/project_structure_standard.mddocs/memory_management.mddocs/quality_gates.mddocs/operations_runbook.mddocs/defence_research_governance.mddocs/research_ethics_and_responsible_ai.mddocs/documentation_standard.mddocs/review_and_approval.mddocs/branch_protection.mddocs/model_card_template.mddocs/release_checklist.mddocs/quality_assurance.mddocs/risk_register.mddocs/google_open_buildings_integration.mdCODE_OF_CONDUCT.mdGOVERNANCE.mdSUPPORT.mdMAINTAINERS.mdSECURITY.md
The pipeline discovered and inventoried 14 extracted dataset roots from the
archives in this folder, including ZIP, TAR, nested ZIP, and multipart TAR
inputs. Two large image ZIPs had malformed ZIP64 central-directory offsets; the
extractor repaired the offset shift reproducibly and logged the retry provenance
in reports/inventory/extraction_provenance_retry_bad_zip64.csv.
The archive-based counts below describe the last completed baseline run before
the Google/Open Buildings direct-source integration. After rerunning the full
pipeline, consult reports/final_pipeline_report.md,
reports/google_open_buildings_shard_manifest.csv, and
reports/features/google_open_buildings_shard_features.csv for the refreshed
combined audit trail.
Prepared outputs from the current run:
| Output | Count |
|---|---|
| Final segmentation samples | 13,487 |
| Classification rows | 56,096 |
| Existing source classification rows | 41,999 |
| Segmentation-derived classification rows | 13,487 |
| All-source fallback classification rows | 610 |
| Context scenes | 13,487 |
| Geometry objects/components | 244,135 |
| Topology adjacency edges | 65,816 |
| Engineered sample feature rows | 13,487 |
| Engineered classification feature rows | 56,096 |
| QA preview images | 117 |
Final segmentation datasets were prepared from:
| Dataset root | Samples |
|---|---|
5706578__6d6f2a1fca |
4,191 |
archive_5__cf3bd094bb |
3,842 |
annotations1024__34ad76d450 |
2,492 |
archive_2__51be8d6b24 |
1,171 |
archive_1__ea8ef46010 |
803 |
satellite.v1i.coco-segmentation__849f18754f |
735 |
train_labels__3c0f552efa |
220 |
airport_satellite_image_dataset.v1i.coco-segmentation__13691e9cc0 |
33 |
Segmentation preparation now uses raster masks, COCO polygons, XML
oriented/horizontal boxes, vehicle TXT polygons, and xView GeoJSON pixel bounds.
QA found no final image-mask alignment failures and no duplicate-image leakage
across final segmentation splits. One unreadable macOS artifact image from
train_images__feb124b8a4 was quarantined under data/interim/quarantine/.
Every pipeline stage also writes a paired Markdown and JSON audit report under
reports/stages/, including the final standardised segmentation taxonomy and
classification label set used by the run.
The repository now also supports the Google/Open Buildings direct-source
building-footprint corpus under data/raw/direct_sources/google_open_buildings/.
Because that dataset is vector-only and very large, the pipeline keeps the raw
files in place, creates a deterministic paired shard manifest, adds one
classification-compatible row per paired shard, and engineers sampled
point-density plus polygon-geometry probe features for audit and modelling.
- ME5-1 (Dr.) Lim Yong Zhi, Research Supervisor
- Chua Jia Jun (NSF)
- Goh Kun Ming (AI Research Intern)
This project is treated as controlled defence research for AETHER (Air Emerging
Technology High-Speed Experimentation and Research), under RAiD (RSAF Agile
innovation Digital), under RSAF, SAF, and MINDEF. Follow official classification
and handling guidance from the responsible authority. The repository now includes
.env.example, dependency files, quality gates, security guidance, release
checks, conduct rules, support guidance, responsible-AI notes, and governance
notes, but these do not replace official security,
licensing, export, or operational release decisions.
5706578.zipincludes a LoveDA datasheet. Its mask values were mapped as:2 -> building,3 -> road,4 -> water,5 -> bare_open_ground,6/7 -> vegetation, and0/1 -> background.archive_1__ea8ef46010andarchive_2__51be8d6b24include RGB class dictionaries; these were used for mask conversion.- Object-centric COCO categories such as aircraft, ships, vehicles, and
helicopters are standardised into additional segmentation classes so polygon
data can be used for segmentation without forcing them into surface classes.
Ambiguous classes remain documented in
reports/class_mapping_resolved.csv. - All extracted source imagery now contributes to classification. Rows are
labelled as
derived_from_segmentation,existing_dataset, orall_source_fallbackso weak or fallback labels are auditable. - Invalid or placeholder class names such as
bad,unknown,scenes,train,valid,test, and folder-only image labels are mapped tounlabelledor excluded by the class-governance policy. - Dataset/annotation pairing now uses same-dataset-first resolution plus a
constrained related-dataset fallback for split archives such as image-only and
label-only bundles. Pairing decisions and ambiguities are written to
reports/cleaning/pairing_manifest.csvandreports/segmentation_pairing_manifest.csv. - Dense object scenes can make pairwise topology expensive. The context stage
computes object geometry for all samples and skips adjacency only when a scene
exceeds
context.max_adjacency_components_per_samplein the config.