Skip to content

fishman7337/ISR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

AETHER ISR Data Preparation

This repository contains a reproducible, script-based data preparation pipeline for satellite-to-simulation research. The pipeline discovers source archives, extracts them safely, inventories dataset formats, performs conservative cleaning, harmonises labels into a shared taxonomy, prepares segmentation and classification-ready outputs, derives geometry/topology/context features, and generates QA reports.

Folder Structure

data/             Canonical dataset estate.
data/raw/archives Immutable source ZIP/TAR inputs.
data/raw/direct_sources/ Canonical raw directory datasets that are not archive-based.
data/raw/         Raw-source catalogues and manifests.
data/extracted/   Safe, traceable archive extractions.
data/interim/     Quarantine, intermediate manifests, and repair products.
data/processed/   Non-final processed copies where needed.
data/final/       Research-ready task datasets.
reports/          Inventory, cleaning, feature, QA, and preview reports.
docs/             Methodology, data card, and class mapping notes.
logs/             Stage logs.
scripts/          Reproducible Python pipeline scripts.
notebooks/        Audit notebooks for cleaning, preprocessing, and features.
configs/          Central pipeline configuration.

Run Order

Set up the environment:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r requirements-dev.txt

Run the full conservative pipeline:

python scripts/run_pipeline.py --config configs/pipeline.yaml

Tidy legacy root-level archives, old data_* folders, and recognised root-level direct-source dataset directories into the canonical data/ estate:

python scripts/migrate_data_layout.py --config configs/pipeline.yaml

Run the unit tests:

pytest

Run the standard project quality gate:

python scripts/quality_gate.py

Check whether the project detected local/on-prem, RTX 4090, Colab, and Google Drive correctly:

python scripts/check_runtime.py --config configs/pipeline.yaml

Validate the required research/industry folder structure:

python scripts/validate_project_structure.py --config configs/pipeline.yaml

Before release or controlled handoff:

python scripts/quality_gate.py --strict

Review the notebook audit companions:

jupyter notebook notebooks/01_data_cleaning.ipynb
jupyter notebook notebooks/02_preprocessing_dataset_preparation.ipynb
jupyter notebook notebooks/03_feature_engineering.ipynb

The notebooks call the same .py stage scripts and then read the generated reports; they are not a separate implementation.

Or run individual stages:

python scripts/extract_archives.py --config configs/pipeline.yaml
python scripts/inspect_datasets.py --config configs/pipeline.yaml
python scripts/clean_files.py --config configs/pipeline.yaml
python scripts/harmonise_labels.py --config configs/pipeline.yaml
python scripts/prepare_segmentation.py --config configs/pipeline.yaml
python scripts/prepare_classification.py --config configs/pipeline.yaml
python scripts/derive_context_features.py --config configs/pipeline.yaml
python scripts/prepare_contextual_prediction.py --config configs/pipeline.yaml
python scripts/feature_engineering.py --config configs/pipeline.yaml
python scripts/qa_checks.py --config configs/pipeline.yaml
python scripts/generate_reports.py --config configs/pipeline.yaml

The extraction and inventory stages should be reviewed before accepting any class harmonisation or standardisation decisions for a paper or thesis appendix.

Current Assumptions

  • Source archives are stored under data/raw/archives/ and treated as immutable raw inputs.
  • Non-archive direct-source datasets are stored under data/raw/direct_sources/ and treated as immutable raw inputs.
  • ZIP and TAR archives are supported, including nested archives and multipart .tar.001 sequences.
  • Existing train/validation/test splits are preserved when detected.
  • Missing splits are handled with deterministic, seed-controlled assignment.
  • The unified segmentation taxonomy is: building, road, runway, vegetation, water, and bare_open_ground, with background index 0.
  • Ambiguous or irrelevant small-object categories are logged and excluded unless explicitly mapped in configs/pipeline.yaml.
  • Image resizing is disabled by default. Originals remain preserved; final outputs are copies or derived masks.
  • The Google/Open Buildings direct-source dataset is treated as a vector-only building-footprint source. It is cleaned, paired, standardised, and carried into classification/context feature engineering without duplicating the raw 200+ GB source tree.

Key Outputs

  • reports/inventory/dataset_inventory.csv
  • reports/inventory/dataset_inventory.md
  • reports/inventory/dataset_inventory.json
  • reports/inventory/direct_source_catalog.csv
  • reports/cleaning/cleaning_report.md
  • reports/cleaning/pairing_manifest.csv
  • reports/project_structure.md
  • reports/runtime_memory.csv
  • reports/stages/index.md
  • reports/stages/*.md
  • reports/stages/*.json
  • reports/features/object_geometry.csv
  • reports/features/scene_context.csv
  • reports/features/object_context_windows.csv
  • reports/features/contextual_relation_priors.csv
  • reports/google_open_buildings_shard_manifest.csv
  • reports/features/google_open_buildings_shard_features.csv
  • reports/features/google_open_buildings_geometry_sample.csv
  • reports/features/engineered_sample_features.csv
  • reports/features/engineered_classification_features.csv
  • reports/features/feature_dictionary.md
  • reports/qa/qa_report.md
  • docs/data_card.md
  • docs/methodology.md
  • docs/class_mapping.md
  • docs/feature_engineering.md
  • docs/contextual_prediction.md
  • docs/pairing_methodology.md
  • docs/notebook_methodology.md
  • docs/label_governance.md
  • docs/reproducibility.md
  • docs/environment_setup.md
  • docs/runtime_environment.md
  • docs/project_structure_standard.md
  • docs/memory_management.md
  • docs/quality_gates.md
  • docs/operations_runbook.md
  • docs/defence_research_governance.md
  • docs/research_ethics_and_responsible_ai.md
  • docs/documentation_standard.md
  • docs/review_and_approval.md
  • docs/branch_protection.md
  • docs/model_card_template.md
  • docs/release_checklist.md
  • docs/quality_assurance.md
  • docs/risk_register.md
  • docs/google_open_buildings_integration.md
  • CODE_OF_CONDUCT.md
  • GOVERNANCE.md
  • SUPPORT.md
  • MAINTAINERS.md
  • SECURITY.md

Current Run Summary

The pipeline discovered and inventoried 14 extracted dataset roots from the archives in this folder, including ZIP, TAR, nested ZIP, and multipart TAR inputs. Two large image ZIPs had malformed ZIP64 central-directory offsets; the extractor repaired the offset shift reproducibly and logged the retry provenance in reports/inventory/extraction_provenance_retry_bad_zip64.csv.

The archive-based counts below describe the last completed baseline run before the Google/Open Buildings direct-source integration. After rerunning the full pipeline, consult reports/final_pipeline_report.md, reports/google_open_buildings_shard_manifest.csv, and reports/features/google_open_buildings_shard_features.csv for the refreshed combined audit trail.

Prepared outputs from the current run:

Output Count
Final segmentation samples 13,487
Classification rows 56,096
Existing source classification rows 41,999
Segmentation-derived classification rows 13,487
All-source fallback classification rows 610
Context scenes 13,487
Geometry objects/components 244,135
Topology adjacency edges 65,816
Engineered sample feature rows 13,487
Engineered classification feature rows 56,096
QA preview images 117

Final segmentation datasets were prepared from:

Dataset root Samples
5706578__6d6f2a1fca 4,191
archive_5__cf3bd094bb 3,842
annotations1024__34ad76d450 2,492
archive_2__51be8d6b24 1,171
archive_1__ea8ef46010 803
satellite.v1i.coco-segmentation__849f18754f 735
train_labels__3c0f552efa 220
airport_satellite_image_dataset.v1i.coco-segmentation__13691e9cc0 33

Segmentation preparation now uses raster masks, COCO polygons, XML oriented/horizontal boxes, vehicle TXT polygons, and xView GeoJSON pixel bounds. QA found no final image-mask alignment failures and no duplicate-image leakage across final segmentation splits. One unreadable macOS artifact image from train_images__feb124b8a4 was quarantined under data/interim/quarantine/.

Every pipeline stage also writes a paired Markdown and JSON audit report under reports/stages/, including the final standardised segmentation taxonomy and classification label set used by the run.

The repository now also supports the Google/Open Buildings direct-source building-footprint corpus under data/raw/direct_sources/google_open_buildings/. Because that dataset is vector-only and very large, the pipeline keeps the raw files in place, creates a deterministic paired shard manifest, adds one classification-compatible row per paired shard, and engineers sampled point-density plus polygon-geometry probe features for audit and modelling.

Research Team

  • ME5-1 (Dr.) Lim Yong Zhi, Research Supervisor
  • Chua Jia Jun (NSF)
  • Goh Kun Ming (AI Research Intern)

Defence Research Handling

This project is treated as controlled defence research for AETHER (Air Emerging Technology High-Speed Experimentation and Research), under RAiD (RSAF Agile innovation Digital), under RSAF, SAF, and MINDEF. Follow official classification and handling guidance from the responsible authority. The repository now includes .env.example, dependency files, quality gates, security guidance, release checks, conduct rules, support guidance, responsible-AI notes, and governance notes, but these do not replace official security, licensing, export, or operational release decisions.

Important Review Items

  • 5706578.zip includes a LoveDA datasheet. Its mask values were mapped as: 2 -> building, 3 -> road, 4 -> water, 5 -> bare_open_ground, 6/7 -> vegetation, and 0/1 -> background.
  • archive_1__ea8ef46010 and archive_2__51be8d6b24 include RGB class dictionaries; these were used for mask conversion.
  • Object-centric COCO categories such as aircraft, ships, vehicles, and helicopters are standardised into additional segmentation classes so polygon data can be used for segmentation without forcing them into surface classes. Ambiguous classes remain documented in reports/class_mapping_resolved.csv.
  • All extracted source imagery now contributes to classification. Rows are labelled as derived_from_segmentation, existing_dataset, or all_source_fallback so weak or fallback labels are auditable.
  • Invalid or placeholder class names such as bad, unknown, scenes, train, valid, test, and folder-only image labels are mapped to unlabelled or excluded by the class-governance policy.
  • Dataset/annotation pairing now uses same-dataset-first resolution plus a constrained related-dataset fallback for split archives such as image-only and label-only bundles. Pairing decisions and ambiguities are written to reports/cleaning/pairing_manifest.csv and reports/segmentation_pairing_manifest.csv.
  • Dense object scenes can make pairwise topology expensive. The context stage computes object geometry for all samples and skips adjacency only when a scene exceeds context.max_adjacency_components_per_sample in the config.

About

Research-grade satellite imagery pipeline for ISR dataset preparation, geometry/topology feature engineering, preliminary model screening, and W&B-tracked training orchestration.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors