This directory contains the self-contained demo and inference package for the Reliability-Aware Trimodal Disaster Severity Assessment project. It provides all model architectures, locked configurations, final checkpoints, and aligned test dataset features required to run the GUI demo on another machine.
-
Final Fusion B2 Model: The promoted calibrated reliability-weighted fusion model (
checkpoints/fusion_b2_best_model.pt). - Branch Checkpoints: Best checkpoints for the Satellite (SiameseUNet), Social Text (BiLSTM & RoBERTa), and Social Image (ResNet34 & EfficientNet-B2) branches.
-
Aligned Precomputed Test Data (
$N=368$ ): Extracted predictions, probabilities, reliability scores, and embeddings for all branches, enabling instant Quick Mode execution. -
Quick Inference Script:
demo_inference.pyfor evaluating samples, listing test cases, and explaining B2 gating decisions. -
Package Verification Script:
verify_demo_package.pyto ensure package integrity, sizes, shapes, and checkpoints.
Our trimodal framework integrates satellite imagery and social media streams into a unified severity classifier:
Pre/Post Satellite Pair ────→ [SiameseUNet] ───→ Probabilities & Reliability ┐
│
Social Text Tweet Text ─────→ [BiLSTM] ────────→ Probabilities & Reliability ├─→ [Fusion B2] ─→ Final Severity
│ (Gated Weighting)
Social Media Image ─────────→ [EfficientNet] ──→ Probabilities & Reliability ┘
- Fusion B2 (Calibrated Reliability-Aware Fusion) projects heterogeneous branch outputs into a unified fusion space.
- Branches are dynamically weighted using their computed Reliability Scores adjusted by a temperature parameter (
$\tau=0.5$ ) and a focal loss calibration layer to concentrate weight on more reliable branches.
- xBD (Satellite): Overlapping pre- and post-disaster RGB tile pairs for damage segmentation.
- HumAID (Social Text): Crisis-related tweets categorized by humanitarian tasks.
- CrisisMMD (Social Image + Text): Geotagged tweets containing text-image pairs with severity annotations.
Note
Why Harvey and Mexico Only? The trimodal alignment is constrained to Hurricane Harvey and the Mexico Earthquake because these are the only overlapping events represented across all three disjoint source datasets (xBD, HumAID, and CrisisMMD).
The official locked evaluation set contains 368 aligned samples:
- Hurricane Harvey: 309 samples
- Mexico Earthquake: 59 samples
- Sample IDs are listed in
config/eval_sample_ids.txt(MD5:1f6c42d86c5630841a35c5ccc741a079when LF normalized).
The B2 model was promoted based on the following Set B canonical evaluation metrics:
| Metric | Canonical Value (Set B) | Legacy Value (Set A)* | Status vs Promotion Gate |
|---|---|---|---|
| Accuracy | 0.8179 | 0.7989 | ✅ PASSED |
| Macro-F1 | 0.7678 | 0.7516 | ✅ PASSED (> 0.7403) |
| Weighted-F1 | 0.8153 | 0.7963 | ✅ PASSED |
| F1-None | 0.7191 | 0.6974 | ✅ PASSED |
| F1-Minor | 0.7039 | 0.6779 | ✅ PASSED |
| F1-Severe | 0.8803 | 0.8647 | ✅ PASSED |
| Harvey Macro-F1 | 0.7755 | 0.7533 | ✅ PASSED |
| Mexico Macro-F1 | 0.7096 | 0.7096 | ✅ PASSED (> 0.50) |
Warning
Canonical Results Warning: Always use the Set B metrics for publication and demo validation. The legacy training-loop metrics (Set A: Accuracy=0.7989, Macro-F1=0.7516) are obsolete.
Branch reliability scores are derived directly from model confidence and annotation quality:
-
Satellite Branch Reliability:
$$\text{Reliability} = \text{mean_confidence} \times (1 - \text{entropy_norm})$$ where$\text{entropy_norm}$ represents the Shannon entropy of pixel classification maps normalized by$\log_2(5)$ . -
Social Text Reliability:
$$\text{Reliability} = 0.5 \times \text{model_confidence} + 0.5$$ -
Social Image Reliability:
$$\text{Reliability} = 0.4 \times \text{annotation_confidence} + 0.4 \times \text{model_confidence} + 0.2$$
pip install -r requirements_demo.txtRun the verification suite to ensure all dimensions, hashes, and checkpoints are intact:
python verify_demo_package.py- List samples in the test set:
python demo_inference.py --list-samples
- Inference by Index (0 to 367):
python demo_inference.py --index 0
- Inference by Sample ID:
python demo_inference.py --sample-id 905930890735439873
- Filter by Event:
python demo_inference.py --event hurricane_harvey --limit 5
To run the PyTorch forward pass on precomputed embeddings:
python demo_inference.py --run-fusion --index 0For wrapping this package in a GUI frontend (e.g., Streamlit, Gradio, or Electron):
- Default Examples: Query
test_data/demo_sample_index.jsonto populate a select box with 6 hand-picked showcase samples representing Harvey and Mexico (covering successes, failures, and low-reliability edge-cases). - Table Views: Load
test_data/aligned_fusion_test_368.csvto display the overall performance grid. - Visualization Dashboard: Plot the branch gating weights (
fusion_weight_sat,fusion_weight_text,fusion_weight_image) as a pie chart or horizontal bar chart to illustrate the dynamic trimodal attention.
demo_model/
├── README.md # Presentation readme
├── requirements_demo.txt # Demo requirements
├── fusion_model.py # Fusion B2 model definition
├── fusion_dataset.py # Aligned dataset loader
├── demo_inference.py # CLI inference script
├── verify_demo_package.py # Integrity verification script
├── social_text_baseline_b.py # Text BiLSTM architecture
├── social_text_baseline_c.py # Text RoBERTa architecture
├── social_image_baseline_a.py # Image ResNet34 architecture
├── social_image_baseline_bc.py # Image EfficientNet architecture
├── models/ # Satellite models
│ ├── siamese_unet.py
│ ├── satellite_unet_baseline.py
│ └── changeformer.py
├── config/ # Locked config and ID files
│ ├── test_ids.json
│ ├── taxonomy_map.json
│ ├── reliability_weights.json
│ ├── blacklist_unique.json
│ ├── satellite_branch_locked.json
│ ├── branch_file_hashes.json
│ ├── eval_sample_ids.txt
│ └── eval_sample_ids_hash.txt
├── checkpoints/ # PyTorch best models weight checkpoints
│ ├── satellite_siamese_best_model.pt
│ ├── text_bilstm_best_model.pt
│ ├── text_roberta_best_model.pt
│ ├── image_resnet34_best_model.pt
│ ├── image_efficientnet_best_model.pt
│ └── fusion_b2_best_model.pt
├── test_data/ # Precomputed aligned test features (N=368)
│ ├── aligned_fusion_test_368.csv
│ ├── aligned_metadata_368.json
│ ├── demo_sample_index.json
│ ├── sat_embeddings_368.npy
│ ├── sat_predictions_368.csv
│ ├── text_bilstm_embeddings_368.npy
│ ├── text_bilstm_predictions_368.csv
│ ├── text_roberta_embeddings_368.npy
│ ├── text_roberta_predictions_368.csv
│ ├── image_resnet34_embeddings_368.npy
│ ├── image_resnet34_predictions_368.csv
│ ├── image_effnet_embeddings_368.npy
│ ├── image_effnet_predictions_368.csv
│ ├── labels_368.npy
│ └── sample_ids_368.txt
├── raw_branch_exports/ # Complete branch outputs for auditability
└── logs/ # Verification logs
├── file_hashes.json
├── missing_files.json
└── verification_report.md
- Small Test Set: Trimodal overlap limits the evaluation test set to 368 samples.
- Geographic Domain Gap: Satellite damage detection fails on Mexico Earthquake due to training dominance on flood-damaged Harvey tiles (flood textures vs building collapses).
- Quick Mode Dependency: Real-time feature extraction on raw tweets/images requires heavy encoders (RoBERTa, EfficientNet) which are bypassed by Quick Mode's pre-computed arrays.