In [2]:
import os
import pandas as pd

Current best submission
---

**Status: 1. Juni 2024, 19:00**

*Key: Super voter combination -- best_dev_combination_2024-06-01*

*Score: 0.6934 // MPA_ALL_MODALITIES: 0.7184 // MPA_TOP_VIEW: 0.6684*


## Model combination



|country_id | count | street | evaluation key | based on |dev set|
|---|---|---|---|---|---|
|QCD| 2089| yes |finetune_QCD_2| topview_streetview_05-27_A |0.7236|
|HUN| 78| yes |finetune_QCD_2| topview_streetview_05-27_A |0.4400|
|FMW| 1428| no |finetune_FMW|topmodal_swin_05-24_A|TODO|
|PNN| 934| no |finetune_PNN|topmodal_swin_05-24_A |TODO|


## X-Multi-Modal Late Fusion
-----

We create a model that can process all three modalities. The backbones are Swin transformers from the TIMM collection, pretrained on ImageNet-1K.

- The pretrained transformers provide embeddings for each modality.
- These are combined through an attention module (mode: attention2).
- At inference time, the model works even when the streetview image is not present.

This model can be applied as a one-size-fits-all solution with satisfying performance on the development set. But for the purpose of this challenge, it proved better to train different models

### For samples with all modalities

For about half the test set, streetview images are available. The multi-modal model uses the following modalities and backbones:

- Orthophotos: SwinV2 transformer (base)
- Street photos: SwinV2 transformer (base)

Link to the pretrained backbone: https://huggingface.co/timm/swinv2_base_window12to24_192to384.ms_in22k_ft_in1k

### For samples with only top modalities

For the other test set samples, we use the following modalities and backbones:

- Orthophotos: SwinV2 transformer (small)
- Sentinel-2 data: SwinV2 transformer (small)

Link to the pretrained backbone: https://huggingface.co/timm/swinv2_small_window16_256.ms_in1k


Data description
---

- ~All samples from HUN are excluded. They are underrepresented in the test set, and proved very hard to predict even with overfitting on a single batch.~
- 5-fold cross validation from StratifiedGroupKFold (stratify on class labels, group cities)
- Images are resized to fit the pretrained model
  - Orthophotos: Resize (since they all are the same resolution)
  - Streetview: RandomResizeAndCrop (since the views and resolution vary)
  - Sentinel-2: Patch is created with 4 64x64 images with 3 channels each
- Data augmentation:
  - Orthophotos: Random horizontal flips and vertical flips with p=0.5; Color jitter (brightness, contrast, and saturation)
  - Streetview: Random horizontal flips with p=0.5; Color jitter
  - Sentinel-2: Color jitter
  
### Training procedure
---

#### Fine-tune to multi-modal data

- Early stopping is applied on the validation set confusion matrix diagonal mean.
- Train on full node (2 GPUs).
- Cross-entropy loss weighted with class weights from inverse sample count

#### Fine-tune to country

- The trained models are refined further on the individual countries
- Performed for QCD, PNN, FMW separately
- 5-fold cross validation
- QCD: 5 checkpoints x 5-fold CV: Train 25 models
- PNN, FMW: 5 checkpoints :: 1-fold CV: Train 5 models


Test predictions and post-processing
---

Test set predictions are generated for each fold, and majority vote is applied. Where there is a tie, the class with the higher probability is chosen (based on the train set, grouped by country ID).

Development set
---

~All Modalities
- Accuracy score: 0.7234
- MAP:            0.7033

Only Topview Modalities
- Accuracy score: 0.6293
- MAP:            0.5976~

Separate models
---

- *topmodal_swin_05-24_A* for samples without a streetview image (to beat: 0.6195 DEV)
- *topview_streetview_05-27_A_alpha* for samples including a streetview image (to beat: 0.7141 DEV)

In [1]:
# TODO
best_submission_root = '../submissions/current_best_model/train/'
best_experiments = os.listdir(best_submission_root)
best_experiments

NameError: name 'os' is not defined

In [7]:
runs = os.listdir(os.path.join(best_submission_root, best_experiments[0]))

In [8]:
with open(os.path.join(best_submission_root, best_experiments[0], runs[0], 'config_tree.log')) as f:
    config_tree = f.readlines()

In [9]:
config_tree

['CONFIG\n',
 '├── datamodule\n',
 '│   └── _target_: ai4eo_mapyourcity.datamodules.mapyourcity_datamodule.MapYourCi\n',
 '│       batch_size: 16                                                          \n',
 '│       num_workers: 8                                                          \n',
 '│       pin_memory: false                                                       \n',
 '│       dataset_options:                                                        \n',
 '│         transform: default                                                    \n',
 '│         data_dir: /work/ka1176/caroline/gitlab/AI4EO-MapYourCity/scripts/../da\n',
 '│         fold: 1                                                               \n',
 '│         fold_dir: /work/ka1176/caroline/gitlab/AI4EO-MapYourCity/scripts/../da\n',
 '│         fold_key: random_stratified_labels_cities_noHUN                       \n',
 '│         model_id:                                                             \n',
 '│      

## References

https://openaccess.thecvf.com/content/CVPR2022/papers/Ma_Are_Multimodal_Transformers_Robust_to_Missing_Modality_CVPR_2022_paper.pdf