This notebook assumes you have ran the huggingface-cli login command in the terminal/notebook, and have write permissions to the HuggingFace datasets.

In [1]:
from util import ImageDataset, readSetFromFile

def load_images(
    masks_glob, 
    includes_file=None,
    imagery_folder="imagery/",
    masks_folder="masks/", 
    fraction=0.1, 
    exclude=[]
):
    include_masks = list(readSetFromFile(includes_file, str)) if includes_file else None

    train_data = ImageDataset(
        "../learning/" + imagery_folder, 
        "../learning/" + masks_folder, 
        masks_glob, 
        include_masks=include_masks, 
        exclude_masks=exclude, 
        subset="Train", 
        fraction=fraction,
    )
    print("")
    val_data = ImageDataset(
        "../learning/" + imagery_folder, 
        "../learning/" + masks_folder, 
        masks_glob, 
        include_masks=include_masks, 
        exclude_masks=exclude, 
        subset="Test", 
        fraction=fraction,
    )

    train_image_names = [str(path) for path in train_data.image_names]
    train_mask_names = [str(path) for path in train_data.mask_names]

    val_image_names = [str(path) for path in val_data.image_names]
    val_mask_names = [str(path) for path in val_data.mask_names]

    return (train_data.names, train_image_names, train_mask_names), (val_data.names, val_image_names, val_mask_names)

In [2]:
from datasets import Dataset, DatasetDict, Image

def create_dataset(names, image_paths, label_paths):
    dataset = Dataset.from_dict({"image": image_paths, "label": label_paths, "name": names})
    dataset = dataset.cast_column("image", Image())
    dataset = dataset.cast_column("label", Image())

    return dataset

In [3]:
paths_manual, _ = load_images("*_corrected.png", "../data/train_images.txt", fraction=0)
validation_dataset_manual = create_dataset(*paths_manual)

Found and loaded 4382 images with glob *_corrected.png.
Pruned 4156 masks based on set of 0 included masks.
Pruned 0 masks from set of 0 excluded masks.
Subset of 226 ground truth segmentation masks marked for Train.

Found and loaded 4382 images with glob *_corrected.png.
Pruned 4156 masks based on set of 0 included masks.
Pruned 0 masks from set of 0 excluded masks.
Subset of 0 ground truth segmentation masks marked for Test.


The extra image names set to be excluded are additional datapoints used for validation experiments and must not be trained on.

In [4]:
from util import writeSetToFile

exclude_names = paths_manual[0]
writeSetToFile("../data/exclude_images.txt", exclude_names)

[stodoran/elwha-segmentation-v1](https://huggingface.co/datasets/stodoran/elwha-segmentation-v1)

In [5]:
train_paths_v1, _ = load_images("*_binary.png", "../data/useful_images.txt", fraction=0, exclude=exclude_names)
train_dataset_v1 = create_dataset(*train_paths_v1)

dataset = DatasetDict({
    "train": train_dataset_v1,
    "validation": validation_dataset_manual,
})
# dataset.push_to_hub("stodoran/elwha-segmentation-v1")

Found and loaded 4382 images with glob *_binary.png.
Pruned 3109 masks based on set of 226 included masks.
Pruned 162 masks from set of 226 excluded masks.
Subset of 1111 ground truth segmentation masks marked for Train.

Found and loaded 4382 images with glob *_binary.png.
Pruned 3109 masks based on set of 226 included masks.
Pruned 162 masks from set of 226 excluded masks.
Subset of 0 ground truth segmentation masks marked for Test.


[stodoran/elwha-segmentation-v2](https://huggingface.co/datasets/stodoran/elwha-segmentation-v2)

In [6]:
train_paths_v2, _ = load_images("*[!_manualfix].png", masks_folder="corrections_v1/", fraction=0, exclude=exclude_names)
train_dataset_v2 = create_dataset(*train_paths_v2)

dataset = DatasetDict({
    "train": train_dataset_v2,
    "validation": validation_dataset_manual,
})
# dataset.push_to_hub("stodoran/elwha-segmentation-v2")

Found and loaded 1148 images with glob *[!_manualfix].png.
Pruned 91 masks from set of 226 excluded masks.
Subset of 1057 ground truth segmentation masks marked for Train.

Found and loaded 1148 images with glob *[!_manualfix].png.
Pruned 91 masks from set of 226 excluded masks.
Subset of 0 ground truth segmentation masks marked for Test.


[stodoran/elwha-segmentation-predict](https://huggingface.co/datasets/stodoran/elwha-segmentation-predict)

In [7]:
train_paths_pred, _ = load_images("*.png", imagery_folder="imagery/", masks_folder="imagery/", fraction=0)
train_dataset_pred = create_dataset(*train_paths_pred)

dataset = DatasetDict({
    "data": train_dataset_pred,
})
# dataset.push_to_hub("stodoran/elwha-segmentation-predict")

Found and loaded 4382 images with glob *.png.
Pruned 0 masks from set of 0 excluded masks.
Subset of 4382 ground truth segmentation masks marked for Train.

Found and loaded 4382 images with glob *.png.
Pruned 0 masks from set of 0 excluded masks.
Subset of 0 ground truth segmentation masks marked for Test.


[stodoran/elwha-segmentation-all](https://huggingface.co/datasets/stodoran/elwha-segmentation-all)

In [8]:
train_paths_all, _ = load_images("*_binary.png", fraction=0, exclude=exclude_names)
train_dataset_all = create_dataset(*train_paths_all)

dataset = DatasetDict({
    "train": train_dataset_all,
    "validation": validation_dataset_manual,
})
# dataset.push_to_hub("stodoran/elwha-segmentation-all")

Found and loaded 4382 images with glob *_binary.png.
Pruned 226 masks from set of 226 excluded masks.
Subset of 4156 ground truth segmentation masks marked for Train.

Found and loaded 4382 images with glob *_binary.png.
Pruned 226 masks from set of 226 excluded masks.
Subset of 0 ground truth segmentation masks marked for Test.
