# Dataset 101 : Mechanics

This notebook will make use of lours's data object.
How to load from a known dataset format, how to merge two datasets, how to remap classes, and how to write it on disk in a wanted format

In [1]:
%load_ext autoreload

%autoreload 2
from lours.dataset import Dataset, from_coco
from lours.utils.testing import assert_dataset_equal

Loading coco eval in test folders. Note that you can also load cAIpy and darknet.

In [2]:
COCO_dataset = from_coco("notebook_data/coco_valid.json")

In [3]:
COCO_dataset

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

## Dataset Sampling

You can use the `loc[]` or `.iloc[]` interface to sample the sub-datasets you want at the image level. To sample at the annotation level, you can use `.loc_annot[]` and `.iloc_annot[]` methods

Notes:

 - For `iloc`, images indices are not considered, only the row number (like in pandas.DataFrame.iloc), so you might want to reorder the images before, or use `loc` that uses indices
 - calling a single number, e.g. `dataset[0]` will give you a dataset of only one image but it will still be a dataset object with two dataframes
 - Images are never loaded by the dataset object itself, you need to load them yourself in your pipeline
 - the `[]` method is equivalent to `iloc[]`

### Image based sampling

In [4]:
# Only taking 50% of the images
COCO_dataset.iloc[::2]

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

In [5]:
ids_to_keep = COCO_dataset.images.index[COCO_dataset.images.index > 30_000]
print(ids_to_keep)
COCO_dataset.loc[ids_to_keep]

Index([352582, 113354,  58393, 147729, 310072,  50149, 519208, 356125,  38048,
       567825,
       ...
       166478, 185409, 577976, 189806, 363188, 311180, 302030, 105455, 428280,
       349837],
      dtype='int64', name='id', length=4722)


VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

This is equivalent to using `filter_images` method with `loc` mode

In [6]:
COCO_dataset.filter_images(ids_to_keep, mode="loc")

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

### Annotation based sampling

Remove half the annotations

In [7]:
COCO_dataset.iloc_annot[::2]

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

Remove half the annotations, remove images emptied of annotations (but keep the ones that were already empty)

In [8]:
to_keep = COCO_dataset.annotations.index[::2]
filtered = COCO_dataset.filter_annotations(
    to_keep, mode="loc", remove_emptied_images=True
)
display(filtered)

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

You can also use `slice(None, None, 2)` with the `iloc` mode

In [9]:
filtered_2 = COCO_dataset.filter_annotations(
    slice(None, None, 2), mode="iloc", remove_emptied_images=True
)

assert_dataset_equal(filtered, filtered_2)

### Iterating through the dataset

You can iterate through the dataset

In [10]:
for single_image_dataset in COCO_dataset[:2]:
    display(single_image_dataset)

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

The `iter_image` method can help you get directly image and annotations dataframes instead of Dataset objects with a single image)


In [11]:
for image, annotations in COCO_dataset[:2].iter_images():
    print(image)
    display(annotations)

width                                      425
height                                     640
relative_path    Images/valid/000000352582.jpg
type                                      .jpg
split                                    valid
Name: 352582, dtype: object


Unnamed: 0_level_0,image_id,category_str,category_id,split,box_x_min,box_y_min,box_width,box_height,area
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
460450,352582,person,1,valid,112.43,195.32,214.78,438.19,48685.6791
535917,352582,person,1,valid,0.0,256.0,80.54,376.81,22650.738
602093,352582,frisbee,34,valid,171.63,424.03,85.89,40.67,2605.7209


width                                      640
height                                     480
relative_path    Images/valid/000000113354.jpg
type                                      .jpg
split                                    valid
Name: 113354, dtype: object


Unnamed: 0_level_0,image_id,category_str,category_id,split,box_x_min,box_y_min,box_width,box_height,area
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
589077,113354,zebra,24,valid,260.99,158.88,141.52,194.11,9978.94125
589740,113354,zebra,24,valid,366.49,174.59,115.67,142.71,5784.6862
592005,113354,zebra,24,valid,3.24,151.28,265.34,175.82,16206.3748


In [12]:
image, annotation = COCO_dataset[:2].get_one_frame(0)
print(image)
display(annotations)

width                                      425
height                                     640
relative_path    Images/valid/000000352582.jpg
type                                      .jpg
split                                    valid
Name: 352582, dtype: object


Unnamed: 0_level_0,image_id,category_str,category_id,split,box_x_min,box_y_min,box_width,box_height,area
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
589077,113354,zebra,24,valid,260.99,158.88,141.52,194.11,9978.94125
589740,113354,zebra,24,valid,366.49,174.59,115.67,142.71,5784.6862
592005,113354,zebra,24,valid,3.24,151.28,265.34,175.82,16206.3748


## Remap classes

Here we use the preset COCO -> Pascal to convert coco classes into Pascal's annotation book

In [13]:
COCO_dataset.label_map

{1: 'person',
 2: 'bicycle',
 3: 'car',
 4: 'motorcycle',
 5: 'airplane',
 6: 'bus',
 7: 'train',
 8: 'truck',
 9: 'boat',
 10: 'traffic light',
 11: 'fire hydrant',
 13: 'stop sign',
 14: 'parking meter',
 15: 'bench',
 16: 'bird',
 17: 'cat',
 18: 'dog',
 19: 'horse',
 20: 'sheep',
 21: 'cow',
 22: 'elephant',
 23: 'bear',
 24: 'zebra',
 25: 'giraffe',
 27: 'backpack',
 28: 'umbrella',
 31: 'handbag',
 32: 'tie',
 33: 'suitcase',
 34: 'frisbee',
 35: 'skis',
 36: 'snowboard',
 37: 'sports ball',
 38: 'kite',
 39: 'baseball bat',
 40: 'baseball glove',
 41: 'skateboard',
 42: 'surfboard',
 43: 'tennis racket',
 44: 'bottle',
 46: 'wine glass',
 47: 'cup',
 48: 'fork',
 49: 'knife',
 50: 'spoon',
 51: 'bowl',
 52: 'banana',
 53: 'apple',
 54: 'sandwich',
 55: 'orange',
 56: 'broccoli',
 57: 'carrot',
 58: 'hot dog',
 59: 'pizza',
 60: 'donut',
 61: 'cake',
 62: 'chair',
 63: 'couch',
 64: 'potted plant',
 65: 'bed',
 67: 'dining table',
 70: 'toilet',
 72: 'tv',
 73: 'laptop',
 74: 'mo

In [14]:
COCO_pascal = COCO_dataset.remap_from_preset("coco", "pascalvoc")

See how label map tab has changed

In [15]:
COCO_pascal

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

### Remap from dictionaries

Fictional usecase where we want to only have vehicles, bags and animals.
If given, new_names must be the length of distinct values in class_mapping

In [16]:
COCO_RT = COCO_pascal.remap_classes(
    class_mapping={
        1: 2,
        2: 2,
        3: 1,
        4: 1,
        5: 3,
        6: 2,
        7: 2,
        8: 1,
        9: 3,
        10: 1,
        11: 3,
        12: 1,
        13: 1,
        14: 2,
        16: 3,
        17: 1,
        18: 3,
        19: 2,
        20: 3,
    },
    new_names={1: "Animal", 2: "Vehicle", 3: "Object"},
)

In [17]:
COCO_RT

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

### Remap from dataframe

Dataframe for remapping must have at least 2 columns : `input_category_id` and `output_category_id`

If available, `output_category_name` will be use to replace the names of remapped ids.

`input_category_name` only serves an informative purpose.

In [18]:
import pandas as pd

class_table = (
    pd.Series(COCO_pascal.label_map).rename("input_category_name").sort_index()
)
class_table.index.rename("input_category_id", inplace=True)
class_table = class_table.reset_index().drop(15)
class_table["output_category_id"] = [
    2,
    2,
    1,
    2,
    3,
    2,
    2,
    1,
    3,
    1,
    3,
    1,
    1,
    2,
    3,
    1,
    3,
    2,
    3,
]
class_table["output_category_name"] = class_table["output_category_id"].replace(
    {1: "animal", 2: "vehicle", 3: "object"}
)

In [19]:
class_table

Unnamed: 0,input_category_id,input_category_name,output_category_id,output_category_name
0,1,aeroplane,2,vehicle
1,2,bicycle,2,vehicle
2,3,bird,1,animal
3,4,boat,2,vehicle
4,5,bottle,3,object
5,6,bus,2,vehicle
6,7,car,2,vehicle
7,8,cat,1,animal
8,9,chair,3,object
9,10,cow,1,animal


In [20]:
COCO_RT_DF = COCO_pascal.remap_from_dataframe(class_table)

In [21]:
COCO_RT_DF

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

### Remap from CSV

Basically the same as remap from dataframe, except the input is a csv file with the same data

In [22]:
csv_file = "remap.csv"
class_table.to_csv(csv_file, index=False)

In [23]:
!cat remap.csv

input_category_id,input_category_name,output_category_id,output_category_name
1,aeroplane,2,vehicle
2,bicycle,2,vehicle
3,bird,1,animal
4,boat,2,vehicle
5,bottle,3,object
6,bus,2,vehicle
7,car,2,vehicle
8,cat,1,animal
9,chair,3,object
10,cow,1,animal
11,diningtable,3,object
12,dog,1,animal
13,horse,1,animal
14,motorbike,2,vehicle
15,person,3,object
17,sheep,1,animal
18,sofa,3,object
19,train,2,vehicle
20,tvmonitor,3,object


In [24]:
COCO_RT_CSV = COCO_pascal.remap_from_csv(csv_file)

In [25]:
COCO_RT_CSV

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

### Remap from other dataset

This method will try to retrieve the label names in the other dataset and apply a remapping accordingly.

classes that are not in the other dataset are mapped to a free id with respect to the other dataset's label map.

In [26]:
COCO_RT_other = COCO_pascal.remap_from_other(COCO_RT_CSV)
COCO_RT_CSV

Using the following class remapping dictionary :
{1: 22,
 2: 21,
 3: 23,
 4: 4,
 5: 5,
 6: 6,
 7: 7,
 8: 8,
 9: 9,
 10: 10,
 11: 11,
 12: 12,
 13: 13,
 14: 14,
 15: 15,
 16: 16,
 17: 17,
 18: 18,
 19: 19,
 20: 20}


VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

## Dataset Reindexing

### Resetting index

The `reset_index` method allows you to reorder the dataset's dataframes according to some column values

In [27]:
COCO_dataset.reset_index()

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

Sort the annotations by category string first : Get the dataframe to start with airplanes and finish with zebra.

In [28]:
reset_COCO_dataset = COCO_dataset.reset_index(
    start_image_id=10,
    start_annotations_id=2,
    sort_annotations_by=("category_str", "image_id"),
)
reset_COCO_dataset

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

### Reindex with mapping

Akin to class remapping, you can also remap the dataset's dataframe indexes with dictionaries. Note that unmapped index values will be reset to a range index, but they will not be sorted. Be sure to sort the dataframes the way you want before calling the method `reset_index_from_mapping` with an incomplete index mapping.

In [29]:
COCO_dataset.reset_index_from_mapping(
    images_index_map={58393: 0}, annotations_index_map={331107: 0}
)

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

### Reindex images index from other dataframe

This feature is similar to panda's [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function : by selecting columns to merge on, the dataset will construct an index mapping for entries that are in both original images dataframe and the other dataframe, and optionally remap the other rows to a simple range index

In [30]:
matched_COCO = COCO_dataset.match_index(reset_COCO_dataset.images, on="relative_path")
display(matched_COCO)

matched_COCO.images.sort_index()

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

Unnamed: 0_level_0,width,height,relative_path,type,split
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10,640,426,Images/valid/000000000139.jpg,.jpg,valid
11,586,640,Images/valid/000000000285.jpg,.jpg,valid
12,640,483,Images/valid/000000000632.jpg,.jpg,valid
13,375,500,Images/valid/000000000724.jpg,.jpg,valid
14,428,640,Images/valid/000000000776.jpg,.jpg,valid
...,...,...,...,...,...
5005,640,354,Images/valid/000000581317.jpg,.jpg,valid
5006,612,612,Images/valid/000000581357.jpg,.jpg,valid
5007,640,427,Images/valid/000000581482.jpg,.jpg,valid
5008,478,640,Images/valid/000000581615.jpg,.jpg,valid


## Dataset merge

### Regular merge

Here, we divide COCO in two and merge them again to show how it works

In [31]:
half1 = COCO_dataset[::2]
half2 = COCO_dataset[1::2]

In [32]:
from lours.utils.testing import assert_dataset_equal

merged_back = half1 + half2
display(merged_back)
display(COCO_dataset)
assert_dataset_equal(COCO_dataset, merged_back)

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

### Merge with `ignore_index`

the merge function can be used with `ignore_index` when image ids are overlapping

In [33]:
half1 = half1.reset_index()
half2 = half2.reset_index()

In [34]:
merged_back = half1.merge(half2, ignore_index=True)
assert_dataset_equal(merged_back, COCO_dataset, ignore_index=True)
merged_back

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

### Merging with overlapping ids

If your datasets have images with overlapping ids, they can still be merged as long as the overlapping subset are the exact same

In [35]:
half1 = Dataset.from_template(
    COCO_dataset, annotations=COCO_dataset.annotations.iloc[::2]
)
display(half1)
half2 = Dataset.from_template(
    COCO_dataset, annotations=COCO_dataset.annotations.iloc[1::2]
)
display(half2)
merged_back = half1 + half2
assert_dataset_equal(COCO_dataset, merged_back)

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

Merging overlapping ids can be turned off with `allow_overlapping_ids` set to False.

In [36]:
half1.merge(half2, allow_overlapping_image_ids=False)

ValueError: Overlapping image ids not permitted. Consider using the allow_overlapping_image_ids or ignore_index options

### Incompatible Label maps

In the case the label map of one dataset is not the subset of the other and vice versa, the label maps are incompatible.

In [37]:
new_label_map = {**COCO_pascal.label_map, **{1: "something else"}}
COCO_incompatible = COCO_pascal.from_template(label_map=new_label_map)

In [38]:
COCO_pascal.merge(COCO_incompatible)

IncompatibleLabelMapsError: Label maps are incompatible

If we lookup the label map of SmartCity, we can see that class labels are not the same for class id 41 (dog vs domestic animal)

In [39]:
for k, name in COCO_pascal.label_map.items():
    other_name = COCO_incompatible.label_map.get(k)
    if other_name is not None and other_name != name:
        print(
            f"Incompatible label map for category_id {k} : '{name}' vs '{other_name}'"
        )

Incompatible label map for category_id 1 : 'aeroplane' vs 'something else'


### Automatic remapping

It is possible though to remap a dataset to match another dataset's label map by retrieving categories with the same names.

We can use either the `remap_from_other` method or directly use the addition as it will fallback to the automatic remapping with a warning.

Note that the merge is effective but you should avoid this fallback mechanism if possible, because label names are not supposed to be used as ids.

In [40]:
remapped = COCO_incompatible.remap_from_other(COCO_pascal)
merged = COCO_pascal.merge(remapped)
merged

Using the following class remapping dictionary :
{1: 21,
 2: 2,
 3: 3,
 4: 4,
 5: 5,
 6: 6,
 7: 7,
 8: 8,
 9: 9,
 10: 10,
 11: 11,
 12: 12,
 13: 13,
 14: 14,
 15: 15,
 16: 16,
 17: 17,
 18: 18,
 19: 19,
 20: 20}


VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

In [41]:
merged = COCO_incompatible + COCO_pascal
merged

Using the following class remapping dictionary :
{1: 21,
 2: 2,
 3: 3,
 4: 4,
 5: 5,
 6: 6,
 7: 7,
 8: 8,
 9: 9,
 10: 10,
 11: 11,
 12: 12,
 13: 13,
 14: 14,
 15: 15,
 16: 16,
 17: 17,
 18: 18,
 19: 19,
 20: 20}


  warn(


VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

## Adding annotations to dataset

### Standalone annotation addition

Similar to [pandas.DataFrame.append](https://pandas.pydata.org/pandas-docs/version/1.4/reference/api/pandas.DataFrame.append.html), you can append one annotation row to your annotations dataframe.

Notice the `box_format` option which will let the method take care of the conversion itself. See [lours.utils.bbox_converter](../generated/lours.utils.bbox_converter.rst) for name conventions. For example yolo bboxes are giving box center x and y coordinates plus box height and width, all normalized with frame size. The format is thus `cxcywh`.

First, create a dataset with 2 images and no annotation

In [42]:
empty = COCO_pascal.loc_annot[[]].iloc[:2]
display(empty)

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

Here, we add one bounding box, for the first image. the box is a quarter of the image (half the height and half the width) and is at the top-left corner of the image.

In [43]:
empty.add_detection_annotation(
    format_string="cxcywh",
    image_id=352582,
    bbox_coordinates=[0.75, 0.75, 0.5, 0.5],
    confidence=0.5,
    category_id=20,
)

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

### Introduction the AnnotationAppender context manager

Similarly to [pandas.DataFrame.append](https://pandas.pydata.org/pandas-docs/version/1.4/reference/api/pandas.DataFrame.append.html), calling this method multiple times is discouraged, because each time it creates a new dataframe with only one more row.

What you can do instead is use the `annotation_append` method with a context manager. This appender will cache all the added annotation and will only append the consolidated data when exiting the context.

This is very useful when running an inference on a whole dataset.

Note that this operation is inplace !

In [44]:
with empty.annotation_append(format_string="cxcywh") as appender:
    appender.append(
        image_id=352582,
        bbox_coordinates=[0.75, 0.75, 0.5, 0.5],
        confidence=0.5,
        category_id=20,
    )
    appender.append(
        image_id=113354,
        bbox_coordinates=[0.25, 0.25, 0.5, 0.5],
        confidence=0.5,
        category_id=21,
    )
    print(empty.len_annot())  # Note that the dataset is not changed here

display(empty)

0


  warn(


VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…