# Schemas mechanics

The Caipy loader features a way to parse the caipyjson and efficiently transform into column-wise data which is more compatible with pandas.

The schemas will help check caipy json structure, automatically normalize the data so that it's not nested anymore, and will booleanize the sets. See related tutorial about booleanization here : [Demo booleanization](7_demo_booleanize.ipynb)

## What is a schemas ?

A schemas is way to specify data structure. For each key of an object is given its type. It can be a string, a float, to more complexe types like lists and objects themselves. That ways you can specify how should the data be nested.

See more info about json schemas in the [official documentation](https://json-schema.org/specification)

This library provides a default schema but you can provide your own schema as well with a path or a url.

.. nbinfo::
    For a better readability, we use mercury's [JSON displayer](https://runmercury.com/examples/display-json-jupyter-notebook/), but you can simply replace `mr.JSON` with `display` for every shown dictionary


In [1]:
%%capture

%load_ext autoreload
%autoreload 2
import json

import mercury as mr

from lours.dataset.io.schema_util import load_json_schema

app = mr.App(title="Display notebook", static_notebook=True)

In [2]:
default_caipy_schema = load_json_schema("default")
# Show the json with mercury, for better readability
mr.JSON(default_caipy_schema)

The interesting types for us are
- `enum` which can be converted to [pandas categorical](https://pandas.pydata.org/docs/user_guide/categorical.html)
- `array`, with the `uniqueItems` set to `True`, this can be seen as an unordered set and thus can be [booleanized](7_demo_booleanize.ipynb)
- `object` this tells us that data is nested and thus need to be normalized. For example, the `weather` tag for images is inside the `tags` object. In the images dataframe, this will go in the `tags.weather` column.

In the future, we might have to deal with `array` object that are not ordered sets. They can be converted to categorical data within the columns `array.1`, `array.2` etc, but there is no support for it for the moment.

### Data checking

The first obvious use of schemas is for validation.

For example the following data structure is rejected because the value `custom_dict["annotations"][0]["attributes"]["colors"]` is set to `turquoise` while it must be on of the following values, that are specified in the schema: `custom_caipy_schema["properties"]["annotations"]["items"]["properties"]["attributes"]["properties"]["colors"]["items"]["enum"]`

In [3]:
with open("../../test_lours/test_data/caipy_dataset/tags/custom_schema/785.json") as f:
    custom_dict = json.load(f)

mr.JSON(custom_dict)

mr.JSON(
    default_caipy_schema["properties"]["annotations"]["items"]["properties"][
        "attributes"
    ]["properties"]["colors"]["items"]["enum"]
)

In [None]:
from jsonschema_rs import validator_for

validator = validator_for(default_caipy_schema)

validator.validate(custom_dict)

ValidationError: "turquoise" is not one of ["red","green","yellow","blue","white","black","orange","purple","grey","brown","pink","beige","cyan"]

Failed validating "enum" in schema["properties"]["annotations"]["items"]["properties"]["attributes"]["properties"]["colors"]["items"]

On instance["annotations"][0]["attributes"]["colors"][1]:
    "turquoise"

### list and enum formatting

As mentioned above, we can use the schemas to construct dataframe with the right columns and dtypes, even if some values are not present.

A regular caveat is the mix of fields that are present in some annotations but not in others. Pandas deals with missing values with `NaN` and `None`, but we can replace them with the right default value. For example, if we know that a field is a list, we can give it the empty list as default value.

In [5]:
from lours.dataset.io.schema_util import (
    fill_with_dtypes_and_default_value,
    get_enums,
    get_remapping_dict_from_schema,
)

image_schema = default_caipy_schema["properties"]["image"]
annotations_schema = default_caipy_schema["properties"]["annotations"]["items"]

To better understand the flattening of the data, we can look at some utility function with schemas.

`get_remapping_dict_from_schema` function will construct a nested dict where the values are the column name destination.

`get_enums` will search for arrays with unique items and retrieve all possible values. This will be used to construct boolean columns which tell us for each enum if it was in the original list for this very row.

In [6]:
mr.JSON(get_remapping_dict_from_schema(image_schema))

In [7]:
enums = get_enums(annotations_schema)
# convert sets to list for json serialization
mr.JSON({k: list(v) for k, v in enums.items()})

In the following cells, we use the caipy loader with and without the default schema. Notice how in the the annotations data of image1, there is no "position" in the dictionary.

If we load the caipy without schema, we can flatten the json data, thanks to [pandas.json_normalize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html), but the missing data will be set to `None`, while it should be an empty list.

Using the schema can help setting the right default value

In [8]:
from lours.dataset import from_caipy

with open(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/Annotations/image1.json"
) as f:
    caipy_json = json.load(f)
mr.JSON(caipy_json["annotations"][0])
with open(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/Annotations/image2.json"
) as f:
    caipy_json = json.load(f)
mr.JSON(caipy_json["annotations"][0])

In [9]:
no_schema_dataset = from_caipy(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
    use_schema=False,
)

schema_dataset = from_caipy(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
    use_schema=True,
    booleanize=False,
)

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
no_schema_dataset.annotations

Unnamed: 0_level_0,image_id,category_str,category_id,box_x_min,box_y_min,box_width,box_height,area,attributes.colors,attributes.occluded,attributes.position
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
269791,6091,stop sign,13,100.22,117.54,253.43,274.9,46726.4303,"[red, white]",True,
1161234,10395,teddy bear,88,49.19,66.2,378.97,379.68,84587.4391,[grey],,[front]


In [11]:
schema_dataset.annotations

Unnamed: 0_level_0,image_id,category_str,category_id,box_x_min,box_y_min,box_width,box_height,area,attributes.colors,attributes.occluded,attributes.position
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
269791,6091,stop sign,13,100.22,117.54,253.43,274.9,46726.4303,"[red, white]",True,[]
1161234,10395,teddy bear,88,49.19,66.2,378.97,379.68,84587.4391,[grey],,[front]


Note that thanks to schema tool `fill_with_default_value` we can put default values afterward.

In [12]:
fill_with_dtypes_and_default_value(annotations_schema, schema_dataset.annotations)

Unnamed: 0_level_0,image_id,category_str,category_id,box_x_min,box_y_min,box_width,box_height,area,attributes.colors,attributes.occluded,attributes.position
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
269791,6091,stop sign,13,100.22,117.54,253.43,274.9,46726.4303,"[red, white]",True,[]
1161234,10395,teddy bear,88,49.19,66.2,378.97,379.68,84587.4391,[grey],,[front]


### From dataframe to nested json

Once you have manipulated your dataset you can then re-save it according to the schema.

In [13]:
from lours.dataset.io.schema_util import remap_dict

flat_dict = schema_dataset.annotations.iloc[0].to_dict()
mr.JSON(flat_dict)

nested_dict = remap_dict(flat_dict, get_remapping_dict_from_schema(annotations_schema))
mr.JSON(nested_dict)

## Using a custom schema

If you have custom data that you want to work with, you can give your own json schema instead of the ones provided by the official package.

The given schema to `from_caipy` and `from_caipy_generic` can be either a path to a json or directly a dictionary.

In the following schema, the value "turquoise" is now considered as a valid value in image's spectrum. Also, the possible values for list items for annotations colors (`annotation["attributes"]["colors"]`) and annotations actions (`annotations["attributes"]["actions"]`) and reduced to only 2 possible items each. Respectively "blue" and "white" for colors, and "sitting" and "laying" for actions.


In [14]:
with open("../../test_lours/test_data/caipy_dataset/tags/custom_schema.json") as f:
    custom_schema = json.load(f)

mr.JSON(custom_schema)

In [15]:
from lours.dataset import from_caipy_generic

As mentioned above, this cell will fail because by default caipy expects the CA-V5.b schema

In [16]:
dataset = from_caipy_generic(
    images_folder=None,
    annotations_folder="../../test_lours/test_data/caipy_dataset/tags/custom_schema/",
    use_schema=True,
)

specifying a fictive path for images : ../../test_lours/test_data/caipy_dataset/tags/Images


  0%|          | 0/1 [00:00<?, ?it/s]

ValidationError: "turquoise" is not one of ["red","green","yellow","blue","white","black","orange","purple","grey","brown","pink","beige","cyan"]

Failed validating "enum" in schema["properties"]["annotations"]["items"]["properties"]["attributes"]["properties"]["colors"]["items"]

On instance["annotations"][0]["attributes"]["colors"][1]:
    "turquoise"

This one will succeed

Also note that booleanize columns are less numerous for `attributes.actions` and `attributes.colors`

In [17]:
dataset = from_caipy_generic(
    images_folder=None,
    annotations_folder="../../test_lours/test_data/caipy_dataset/tags/custom_schema/",
    use_schema=True,
    json_schema=custom_schema,
)

specifying a fictive path for images : ../../test_lours/test_data/caipy_dataset/tags/Images


  0%|          | 0/1 [00:00<?, ?it/s]

In [18]:
dataset

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…