# Booleanize columns of lists in you Dataset dataframes

This notebooks shows how the `booleanize` and `debooleanize` methods can be used for easy attributes/tags filtering

Booleanization is the action of converting columns of list to a list of boolean columns. Each boolean column tells whether the element is present in the original list or not.

What's more, it shows how the widget works to be able to choose between showing boolean values or list values.

In [1]:
%load_ext autoreload

%autoreload 2
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)
import lours
from lours.dataset import from_caipy
from lours.utils.testing import assert_dataset_equal

## Booleanization example

### Note on widget interface

You can see by selecting the "Annotations" tab that you can chose to show the dataframes as booleanized or not, and with nested columns or not.

Don't forget that under the hood, the columns are booleanized and not nested, here, it's just for readability of the widget.

## Automatic booleanization

By default, when using `from_caipy` with `use_schema` set to `True`, it booleanizes the dataset.

See more info about advanced parsing with caipy and schemas in the related tutorial : [Demo schemas](6_demo_schemas.ipynb)

In [2]:
from_caipy(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
    use_schema=True,
    booleanize=True,
)

  0%|          | 0/2 [00:00<?, ?it/s]

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

Also, note that by default, columns are shown as they appear on the dataframe, i.e. raw columns with boolean value, but the default display option can be changed in the `lours.utils` module.

In [3]:
lours.utils.DISPLAY_NESTED_COLUMNS = True
lours.utils.DISPLAY_UNBOOLEANIZED = True
from_caipy(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
    use_schema=True,
    booleanize=True,
)

  0%|          | 0/2 [00:00<?, ?it/s]

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

## Manual Booleanization

If you select `booleanize=False` when loading with `from_caipy`, you will keep the item column.
To booleanize it manually, you can call the method `.booleanize` Make sure that the column names you give to that method are only composed of iterables in each cell (be it set or list)

In [4]:
dataset = from_caipy(
    "../../test_lours/test_data/caipy_dataset/tags/small_tagged_dataset/",
    json_schema="default",
    booleanize=False,
)

  0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
dataset

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

In [6]:
booleanized = dataset.booleanize("attributes.colors")
booleanized

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

## Working with booleanized data

See how simpler it is to filter annotations based on attributes.colors :

In this example, we  are interested into keeping annotations that have "red" in their "colors"

With a regular dataset, you will need to call the very inefficient `.apply` method.

In [7]:
dataset.loc_annot[dataset.annotations["attributes.colors"].apply(lambda x: "red" in x)]

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

With a booleanized dataset, you can directly call the `.loc_annot` method with the `attributes.colors.red` column.

In [8]:
booleanized.loc_annot[booleanized.annotations["attributes.colors.red"]]

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…

## Debooleanization

Although the booleanized columns are dropped in favor of the boolean ones, we keep track of them in a special attribute `Dataset.booleanized_columns`

As such, we can use the `debooleanize` method to get back to the original dataset. Note that this method has to be used for several io methods :
 - `to_caipy`
 - `to_caipy_generic`
 - `to_coco`
 - `to_fiftyone`

In [9]:
debool = booleanized.debooleanize()
assert_dataset_equal(debool, dataset)
debool

VBox(children=(HTML(value="<p><span style='white-space: pre-wrap; font-weight: bold'>Dataset object containing…