# Prepare Dataset

*[Florian Roscheck](https://www.linkedin.com/in/florianroscheck/), 2024-03-30*

In this notebook, we prepare the TACO waste dataset for using it with machine learning on Azure ML. We will download the data and make it available as data asset on Azure ML. To use the labels with Azure Machine Learning, we have to transform them into a different format. Since, in this project, we do not want to leverage the full taxonomy of the TACO-supplied labels, but a more general taxonomy, we will process the labels. Finally, we will create an MLTable with the data that Azure ML can ingest.

## Download Dataset

In this section, we download the dataset. To do this, we will clone the [TACO dataset GitHub repository](https://github.com/pedropro/TACO) and then use the provided download script to receive images and annotations. Since the downloaded data directory structure does not match our best practices, we will move the files to a fitting location in the `data/raw` directory.

In [1]:
# Create directory for cloning repo into

!mkdir -p ../src/data/TACO

In [1]:
# Clone repo

!git clone https://github.com/pedropro/TACO ../src/data/TACO

Cloning into '../src/data/TACO'...
remote: Enumerating objects: 740, done.[K
remote: Counting objects: 100% (160/160), done.[K
remote: Compressing objects: 100% (72/72), done.[K
remote: Total 740 (delta 117), reused 128 (delta 88), pack-reused 580[K
Receiving objects: 100% (740/740), 98.70 MiB | 22.80 MiB/s, done.
Resolving deltas: 100% (494/494), done.
Updating files: 100% (25/25), done.


In [1]:
# Run data download script

!python ../src/data/TACO/download.py --dataset_path ../src/data/TACO/data/annotations.json

Note. If for any reason the connection is broken. Just call me again and I will start where I left.
Downloading: [..............................] - 1500/1500


In [1]:
# Create directory to move downloaded data into

!mkdir -p ../data/raw/TACO/images

In [1]:
# Move downloaded data into data/raw directory

from pathlib import Path

for dir in Path('../src/data/TACO/data/').glob('*'):
    for file in dir.glob('*'):
        if file.name.startswith('.aml'):
            # When manually exploring data through the Azure ML
            # frontend, Azure ML creates temporary files that 
            # we don't want in our image dataset. Here, we remove 
            # these files before moving the directory.
            file.unlink()
    if dir.is_dir():
        dir.rename(Path('../data/raw/TACO/images/').joinpath(dir.name))

Path('../src/data/TACO/data/annotations.json').rename('../data/raw/TACO/annotations.json')

PosixPath('../data/raw/TACO/annotations.json')

## Upload and Register the Dataset

In this section, we upload and register the dataset so we can use it in machine learning through Azure ML. We will upload the dataset to the blob storage attached to the Azure ML workspace. We register the dataset so that it becomes available as data asset and we can use it in machine learning workflows.

In [1]:
# Make a connection to the Azure ML workspace and
# its default blob storage so we can interact with it.

from azureml.core import Workspace

ws = Workspace.from_config()
datastore = ws.get_default_datastore()

print('Data will be uploaded here:')
print(datastore)

Data will be uploaded here:
{
  "name": "***",
  "container_name": "***",
  "account_name": "***",
  "protocol": "https",
  "endpoint": "core.windows.net"
}


Now that we have a connection to the blob storage, we can upload the data to it. 

In [1]:
# Upload the data to the blob storage

from azureml.core import Dataset

# Here is the subdirectory which the data will be uploaded to 
# on the blob storage
blob_storage_path = 'data/raw/TACO'

uploaded_data = Dataset.File.upload_directory(
    src_dir='../data/raw/TACO/',
    target=(datastore, blob_storage_path),
    overwrite=True
    )

Validating arguments.
Arguments validated.
'overwrite' is set to True. Any file already present in the target will be overwritten.
Uploading files from '/mnt/batch/tasks/shared/LS_root/mounts/clusters/florian-inference-test/code/Users/florian.roscheck/trash_recognizer/notebooks/../data/raw/TACO' to 'data/raw/TACO'
Copying 66 files with concurrency set to 2
Copied /mnt/batch/tasks/shared/LS_root/mounts/clusters/florian-inference-test/code/Users/florian.roscheck/trash_recognizer/notebooks/../data/raw/TACO/batch_1/.amlignore, file 1 out of 66. Destination path: https://***.blob.core.windows.net/***/data/raw/TACO/batch_1/.amlignore
Copied /mnt/batch/tasks/shared/LS_root/mounts/clusters/florian-inference-test/code/Users/florian.roscheck/trash_recognizer/notebooks/../data/raw/TACO/batch_1/000001.jpg, file 2 out of 66. Destination path: https://***.blob.core.windows.net/***/data/raw/TACO/batch_1/000001.jpg
Copied /mnt/batch/tasks/shared/LS_root/mounts/clusters/florian-inference-test/code/User

Through the `uploaded_data` object we now have a `azureml.data.file_dataset.FileDataset` to work with. We can use its convenience method `register` to register it as a data asset on Azure ML.

In [1]:
# Register data asset in workspace
# Note that the description field is a great way to inform your team
# members of important information about the dataset, enabling them
# to use it.

uploaded_data.register(
    workspace=ws,
    name='TACO',
    description=(
        "Image dataset of trash pictures, from here: http://tacodataset.org/ ",
        "(open-source licenses, see https://github.com/pedropro/TACO/blob/master/data/annotations.json ",
        " and https://openlittermap.com/about)")
)

That's it! We have now uploaded and registered the dataset on Azure ML. This is the first step in using the data in our experiments.

## Transform Annotations

In order to use the annotations of the images in machine learning with AzureML, we have to convert them into the JSONL format. This is what we are going to do in this section.

The labels in the TACO dataset are provided in [COCO Format](http://cocodataset.org/#format-data). Unfortunately, Azure ML expects data to be in [JSONL](https://learn.microsoft.com/en-us/azure/machine-learning/reference-automl-images-schema?view=azureml-api-2#instance-segmentation) format for use with Azure automated machine learning. To not lose time here, we are going to leverage code written by [Microsoft](https://github.com/Azure/azureml-examples/tree/36296068c9292a37323a83bc6d1ae23a2c2bc87f/sdk/python/jobs/automl-standalone-jobs/jsonl-conversion) which implements this conversion from COCO to JSONL. Let's download this code and use it for transforming the labels!

In [None]:
# Create directory for JSONL conversion code

!mkdir -p ../src/features/jsonl_conversion

In [None]:
# Download conversion code

!wget https://raw.githubusercontent.com/Azure/azureml-examples/36296068c9292a37323a83bc6d1ae23a2c2bc87f/sdk/python/jobs/automl-standalone-jobs/jsonl-conversion/base_jsonl_converter.py -P ../src/features/jsonl_conversion/
!wget https://raw.githubusercontent.com/Azure/azureml-examples/36296068c9292a37323a83bc6d1ae23a2c2bc87f/sdk/python/jobs/automl-standalone-jobs/jsonl-conversion/coco_jsonl_converter.py -P ../src/features/jsonl_conversion/
!wget https://raw.githubusercontent.com/Azure/azureml-examples/36296068c9292a37323a83bc6d1ae23a2c2bc87f/sdk/python/jobs/automl-standalone-jobs/jsonl-conversion/masktools.py -P ../src/features/jsonl_conversion/

To execute the newly downloaded code, we need to install some dependencies. Let's do that now!

In [None]:
%pip install pycocotools simplification

Now we are ready to convert the TACO COCO file to JSONL. Let's import the newly downloaded code.

In [1]:
# Import downloaded code

import sys

sys.path.append('../src/features/jsonl_conversion')

from base_jsonl_converter import write_json_lines
from coco_jsonl_converter import COCOJSONLConverter

Although we try to reuse as much as possible, there is one conflict between the TACO data and the `COCOJSONLConverter`: In its method `_populate_image_url`, it removes all directories from the image path:

```python

def _populate_image_url(self, index, coco_image):
    """
    populates image url for jsonl entry

    Parameters:
        index (int): image entry index
        coco_image (dict): image entry from coco data file
    """
    image_url = coco_image["file_name"]
    self.jsonl_data[index]["image_url"] = (
        self.base_url + image_url[image_url.rfind("/") + 1 :] # <-- removal of directories
    )
    self.image_id_to_data_index[coco_image["id"]] = index

```

Since the images are stored in subdirectories in the TACO dataset, and the image paths in `annotations.json` file point to these subdirectories, we need to modify the method to keep the subdirectories. Let's do this quickly:

In [1]:
# Create a new class with a patched _populate_image_url method
# that keeps the subdirectories.

class ModifiedCOCOJSONLConverter(COCOJSONLConverter):
    def _populate_image_url(self, index, coco_image):
        """
        populates image url for jsonl entry

        Parameters:
            index (int): image entry index
            coco_image (dict): image entry from coco data file
        """
        image_url = coco_image["file_name"]
        self.jsonl_data[index]["image_url"] = (
            self.base_url + image_url
        )
        self.image_id_to_data_index[coco_image["id"]] = index

Now, to have any existing directory reference in the TACO COCO annotations point to the right location, we will have to inform the converter of the location of the TACO images on our blob storage. To get the correct directory here, we will leverage the Azure ML Python library and get a reference to the data asset we created earlier.

In [1]:
# Connect Azure ML client to Azure ML workspace

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(), 
    ws.subscription_id, 
    ws.resource_group, 
    ws.name
)
data_asset = ml_client.data.get("TACO", version="1")

print(data_asset.path)

azureml://subscriptions/***/resourcegroups/***/workspaces/***/datastores/***/paths/data/raw/TACO/


Now that we have the libraries imported, patched, and the new path to the images set, we can use the `COCOJSONLConverter` to convert the TACO-supplied `annotations.json` file to a JSONL file we can use with Azure ML.

In [1]:
# Convert TACO-supplied annotations.json file to JSONL

from pathlib import Path

converter = ModifiedCOCOJSONLConverter(
    base_url=data_asset.path, 
    coco_file='../data/raw/TACO/annotations.json'
)

target_file = Path('../data/processed/TACO/annotations.jsonl')
target_file.parent.mkdir(parents=True, exist_ok=True)

write_json_lines(converter, target_file)

Conversion completed. Converted 1500 lines.


## Process Labels

The TACO dataset offers a [rich taxonomy](http://tacodataset.org/taxonomy). But for sorting trash in a modern society, usually just a couple of categories are needed. It is mostly important into which trash can a specific item goes, not what the item specifically is. For example, a cardbox and a paper wrapper both go into the paper recycling trash can.

> Please note that the simplification presented here is a simplified and opinionated categorization based on experience living in Germany and being subject to the German recycling system – it is not an expert-informed assessment. Some important categories like battery or non-glass bottle recycling are not considered. If you want to implement trash sorting for the "real world", it is advisable you understand the trash processing and recycling system in the environment you are aiming to use the system in. The purpose of the categorization presented here is to show how labels of existing datasets can be manipulated before training with Azure ML, not to build a highly reliable trash categorization system.

Here are the categories (trash cans) we want to sort trash into:
- Blue: Paper recycling
- Yellow: Plastics, metal, composite material recycling
- Glass: Glass recycling
- Other: Anything else

Let's first have a glance at the existing labels and then develop a mapping mechanism for simplifying them so they end up as the categories listed above.

In [1]:
# Read in JSONL file as Pandas DataFrame

import pandas as pd

df = pd.read_json(target_file, lines=True)
df.head()

Unnamed: 0,image_url,image_details,label
0,azureml://subscriptions/***...,"{'format': 'jpg', 'width': 1537, 'height': 2049}","[{'label': 'Glass bottle', 'polygon': [[0.3649..."
1,azureml://subscriptions/***...,"{'format': 'jpg', 'width': 1537, 'height': 2049}","[{'label': 'Meal carton', 'polygon': [[0.60377..."
2,azureml://subscriptions/***...,"{'format': 'jpg', 'width': 1537, 'height': 2049}","[{'label': 'Clear plastic bottle', 'polygon': ..."
3,azureml://subscriptions/***...,"{'format': 'jpg', 'width': 2049, 'height': 1537}","[{'label': 'Clear plastic bottle', 'polygon': ..."
4,azureml://subscriptions/***...,"{'format': 'jpg', 'width': 1537, 'height': 2049}","[{'label': 'Drink can', 'polygon': [[0.6714378..."


In [1]:
# Extract labels and sort by occurence

labels_series = df['label'].explode().apply(lambda x: x['label'])
labels_series.value_counts(normalize=True)

label
Cigarette                    0.139423
Unlabeled litter             0.108069
Plastic film                 0.094273
Clear plastic bottle         0.059574
Other plastic                0.057065
Other plastic wrapper        0.054348
Drink can                    0.047868
Plastic bottle cap           0.043687
Plastic straw                0.032818
Broken glass                 0.028846
Styrofoam piece              0.023411
Disposable plastic cup       0.021739
Glass bottle                 0.021739
Pop tab                      0.020694
Other carton                 0.019440
Normal paper                 0.017140
Metal bottle cap             0.016722
Plastic lid                  0.016095
Paper cup                    0.014005
Corrugated carton            0.013378
Aluminium foil               0.012960
Single-use carrier bag       0.012751
Other plastic bottle         0.010452
Drink carton                 0.009406
Tissues                      0.008779
Crisp packet                 0.008152
Dispos

Based on the unique labels, we can now develop a mapping function. To facilitate, we first transform all labels to lowercase. Then, we go through the list above and manually extract as many key terms as necessary to build a mapping function that encompasses all existing labels.

In [1]:
# Build mapping function

# Transform all labels to lowercase
labels = set([label.lower() for label in labels_series.values.tolist()])

# Collect key terms
# (To belong to a specific category, the term needs to appear in the existing
# label in full.)
key_terms = {'yellow': ('plastic', 'six pack rings', 'carrier bag', 'can', 'aluminum', 'aluminium', 
                        'metal', 'foam', 'polypropylene', 'tupperware', 'aerosol', 'garbage bag',
                        'pop tab'), 
            'blue': ('paper', 'carton', 'toilet tube'), 
            'glass': ('glass',)}

# Produce mapping
bins = {}
for keyword, items in key_terms.items():
    if keyword not in bins:
        bins[keyword] = []
    for item in items:
        bins[keyword].extend([label for label in labels if item in label])
    labels = labels - set(bins[keyword])
bins['other'] = list(labels)

Let's sanity check the mapping. We want to see a list of the original item category and, next to it, the newly assigned category.

In [1]:
bins_inverse = {}
for keyword, items in bins.items():
    for item in items:
        bins_inverse[item] = keyword

bins_inverse

{'plastic straw': 'yellow',
 'disposable plastic cup': 'yellow',
 'other plastic cup': 'yellow',
 'plastic film': 'yellow',
 'plastic utensils': 'yellow',
 'plastic lid': 'yellow',
 'other plastic wrapper': 'yellow',
 'plastic glooves': 'yellow',
 'other plastic bottle': 'yellow',
 'plastic bottle cap': 'yellow',
 'clear plastic bottle': 'yellow',
 'other plastic': 'yellow',
 'other plastic container': 'yellow',
 'six pack rings': 'yellow',
 'single-use carrier bag': 'yellow',
 'drink can': 'yellow',
 'food can': 'yellow',
 'aluminium foil': 'yellow',
 'aluminium blister pack': 'yellow',
 'metal bottle cap': 'yellow',
 'metal lid': 'yellow',
 'scrap metal': 'yellow',
 'styrofoam piece': 'yellow',
 'foam cup': 'yellow',
 'foam food container': 'yellow',
 'polypropylene bag': 'yellow',
 'tupperware': 'yellow',
 'aerosol': 'yellow',
 'garbage bag': 'yellow',
 'pop tab': 'yellow',
 'paper straw': 'blue',
 'paper cup': 'blue',
 'magazine paper': 'blue',
 'normal paper': 'blue',
 'paper bag'

This mapping generally looks good, but for some items we might need a more differentiated category. For example, for a pizza box, we might risk discarding it into the wrong trash can, other, if it is made out of paper and overall clean – in which case it should go into the blue can. But for the sake of this exercise, the quality of the mapping it sufficient.

Let's apply it to the annotations!

In [1]:
# Replace labels in JSONL DataFrame

def replace_label(row):
    for entry in row:
        entry['label'] = bins_inverse[entry['label'].lower()]
    return row

df['label'].apply(replace_label)

0       [{'label': 'glass', 'polygon': [[0.36499674690...
1       [{'label': 'blue', 'polygon': [[0.603773584905...
2       [{'label': 'yellow', 'polygon': [[0.4359141184...
3       [{'label': 'yellow', 'polygon': [[0.1727672035...
4       [{'label': 'yellow', 'polygon': [[0.6714378659...
                              ...                        
1495    [{'label': 'blue', 'polygon': [[0.545504385964...
1496    [{'label': 'yellow', 'polygon': [[0.5526315789...
1497    [{'label': 'glass', 'polygon': [[0.10350000000...
1498    [{'label': 'blue', 'polygon': [[0.211622807017...
1499    [{'label': 'yellow', 'polygon': [[0.75, 0.4437...
Name: label, Length: 1500, dtype: object

Great! Now, we should write the modified data into JSONL format so we can use it with Azure ML. We are going to leverage Pandas' `DataFrame.to_json()` functionality. There is just one gotcha: The `to_json()` command will escape `/` characters like `\/` – but this affects the file paths stored in the JSONL file. So, this is why we write the JSONL file to a string buffer and manipulate the buffer to undo the escaping before we finally write it to the labels file.

In [1]:
# Write new mapping to JSONL file

from io import StringIO

buffer = StringIO()
df.to_json(buffer, lines=True, orient='records')

unescaped_buffer = buffer.getvalue().replace('\/','/')

modified_annotations_file = Path('../data/processed/TACO/annotations_modified.jsonl')

with open(modified_annotations_file, 'w') as file:
    file.write(unescaped_buffer)

Perfect! We have modified the annotations and are now ready for the final step of data preparation.

## Create MLTable

Azure ML uses so-called "Azure Machine Learning tables" (find out more on the [Azure ML website](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-mltable?view=azureml-api-2&tabs=cli) to facilitate working with data. These tables are like blueprints for loading and modifying data. I personally find them especially useful for computer vision tasks, as their native support on Azure ML helps leverage Azure ML's features to the fullest. Let's create an Azure ML table for the newly created annotations file.

In [1]:
# Create directory for MLTable and move annotations file
# into this directory

mltable_path = Path('../data/processed/TACO_labels')
mltable_path.mkdir(parents=True, exist_ok=True)

Path(modified_annotations_file).rename(mltable_path.joinpath(f'./{modified_annotations_file.name}'))


PosixPath('../data/processed/TACO_labels/annotations_modified.jsonl')

In [1]:
# Create MLTable file

content = f"""paths:
  - file: ./{modified_annotations_file.name}
transformations:
  - read_json_lines:
        encoding: utf8
        invalid_lines: error
        include_path_column: false
  - convert_column_types:
      - columns: image_url
        column_type: stream_info
"""

with open(str(mltable_path) + '/MLTable', 'w') as file:
    file.write(content)

Now we can register the file as a data asset on Azure ML so we can use it to train a model with the added benefit that any training job on Azure ML refers to this data asset. This enables model inputs to be reproducible.

In [1]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Define the Data asset object
dataset = Data(
    path=mltable_path,
    type=AssetTypes.MLTABLE,
    description=("Annotations for TACO dataset, with yellow, blue, ",
                 "glass, and other classes of trash"),
    name="TACO-annotations",
    version="1",
)

# Create the data asset in the workspace
ml_client.data.create_or_update(dataset)

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': ['./annotations_modified.jsonl'], 'type': 'mltable', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'TACO-annotations', 'description': "('Annotations for TACO dataset, with yellow, blue, ', 'glass, and other classes of trash')", 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/***/resourceGroups/***/providers/Microsoft.MachineLearningServices/workspaces/***/data/TACO-annotations/versions/1', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/florian-inference-test/code/Users/florian.roscheck/trash_recognizer/notebooks', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7fb2f7514d00>, 'serialize': <msrest.serialization.Serializer object at 0x7fb2f75178b0>, 'version': '1', 'latest_version': None, 'path': 'azureml://subscriptions/***/resourcegroups/***/workspaces/***/data

If you now explore the dataset in the UI of Azure ML, you will notice that the preview window shows images with labels included!