**Objective:** to download images of the required classes from Open Images and save them in the ImageClassificationDirectoryTree format.

**Data:** Open Images Dataset V6 + Extensions ([source](https://storage.googleapis.com/openimages/web/index.html)).

## 1. Setup

We can download images from Open Images in various ways, which are described on the corresponding page of the source. But we will use the open-source FiftyOne tool for this (information about which can be viewed at the [link](https://voxel51.com/docs/fiftyone/)), which we need to install first.

In [None]:
!pip install fiftyone

We will also mount Google Drive to save the selected images to it in the form we need.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive



Image datasets from Open Images can be found directly in FiftyOne Dataset Zoo.

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz

In [None]:
# Import the module for working with sample fields
from fiftyone import ViewField as F

## 2. Downloading images

For the classification task that we solve in other notebooks, we chose two classes "Rabbit" and "Squirrel", so we need to download images of only these classes from the training, validation and testing sections. We can just download them, but the resulting image storage format is not quite suitable for us, and perhaps not all the images we need, so we will first load them, and then export them to a format that is convenient for us.

In [None]:
# Loading an image dataset
dataset_img = foz.load_zoo_dataset(
    'open-images-v6',
    label_types=['classifications'],
    classes=['Rabbit', 'Squirrel'],
    dataset_name='rabbit_squirrel_image',
)

Downloading split 'train' to '/root/fiftyone/open-images-v6/train'
Downloading https://storage.googleapis.com/openimages/v5/class-descriptions-boxable.csv to /root/fiftyone/open-images-v6/train/metadata/classes.csv
Downloading https://storage.googleapis.com/openimages/2018_04/bbox_labels_600_hierarchy.json to /tmp/tmppgndgntz/metadata/hierarchy.json
Downloading https://storage.googleapis.com/openimages/v5/train-annotations-human-imagelabels-boxable.csv to /root/fiftyone/open-images-v6/train/labels/classifications.csv
Downloading train samples
 100% |███████████████| 3961/3961 [4.6m elapsed, 0s remaining, 15.2 samples/s]      
Found 3961 samples
Downloading split 'test' to '/root/fiftyone/open-images-v6/test'
Downloading https://storage.googleapis.com/openimages/v5/class-descriptions-boxable.csv to /root/fiftyone/open-images-v6/test/metadata/classes.csv
Downloading https://storage.googleapis.com/openimages/2018_04/bbox_labels_600_hierarchy.json to /tmp/tmpj8gefijp/metadata/hierarchy.jso

We have loaded 3961 image samples in total for both classes for training, 145 for validation, and 488 for testing.

In [None]:
# Look at the schema of the dataset
dataset_img

Name:        rabbit_squirrel_image
Media type:  image
Num samples: 4594
Persistent:  False
Tags:        ['test', 'train', 'validation']
Sample fields:
    filepath:        fiftyone.core.fields.StringField
    tags:            fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    positive_labels: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classifications)
    negative_labels: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classifications)
    open_images_id:  fiftyone.core.fields.StringField
    ground_truth:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)

As we can see, the dataset has both positive and negative labels, with the first indicating, according to the documentation, the presence of an object of a certain class, and the second indicating its absence in the corresponding image. We are interested in positive labels, so we will continue to work with them.

If we look at the form in which our downloaded images are stored, we will see the following organization of them:

```
dataset_dir/
    info.json
    train/
        metadata/
            hierarchy.json
            classes.csv
        labels/
            classifications.csv
        data/
           image1.jpg
           image2.jpg
           ...
    validation/
        ...
    test/
        ...
```



The "validation" and "test" directories are organized in the same way as the "train" directory.

This format is probably convenient for machine analysis, but not very convenient if we want to manually check if the images are labeled correctly, remove some of them from the class we selected, etc. So we export our data to a different storage format, which will divide the images into classes as well. To do this, we need to create a new sample level field, which will contain the name of the class to which the image belongs based on positive labels.

In [None]:
# Create 'ground_truth' field for each sample
for label in ['Rabbit', 'Squirrel']:
    for sample in dataset_img.filter_labels('positive_labels', F('label') == label):
        sample['ground_truth'] = fo.Classification(label=label.lower())
        sample.save()

In [None]:
# Create a dataset view to view the examples
dataset_img_train_view = dataset_img.match_tags('train')

In [None]:
# Look at the first sample of the selected classes
dataset_img_train_view.filter_labels('ground_truth', F('label').is_in(['rabbit', 'squirrel'])).first()

<SampleView: {
    'id': '60acdc100135b5d26aa8efba',
    'media_type': 'image',
    'filepath': '/root/fiftyone/open-images-v6/train/data/760961d593750514.jpg',
    'tags': BaseList(['train']),
    'metadata': None,
    'positive_labels': <Classifications: {
        'classifications': BaseList([
            <Classification: {
                'id': '60acdc100135b5d26aa8efb9',
                'tags': BaseList([]),
                'label': 'Squirrel',
                'confidence': 1.0,
                'logits': None,
            }>,
        ]),
        'logits': None,
    }>,
    'negative_labels': <Classifications: {'classifications': BaseList([]), 'logits': None}>,
    'open_images_id': '760961d593750514',
    'ground_truth': <Classification: {
        'id': '60ace42f0135b5d26aa97c4c',
        'tags': BaseList([]),
        'label': 'squirrel',
        'confidence': None,
        'logits': None,
    }>,
}>

From the cell above, we can see that our new field has been added to the sample, and the class of the sample corresponds to its positive label.

In [None]:
# Look at the first sample that is not the selected classes
dataset_img_train_view.match(~F('ground_truth.label').is_in(['rabbit', 'squirrel'])).first()

<SampleView: {
    'id': '60acdc100135b5d26aa8efc6',
    'media_type': 'image',
    'filepath': '/root/fiftyone/open-images-v6/train/data/f7b7a26d252f9c6a.jpg',
    'tags': BaseList(['train']),
    'metadata': None,
    'positive_labels': <Classifications: {'classifications': BaseList([]), 'logits': None}>,
    'negative_labels': <Classifications: {
        'classifications': BaseList([
            <Classification: {
                'id': '60acdc100135b5d26aa8efc5',
                'tags': BaseList([]),
                'label': 'Rabbit',
                'confidence': 0.0,
                'logits': None,
            }>,
        ]),
        'logits': None,
    }>,
    'open_images_id': 'f7b7a26d252f9c6a',
    'ground_truth': None,
}>

But if the image does not have a positive label that would correspond to our classes, then our new field remains empty, which means that all these images are from a separate group.

We can now export the dataset to a different storage format(ImageClassificationDirectoryTree), keeping the separation into training, validation, and test data, and storing only the images that match our two classes. If we exported the entire dataset, the unlabeled images would be saved in the corresponding folder "_unlabeled".

In [None]:
# Specify the directory for storing images
image_store_dir = '/content/drive/MyDrive/Colab_Notebooks/Rabbit_Squirrel_Project/Images_by_class/'

In [None]:
# Export the dataset to the selected format
label_field = 'ground_truth'

for split in ['train', 'validation', 'test']:
    export_dir = ''.join([image_store_dir, split])
    label_field = 'ground_truth'
    (dataset_img.match_tags(split)
                .match(F('ground_truth.label').is_in(['rabbit', 'squirrel']))
                .export(export_dir=export_dir, label_field=label_field, 
                        dataset_type=fo.types.ImageClassificationDirectoryTree))

Directory '/content/drive/MyDrive/Colab_Notebooks/Rabbit_Squirrel_Project/Images_by_class/train' already exists; export will be merged with existing files
 100% |███████████████| 2964/2964 [12.0m elapsed, 0s remaining, 4.7 samples/s]      
Directory '/content/drive/MyDrive/Colab_Notebooks/Rabbit_Squirrel_Project/Images_by_class/validation' already exists; export will be merged with existing files
 100% |█████████████████| 105/105 [27.9s elapsed, 0s remaining, 3.5 samples/s]      
Directory '/content/drive/MyDrive/Colab_Notebooks/Rabbit_Squirrel_Project/Images_by_class/test' already exists; export will be merged with existing files
 100% |█████████████████| 326/326 [1.5m elapsed, 0s remaining, 3.4 samples/s]      


Only 2964 images from the training set, 105 from the validation one, and 326 from the test one were exported, which means that only these images are suitable for our classes.
Also, after the export, the images on the disk will be stored in the following format:

```
image_store_dir/
    train/
        rabbit/
            image1.jpg
            image2.jpg
            ...
        squirrel/
            image3.jpg
            ...
    validation/
        ...
    test/
        ...
```

As before, the "validation" and "test" directories are organized in the same way as the "train" directory.

Our objective is completed, now we can delete the dataset that was originally loaded, because we no longer need it.

In [None]:
# Delete a local copy of the dataset
foz.delete_zoo_dataset('open-images-v6')