# 1. Load from disk and upload to the Hub

## Setup

In [None]:
import os
import urllib
import zipfile

from detection_datasets import DetectionDataset

## Download the files

The files (images and annotations) are stored in S3, and the links for downloading them are provided in a [GitHub repository](https://github.com/cvdfoundation/fashionpedia).


The dataset is formatted in the [COCO format](https://cocodataset.org/#format-data).

Link for the `train` images:  
https://s3.amazonaws.com/ifashionist-dataset/images/train2020.zip    
Link for the `validation` images:   
https://s3.amazonaws.com/ifashionist-dataset/images/val_test2020.zip   

Link for the `train` annotations:   
https://s3.amazonaws.com/ifashionist-dataset/annotations/instances_attributes_train2020.json    
Link for the `validation` annotations:   
https://s3.amazonaws.com/ifashionist-dataset/annotations/instances_attributes_val2020.json    

You may notice the the `test` split is absent: this is because the dataset was part of a Kaggle competition, where the submission are evaluate on a holdout test data that is not public. 
See notebook 2. to see how to create your custom splits nevertheless.

Let's first define some constants:

In [None]:
# Download from S3
RAW_TRAIN_IMAGES = 'https://s3.amazonaws.com/ifashionist-dataset/images/train2020.zip'
RAW_VAL_IMAGES = 'https://s3.amazonaws.com/ifashionist-dataset/images/val_test2020.zip'
RAW_TRAIN_ANNOTATIONS = 'https://s3.amazonaws.com/ifashionist-dataset/annotations/instances_attributes_train2020.json'
RAW_VAL_ANNOTATIONS = 'https://s3.amazonaws.com/ifashionist-dataset/annotations/instances_attributes_val2020.json'

# to local disk
DATA_DIR = os.path.join(os.getcwd(), 'data')
TRAIN_ANNOTATIONS = 'train.json'
VAL_ANNOTATIONS = 'val.json'

And now download the images and annotations:

In [None]:
def download(url, target):
    """Download image and annotations."""
    
    # Images
    if url.split('.')[-1] == 'zip':
        path, _ = urllib.request.urlretrieve(url=url)
        with zipfile.ZipFile(path, "r") as f:
            f.extractall(target)
            
        os.remove(path)
    
    # Annotations
    else:
        urllib.request.urlretrieve(url=url, filename=target)


os.makedirs(DATA_DIR, exist_ok=True)

download(url=RAW_TRAIN_ANNOTATIONS, target=os.path.join(DATA_DIR, TRAIN_ANNOTATIONS))
download(url=RAW_VAL_ANNOTATIONS, target=os.path.join(DATA_DIR, VAL_ANNOTATIONS))

download(url=RAW_TRAIN_IMAGES, target=DATA_DIR)
download(url=RAW_VAL_IMAGES, target=DATA_DIR)

Here are the files and directories we have just downloaded:

In [None]:
os.listdir(DATA_DIR)

Note the the validation images are in the 'test' folder.

## Read the downloaded files

In [None]:
config = {
    'dataset_format': 'coco',                        # the format of the dataset on disk
    'path': DATA_DIR,                                # where the dataset is located
    'splits': {                                      # how to read the files
        'train': (TRAIN_ANNOTATIONS, 'train'),       # name of the split (annotation file, images directory)
        'val': (VAL_ANNOTATIONS, 'test'),               # the val directory get unziped in 'test'
    },
}

dd = DetectionDataset().from_disk(**config)

## Analyse the data

### DataFrame

Internally the data is stored in a Pandas DataFrame.

It can viewed grouped by image (the default):

In [None]:
dd.data      # This is the same as calling dd.get_data(index='image')

Or it can be viewed with one row for each annotation:

In [None]:
dd.get_data(index='bbox')

### Image

We can show an image an the annotations:

In [None]:
dd.show()

### Numbers

In [None]:
dd.n_images

In [None]:
dd.n_bbox

As mentionned earlier, there is no 'test' dataset here:

In [None]:
dd.splits

In [None]:
dd.split_proportions

We also see that >97.5% of the images belong to the training dataset.

### Categories

There are 46 categories in this dataset, we can get the full list:

In [None]:
dd.n_categories

In [None]:
dd.category_names

Let's also show the categories with their ids:

In [None]:
dd.categories

## Upload to the Hub

We can now upload the dataset to the Hugging Face Hub:

In [None]:
dd.to_hub(dataset_name='fashionpedia', repo_name='detection-datasets')