## Explore FLAIR labels
This notebook will explore the distribution of users and labels in FLAIR dataset.
The `labels_and_metadata.json` file contains a list of `user_id`, `labels` and `fine_grained_labels` for each image. 

`fine_grained_labels` are human annotated labels from a taxonomy of 1,628 categories.
`labels` reference higher-order categories than `fine_grained_labels`: the 1,628 `fine_grained_labels` map to 17 `labels`.

For example, `fine_grained_labels: ["waffle", "bread"]` map to `labels: ["food"]`.
It is expected that `fine_grained_labels` also contain the 17 coarse-grained categories as the subjects in an image might not be categorized into finer granularity.

### Load the labels and metadata

In [1]:
import json
import os

dataset_dir = "../flair-data/" # replace with the path to directory that you downloaded the dataset

with open(os.path.join(dataset_dir, "labels_and_metadata.json")) as f:
    metadata_list = json.load(f)

print(f"Loaded metadata and labels for {len(metadata_list)} images")
print(f"Example metadata and labels for one image:\n" + json.dumps(metadata_list[0], indent=4))

Loaded metadata and labels for 429078 images
Example metadata and labels for one image:
{
    "user_id": "59769174@N00",
    "image_id": "14913474848",
    "labels": [
        "equipment",
        "material",
        "structure"
    ],
    "partition": "train",
    "fine_grained_labels": [
        "bag",
        "document",
        "furniture",
        "material",
        "printed_page"
    ]
}


### Count labels over users and images

In [2]:
from collections import Counter, defaultdict

# Counter for image label statistics
image_label_counter = []
image_fine_grained_counter = []

# Counters for label statistics
label_counter = Counter()
fine_grained_label_counter = Counter()

# Counters for user label statistics
user_image_counter = Counter()
user_label_counter = defaultdict(Counter)
user_fine_grained_label_counter = defaultdict(Counter)

n_train_images, n_val_images, n_test_images = 0, 0, 0
train_users, val_users, test_users = set(), set(), set()

for metadata in metadata_list:
    image_label_counter.append(len(metadata["labels"]))
    image_fine_grained_counter.append(len(metadata["fine_grained_labels"]))
    # Increment count for overall label distribution
    label_counter.update(metadata["labels"])
    fine_grained_label_counter.update(metadata["fine_grained_labels"])
    # Increment count for user label distribution
    user_image_counter[metadata["user_id"]] += 1
    user_label_counter[metadata["user_id"]].update(metadata["labels"])
    user_fine_grained_label_counter[metadata["user_id"]].update(metadata["fine_grained_labels"])
    # train/val/test counts
    if metadata["partition"] == "train":
        train_users.add(metadata["user_id"])
        n_train_images += 1
    if metadata["partition"] == "val":
        val_users.add(metadata["user_id"])
        n_val_images += 1
    if metadata["partition"] == "test":
        test_users.add(metadata["user_id"])
        n_test_images += 1

print(f"{len(label_counter)} unique labels\n" 
      f"{len(fine_grained_label_counter)} unique fine-grained labels\n" 
      f"{len(user_image_counter)} users")
print(f"Number of train/val/test images: {n_train_images}/{n_val_images}/{n_test_images}")
print(f"Number of train/val/test users: {len(train_users)}/{len(val_users)}/{len(test_users)}")

17 unique labels
1628 unique fine-grained labels
51414 users
Number of train/val/test images: 345879/39239/43960
Number of train/val/test users: 41131/5141/5142


### Per-image label statistics

The table below displays the per-image label statistics. 
On average, each image has 2.79 labels and 4.61 fine-grained labels. 

In [3]:
import pandas as pd

pd.DataFrame(zip(image_label_counter, image_fine_grained_counter), columns=['label count', 'fine-grained label count']).describe()

Unnamed: 0,label count,fine-grained label count
count,429078.0,429078.0
mean,2.785249,4.614781
std,1.149218,2.729608
min,1.0,1.0
25%,2.0,3.0
50%,3.0,4.0
75%,4.0,6.0
max,9.0,36.0


### Histogram of coarse-grained labels
The table below shows the counts of the 17 higher-order labels overall images.

In [4]:
pd.DataFrame(label_counter.most_common(), columns=['label', 'count'])

Unnamed: 0,label,count
0,structure,228923
1,equipment,225862
2,material,139733
3,outdoor,131322
4,plant,123363
5,food,110792
6,animal,68858
7,liquid,68677
8,art,37230
9,interior_room,32042


### Histogram of the top 20 fine-grained labels
The table below shows the counts of the 20 most common fine-grained labels overall images.

In [5]:
sorted_fine_grained_labels = fine_grained_label_counter.most_common()
pd.DataFrame(sorted_fine_grained_labels[:20], columns=['Head fine-grained label', 'count'])

Unnamed: 0,Head fine-grained label,count
0,wood_processed,82336
1,material,74807
2,structure,61517
3,grass,44741
4,plate,42392
5,plant,42009
6,blue_sky,41043
7,foliage,40793
8,cloudy,36225
9,tree,33984


### Per-user image statistics
The table below displays the per-user image statistics. 
On average, each user has 8.34 images. The distribution is head-heavy and long-tailed, where the median of images per user is 2.

In [6]:
pd.DataFrame(user_image_counter.values(), columns=["user image counts"]).describe()

Unnamed: 0,user image counts
count,51414.0
mean,8.345548
std,51.27598
min,1.0
25%,1.0
50%,2.0
75%,5.0
max,4151.0


### Per-user label statistics
The table below displays the per-user image statistics. 
On average, each user has 4.6 distinct labels.

In [7]:
user_num_labels = [len(counter) for counter in user_label_counter.values()]
pd.DataFrame(user_num_labels, columns=["user label counts"]).describe()

Unnamed: 0,user label counts
count,51414.0
mean,4.608647
std,2.811109
min,1.0
25%,3.0
50%,4.0
75%,6.0
max,17.0


### Per-user fine-grained label statistics
The table below displays the per-user image statistics. 
On average, each user has 16.8 distinct fine-grained labels.

In [8]:
user_num_fine_grained_labels = [len(counter) for counter in user_fine_grained_label_counter.values()]
pd.DataFrame(user_num_fine_grained_labels, columns=["user fine-grained label counts"]).describe()

Unnamed: 0,user fine-grained label counts
count,51414.0
mean,16.813242
std,32.666895
min,1.0
25%,4.0
50%,7.0
75%,16.0
max,838.0
