# Datasets Overlap Assessment

#### Here we will collect the number of images per class per dataset and analyze overlap.
From mapping, we know neither CIFAR-100 or ImageNet-1000 have Wasp or Moquito, additionally, CIFAR-100 does not have Ant, Dragonfly, Fly, Grasshopper, Ladybug. 
Class overlap:
- Clean: 11 classes
- CIFAR-100: 4 of the clean classes map (4 fine matches)
- ImageNet-1000: 9 of the clean classes map (27 fine matches)

In [1]:
import sys
import os
from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd

sys.path.append(os.path.abspath('..'))

from utils.label_mappings import *
from datasets import load_dataset

### CIFAR-100 dataset

In [2]:
cifar100 = load_dataset("uoft-cs/cifar100")

In [3]:
cifar100['train']  # want to match cifar schema

Dataset({
    features: ['img', 'fine_label', 'coarse_label'],
    num_rows: 50000
})

In [4]:
cifar100['train'][0]  # want image type to be same

{'img': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32>,
 'fine_label': 19,
 'coarse_label': 11}

### ImageNet-1000 dataset

In [5]:
# awaiting DSMLP setup since a lot of memory is needed. (1.28 mil images)

### Loading clean dataset for compatible formats

In [6]:
base_path = '../clean_insect_images/'

class_dirs = ['Ant','Bee','Butterfly','Dragonfly','Fly','Grasshopper','Ladybug','Spider']

clean_ds = {'image':[], 'label':[], 'file_path':[]}

for c in class_dirs:
    target_dir = os.path.join(base_path, c)
    image_files = os.listdir(target_dir)
    for f in image_files:
        full_image_path = os.path.join(target_dir, f)
        clean_ds['image'].append(Image.open(full_image_path))
        clean_ds['label'].append(c)
        clean_ds['file_path'].append(f)



In [7]:
cifar100_df = pd.DataFrame({'fine_label': cifar100['train']['fine_label']})
imgnt1k_df = ...
clean_df = pd.DataFrame({'label': clean_ds['label']})

In [9]:
def map_to_clean_label(label):
    if label in cifar100_to_clean_map:
        return cifar100_to_clean_map[label]
    else:
        return None

cifar100_df['clean_label'] = cifar100_df['fine_label'].apply(map_to_clean_label)

In [11]:
cifar100_df.groupby('clean_label').count()

Unnamed: 0_level_0,fine_label
clean_label,Unnamed: 1_level_1
Bee,500
Beetle,500
Butterfly,500
Spider,500
