# WildlifeReID-10k creation - part 2

This is the second part for creating the WildlifeReID-10k dataset. It creates the split for the dataset obtained in the first part.

First load the necessary packages.

In [1]:
import sys
sys.path.insert(0, '..')

import os
import numpy as np
from wildlife_datasets import datasets, splits

We specify the roots, load the dataset and features and verify whether the orderings of the dataset and the features correspond to each other.

In [2]:
root = '/data/wildlife_datasets/data/WildlifeReID10k'
save_clusters_prefix = 'clusters/cluster'
d = datasets.WildlifeReID10k(root)
df = d.df
if not np.array_equal(df.index, range(len(df))):
    raise Exception('Index must be 0..n')
features_names = np.load('features/features_dino_names.npy', allow_pickle=True)
if not np.array_equal(df['path'], features_names):
    raise Exception('Features were computed for different indices')

We normalize the features.

In [3]:
features = np.load('features/features_dino.npy')
for i in range(len(features)):
    features[i] /= np.linalg.norm(features[i])

We create the splits dataset-by-dataset. We first use `OpenSetSplit` from `wildlife-datasets`, which creates an open-set split with approximately 80% images in the training size. Approximately 10% of images depict new individuals: they are only in the testing but not in the training set. We then repslit the open-set split using `resplit_by_features`. For each individual, it keeps the same number of images in the training and testing set but reshuffles them. It shuffles them based so that the images whose `features` are similar, end up all in the training set. This prevents the information leak from the training to the testing set when images in both sets are similar or even the same.

In [4]:
for name, df_dataset in df.groupby('dataset'):
    print(name)
    splitter = splits.OpenSetSplit(0.8, 0.1, seed=666)    
    idx_train0, idx_test0 = splitter.split(df_dataset)[0]
    idx_train, idx_test = splitter.resplit_by_features(df_dataset, features[df_dataset.index], idx_train0, save_clusters_prefix=save_clusters_prefix)

    df.loc[idx_train, 'split'] = 'train'
    df.loc[idx_test, 'split'] = 'test'
df = df.drop('image_id', axis=1)
df.to_csv(os.path.join(root, 'metadata.csv'), index=False)

AAUZebraFish


100%|██████████| 6/6 [00:03<00:00,  1.84it/s]


ATRW


100%|██████████| 182/182 [01:31<00:00,  1.99it/s]


AerialCattle2017


100%|██████████| 23/23 [00:16<00:00,  1.42it/s]


BelugaID


100%|██████████| 788/788 [04:35<00:00,  2.86it/s]


BirdIndividualID


100%|██████████| 50/50 [01:04<00:00,  1.29s/it]


CTai


100%|██████████| 71/71 [00:57<00:00,  1.24it/s]


CZoo


100%|██████████| 24/24 [00:18<00:00,  1.29it/s]


CatIndividualImages


100%|██████████| 509/509 [04:26<00:00,  1.91it/s]


CowDataset


100%|██████████| 13/13 [00:06<00:00,  1.91it/s]


Cows2021


100%|██████████| 179/179 [03:08<00:00,  1.05s/it]


DogFaceNet


100%|██████████| 1393/1393 [03:40<00:00,  6.31it/s]


FriesianCattle2015


100%|██████████| 25/25 [00:03<00:00,  7.18it/s]


FriesianCattle2017


100%|██████████| 89/89 [00:14<00:00,  6.18it/s]


GiraffeZebraID


100%|██████████| 2056/2056 [02:51<00:00, 11.97it/s]


Giraffes


100%|██████████| 178/178 [00:28<00:00,  6.24it/s]


HyenaID2022


100%|██████████| 256/256 [00:58<00:00,  4.41it/s]


IPanda50


100%|██████████| 50/50 [00:44<00:00,  1.13it/s]


LeopardID2022


100%|██████████| 430/430 [01:29<00:00,  4.80it/s]


MPDD


100%|██████████| 191/191 [00:38<00:00,  5.01it/s]


NDD20


100%|██████████| 82/82 [01:14<00:00,  1.09it/s]


NyalaData


100%|██████████| 237/237 [00:37<00:00,  6.30it/s]


OpenCows2020


100%|██████████| 46/46 [00:40<00:00,  1.14it/s]


PolarBearVidID


100%|██████████| 13/13 [00:11<00:00,  1.17it/s]


SMALST


100%|██████████| 10/10 [00:10<00:00,  1.05s/it]


SeaStarReID2023


100%|██████████| 95/95 [00:58<00:00,  1.62it/s]


SeaTurtleID2022


100%|██████████| 438/438 [03:08<00:00,  2.32it/s]


SealID


100%|██████████| 57/57 [00:58<00:00,  1.02s/it]


StripeSpotter


100%|██████████| 45/45 [00:10<00:00,  4.10it/s]


WhaleSharkID


100%|██████████| 543/543 [03:07<00:00,  2.89it/s]


ZindiTurtleRecall


100%|██████████| 2265/2265 [06:02<00:00,  6.25it/s]
