# WildlifeReID-10k creation - part 1

This is the first part for creating the WildlifeReID-10k dataset. It copies the files to a separate folder, applies bounding boxes and masks and combined them together. The split is created in the second part.

First load the necessary packages.

In [5]:
import os
import pandas as pd
from wildlife_datasets.preparation import prepare_functions, species_conversion

Then specify the roots, where the dataset is located. Parameters `size` specifies the size to which the datasets will be resized and parameters `copy_files` whether the files copied from `root_datasets` to `root`. Since bounding boxes and masks are applied and the black borders are cropped, it is relatively time-consuming.

In [8]:
root_datasets = '/data/wildlife_datasets/data'
root = os.path.join(root_datasets, 'WildlifeReID10k')
root_images = os.path.join(root, 'images')
root_metadata = os.path.join(root, 'metadata')
transform = None
copy_files = True
names_permissible = [
    'AAUZebraFish',
    'AerialCattle2017',
    'AmvrakikosTurtles',
    'ATRW',
    'BelugaID',
    'BirdIndividualID',
    'CatIndividualImages',
    'CTai',
    'CZoo',
    'Chicks4FreeID',
    'CowDataset',
    'Cows2021',
    'DogFaceNet',
    'FriesianCattle2015',
    'FriesianCattle2017',
    'GiraffeZebraID',
    'Giraffes',
    'HyenaID2022',
    'IPanda50',
    'LeopardID2022',
    'MPDD',
    'MultiCamCows2024',
    'NDD20',
    'NyalaData',
    'OpenCows2020',
    'PolarBearVidID',
    'PrimFace',
    'ReunionTurtles',
    'SealID',
    'SeaStarReID2023',
    'SeaTurtleID2022',
    'SMALST',
    'SouthernProvinceTurtles',
    'StripeSpotter',
    'WhaleSharkID',
    'ZakynthosTurtles',
    'ZindiTurtleRecall',
]
remove_str = ['[', ']']
replace_extensions = {'.webp': '.jpg'}

Create metadata for each dataset and potentially copy the files.

In [None]:
for name, prepare in prepare_functions.items():
    if name in names_permissible:
        print(name)
        os.makedirs(f'{root_metadata}/{name}/', exist_ok=True)
        metadata_part = prepare(f'{root_datasets}/{name}', f'{root_images}/{name}', transform=transform, copy_files=copy_files, remove_str=remove_str, replace_extensions=replace_extensions)
        metadata_part.to_csv(f'{root_metadata}/{name}/metadata.csv', index=False)

The next codes adds additional information to the metadata and combines them together. After this code, the dataset is finished but splits. To compute splits, we first need to compute the features from `extract_features.py` and then compute the actual splits in `prepare_wildlife_reid_10k_2.ipynb`.

In [None]:
metadata = []
for name in prepare_functions:
    if name in names_permissible:
        metadata_part = pd.read_csv(f'{root_metadata}/{name}/metadata.csv')
        metadata_part['dataset'] = name
        metadata_part['identity'] = name + '_' + metadata_part['identity'].astype(str)
        metadata_part['path'] = 'images/' + name + '/' + metadata_part['path']
        metadata_part['species'] = metadata_part['species'].apply(lambda x: species_conversion[x])
        metadata.append(metadata_part)
metadata = pd.concat(metadata).reset_index(drop=True)
metadata = metadata.drop('image_id', axis=1)
metadata['image_id'] = range(len(metadata))
idx = ~metadata['date'].isnull()
idx = metadata.index[idx]
metadata.loc[idx, 'date'] = pd.to_datetime(metadata.loc[idx, 'date'].astype(str).apply(lambda x: x[:10]), format='%Y-%m-%d').astype(str)
metadata['orientation'] = metadata['orientation'].replace({'below': 'down', 'up': 'top', 'above': 'top'})
metadata.to_csv(f'{root}/metadata.csv', index=False)