# WildlifeReID-10k creation - part 2

This is the second part for creating the WildlifeReID-10k dataset. It creates the split for the dataset obtained in the first part.

First load the necessary packages.

In [None]:
import os
import numpy as np
from wildlife_datasets import datasets, splits

We specify the roots, load the dataset and features and verify whether the orderings of the dataset and the features correspond to each other.

In [None]:
root = '/data/wildlife_datasets/data/WildlifeReID10k'
save_clusters_prefix = 'clusters/cluster'
os.makedirs(save_clusters_prefix, exist_ok=True)
d = datasets.WildlifeReID10k(root)
df = d.df
if not np.array_equal(df.index, range(len(df))):
    raise Exception('Index must be 0..n')
for name, df_dataset in df.groupby('dataset'):
    features_names = np.load(f'features/names_{name}.npy', allow_pickle=True)
    if not np.array_equal(df_dataset['path'], features_names):
        raise Exception('Features were computed for different indices')

We create the splits dataset-by-dataset. We first use `OpenSetSplit` from `wildlife-datasets`, which creates an open-set split with approximately 80% images in the training size. Approximately 10% of images depict new individuals: they are only in the testing but not in the training set. We then repslit the open-set split using `resplit_by_features`. For each individual, it keeps the same number of images in the training and testing set but reshuffles them. It shuffles them based so that the images whose `features` are similar, end up all in the training set. This prevents the information leak from the training to the testing set when images in both sets are similar or even the same.

In [None]:
for name, df_dataset in df.groupby('dataset'):
    print(name)
    features = np.load(f'features/features_{name}.npy')
    for i in range(len(features)):
        features[i] /= np.linalg.norm(features[i])
    splitter = splits.OpenSetSplit(0.8, 0.1, seed=666)    
    idx_train0, idx_test0 = splitter.split(df_dataset)[0]
    idx_train, idx_test = splitter.resplit_by_features(df_dataset, features, idx_train0, save_clusters_prefix=save_clusters_prefix)

    df.loc[idx_train, 'split'] = 'train'
    df.loc[idx_test, 'split'] = 'test'
df = df.drop('image_id', axis=1)
df.to_csv(os.path.join(root, 'metadata.csv'), index=False)