# 🐾 AnimalCLEF2025 Competition: Official Starter notebook

The **Goal of the** [AnimalCLEF2025](https://www.kaggle.com/competitions/animal-clef-2025/) competition is to identify individual animal (lynxes, salamanders and sea turtles) in photos. This notebook visualize the provided dataset and propose a baseline solution, based on the state-of-the-art re-identification model [MegaDescriptor](https://huggingface.co/BVRA/MegaDescriptor-L-384). The dataset is split into the database and query sets. For each image from the query set, the goal is to:

- Predict whether the depicted individual is in the database.
- If no, the prediction is `new_individual`.
- If yes, the prediction should be the same as the individual in the database.

## Dependencies instalation
For the competition we provide two Python packages for loading and preprocessing of available datasets ([wildlife-datasets](https://github.com/WildlifeDatasets/wildlife-datasets)) and tools / method for animal re-identification ([wildlife-tools](https://github.com/WildlifeDatasets/wildlife-tools)).

In [1]:
!pip install git+https://github.com/WildlifeDatasets/wildlife-datasets@develop
!pip install git+https://github.com/WildlifeDatasets/wildlife-tools

Collecting git+https://github.com/WildlifeDatasets/wildlife-datasets@develop
  Cloning https://github.com/WildlifeDatasets/wildlife-datasets (to revision develop) to /tmp/pip-req-build-j0c11ej5
  Running command git clone --filter=blob:none --quiet https://github.com/WildlifeDatasets/wildlife-datasets /tmp/pip-req-build-j0c11ej5
  Running command git checkout -b develop --track origin/develop
  Switched to a new branch 'develop'
  Branch 'develop' set up to track remote branch 'develop' from 'origin'.
  Resolved https://github.com/WildlifeDatasets/wildlife-datasets to commit 753d9bf64861c3e17011136b3436bf58bf02317f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: wildlife-datasets
  Building wheel for wildlife-datasets (pyproject.toml) ... [?25l[?25hdone
  Created wheel for wildlife-datasets: filename=wildlife_datasets-1.0.6

## Dependencies import
We load all the required packages and then define the function `create_sample_submission`, which converts provided predictions and a submission file for the competition.

In [2]:
import os
import numpy as np
import pandas as pd
import timm
import torch
import torchvision.transforms as T
from wildlife_datasets.datasets import AnimalCLEF2025
from wildlife_tools.features import DeepFeatures
from wildlife_tools.similarity import CosineSimilarity
def create_sample_submission(dataset_query, predictions, file_name='sample_submission.csv'):
    df = pd.DataFrame({
        'image_id': dataset_query.metadata['image_id'],
        'identity': predictions
    })
    df.to_csv(file_name, index=False)

We need to specify the `root`, where the data are stored and then two image transformations. 
1. The first transform only resizes the images and is used for visualization.
2. The second transform also converts it to torch tensor and is used for operations on neural networks.

In [5]:
root = '/kaggle/input/animal-clef-2025'
transform_display = T.Compose([
    T.Resize([384, 384]),
    ])
transform = T.Compose([
    *transform_display.transforms,
    T.ToTensor(),
    T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))
    ])

## Inference with MegaDescriptor

Instead of training a classifier, we can just use out of the shelf pretrained models - [MegaDescriptor](https://huggingface.co/BVRA/MegaDescriptor-L-384). We use MegaDescriptor to extract features from all images. 

**Note:** _It is highly recommended to use the GPU acceleration._

In [7]:
# Loading the dataset
dataset = AnimalCLEF2025(root, transform=transform, load_label=True)
dataset_database = dataset.get_subset(dataset.metadata['split'] == 'database')
dataset_query = dataset.get_subset(dataset.metadata['split'] == 'query')
n_query = len(dataset_query)

In [9]:
# Setting up the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Loading the model
name = 'hf-hub:BVRA/MegaDescriptor-L-384'
model = timm.create_model(name, num_classes=0, pretrained=True)
extractor = DeepFeatures(model, device=device, batch_size=32, num_workers=0)
features_database = extractor(dataset_database)
features_query = extractor(dataset_query)


100%|█████████████████████████████████████████████████████████████| 409/409 [11:56<00:00,  1.75s/it]
100%|███████████████████████████████████████████████████████████████| 67/67 [02:25<00:00,  2.17s/it]


In [21]:
features_database.features.shape

(13074, 1536)

In [26]:
features_database.col_label

'identity'

In [15]:
len(features_database.features[0])

1536

In [16]:
len(features_query.features[0])

1536

In [11]:
features_query.metadata

Unnamed: 0,image_id,identity,path,date,orientation,species,split,dataset
0,3,,images/LynxID2025/query/003b89301c7b9f6d18f722...,,back,lynx,query,LynxID2025
1,5,,images/LynxID2025/query/004d500301a70ec9b5ba08...,,left,lynx,query,LynxID2025
2,12,,images/LynxID2025/query/00d97c67f0cb0d13a3a449...,,left,lynx,query,LynxID2025
3,13,,images/LynxID2025/query/00dcbabf03826937bcf6a0...,,right,lynx,query,LynxID2025
4,18,,images/LynxID2025/query/011d81e0402d1be66bccab...,,right,lynx,query,LynxID2025
...,...,...,...,...,...,...,...,...
2130,15204,,images/SeaTurtleID2022/query/images/fecd2dfed0...,2024-06-07,,loggerhead turtle,query,SeaTurtleID2022
2131,15205,,images/SeaTurtleID2022/query/images/ff1a0c812b...,2023-06-28,,loggerhead turtle,query,SeaTurtleID2022
2132,15206,,images/SeaTurtleID2022/query/images/ff22f1cfa6...,2024-06-09,,loggerhead turtle,query,SeaTurtleID2022
2133,15207,,images/SeaTurtleID2022/query/images/ff5d5116d1...,2023-06-21,,loggerhead turtle,query,SeaTurtleID2022


In [28]:
pd.DataFrame(features_database.features).to_parquet("db_embeddings_features.parquet", engine="pyarrow")
pd.DataFrame(features_database.metadata).to_parquet("db_embeddings_metadata.parquet", engine="pyarrow")

In [29]:
pd.DataFrame(features_query.features).to_parquet("query_embeddings.parquet", engine="pyarrow")
pd.DataFrame(features_query.metadata).to_parquet("query_embeddings_metadata.parquet", engine="pyarrow")