# Pollinator Classification Challenge â€“ Starting Kit

This notebook is the official **Starting Kit** for the Pollinator Classification Challenge.

It is designed to guide participants through:
- Understanding the provided data
- Using the given baselines
- Running a minimal end-to-end pipeline
- Generating a valid submission file for Codabench

By running this notebook from top to bottom, you will obtain a ready-to-submit ZIP file.


## Challenge Description

The goal of this challenge is to **classify pollinator species** from sensor-based data.

Each data sample corresponds to a pollinator event.
For each event, the task is to predict the correct pollinator class among a fixed set of species.

The evaluation is handled automatically by Codabench using the provided ingestion and scoring programs.
Participants only need to generate predictions in the correct format.


## Provided Data

Two data representations are provided:

### 1. CNN-Extracted Features
This dataset contains numerical feature vectors extracted using a Convolutional Neural Network (CNN).
Each sample is represented by a fixed-length vector.

This format is:
- Lightweight
- Easy to manipulate
- Recommended for building a baseline

### 2. Raw Data in HDF5 (.h5) Format
The raw sensor data is stored in `.h5` files.
This format contains richer information (e.g. signals, timestamps) and is intended for more advanced approaches.

In this starting kit, **we use the CNN-extracted features**, following the provided baseline notebook.


In [None]:
COLAB = 'google.colab' in str(get_ipython())

## Baseline Reference Notebooks

This starting kit is based on the following provided notebooks:

- `read_pollinator_CNN_Extracted_data.ipynb`  
  Demonstrates how to load CNN-extracted features, labels, and class mappings.

- `read_pollinator_h5_data.ipynb`  
  Demonstrates how to read the raw HDF5 data structure.

The present notebook reuses the same data organization principles,
but focuses on a **simplified and reproducible baseline**.


In [None]:
import os
import json
import numpy as np
from pathlib import Path

## Loading the CNN-Extracted Features

We now load:
- The feature matrix `X`
- The label vector `y`
- The class mapping file

Each row in `X` corresponds to one pollinator sample.


In [None]:
DATA_DIR = Path("./Data/Pollinator-Data-CNN-Extracted")

X = np.load(DATA_DIR / "X_features.npy")
y = np.load(DATA_DIR / "y_labels.npy")

with open(DATA_DIR / "class_type_mapper.json", "r") as f:
    class_mapper = json.load(f)

print("Feature matrix shape:", X.shape)
print("Label vector shape:", y.shape)
print("Number of classes:", len(class_mapper))

## Baseline Model

This starting kit implements a **very simple baseline model**.

The purpose of this baseline is **not to achieve high performance**, but to:
- Illustrate the full training and submission workflow
- Provide a reliable reference implementation
- Ensure compatibility with Codabench evaluation

The baseline predicts the **most frequent class** observed in the labels.


In [None]:
from collections import Counter

most_frequent_class = Counter(y).most_common(1)[0][0]
print("Most frequent class:", most_frequent_class)

## Generating Predictions

Predictions must be generated for all samples in the test set.

In this baseline, the same class is predicted for every sample.
This ensures a minimal but valid submission.


In [None]:
y_pred = np.full(shape=len(y), fill_value=most_frequent_class)
print("Prediction vector shape:", y_pred.shape)

## Submission Format

Submissions must be provided as a **ZIP archive**.

The archive must contain a NumPy file with the predicted labels.
The ingestion and scoring programs provided in the starting kit
will automatically read and evaluate this file.


In [None]:
import zipfile
import datetime

class Submission:
    def __init__(self, predictions, output_dir="submission"):
        self.predictions = predictions
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)

    def save(self):
        timestamp = datetime.datetime.now().strftime("%y-%m-%d-%H-%M")
        zip_path = self.output_dir / f"Submission_{timestamp}.zip"

        pred_file = self.output_dir / "predictions.npy"
        np.save(pred_file, self.predictions)

        with zipfile.ZipFile(zip_path, "w") as zf:
            zf.write(pred_file, arcname="predictions.npy")

        pred_file.unlink()
        return zip_path

## Creating the Submission File

We now generate the final ZIP archive that can be uploaded directly to Codabench.


In [None]:
submission = Submission(y_pred)
zip_path = submission.save()

print("Submission ZIP saved at:", zip_path)

## Conclusion

This notebook provides a complete and minimal starting point for the
Pollinator Classification Challenge.

Participants are encouraged to:
- Replace the baseline with more advanced models
- Explore the raw HDF5 data
- Improve validation and feature engineering strategies

If this notebook runs successfully from start to end,
the generated ZIP file is ready to be submitted to Codabench.
