# Dataset Sufficiency Analysis for Object Detection Tutorial


## Problem Statement

For machine learning tasks, often we would like to evaluate the performance of a model on a small, preliminary dataset. In situations where data collection is expensive, we would like to extrapolate hypothetical performance out to a larger dataset.

DAML has introduced a method projecting performance via _sufficiency curves_.


## When to Use

The `Sufficiency` class should be used when you would like to extrapolate hypothetical performance.
For example, if you have a small dataset, and would like to know if it is worthwhile to collect more data.


## What you will need

1. A particular model architecture.
2. Metric(s) that we would like to evaluate.
3. A dataset of interest.


### Setting up

Let's import the required libraries needed to set up a minimal working example.
Note that this tutorial will be run in the `yolov5` directory, which is a tool for object detection.


In [None]:
import os
import shutil
from typing import Dict

import numpy as np
import val

from daml.workflows import Sufficiency

trains = os.listdir("../datasets/VisDrone/VisDrone2019-DET-train/full_labels/")

In [None]:
class YoloMdelWrapper:
    def __init__(self):
        self.trained = 0

    def reset_parameters(self):
        self.trained = 0
        return self

    def apply(self, fn):
        self.trained = 0

We will define two utility functions to subset the data in yolo so our model can train only on data which we allow.


In [None]:
def delete_labels():
    folder = "../datasets/VisDrone/VisDrone2019-DET-train/labels/"
    for filename in os.listdir(folder):
        file_path = os.path.join(folder, filename)
        try:
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
        except Exception as e:
            print(f"Failed to delete {file_path}. Reason: {e}")


def copy_labels(names):
    for i in names:
        shutil.copy(
            "../datasets/VisDrone/VisDrone2019-DET-train/full_labels/" + i,
            "../datasets/VisDrone/VisDrone2019-DET-train/labels/",
        )

### Define functions

Use only the first 500 images for training.


In [None]:
trains = trains[0:500]

For the purposes of this example, we will use subsets of the training and test data.

Finally, we define our custom training and evaluation functions.
Sufficiency requires that the evaluation function returns a dictionary of the results.


In [None]:
def custom_train(model, dataset, indices):
    # Defined only for this testing scenario
    delete_labels()
    copy_labels([dataset[i] for i in indices])
    # ruff: noqa: E501
    if len(indices) == 5:
        !python train.py --data VisDrone.yaml --epochs 10 --weights '' --cfg yolov5n.yaml --img 640 --noval --exist-ok
    else:
        !python train.py --epochs 10 --data VisDrone.yaml --weights 'runs/train/exp/weights/last.pt' --cfg yolov5n.yaml --img 640 --noval --exist-ok


def custom_eval(model, dataset) -> Dict[str, float]:
    metrics = val.run("./data/VisDrone.yaml", "./runs/train/exp/weights/last.pt")
    print(metrics[0][2])
    return {"mAP": metrics[0][2]}

### Initialize sufficiency metric

Attach the custom training and evaluation functions to the Sufficiency metric and define the number of models to train in parallel (stability), as well as the number of steps along the learning curve to evaluate.


In [None]:
mymodel = YoloMdelWrapper()
# Instantiate sufficiency metric
suff = Sufficiency(
    model=mymodel,
    train_ds=trains,
    test_ds=np.array([0]),
    train_fn=custom_train,
    eval_fn=custom_eval,
    runs=1,
    substeps=4,
)

### Evaluate Sufficiency

Now we can evaluate the metric to train the models and produce the learning curve.


In [None]:
# Train & test model
output = suff.evaluate()

In [None]:
# Print out sufficiency output in a table format
from tabulate import tabulate

print(tabulate(output, headers=list(output.keys()), tablefmt="pretty"))

In [None]:
# Print out projected output values
projection = Sufficiency.project(output, [1000, 2000, 4000])
print(tabulate(projection, list(projection.keys()), tablefmt="pretty"))

In [None]:
# Plot the output using the convenience function
%matplotlib inline
_ = Sufficiency.plot(output)

### Results

Using this learning curve, we can project performance under much larger datasets (with the same model).
