In [5]:
import pandas as pd

from detect_sleep_states.metric import score

In [6]:
solution = pd.DataFrame(
    {'series_id': [1, 1], 'step': [1000, 10000], 'event': ['onset', 'wakeup']})
tolerances = {'onset': [12, 36, 60, 90, 120, 150, 180, 240, 300, 360],
              'wakeup': [12, 36, 60, 90, 120, 150, 180, 240, 300, 360]}
column_names = {
        'series_id_column_name': 'series_id',
        'time_column_name': 'step',
        'event_column_name': 'event',
        'score_column_name': 'score',
}

## Perfect

In [8]:
submission = pd.DataFrame({
    'series_id': [1, 1], 
    'step': [1000, 10000],
    'event': ['onset', 'wakeup'],
    'score': [1., 1]
})
score(solution, submission, tolerances, **column_names)[0]

1.0

## Closer pred has higher score

In [9]:
submission = pd.DataFrame({
    'series_id': [1, 1] * 2, 
    'step': [1e3, 1e4, 1e3-100, 1e4-100],
    'event': ['onset', 'wakeup'] * 2,
    'score': [1., 1, 0.5, 0.5]
})
score(solution, submission, tolerances, **column_names)[0]

1.0

## Not perfect

In [11]:
submission = pd.DataFrame({
    'series_id': [1, 1], 
    'step': [1e3-100, 1e4-100],
    'event': ['onset', 'wakeup'],
    'score': [1., 1.]
})
score(solution, submission, tolerances, **column_names)[0]

0.6

## Closer pred has lower score

In [10]:
submission = pd.DataFrame({
    'series_id': [1, 1] * 2, 
    'step': [1e3, 1e4, 1e3-100, 1e4-100],
    'event': ['onset', 'wakeup'] * 2,
    'score': [0.5, 0.5, 1., 1.]
})
score(solution, submission, tolerances, **column_names)[0]

0.8

## Adding even more lower score preds

In [14]:
submission = pd.DataFrame({
    'series_id': [1, 1] * 3, 
    'step': [1e3, 1e4, 1e3-100, 1e4-100, 1e3-700, 1e4-700],
    'event': ['onset', 'wakeup'] * 3,
    'score': [1., 1, 0.5, 0.5, 0.25, 0.25]
})
score(solution, submission, tolerances, **column_names)[0]

1.0

# Summary:
- Adding farther away predictions with lower scores does not hurt metric
- Adding closer prediction with lower score improves metric

# Thoughts:
To be useful for sleep analysis, we would want the predictions to be meaningful. However the metric either stays the same or improves when multiple predictions are made for the same event which is unrealistic.

In the real world setting we would need to choose a cutoff at which to accept a prediction as part of the output.

In the case of a farther away prediction with a lower score, we would have to make sure the cutoff is low enough to not output that prediction. If the threshold cutoff was not low enough then we would include multiple predictions which would hurt the output and make analysis unusable. Currently, there is no impact on the metric for predictions with lower score that are further away from the event.

In the case of a closer prediction with a lower score, either multiple predictions would be output or the closer prediction would not be output if it was excluded due to the threshold cutoff. In either case it would hurt the prediction for that event, not help it. Currently, it improves the metric, which I think is artificial and not realistic.  

In either case I believe the metric was poorly chosen as it doesn't measure how the predictions would be used in a real world setting.

People were able to exploit the flaw in the metric to increase scores.

I believe a metric that operates on final output, such as F1 score would have been more appropriate, perhaps weighted by distance from the true event. 
