# Dynascoring

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2023"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Scoring function](#Scoring-function)
1. [Example](#Example)

## Overview

This notebook provides an implementation of the dynascoring method of [Ma et al 2021](https://papers.nips.cc/paper/2021/hash/55b1927fdafef39c48e5b73b5d61ea60-Abstract.html). Dynascores allow you to synthesize multiple metrics into a single score, with weights on the metrics expressing your assessment of the relative importance of the metrics. The notebook implements the function and then illustrates with an example from the paper.

## Set-up

In [2]:
import pandas as pd

## Scoring function

In [3]:
def dynascore(
        data,
        weights,
        perf_metric_field_name,
        direction_multipliers,
        offsets,
        delta_cutoff_proportion=0.0001):
    """Implementation of dynacoring.

    Parameters
    ----------
    data: `pd.DataFrame`
        Column names must include at least the keys of `weights`.
    weights: `dict`
        Map from metric names to their weights for dynascoring.
    perf_metric_field_name: `str`
        The metric in `weights` to use for performance.
    direction_multipliers: `pd.Series` or `None`
        If not `None`, then this should have the same structure as
        `weights` but with values 1 for no change in direction and `-1`
        to change direction for the metric.
    offsets: `pd.Series` or `None`
        If not `None`, then this should have the same structure as
        `weights` and provide an adjustment of some kind for the
        values in `weights`.
    delta_cutoff_proportion: float
        Default of 0.0001. This value controls the smallest scoring
        distinction that we retain.

    Returns
    -------
    pd.DataFrame containing the adjusted scores and a new column
    `Dynascore`.

    """
    converted_data = data.copy(deep=True)
    converted_data.sort_values(perf_metric_field_name, inplace=True)

    metrics = weights.index

    # Convert the data:
    for metric in metrics:
        if direction_multipliers is not None:
            converted_data[metric] *= direction_multipliers[metric]
        if offsets is not None:
            converted_data[metric] += offsets[metric]

    converted_data["Dynascore"] = 0

    # Normalized the weights:
    weights = weights / weights.sum()

    # We don't want small denominators to make AMRS super sensitive to
    # noise in the model submissions.
    delta = converted_data.diff()
    delta_threshold = (
        converted_data[perf_metric_field_name].max() * delta_cutoff_proportion
    )
    satisfied_indices = []
    for index in range(len(delta[perf_metric_field_name])):
        if abs(delta[perf_metric_field_name][index]) > delta_threshold:
            satisfied_indices.append(index)

    for metric in metrics:
        AMRS = (
            delta[metric][satisfied_indices].abs() / delta[perf_metric_field_name][satisfied_indices]
        ).mean(skipna=True)
        converted_data[metric] = converted_data[metric] / abs(AMRS)
        converted_data["Dynascore"] += converted_data[metric] * weights.get(
            metric, 0
        )

    return converted_data.sort_values("Dynascore", ascending=False)

## Example

These numbers are from [Ma et al. 2021](https://papers.nips.cc/paper/2021/hash/55b1927fdafef39c48e5b73b5d61ea60-Abstract.html), Table 1, top (NLI example). The output scores are somewhat different from the paper, I assume because the paper's scoring was done based on the unrounded values.

In [4]:
data = {
    "Model": ["DeBERTa", "RoBERTa", "ALBERT", "T5", "BERT", "Majority Baseline", "FastText"],
    "Perf": [69.54, 69.07, 67.29, 67.16, 64.82, 32.41, 31.29],
    "Throughput": [7.41, 9.23, 9.60, 7.10, 9.39, 77.33, 73.94],
    "Memory": [5.71, 4.82, 2.18, 10.62, 4.13, 1.15, 2.20],
    "Fairness": [91.97, 90.94, 89.94, 91.89, 92.11, 100.00, 83.23],
    "Robustness": [75.70, 74.82, 74.12, 73.47, 66.38, 100.00, 69.14]
}

data = pd.DataFrame(data).set_index("Model")

Here's a look at a full set-up; the required pieces are just `weights` and `perf_metric_field_name`:

In [5]:
# The implementation normalizes the weights:
weights = pd.Series({
    "Perf": 4,
    "Throughput": 1,
    "Memory": 1,
    "Fairness": 1,
    "Robustness": 1})

perf_metric_field_name = "Perf"

# All our metrics are ones we want to increase, so these values
# are all 1. We could use -1 to reverse direction.
direction_multipliers = pd.Series(
    {'Perf': 1,
     'Throughput': 1,
     'Memory': 1,
     "Fairness": 1,
     "Robustness": 1})

offsets = pd.Series(
    {'Perf': 0,
     'Throughput': 0,
     'Memory': 16, # (16GB - memory used), as in the paper
     "Fairness": 0,
     "Robustness": 0})

Example run:

In [6]:
scored = dynascore(
    data,
    weights,
    perf_metric_field_name=perf_metric_field_name,
    direction_multipliers=direction_multipliers,
    offsets=offsets,
    delta_cutoff_proportion=0.0001)

In [7]:
scored

Unnamed: 0_level_0,Perf,Throughput,Memory,Fairness,Robustness,Dynascore
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
DeBERTa,69.54,1.511594,1.806587,16.68947,11.68017,38.730978
RoBERTa,69.07,1.882863,1.732527,16.50256,11.54439,38.492792
ALBERT,67.29,1.95834,1.51284,16.321093,11.436383,37.548582
T5,67.16,1.448356,2.215171,16.674953,11.336091,37.539321
BERT,64.82,1.915502,1.675109,16.714875,10.242136,36.228453
Majority Baseline,32.41,15.77484,1.427129,18.146646,15.429551,22.552271
FastText,31.29,15.083301,1.514504,15.103453,10.667992,20.941156
