# Task 2 Evaluation

This notebook contains the evaluation for Task 1 of the TREC Fair Ranking track.

## Setup

We begin by loading necessary libraries:

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gzip
import binpickle

Set up progress bar and logging support:

In [None]:
from tqdm.auto import tqdm
tqdm.pandas(leave=False)

In [None]:
import sys, logging
logging.basicConfig(level=logging.INFO, stream=sys.stderr)
log = logging.getLogger('task1-eval')

Import metric code:

In [None]:
import metrics
from trecdata import scan_runs

And finally import the metric itself:

In [None]:
metric = binpickle.load('task2-eval-metric.bpk')

## Importing Data



Let's load the runs now:

In [None]:
runs = pd.DataFrame.from_records(row for (task, rows) in scan_runs() if task == 2 for row in rows)
runs

In [None]:
runs.head()

We also need to load our topic eval data:

In [None]:
topics = pd.read_json('data/eval-topics.json.gz', lines=True)
topics.head()

Tier 2 is the top 5 docs of the first 25 rankings.  Further, we didn't complete Tier 2 for all topics.

In [None]:
t2_topics = topics.loc[topics['max_tier'] >= 2, 'id']

In [None]:
r_top5 = runs['rank'] <= 5
r_first25 = runs['seq_no'] <= 25
r_done = runs['topic_id'].isin(t2_topics)
runs = runs[r_done & r_top5 & r_first25]
runs.info()

## Computing Metrics

We are now ready to compute the metric for each (system,topic) pair.  Let's go!

In [None]:
rank_exp = runs.groupby(['run_name', 'topic_id']).progress_apply(metric)
# rank_exp = rank_awrf.unstack()
rank_exp

Now let's average by runs:

In [None]:
run_scores = rank_exp.groupby('run_name').mean()
run_scores

## Analyzing Scores

What is the distribution of scores?

In [None]:
run_scores.describe()

In [None]:
sns.displot(x='EE-L', data=run_scores)
plt.show()

In [None]:
run_scores.sort_values('EE-L', ascending=False)

In [None]:
sns.relplot(x='EE-D', y='EE-R', data=run_scores)
sns.rugplot(x='EE-D', y='EE-R', data=run_scores)
plt.show()

## Per-Topic Stats

We need to return per-topic stats to each participant, at least for the score.

In [None]:
topic_stats = rank_exp.groupby('topic_id').agg(['mean', 'median', 'min', 'max'])
topic_stats

Make final score analysis:

In [None]:
topic_range = topic_stats.loc[:, 'EE-L']
topic_range = topic_range.drop(columns=['mean'])
topic_range

And now we combine scores with these results to return to participants.

In [None]:
ret_dir = Path('results')
for system, runs in rank_exp.groupby('run_name'):
    aug = runs.join(topic_range).reset_index().drop(columns=['run_name'])
    fn = ret_dir / f'{system}.tsv'
    log.info('writing %s', fn)
    aug.to_csv(fn, sep='\t', index=False)