# Metric Inputs

This notebook serves to *re-import* the metric input data (qrels, page alignments) that were prepared in `PageAlignments`.
It can be imported as a Python module, and is intended to support the following usage in task-specific alignment & target
notebooks:

```python
from MetricImputs import *
```

On its own, it just shows summaries of that data.

## Setup

Import some libraries:

In [None]:
import warnings
import logging
import pandas as pd
import xarray as xr
from pathlib import Path

We're now going to set up the data mode, if necessary.

In [None]:
import wptrec
DATA_MODE = getattr(wptrec, 'DATA_MODE', None)
if DATA_MODE is None:
    warnings.warn('No DATA_MODE specified, assuming ‘train’')
    DATA_MODE = 'train'

And the data dir

In [None]:
DATA_DIR = Path('data/metric-tables')

In [None]:
_log = logging.getLogger(__name__)

## Topics

Now we will load the topics:

In [None]:
topics = pd.read_json(f'data/trec_2022_{DATA_MODE}_reldocs.jsonl', lines=True)
topics.head()

In [None]:
topics.rename(columns={'id': 'topic_id'}, inplace=True)

Now we are going to explode this into a set of `qrels`:

In [None]:
qrels = topics[['topic_id', 'rel_docs']].explode('rel_docs', ignore_index=True)
qrels.rename(columns={'rel_docs': 'page_id'}, inplace=True)
qrels['page_id'] = qrels['page_id'].astype('i4')
qrels = qrels.drop_duplicates()
qrels.head()

## Page Alignments

And the page alignments, with a helper function.

In [None]:
def _load_page_align(key):
    fn = DATA_DIR / f'page-{key}-align.parquet'
    _log.info('reading %s', fn)
    df = pd.read_parquet(fn)
    df.index.name = 'page_id'
    df.name = key
    dfx = xr.DataArray(df, dims=['page', key])
    return df, dfx

In [None]:
sub_geo_align, sub_geo_xr = _load_page_align('sub-geo')

In [None]:
src_geo_align, src_geo_xr = _load_page_align('src-geo')

In [None]:
gender_align, gender_xr = _load_page_align('gender')

In [None]:
occ_align, occ_xr = _load_page_align('occ')

In [None]:
alpha_align, alpha_xr = _load_page_align('alpha')

In [None]:
age_align, age_xr = _load_page_align('age')

In [None]:
pop_align, pop_xr = _load_page_align('pop')

In [None]:
langs_align, langs_xr = _load_page_align('langs')

## Geographic Background

Our geographic target needs world population for to establish an equity target - this data comes from Wikipedia's [List of continents and continental subregions by population](https://en.wikipedia.org/wiki/List_of_continents_and_continental_subregions_by_population).

In [None]:
world_pop = pd.read_csv('data/world-pop.csv')
world_pop

Process it into a distribution series:

In [None]:
world_pop = world_pop.set_index('Name')['Population']
world_pop /= world_pop.sum()
world_pop.name = 'geography'
world_pop.sort_index(inplace=True)
world_pop

## Gender Background

And a gender global target:

In [None]:
gender_tgt = pd.Series({
    'female': 0.495,
    'male': 0.495,
    'NB': 0.01
})
gender_tgt.name = 'gender'
gender_tgt.sum()

## Static Data

The work-needed codes have an order:

In [None]:
work_order = [
    'Stub',
    'Start',
    'C',
    'B',
    'GA',
    'FA',
]

And finally a name for unknown:

In [None]:
UNKNOWN = '@UNKNOWN'

## Page Quality

And we can load the page quality data:

In [None]:
page_quality = pd.read_parquet(DATA_DIR / 'page-quality.parquet')
page_quality = page_quality.set_index('page_id')['quality']
page_quality = page_quality.astype('category').cat.reorder_categories(work_order)
page_quality = page_quality.to_frame()

## Dimension Lists

We're going to make a list of dimensions, along with their targets.
We have a class to define these:

In [None]:
from wptrec.dimension import FairDim

In [None]:

dimensions = [
    FairDim(sub_geo_align, sub_geo_xr, world_pop, True),
    FairDim(src_geo_align, src_geo_xr, world_pop, True),
    FairDim(gender_align, gender_xr, gender_tgt, True),
    FairDim(occ_align, occ_xr, None, True),
    FairDim(alpha_align, alpha_xr),
    FairDim(age_align, age_xr),
    FairDim(pop_align, pop_xr),
    FairDim(langs_align, langs_xr),
]