# Task 2 Alignment

This notebook computes the target distributions and retrieved page alignments for **Task 2**.
It depends on the output of the PageAlignments notebook.

This notebook can be run in two modes: 'train', to process the training topics, and 'eval' for the eval topics.

In [1]:
DATA_MODE = 'train'

## Setup

We begin by loading necessary libraries:

In [2]:
import sys
from pathlib import Path
import pandas as pd
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gzip
import json
from natural.size import binarysize

Set up progress bar and logging support:

In [3]:
from tqdm.auto import tqdm
tqdm.pandas(leave=False)

In [4]:
import sys, logging
logging.basicConfig(level=logging.INFO, stream=sys.stderr)
log = logging.getLogger('Task2Alignment')

And set up an output directory:

In [5]:
from wptrec.save import OutRepo
output = OutRepo('data/metric-tables')

In [21]:
from wptrec import metrics

## Data and Helpers

Most data loading is outsourced to `MetricInputs`.  First we save the data mode where metric inputs can find it:

In [6]:
import wptrec
wptrec.DATA_MODE = DATA_MODE

In [7]:
from MetricInputs import *

INFO:MetricInputs:reading data\metric-tables\page-sub-geo-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-src-geo-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-gender-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-occ-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-alpha-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-age-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-pop-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-langs-align.parquet


In [8]:
dimensions

[<dimension "sub-geo": 21 levels>,
 <dimension "src-geo": 21 levels>,
 <dimension "gender": 4 levels>,
 <dimension "occ": 33 levels>,
 <dimension "alpha": 4 levels>,
 <dimension "age": 4 levels>,
 <dimension "pop": 4 levels>,
 <dimension "langs": 3 levels>]

### qrel join

We want a function to join alignments with qrels:

In [9]:
def qr_join(align):
    return qrels.join(align, on='page_id').set_index(['topic_id', 'page_id'])

### norm_dist

And a function to normalize to a distribution:

In [12]:
def norm_dist_df(mat):
    sums = mat.sum('columns')
    return mat.divide(sums, 'rows')

## Work and Target Exposure

The first thing we need to do to prepare the metric is to compute the work-needed for each topic's pages, and use that to compute the target exposure for each (relevant) page in the topic.

This is because an ideal ranking orders relevant documents in decreasing order of work needed, followed by irrelevant documents.  All relevant documents at a given work level should receive the same expected exposure.

First, look up the work for each query page ('query page work', or qpw):

In [13]:
qpw = qrels.join(page_quality, on='page_id')
qpw

Unnamed: 0,topic_id,page_id,quality
0,84,572,Start
1,84,627,GA
2,84,678,C
3,84,903,C
4,84,1193,C
...,...,...,...
2088301,2859,69878035,Start
2088302,2859,69879576,Stub
2088303,2859,69882349,Stub
2088304,2859,69887896,Stub


And now  use that to compute the number of documents at each work level:

In [14]:
qwork = qpw.groupby(['topic_id', 'quality'])['page_id'].count()
qwork

topic_id  quality
84        Stub        3631
          Start       1872
          C           1069
          B            575
          GA           260
                     ...  
2859      Start      25214
          C          13446
          B           4925
          GA          1976
          FA           115
Name: page_id, Length: 300, dtype: int64

Now we need to convert this into target exposure levels.  This function will, given a series of counts for each work level, compute the expected exposure a page at that work level should receive.

In [24]:
def qw_tgt_exposure(qw_counts: pd.Series) -> pd.Series:
    if 'topic_id' == qw_counts.index.names[0]:
        qw_counts = qw_counts.reset_index(level='topic_id', drop=True)
    qwc = qw_counts.reindex(work_order, fill_value=0).astype('i4')
    tot = int(qwc.sum())
    da = metrics.discount(tot)
    qwp = qwc.shift(1, fill_value=0)
    qwc_s = qwc.cumsum()
    qwp_s = qwp.cumsum()
    res = pd.Series(
        [np.mean(da[s:e]) for (s, e) in zip(qwp_s, qwc_s)],
        index=qwc.index
    )
    return res

We'll then apply this to each topic, to determine the per-topic target exposures:

In [25]:
qw_pp_target = qwork.groupby('topic_id').apply(qw_tgt_exposure)
qw_pp_target.name = 'tgt_exposure'
qw_pp_target

topic_id  quality
84        Stub       0.099625
          Start      0.082342
          C          0.079633
          B          0.078472
          GA         0.077948
                       ...   
2859      Start      0.065043
          C          0.062777
          B          0.061996
          GA         0.061735
          FA         0.061659
Name: tgt_exposure, Length: 300, dtype: float32

We can now merge the relevant document work categories with this exposure, to compute the target exposure for each relevant document:

In [26]:
qp_exp = qpw.join(qw_pp_target, on=['topic_id', 'quality'])
qp_exp = qp_exp.set_index(['topic_id', 'page_id'])['tgt_exposure']
qp_exp

topic_id  page_id 
84        572         0.082342
          627         0.077948
          678         0.079633
          903         0.079633
          1193        0.079633
                        ...   
2859      69878035    0.065043
          69879576    0.075569
          69882349    0.075569
          69887896    0.075569
          69891491    0.075569
Name: tgt_exposure, Length: 2088306, dtype: float32

## Subject Geography

Subject geography targets the average of the relevant set alignments and the world population.

In [12]:
qr_sub_geo_align = qr_join(sub_geo_align)
qr_sub_geo_align

Unnamed: 0_level_0,Unnamed: 1_level_0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
84,572,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84,627,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84,678,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84,903,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84,1193,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2859,69878035,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2859,69879576,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2859,69882349,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2859,69887896,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can just average with the world pop, with a bit of normalization.

In [13]:
qr_sub_geo_tgt = qr_sub_geo_align.groupby('topic_id').sum()
qr_sub_geo_tgt = qr_sub_geo_tgt.iloc[:, 1:]
qr_sub_geo_tgt = norm_dist_df(qr_sub_geo_tgt)
qr_sub_geo_tgt = (qr_sub_geo_tgt + world_pop) * 0.5
qr_sub_geo_tgt.head()

Unnamed: 0_level_0,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,Northern America,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
84,7.721177e-08,0.00576,0.015795,0.006975,0.035669,0.131219,0.034804,0.010847,0.016884,0.156412,0.080714,0.087933,0.038566,0.054365,0.009957,0.148701,0.052571,0.029303,0.025218,0.058306
111,7.721177e-08,0.011925,0.021943,0.004494,0.133176,0.11409,0.020969,0.014967,0.015383,0.060492,0.009321,0.107196,0.190632,0.074043,0.029067,0.124993,0.011745,0.023559,0.017587,0.014419
265,0.005345676,0.00359,0.014793,0.006669,0.029535,0.138823,0.044902,0.015195,0.024134,0.22865,0.05904,0.022009,0.048497,0.047861,0.010372,0.134395,0.039977,0.025388,0.02283,0.077994
323,0.0001021319,0.007558,0.018673,0.00661,0.034132,0.133108,0.047561,0.016861,0.019721,0.209493,0.064425,0.024642,0.05663,0.05621,0.008836,0.136204,0.030463,0.029258,0.026721,0.072791
396,7.721177e-08,0.004347,0.017819,0.004845,0.026839,0.143674,0.036382,0.009807,0.017386,0.228696,0.077268,0.01846,0.040173,0.054573,0.005905,0.169444,0.038324,0.025741,0.023708,0.05661


Make sure the rows are distributions:

In [14]:
qr_sub_geo_tgt.sum('columns').describe()

count    5.000000e+01
mean     1.000000e+00
std      1.382671e-16
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      1.000000e+00
dtype: float64

Everything is 1, we're good to go!

In [15]:
output.save_table(qr_sub_geo_tgt, f'task1-{DATA_MODE}-sub-geo-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-train-sub-geo-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-train-sub-geo-target.csv.gz: 9.56 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-train-sub-geo-target.parquet
INFO:wptrec.save:data\metric-tables\task1-train-sub-geo-target.parquet: 24.36 KiB


## Source Geography

Source geography works the same way.

In [16]:
qr_src_geo_align = qr_join(src_geo_align)
qr_src_geo_align

Unnamed: 0_level_0,Unnamed: 1_level_0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
84,572,0.800000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.200000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
84,627,0.381443,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.134021,0.015464,0.0,0.0,0.0,0.005155,0.072165,0.0,0.0,0.082474
84,678,0.181818,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.545455
84,903,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,1.000000,0.0,0.0,0.000000
84,1193,0.628571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,...,0.057143,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2859,69878035,0.200000,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.000000,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
2859,69879576,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.090909,0.0,0.0,0.454545
2859,69882349,1.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
2859,69887896,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.000000,0.000000,0.0,0.0,0.0,0.750000,0.000000,0.0,0.0,0.000000


For purely geographic fairness, the target is easy - average with world pop.

In [17]:
qr_src_geo_tgt = qr_src_geo_align.groupby('topic_id').sum()
qr_src_geo_tgt = qr_src_geo_tgt.iloc[:, 1:]
qr_src_geo_tgt = norm_dist_df(qr_src_geo_tgt)
qr_src_geo_tgt = (qr_src_geo_tgt + world_pop) * 0.5
qr_src_geo_tgt.head()

Unnamed: 0_level_0,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,Northern America,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
84,7.721177e-08,0.004013,0.012604,0.00497,0.027754,0.122795,0.025676,0.009553,0.014945,0.254032,0.084208,0.085304,0.035862,0.050074,0.007148,0.141661,0.036544,0.024211,0.020331,0.038314
111,7.721177e-08,0.004262,0.01334,0.004538,0.026227,0.118847,0.065773,0.009529,0.014695,0.289826,0.084273,0.046079,0.036684,0.04754,0.005957,0.126207,0.013778,0.023125,0.017658,0.051662
265,7.721177e-08,0.003712,0.012426,0.004583,0.026108,0.117771,0.02635,0.009502,0.014786,0.366511,0.067383,0.010005,0.030015,0.043952,0.005025,0.124363,0.01718,0.023143,0.018127,0.079057
323,1.249956e-06,0.004553,0.014491,0.00498,0.027809,0.126423,0.043497,0.009764,0.015062,0.268999,0.106848,0.023592,0.037743,0.050226,0.006738,0.133011,0.019052,0.023814,0.021962,0.061435
396,1.082931e-06,0.003535,0.015282,0.004697,0.026445,0.136402,0.036323,0.009502,0.015784,0.28806,0.07733,0.018547,0.034743,0.051686,0.005586,0.157908,0.027634,0.024457,0.02223,0.043849


Make sure the rows are distributions:

In [18]:
qr_src_geo_tgt.sum('columns').describe()

count    5.000000e+01
mean     1.000000e+00
std      1.248844e-16
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      1.000000e+00
dtype: float64

Everything is 1, we're good to go!

In [19]:
output.save_table(qr_src_geo_tgt, f'task1-{DATA_MODE}-src-geo-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-train-src-geo-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-train-src-geo-target.csv.gz: 9.77 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-train-src-geo-target.parquet
INFO:wptrec.save:data\metric-tables\task1-train-src-geo-target.parquet: 24.54 KiB


## Gender

Now we're going to grab the gender alignments.  Again, we ignore UNKNOWN.

In [20]:
qr_gender_align = qr_join(gender_align)
qr_gender_align.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,@UNKNOWN,female,male,NB
topic_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
84,572,1.0,0.0,0.0,0.0
84,627,1.0,0.0,0.0,0.0
84,678,1.0,0.0,0.0,0.0
84,903,1.0,0.0,0.0,0.0
84,1193,1.0,0.0,0.0,0.0


In [21]:
qr_gender_tgt = qr_gender_align.iloc[:, 1:].groupby('topic_id').sum()
qr_gender_tgt = norm_dist_df(qr_gender_tgt)
qr_gender_tgt = (qr_gender_tgt + gender_tgt) * 0.5
qr_gender_tgt.head()

Unnamed: 0_level_0,female,male,NB
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
84,0.359248,0.635752,0.005
111,0.345061,0.649939,0.005
265,0.333219,0.661244,0.005537
323,0.301029,0.693971,0.005
396,0.431714,0.562898,0.005389


In [22]:
output.save_table(qr_gender_tgt, f'task1-{DATA_MODE}-gender-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-train-gender-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-train-gender-target.csv.gz: 1.40 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-train-gender-target.parquet
INFO:wptrec.save:data\metric-tables\task1-train-gender-target.parquet: 4.81 KiB


## Occupation

Occupation is more straightforward, since we don't have a global target to average with.  We do need to drop unknown.

In [23]:
qr_occ_align = qr_join(occ_align)
qr_occ_tgt = qr_occ_align.iloc[:, 1:].groupby('topic_id').sum()
qr_occ_tgt = norm_dist_df(qr_occ_tgt)
qr_occ_tgt.head()

Unnamed: 0_level_0,activist,agricultural worker,artist,athlete,biologist,businessperson,chemist,civil servant,clergyperson,computer scientist,...,military personnel,musician,performing artist,physicist,politician,scientist,social scientist,sportsperson (non-athlete),transportation occupation,writer
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
84,0.032103,0.1802,0.013243,0.010337,0.096144,0.118009,0.022222,0.010752,0.007103,0.0,...,0.012829,0.001619,0.006693,0.004837,0.119983,0.090064,0.029409,0.001658,0.000652,0.076647
111,0.012048,0.0,0.016064,0.0,0.697791,0.023092,0.0,0.0,0.012048,0.0,...,0.0,0.0,0.006024,0.0,0.004016,0.083333,0.0,0.0,0.0,0.024096
265,0.001554,0.000365,0.009564,0.000852,0.002707,0.045472,0.00292,0.00148,0.002875,0.001989,...,0.002735,0.001778,0.000477,0.108297,0.005021,0.648089,0.002571,0.000335,0.001071,0.019928
323,0.002604,0.001543,0.006877,0.017871,0.000706,0.055789,0.000825,0.003464,0.002237,0.000371,...,0.307782,0.001369,0.004737,0.003836,0.020113,0.012174,0.001489,0.003178,0.39318,0.0137
396,0.001455,0.000187,0.162101,0.004497,7e-05,0.043703,1.4e-05,0.000244,5.9e-05,4.9e-05,...,0.000764,0.040067,0.512512,2e-05,0.002737,0.000353,0.000778,0.000342,0.000379,0.079161


In [24]:
output.save_table(qr_occ_tgt, f'task1-{DATA_MODE}-occ-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-train-occ-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-train-occ-target.csv.gz: 14.37 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-train-occ-target.parquet
INFO:wptrec.save:data\metric-tables\task1-train-occ-target.parquet: 37.65 KiB


## Remaining Attributes

The remaining attributes don't need any further processing, as they are completely known.

In [25]:
qr_age_align = qr_join(age_align)
qr_age_tgt = norm_dist_df(qr_age_align.groupby('topic_id').sum())
output.save_table(qr_age_tgt, f'task1-{DATA_MODE}-age-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-train-age-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-train-age-target.csv.gz: 2.13 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-train-age-target.parquet
INFO:wptrec.save:data\metric-tables\task1-train-age-target.parquet: 6.23 KiB


In [26]:
qr_alpha_align = qr_join(alpha_align)
qr_alpha_tgt = norm_dist_df(qr_alpha_align.groupby('topic_id').sum())
output.save_table(qr_alpha_tgt, f'task1-{DATA_MODE}-alpha-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-train-alpha-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-train-alpha-target.csv.gz: 2.12 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-train-alpha-target.parquet
INFO:wptrec.save:data\metric-tables\task1-train-alpha-target.parquet: 5.99 KiB


In [27]:
qr_langs_align = qr_join(langs_align)
qr_langs_tgt = norm_dist_df(qr_langs_align.groupby('topic_id').sum())
output.save_table(qr_langs_tgt, f'task1-{DATA_MODE}-langs-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-train-langs-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-train-langs-target.csv.gz: 1.68 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-train-langs-target.parquet
INFO:wptrec.save:data\metric-tables\task1-train-langs-target.parquet: 5.20 KiB


In [28]:
qr_pop_align = qr_join(pop_align)
qr_pop_tgt = norm_dist_df(qr_pop_align.groupby('topic_id').sum())
output.save_table(qr_pop_tgt, f'task1-{DATA_MODE}-pop-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-train-pop-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-train-pop-target.csv.gz: 2.17 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-train-pop-target.parquet
INFO:wptrec.save:data\metric-tables\task1-train-pop-target.parquet: 6.14 KiB


### Geographic Alignment

Now that we've computed per-page target exposure, we're ready to set up the geographic alignment vectors for computing the per-*group* expected exposure with geographic data.

We're going to start by getting the alignments for relevant documents for each topic:

In [None]:
qp_geo_align = qrels.join(page_geo_align, on='page_id').set_index(['id', 'page_id'])
qp_geo_align.index.names = ['q_id', 'page_id']
qp_geo_align

Now we need to compute the per-query target exposures.  This starst with aligning our vectors:

In [None]:
qp_geo_exp, qp_geo_align = qp_exp.align(qp_geo_align, fill_value=0)

And now we can multiply the exposure vector by the alignment vector, and summing by topic - this is equivalent to the matrix-vector multiplication on a topic-by-topic basis.

In [None]:
qp_aexp = qp_geo_align.multiply(qp_geo_exp, axis=0)
q_geo_align = qp_aexp.groupby('q_id').sum()

Now things get a *little* weird.  We want to average the empirical distribution with the world population to compute our fairness target.  However, we don't have empirical data on the distribution of articles that do or do not have geographic alignments.

Therefore, we are going to average only the *known-geography* vector with the world population.  This proceeds in N steps:

1. Normalize the known-geography matrix so its rows sum to 1.
2. Average each row with the world population.
3. De-normalize the known-geography matrix so it is in the original scale, but adjusted w/ world population
4. Normalize the *entire* matrix so its rows sum to 1

Let's go.

In [None]:
qg_known = q_geo_align.drop(columns=['Unknown'])

Normalize (adding a small value to avoid division by zero - affected entries will have a zero numerator anyway):

In [None]:
qg_ksums = qg_known.sum(axis=1)
qg_kd = qg_known.divide(np.maximum(qg_ksums, 1.0e-6), axis=0)

Average:

In [None]:
qg_kd = (qg_kd + world_pop) * 0.5

De-normalize:

In [None]:
qg_known = qg_kd.multiply(qg_ksums, axis=0)

Recombine with the Unknown column:

In [None]:
q_geo_tgt = q_geo_align[['Unknown']].join(qg_known)

Normalize targets:

In [None]:
q_geo_tgt = q_geo_tgt.divide(q_geo_tgt.sum(axis=1), axis=0)
q_geo_tgt

This is our group exposure target distributions for each query, for the geographic data.  We're now ready to set up the matrix.

In [None]:
train_geo_qtgt = q_geo_tgt.loc[train_topics['id']]
eval_geo_qtgt = q_geo_tgt.loc[eval_topics['id']]

And save data.

In [None]:
save_table(train_geo_qtgt, 'task2-train-geo-targets')
save_table(eval_geo_qtgt, 'task2-eval-geo-targets')

### Intersectional Alignment

Now we need to compute the intersectional targets for Task 2.  We're going to take a slightly different approach here, based on the intersectional logic for Task 1, because we've come up with better ways to write the code, but the effect is the same: only known aspects are averaged.

We'll write a function very similar to the one for Task 1:

In [None]:
def query_xideal(qdf, ravel=True):
    pages = qdf['page_id']
    pages = pages[pages.isin(page_xalign.indexes['page'])]
    q_xa = page_xalign.loc[pages.values, :, :]
    
    # now we need to get the exposure for the pages, and multiply
    p_exp = qp_exp.loc[qdf.name]
    assert p_exp.index.is_unique
    p_exp = xr.DataArray(p_exp, dims=['page'])
    
    # and we multiply!
    q_xa = q_xa * p_exp

    # normalize into a matrix (this time we don't clear)
    q_am = q_xa.sum(axis=0)
    q_am = q_am / q_am.sum()
    
    # compute fractions in each section - combined with q_am[0,0], this should be about 1
    q_fk_all = q_am[1:, 1:].sum()
    q_fk_geo = q_am[1:, :1].sum()
    q_fk_gen = q_am[:1, 1:].sum()
    
    # known average
    q_am[1:, 1:] *= 0.5
    q_am[1:, 1:] += int_tgt * 0.5 * q_fk_all
    
    # known-geo average
    q_am[1:, :1] *= 0.5
    q_am[1:, :1] += geo_tgt_xa * 0.5 * q_fk_geo
    
    # known-gender average
    q_am[:1, 1:] *= 0.5
    q_am[:1, 1:] += gender_tgt_xa * 0.5 * q_fk_gen
    
    # and return the result
    if ravel:
        return pd.Series(q_am.values.ravel())
    else:
        return q_am

Test this function out:

In [None]:
query_xideal(qdf, ravel=False)

And let's go!

In [None]:
q_xtgt = qrels.groupby('id').progress_apply(query_xideal)
q_xtgt

In [None]:
train_qtgt = q_xtgt.loc[train_topics['id']]
eval_qtgt = q_xtgt.loc[eval_topics['id']]

And save our tables:

In [None]:
save_table(train_qtgt, 'task2-train-int-targets')

In [None]:
save_table(eval_qtgt, 'task2-eval-int-targets')

## Task 2B - Equity of Underexposure

For 2022, we are using a diffrent version of the metric. **Equity of Underexposure** looks at each page's underexposure (system exposure is less than target exposure), and looks for underexposure to be equitably distributed between groups.

On its own, this isn't too difficult; averaging with background distributions, however, gets rather subtle.  Background distributions are at the roup level, but we need to propgagate that into the page level, so we can compute the difference between system and target exposure at the page level, and then aggregate the underexposure within each group.

The idea of equity of underexposure is that we $\epsilon = \operatorname{E}_\pi [\eta]$ and $\epsilon^* = \operatorname{E}_\tau [\eta]$.  We then compute $u = min(\epsilon^* - \epsilon, 0)$, and restrict it to be negative, and aggregate it by group; if $A$ is our page alignment matrix and $\vec{u}$, we compute the group underexposure by $A^T \vec{u}$.

That's the key idea.  However, we want to use $\epsilon^\dagger$ that has the equivalent of averaging group-aggregated $\epsilon^*$ with global target distributions $w_g$.  We can do this in a few stages.  First, we compute the total attention of each group, and use that to compute the fraction of group global weight that should go to each unit of alignment:

\begin{align*}
s_g & = \sum_d a_{dg} \\
\hat{w}_g & = \frac{w_g}{s_g}
\end{align*}

We can then average:

\begin{align*}
\epsilon^\dagger_d & = \frac{1}{2}\left(\epsilon^*_d + \sum_g a_{dg} \hat{w}_g \epsilon^*_{\mathrm{total}} \right) \\
\end{align*}

This is all on a per-topic basis.

### Demo Topic

We're going to reuse demo topic data from before:

In [None]:
q_xa

Compute the total for each attribute:

In [None]:
s_xg = q_xa.sum(axis=0) + 1e-10
s_xg

Let's get some fractions out of that:

In [None]:
s_xgf = s_xg / s_xg.sum()
s_xgf

Now, let's make a copy, and start building up a world target matrix that properly accounts for missing values:

In [None]:
W = s_xgf.copy()

Now, let's put in the known intersectional targets:

In [None]:
W[1:, 1:] = int_tgt * W[1:, 1:].sum()

Now we need the known-gender / unknown-geo targets:

In [None]:
W[0, 1:] = int_tgt.sum(axis=0) * W[0, 1:].sum()

And the known-geo / unknown-gender targets:

In [None]:
W[1:, 0] = int_tgt.sum(axis=1) * W[1:, 0].sum()

Let's see what we have:

In [None]:
W

Now we normalize it by $s_g$:

In [None]:
Wh = W / s_xg
Wh

The massive values are only where we have no relevant items, so they'll never actually be used.

We can now compute the query-aligned target matrix.

In [None]:
qp_gt = (q_xa * (Wh * qp_exp[1].sum())).sum(axis=(1,2)).to_series()
qp_gt.index.name = 'page_id'
qp_gt

In [None]:
qp_exp[1]

In [None]:
qp_tgt = 0.5 * (qp_exp[1] + qp_gt)
qp_tgt

### Setting Up Matrix

Now that we have the math worked out, we can create actual global target frames for each query.

In [None]:
def topic_page_tgt(qdf):
    pages = qdf['page_id']
    pages = pages[pages.isin(page_xalign.indexes['page'])]
    q_xa = page_xalign.loc[pages.values, :, :]
    
    # now we need to get the exposure for the pages
    p_exp = qp_exp.loc[qdf.name]
    assert p_exp.index.is_unique
    
    # need our sums
    s_xg = q_xa.sum(axis=0) + 1e-10
    
    # set up the global target
    W = s_xg / s_xg.sum()
    W[1:, 1:] = int_tgt * W[1:, 1:].sum()
    W[0, 1:] = int_tgt.sum(axis=0) * W[0, 1:].sum()
    W[1:, 0] = int_tgt.sum(axis=1) * W[1:, 0].sum()
    
    # per-unit global weights, de-normalized by total exposure
    Wh = W / s_xg
    Wh *= p_exp.sum()
    
    # compute global target
    gtgt = q_xa * Wh
    gtgt = gtgt.sum(axis=(1,2)).to_series()
    
    # compute average target
    avg_tgt = 0.5 * (p_exp + gtgt)
    avg_tgt.index.name = 'page'
    
    return avg_tgt

Test it quick:

In [None]:
topic_page_tgt(qdf)

And create our targets:

In [None]:
qp_tgt = qrels.groupby('id').progress_apply(topic_page_tgt)
qp_tgt

In [None]:
save_table(qp_tgt.to_frame('target'), 'task2-all-page-targets')

In [None]:
train_qptgt = qp_tgt.loc[train_topics['id']].to_frame('target')
eval_qptgt = qp_tgt.loc[eval_topics['id']].to_frame('target')

In [None]:
save_table(train_qptgt, 'task2-train-page-targets')
save_table(eval_qptgt, 'task2-eval-page-targets')