# Task 2 Alignment

This notebook computes the target distributions and retrieved page alignments for **Task 2**.
It depends on the output of the PageAlignments notebook, as imported by MetricInputs.

This notebook can be run in two modes: 'train', to process the training topics, and 'eval' for the eval topics.

In [1]:
DATA_MODE = 'eval'

## Setup

We begin by loading necessary libraries:

In [2]:
import sys
import operator
from functools import reduce
from itertools import product
from collections import namedtuple
from pathlib import Path
import pandas as pd
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gzip
import json
from natural.size import binarysize

Set up progress bar and logging support:

In [3]:
from tqdm.auto import tqdm
tqdm.pandas(leave=False)

In [4]:
import sys, logging
logging.basicConfig(level=logging.INFO, stream=sys.stderr)
log = logging.getLogger('Task2Alignment')

And set up an output directory:

In [5]:
from wptrec.save import OutRepo
output = OutRepo('data/metric-tables')

In [6]:
from wptrec import metrics
from wptrec.dimension import sum_outer

## Data and Helpers

Most data loading is outsourced to `MetricInputs`.  First we save the data mode where metric inputs can find it:

In [7]:
import wptrec
wptrec.DATA_MODE = DATA_MODE

In [8]:
from MetricInputs import *

INFO:MetricInputs:reading data\metric-tables\page-sub-geo-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-src-geo-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-gender-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-occ-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-alpha-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-age-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-pop-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-langs-align.parquet


In [9]:
dimensions

[<dimension "sub-geo": 21 levels>,
 <dimension "src-geo": 21 levels>,
 <dimension "gender": 4 levels>,
 <dimension "occ": 33 levels>,
 <dimension "alpha": 4 levels>,
 <dimension "age": 4 levels>,
 <dimension "pop": 4 levels>,
 <dimension "langs": 3 levels>]

### qrel join

We want a function to join alignments with qrels:

In [10]:
def qr_join(align):
    return qrels.join(align, on='page_id').set_index(['topic_id', 'page_id'])

### norm_dist

And a function to normalize to a distribution:

In [11]:
def norm_dist_df(mat):
    sums = mat.sum('columns')
    return mat.divide(sums, 'rows')

## Work and Target Exposure

The first thing we need to do to prepare the metric is to compute the work-needed for each topic's pages, and use that to compute the target exposure for each (relevant) page in the topic.

This is because an ideal ranking orders relevant documents in decreasing order of work needed, followed by irrelevant documents.  All relevant documents at a given work level should receive the same expected exposure.

First, look up the work for each query page ('query page work', or qpw):

In [12]:
qpw = qrels.join(page_quality, on='page_id')
qpw

Unnamed: 0,topic_id,page_id,quality
0,187,682,B
1,187,954,C
2,187,1170,C
3,187,1315,B
4,187,1322,B
...,...,...,...
2737607,2872,69877511,Stub
2737608,2872,69878912,C
2737609,2872,69879322,Start
2737610,2872,69881345,Stub


And now  use that to compute the number of documents at each work level:

In [13]:
qwork = qpw.groupby(['topic_id', 'quality'])['page_id'].count()
qwork

topic_id  quality
187       Stub       31076
          Start      20015
          C          11853
          B           4146
          GA          1479
                     ...  
2872      Start      21769
          C           9480
          B           2627
          GA           806
          FA            69
Name: page_id, Length: 300, dtype: int64

Now we need to convert this into target exposure levels.  This function will, given a series of counts for each work level, compute the expected exposure a page at that work level should receive.

In [14]:
def qw_tgt_exposure(qw_counts: pd.Series) -> pd.Series:
    if 'topic_id' == qw_counts.index.names[0]:
        qw_counts = qw_counts.reset_index(level='topic_id', drop=True)
    qwc = qw_counts.reindex(work_order, fill_value=0).astype('i4')
    tot = int(qwc.sum())
    da = metrics.discount(tot)
    qwp = qwc.shift(1, fill_value=0)
    qwc_s = qwc.cumsum()
    qwp_s = qwp.cumsum()
    res = pd.Series(
        [np.mean(da[s:e]) for (s, e) in zip(qwp_s, qwc_s)],
        index=qwc.index
    )
    return res

We'll then apply this to each topic, to determine the per-topic target exposures:

In [15]:
qw_pp_target = qwork.groupby('topic_id').apply(qw_tgt_exposure)
qw_pp_target.name = 'tgt_exposure'
qw_pp_target

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


topic_id  quality
187       Stub       0.075443
          Start      0.065321
          C          0.063307
          B          0.062546
          GA         0.062307
                       ...   
2872      Start      0.062570
          C          0.061352
          B          0.060958
          GA         0.060853
          FA         0.060827
Name: tgt_exposure, Length: 300, dtype: float32

We can now merge the relevant document work categories with this exposure, to compute the target exposure for each relevant document:

In [16]:
qp_exp = qpw.join(qw_pp_target, on=['topic_id', 'quality'])
qp_exp = qp_exp.set_index(['topic_id', 'page_id'])['tgt_exposure']
qp_exp

topic_id  page_id 
187       682         0.062546
          954         0.063307
          1170        0.063307
          1315        0.062546
          1322        0.062546
                        ...   
2872      69877511    0.071035
          69878912    0.061352
          69879322    0.062570
          69881345    0.071035
          69883661    0.062570
Name: tgt_exposure, Length: 2737612, dtype: float32

## Subject Geography

Subject geography targets the average of the relevant set alignments and the world population.

In [17]:
qr_sub_geo_align = qr_join(sub_geo_align)
qr_sub_geo_align

Unnamed: 0_level_0,Unnamed: 1_level_0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
187,682,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
187,954,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
187,1170,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
187,1315,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
187,1322,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2872,69877511,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2872,69878912,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2872,69879322,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2872,69881345,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


Compute a raw target, factoring in weights:

In [18]:
qr_sub_geo_tgt = qr_sub_geo_align.multiply(qp_exp, axis='rows').groupby('topic_id').sum()

And now we need to average the known-geo with the background.

In [19]:
qr_sub_geo_fk = qr_sub_geo_tgt.iloc[:, 1:].sum('columns')
qr_sub_geo_tgt.iloc[:, 1:] *= 0.5
qr_sub_geo_tgt.iloc[:, 1:] += qr_sub_geo_fk.apply(lambda k: world_pop * k * 0.5)
qr_sub_geo_tgt.head()

Unnamed: 0_level_0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
187,758.390795,0.000309,19.328449,59.318134,21.122305,109.070472,538.04105,160.059755,39.543253,70.002691,...,629.703684,93.268184,144.134466,206.460548,22.555736,554.706743,289.10963,97.05851,112.50682,279.935254
270,967.129024,0.000233,69.231601,59.217295,23.466741,147.703296,423.954414,216.488611,39.756533,68.587236,...,234.564466,82.918686,151.424207,154.497647,35.594573,402.112346,174.454966,106.516209,91.341129,224.075828
359,641.628435,0.00022,59.452325,50.316697,12.890808,74.364014,418.392361,59.464259,27.060083,41.801031,...,23.286746,19.606259,102.636117,124.679452,12.871306,348.961634,36.449902,66.136114,49.743391,42.552754
365,481.82171,0.000148,18.649567,31.101461,9.853304,53.778891,251.973232,88.15213,28.231344,36.627649,...,70.381866,38.938076,88.112552,128.244847,9.595809,242.239786,158.985835,58.330269,70.821535,78.130276
400,2137.392223,0.000465,39.636681,106.665337,28.905838,168.78729,825.397926,224.909498,61.032641,101.411871,...,623.047574,189.24003,250.638365,297.782968,44.41248,820.697054,224.451157,154.170802,152.89115,466.963698


These are **not** distributions, let's fix that!

In [20]:
qr_sub_geo_tgt = norm_dist_df(qr_sub_geo_tgt)

In [21]:
output.save_table(qr_sub_geo_tgt, f'task2-{DATA_MODE}-sub-geo-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task2-eval-sub-geo-target.csv.gz
INFO:wptrec.save:data\metric-tables\task2-eval-sub-geo-target.csv.gz: 10.67 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task2-eval-sub-geo-target.parquet
INFO:wptrec.save:data\metric-tables\task2-eval-sub-geo-target.parquet: 25.97 KiB


## Source Geography

Source geography works the same way.

In [22]:
qr_src_geo_align = qr_join(src_geo_align)
qr_src_geo_align

Unnamed: 0_level_0,Unnamed: 1_level_0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
187,682,0.400000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.150000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000,0.050000
187,954,0.257143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.285714,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000,0.171429
187,1170,0.368421,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.052632,0.052632,0.0,0.0,0.0,0.0,0.000000,0.0,0.000,0.000000
187,1315,0.375000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.125,0.000000
187,1322,0.428571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.571429,0.0,0.000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2872,69877511,1.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000,0.000000
2872,69878912,0.366667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.1,0.0,0.0,0.000000,0.0,0.000,0.000000
2872,69879322,0.200000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.600,0.000000
2872,69881345,0.500000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.5,0.0,0.0,0.000000,0.0,0.000,0.000000


And now we repeat these computations!

In [23]:
qr_src_geo_tgt = qr_src_geo_align.multiply(qp_exp, axis='rows').groupby('topic_id').sum()

In [24]:
qr_src_geo_fk = qr_src_geo_tgt.iloc[:, 1:].sum('columns')
qr_src_geo_tgt.iloc[:, 1:] *= 0.5
qr_src_geo_tgt.iloc[:, 1:] += qr_src_geo_fk.apply(lambda k: world_pop * k * 0.5)
qr_src_geo_tgt.head()

Unnamed: 0_level_0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
187,1892.369467,0.000221,10.629244,38.085204,13.580445,76.133518,368.32416,91.512896,27.29295,43.321614,...,518.410807,53.513703,92.059446,137.914296,14.561724,383.65477,94.527498,67.295433,64.363892,158.997834
270,1682.383393,0.000177,14.119208,32.195247,10.694027,62.403539,291.840082,76.966103,21.889625,34.559267,...,175.35488,40.800385,81.370613,104.008024,13.928225,291.129055,74.411532,54.628367,45.613038,244.158227
359,1349.305462,0.000166,10.812019,28.257371,9.637102,55.966959,288.368964,44.738826,20.314799,31.459874,...,27.303292,12.455743,62.663487,93.284401,9.119594,262.22393,24.364869,49.612454,37.221849,34.950278
365,899.571884,0.000116,24.578317,20.16328,7.058418,40.687112,195.637555,43.383442,18.29035,22.941681,...,51.127647,29.052902,48.15703,91.720399,6.46741,185.621213,93.737043,35.679642,47.403614,70.055359
400,3510.441727,0.00212,20.067844,67.028356,21.829953,124.063329,603.794372,146.012407,44.342043,71.306778,...,564.339012,159.334304,158.261887,220.074822,27.370868,631.930727,129.760455,111.504376,103.379965,207.421067


Make sure the rows are distributions:

In [25]:
qr_src_geo_tgt = norm_dist_df(qr_src_geo_tgt)

In [26]:
output.save_table(qr_src_geo_tgt, f'task2-{DATA_MODE}-src-geo-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task2-eval-src-geo-target.csv.gz
INFO:wptrec.save:data\metric-tables\task2-eval-src-geo-target.csv.gz: 10.62 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task2-eval-src-geo-target.parquet
INFO:wptrec.save:data\metric-tables\task2-eval-src-geo-target.parquet: 25.97 KiB


## Gender

Now we're going to grab the gender alignments.  Works the same way.

In [27]:
qr_gender_align = qr_join(gender_align)
qr_gender_align.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,@UNKNOWN,female,male,NB
topic_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
187,682,1.0,0.0,0.0,0.0
187,954,0.0,0.0,1.0,0.0
187,1170,1.0,0.0,0.0,0.0
187,1315,1.0,0.0,0.0,0.0
187,1322,1.0,0.0,0.0,0.0


In [28]:
qr_gender_tgt = qr_gender_align.multiply(qp_exp, axis='rows').groupby('topic_id').sum()

In [29]:
qr_gender_fk = qr_gender_tgt.iloc[:, 1:].sum('columns')
qr_gender_tgt.iloc[:, 1:] *= 0.5
qr_gender_tgt.iloc[:, 1:] += qr_gender_fk.apply(lambda k: gender_tgt * k * 0.5)
qr_gender_tgt.head()

Unnamed: 0_level_0,@UNKNOWN,female,male,NB
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
187,4231.726279,159.708759,364.436851,2.704633
270,1461.677295,1029.567013,1476.707985,12.917147
359,1164.86894,601.468537,1714.967051,11.64038
365,1012.069178,445.784544,938.953553,6.958483
400,94.885554,3323.222661,4707.223206,42.888097


In [30]:
qr_gender_tgt = norm_dist_df(qr_gender_tgt)

In [31]:
output.save_table(qr_gender_tgt, f'task2-{DATA_MODE}-gender-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task2-eval-gender-target.csv.gz
INFO:wptrec.save:data\metric-tables\task2-eval-gender-target.csv.gz: 2.24 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task2-eval-gender-target.parquet
INFO:wptrec.save:data\metric-tables\task2-eval-gender-target.parquet: 6.90 KiB


## Occupation

Occupation is more straightforward, since we don't have a global target to average with.  We do need to drop unknown.

In [32]:
qr_occ_align = qr_join(occ_align).multiply(qp_exp, axis='rows')
qr_occ_tgt = qr_occ_align.iloc[:, 1:].groupby('topic_id').sum()
qr_occ_tgt = norm_dist_df(qr_occ_tgt)
qr_occ_tgt.head()

Unnamed: 0_level_0,activist,agricultural worker,artist,athlete,biologist,businessperson,chemist,civil servant,clergyperson,computer scientist,...,military personnel,musician,performing artist,physicist,politician,scientist,social scientist,sportsperson (non-athlete),transportation occupation,writer
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
187,0.001742,0.000423,0.046779,0.003448,0.001719,0.025363,4.6e-05,0.001743,0.00076,0.000127,...,0.00302,0.001195,0.000999,0.000467,0.009501,0.010534,0.004196,0.000352,0.000268,0.01291
270,0.00022,0.000236,0.000887,0.963747,0.000225,0.001724,0.00019,0.001045,0.000168,1.5e-05,...,0.001331,0.000621,0.001608,3.5e-05,0.003671,0.000428,0.000431,0.013288,0.000434,0.001255
359,0.000308,7.3e-05,0.000855,0.913443,6.6e-05,0.007293,9.4e-05,0.000495,7.1e-05,0.0,...,0.002207,0.001371,0.004169,1.4e-05,0.002663,5.4e-05,6.9e-05,0.047321,8.5e-05,0.002104
365,0.000134,3e-05,0.000319,0.87482,3.9e-05,0.003116,8.9e-05,0.000365,0.000151,2.4e-05,...,0.001284,0.000451,0.002861,0.0,0.001718,0.0001,0.000131,0.106472,0.00016,0.001281
400,0.00446,0.000402,0.331925,0.003775,0.001594,0.020481,0.000277,0.002385,0.001815,0.000278,...,0.002132,0.011309,0.133634,0.000404,0.007652,0.003079,0.003482,0.0017,0.000531,0.262259


In [33]:
output.save_table(qr_occ_tgt, f'task2-{DATA_MODE}-occ-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task2-eval-occ-target.csv.gz
INFO:wptrec.save:data\metric-tables\task2-eval-occ-target.csv.gz: 14.69 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task2-eval-occ-target.parquet
INFO:wptrec.save:data\metric-tables\task2-eval-occ-target.parquet: 37.48 KiB


## Remaining Attributes

The remaining attributes don't need any further processing, as they are completely known.

In [34]:
qr_age_align = qr_join(age_align).multiply(qp_exp, axis='rows')
qr_age_tgt = norm_dist_df(qr_age_align.groupby('topic_id').sum())
output.save_table(qr_age_tgt, f'task2-{DATA_MODE}-age-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task2-eval-age-target.csv.gz
INFO:wptrec.save:data\metric-tables\task2-eval-age-target.csv.gz: 1.20 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task2-eval-age-target.parquet
INFO:wptrec.save:data\metric-tables\task2-eval-age-target.parquet: 5.24 KiB


In [35]:
qr_alpha_align = qr_join(alpha_align).multiply(qp_exp, axis='rows')
qr_alpha_tgt = norm_dist_df(qr_alpha_align.groupby('topic_id').sum())
output.save_table(qr_alpha_tgt, f'task2-{DATA_MODE}-alpha-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task2-eval-alpha-target.csv.gz
INFO:wptrec.save:data\metric-tables\task2-eval-alpha-target.csv.gz: 1.16 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task2-eval-alpha-target.parquet
INFO:wptrec.save:data\metric-tables\task2-eval-alpha-target.parquet: 5.00 KiB


In [36]:
qr_langs_align = qr_join(langs_align).multiply(qp_exp, axis='rows')
qr_langs_tgt = norm_dist_df(qr_langs_align.groupby('topic_id').sum())
output.save_table(qr_langs_tgt, f'task2-{DATA_MODE}-langs-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task2-eval-langs-target.csv.gz
INFO:wptrec.save:data\metric-tables\task2-eval-langs-target.csv.gz: 978.00 iB
INFO:wptrec.save:saving Parquet to data\metric-tables\task2-eval-langs-target.parquet
INFO:wptrec.save:data\metric-tables\task2-eval-langs-target.parquet: 4.46 KiB


In [37]:
qr_pop_align = qr_join(pop_align).multiply(qp_exp, axis='rows')
qr_pop_tgt = norm_dist_df(qr_pop_align.groupby('topic_id').sum())
output.save_table(qr_pop_tgt, f'task2-{DATA_MODE}-pop-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task2-eval-pop-target.csv.gz
INFO:wptrec.save:data\metric-tables\task2-eval-pop-target.csv.gz: 1.24 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task2-eval-pop-target.parquet
INFO:wptrec.save:data\metric-tables\task2-eval-pop-target.parquet: 5.15 KiB


## Multidimensional Alignment

Now let's dive into the multidmensional alignment.  This is going to proceed a lot like the Task 1 alignment.

### Dimension Definitions

Let's define background distributions for some of our dimensions:

In [38]:
dim_backgrounds = {
    'sub-geo': world_pop,
    'src-geo': world_pop,
    'gender': gender_tgt,
}

Now we'll make a list of dimensions to treat with averaging:

In [39]:
DR = namedtuple('DimRec', ['name', 'align', 'background'], defaults=[None])
avg_dims = [
    DR(d.name, d.page_align_xr, xr.DataArray(dim_backgrounds[d.name], dims=[d.name]))
    for d in dimensions
    if d.name in dim_backgrounds
]
[d.name for d in avg_dims]

['sub-geo', 'src-geo', 'gender']

And a list of dimensions to use as-is:

In [40]:
raw_dims = [
    DR(d.name, d.page_align_xr)
    for d in dimensions
    if d.name not in dim_backgrounds
]
[d.name for d in raw_dims]

['occ', 'alpha', 'age', 'pop', 'langs']

Now: these dimension are in the original order - `dimensions` has the averaged dimensions before the non-averaged ones. **This is critical for the rest of the code to work.**

### Data Subsetting

Also from Task 1.

In [41]:
avg_cases = list(product(*[[True, False] for d in avg_dims]))
avg_cases.pop()
avg_cases

[(True, True, True),
 (True, True, False),
 (True, False, True),
 (True, False, False),
 (False, True, True),
 (False, True, False),
 (False, False, True)]

In [42]:
def case_selector(case):
    def mksel(known):
        if known:
            # select all but 1st column
            return slice(1, None, None)
        else:
            # select 1st column
            return 0
    
    return tuple(mksel(k) for k in case)

### Background Averaging

We're now going to define our background-averaging function; this is reused from the Task 1 alignment code.

For each condition, we are going to proceed as follows:

1. Compute an appropriate intersectional background distribution (based on the dimensions that are "known")
2. Select the subset of the target matrix with this known status
3. Compute the sum of this subset
4. Re-normalize the subset to sum to 1
5. Compute a normalization table such that each coordinate in the distributions to correct sums to 1 (so multiplying this by the background distribution spreads the background across the other dimensions appropriately), and use this to spread the background distribution
6. Average with the spread background distribution
7. Re-normalize to preserve the original sum

Let's define the whole process as a function:

In [43]:
def avg_with_bg(tm, verbose=False):
    tm = tm.copy()
    
    tail_names = [d.name for d in raw_dims]
    
    # compute the tail mass for each coordinate (can be done once)
    tail_mass = tm.sum(tail_names)
    
    # now some things don't have any mass, but we still need to distribute background distributions.
    # solution: we impute the marginal tail distribution
    # first compute it
    tail_marg = tm.sum([d.name for d in avg_dims])
    # then impute that where we don't have mass
    tm_imputed = xr.where(tail_mass > 0, tm, tail_marg)
    # and re-compute the tail mass
    tail_mass = tm_imputed.sum(tail_names)
    # and finally we compute the rescaled matrix
    tail_scale = tm_imputed / tail_mass
    del tm_imputed
    
    for case in avg_cases:
        # for deugging: get names
        known_names = [d.name for (d, known) in zip(avg_dims, case) if known]
        if verbose:
            print('processing known:', known_names)
        
        # Step 1: background
        bg = reduce(operator.mul, [
            d.background
            for (d, known) in zip(avg_dims, case)
            if known
        ])
        if not np.allclose(bg.sum(), 1.0):
            warnings.warn('background distribution for {} sums to {}, expected 1'.format(known_names, bg.values.sum()))
        
        # Step 2: selector
        sel = case_selector(case)
        
        # Steps 3: sum in preparation for normalization
        c_sum = tm[sel].sum()
        
        # Step 5: spread the background
        bg_spread = bg * tail_scale[sel] * c_sum
        if not np.allclose(bg_spread.sum(), c_sum):
            warnings.warn('rescaled background sums to {}, expected c_sum'.format(bg_spread.values.sum()))
        
        # Step 4 & 6: average with the background
        tm[sel] *= 0.5
        bg_spread *= 0.5
        tm[sel] += bg_spread
                        
        if not np.allclose(tm[sel].sum(), c_sum):
            warnings.warn('target distribution for {} sums to {}, expected {}'.format(known_names, tm[sel].values.sum(), c_sum))
    
    return tm

### Computing Targets

We're now ready to compute a multidimensional target. This works like the Task 1, with the difference that we are propagating work needed into the targets as well; the input will be series whose *index* is page IDs and values are the work levels.

In [44]:
def query_xalign(pages):
    # compute targets to average
    avg_pages = reduce(operator.mul, [d.align.loc[pages.index] for d in avg_dims])
    raw_pages = reduce(operator.mul, [d.align.loc[pages.index] for d in raw_dims])
    
    # weight the left pages
    pages.index.name = 'page'
    qpw = xr.DataArray.from_series(pages)
    avg_pages = avg_pages * qpw

    # convert to query distribution
    tgt = sum_outer(avg_pages, raw_pages)
    tgt /= qpw.sum()

    # average with background distributions
    tgt = avg_with_bg(tgt)
    
    # and return the result
    return tgt

### Applying Computations

Now let's run this thing - compute all the target distributions:

In [45]:
q_ids = qp_exp.index.levels[0].copy()
q_ids

Int64Index([ 187,  270,  359,  365,  400,  404,  480,  517,  568,  596,  715,
             807,  834,  881,  883,  949,  951,  955,  995, 1018, 1180, 1233,
            1328, 1406, 1417, 1448, 1449, 1479, 1499, 1548, 1558, 1647, 1685,
            1806, 1821, 1877, 1884, 1890, 2000, 2028, 2106, 2153, 2160, 2229,
            2244, 2448, 2483, 2758, 2867, 2872],
           dtype='int64', name='topic_id')

In [48]:
q_tgts = [query_xalign(qp_exp.loc[q]) for q in tqdm(q_ids)]

  0%|          | 0/50 [00:00<?, ?it/s]

In [49]:
q_tgts = xr.concat(q_tgts, q_ids)
q_tgts

Save this to NetCDF (xarray's recommended format):

In [50]:
output.save_xarray(q_tgts, f'task2-{DATA_MODE}-int-targets')

INFO:wptrec.save:saving NetCDF to data\metric-tables\task2-eval-int-targets.nc


## Task 2B - Equity of Underexposure - NOT YET DONE

For 2022, we are using a diffrent version of the metric. **Equity of Underexposure** looks at each page's underexposure (system exposure is less than target exposure), and looks for underexposure to be equitably distributed between groups.

On its own, this isn't too difficult; averaging with background distributions, however, gets rather subtle.  Background distributions are at the roup level, but we need to propgagate that into the page level, so we can compute the difference between system and target exposure at the page level, and then aggregate the underexposure within each group.

The idea of equity of underexposure is that we $\epsilon = \operatorname{E}_\pi [\eta]$ and $\epsilon^* = \operatorname{E}_\tau [\eta]$.  We then compute $u = min(\epsilon^* - \epsilon, 0)$, and restrict it to be negative, and aggregate it by group; if $A$ is our page alignment matrix and $\vec{u}$, we compute the group underexposure by $A^T \vec{u}$.

That's the key idea.  However, we want to use $\epsilon^\dagger$ that has the equivalent of averaging group-aggregated $\epsilon^*$ with global target distributions $w_g$.  We can do this in a few stages.  First, we compute the total attention of each group, and use that to compute the fraction of group global weight that should go to each unit of alignment:

\begin{align*}
s_g & = \sum_d a_{dg} \\
\hat{w}_g & = \frac{w_g}{s_g}
\end{align*}

We can then average:

\begin{align*}
\epsilon^\dagger_d & = \frac{1}{2}\left(\epsilon^*_d + \sum_g a_{dg} \hat{w}_g \epsilon^*_{\mathrm{total}} \right) \\
\end{align*}

This is all on a per-topic basis.

### Demo Topic

We're going to reuse demo topic data from before:

In [None]:
q_xa

Compute the total for each attribute:

In [None]:
s_xg = q_xa.sum(axis=0) + 1e-10
s_xg

Let's get some fractions out of that:

In [None]:
s_xgf = s_xg / s_xg.sum()
s_xgf

Now, let's make a copy, and start building up a world target matrix that properly accounts for missing values:

In [None]:
W = s_xgf.copy()

Now, let's put in the known intersectional targets:

In [None]:
W[1:, 1:] = int_tgt * W[1:, 1:].sum()

Now we need the known-gender / unknown-geo targets:

In [None]:
W[0, 1:] = int_tgt.sum(axis=0) * W[0, 1:].sum()

And the known-geo / unknown-gender targets:

In [None]:
W[1:, 0] = int_tgt.sum(axis=1) * W[1:, 0].sum()

Let's see what we have:

In [None]:
W

Now we normalize it by $s_g$:

In [None]:
Wh = W / s_xg
Wh

The massive values are only where we have no relevant items, so they'll never actually be used.

We can now compute the query-aligned target matrix.

In [None]:
qp_gt = (q_xa * (Wh * qp_exp[1].sum())).sum(axis=(1,2)).to_series()
qp_gt.index.name = 'page_id'
qp_gt

In [None]:
qp_exp[1]

In [None]:
qp_tgt = 0.5 * (qp_exp[1] + qp_gt)
qp_tgt

### Setting Up Matrix

Now that we have the math worked out, we can create actual global target frames for each query.

In [None]:
def topic_page_tgt(qdf):
    pages = qdf['page_id']
    pages = pages[pages.isin(page_xalign.indexes['page'])]
    q_xa = page_xalign.loc[pages.values, :, :]
    
    # now we need to get the exposure for the pages
    p_exp = qp_exp.loc[qdf.name]
    assert p_exp.index.is_unique
    
    # need our sums
    s_xg = q_xa.sum(axis=0) + 1e-10
    
    # set up the global target
    W = s_xg / s_xg.sum()
    W[1:, 1:] = int_tgt * W[1:, 1:].sum()
    W[0, 1:] = int_tgt.sum(axis=0) * W[0, 1:].sum()
    W[1:, 0] = int_tgt.sum(axis=1) * W[1:, 0].sum()
    
    # per-unit global weights, de-normalized by total exposure
    Wh = W / s_xg
    Wh *= p_exp.sum()
    
    # compute global target
    gtgt = q_xa * Wh
    gtgt = gtgt.sum(axis=(1,2)).to_series()
    
    # compute average target
    avg_tgt = 0.5 * (p_exp + gtgt)
    avg_tgt.index.name = 'page'
    
    return avg_tgt

Test it quick:

In [None]:
topic_page_tgt(qdf)

And create our targets:

In [None]:
qp_tgt = qrels.groupby('id').progress_apply(topic_page_tgt)
qp_tgt

In [None]:
save_table(qp_tgt.to_frame('target'), 'task2-all-page-targets')

In [None]:
train_qptgt = qp_tgt.loc[train_topics['id']].to_frame('target')
eval_qptgt = qp_tgt.loc[eval_topics['id']].to_frame('target')

In [None]:
save_table(train_qptgt, 'task2-train-page-targets')
save_table(eval_qptgt, 'task2-eval-page-targets')