# Task 1 Alignment

This notebook computes the target distributions and retrieved page alignments for **Task 1**.
It depends on the output of the PageAlignments notebook.

This notebook can be run in two modes: 'train', to process the training topics, and 'eval' for the eval topics.

In [1]:
DATA_MODE = 'eval'

## Setup

We begin by loading necessary libraries:

In [2]:
import sys
import warnings
from collections import namedtuple
from functools import reduce
from itertools import product
import operator
from pathlib import Path

In [3]:
import pandas as pd
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gzip
import json
from natural.size import binarysize
from natural.number import number

Set up progress bar and logging support:

In [4]:
from tqdm.auto import tqdm
tqdm.pandas(leave=False)

In [5]:
import sys, logging
logging.basicConfig(level=logging.INFO, stream=sys.stderr)
log = logging.getLogger('Task1Alignment')

And set up an output directory:

In [6]:
from wptrec.save import OutRepo
output = OutRepo('data/metric-tables')

## Data and Helpers

Most data loading is outsourced to `MetricInputs`.  First we save the data mode where metric inputs can find it:

In [7]:
import wptrec
wptrec.DATA_MODE = DATA_MODE

In [8]:
from MetricInputs import *

INFO:MetricInputs:reading data\metric-tables\page-sub-geo-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-src-geo-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-gender-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-occ-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-alpha-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-age-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-pop-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-langs-align.parquet


In [9]:
dimensions

[<dimension "sub-geo": 21 levels>,
 <dimension "src-geo": 21 levels>,
 <dimension "gender": 4 levels>,
 <dimension "occ": 33 levels>,
 <dimension "alpha": 4 levels>,
 <dimension "age": 4 levels>,
 <dimension "pop": 4 levels>,
 <dimension "langs": 3 levels>]

### qrel join

We want a function to join alignments with qrels:

In [10]:
def qr_join(align):
    return qrels.join(align, on='page_id').set_index(['topic_id', 'page_id'])

### norm_dist

And a function to normalize to a distribution:

In [11]:
def norm_dist_df(mat):
    sums = mat.sum('columns')
    return mat.divide(sums, 'rows')

## Prep Overview

Now that we have our alignments and qrels, we are ready to prepare the Task 1 metrics.

We're first going to prepare the target distributions; then we will compute the alignments for the retrieved pages.

## Subject Geography

Subject geography targets the average of the relevant set alignments and the world population.

In [12]:
qr_sub_geo_align = qr_join(sub_geo_align)
qr_sub_geo_align

Unnamed: 0_level_0,Unnamed: 1_level_0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
187,682,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
187,954,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
187,1170,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
187,1315,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
187,1322,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2872,69877511,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2872,69878912,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2872,69879322,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2872,69881345,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


For purely geographic fairness, we just need to average the unknowns with the world pop:

In [13]:
qr_sub_geo_tgt = qr_sub_geo_align.groupby('topic_id').mean()
qr_sub_geo_fk = qr_sub_geo_tgt.iloc[:, 1:].sum('columns')
qr_sub_geo_tgt.iloc[:, 1:] *= 0.5
qr_sub_geo_tgt.iloc[:, 1:] += qr_sub_geo_fk.apply(lambda k: world_pop * k * 0.5)
qr_sub_geo_tgt.head()

Unnamed: 0_level_0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
187,0.161757,6.47222e-08,0.004007,0.012384,0.004401,0.02283,0.112412,0.03344,0.008264,0.014711,...,0.133172,0.020594,0.030093,0.043274,0.004694,0.11635,0.059294,0.020306,0.023583,0.058312
270,0.242805,5.84644e-08,0.017378,0.014851,0.005852,0.037144,0.106411,0.053948,0.009914,0.017165,...,0.058914,0.020977,0.038029,0.03875,0.008852,0.101007,0.044103,0.026599,0.022927,0.055952
359,0.183666,6.30306e-08,0.017007,0.014391,0.003689,0.021289,0.118833,0.017016,0.007747,0.011968,...,0.006663,0.005588,0.029521,0.035681,0.003675,0.099904,0.010362,0.018935,0.014239,0.012154
365,0.20137,6.166361e-08,0.007572,0.012774,0.004079,0.022296,0.104172,0.03595,0.011613,0.015012,...,0.029218,0.016421,0.036189,0.053554,0.003956,0.100548,0.065794,0.024046,0.029213,0.031859
400,0.258172,5.727783e-08,0.004827,0.013104,0.003552,0.020758,0.101462,0.027533,0.007496,0.01244,...,0.076621,0.023341,0.030668,0.036634,0.005453,0.101073,0.027173,0.018965,0.018795,0.056502


Make sure the rows are distributions:

In [14]:
qr_sub_geo_tgt.sum('columns').describe()

count    5.000000e+01
mean     1.000000e+00
std      1.409697e-16
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      1.000000e+00
dtype: float64

Everything is 1, we're good to go!

In [15]:
output.save_table(qr_sub_geo_tgt, f'task1-{DATA_MODE}-sub-geo-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-eval-sub-geo-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-eval-sub-geo-target.csv.gz: 10.66 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-eval-sub-geo-target.parquet
INFO:wptrec.save:data\metric-tables\task1-eval-sub-geo-target.parquet: 25.97 KiB


## Source Geography

Source geography works the same way.

In [16]:
qr_src_geo_align = qr_join(src_geo_align)
qr_src_geo_align

Unnamed: 0_level_0,Unnamed: 1_level_0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
187,682,0.400000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.150000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000,0.050000
187,954,0.257143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.285714,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000,0.171429
187,1170,0.368421,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.052632,0.052632,0.0,0.0,0.0,0.0,0.000000,0.0,0.000,0.000000
187,1315,0.375000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.125,0.000000
187,1322,0.428571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.571429,0.0,0.000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2872,69877511,1.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000,0.000000
2872,69878912,0.366667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.1,0.0,0.0,0.000000,0.0,0.000,0.000000
2872,69879322,0.200000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.600,0.000000
2872,69881345,0.500000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.5,0.0,0.0,0.000000,0.0,0.000,0.000000


And repeat:

In [17]:
qr_src_geo_tgt = qr_src_geo_align.groupby('topic_id').mean()
qr_src_geo_fk = qr_src_geo_tgt.iloc[:, 1:].sum('columns')
qr_src_geo_tgt.iloc[:, 1:] *= 0.5
qr_src_geo_tgt.iloc[:, 1:] += qr_src_geo_fk.apply(lambda k: world_pop * k * 0.5)
qr_src_geo_tgt.head()

Unnamed: 0_level_0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
187,0.391787,4.696121e-08,0.00225,0.00807,0.002876,0.016153,0.077938,0.019365,0.00579,0.009195,...,0.110871,0.011692,0.019483,0.02928,0.003079,0.081422,0.019888,0.014278,0.013633,0.033541
270,0.420047,4.477917e-08,0.003611,0.008171,0.002702,0.015759,0.073673,0.019488,0.005524,0.008721,...,0.044787,0.010542,0.020577,0.026281,0.003505,0.073534,0.018938,0.013802,0.011568,0.061875
359,0.372489,4.845126e-08,0.003072,0.00826,0.002821,0.016384,0.084042,0.013101,0.005947,0.009209,...,0.007908,0.003628,0.018333,0.027301,0.002669,0.076759,0.00712,0.014524,0.010901,0.010185
365,0.364985,4.903066e-08,0.010223,0.008492,0.002984,0.017147,0.082518,0.018251,0.007674,0.009672,...,0.021657,0.012542,0.020322,0.038885,0.00273,0.078353,0.03996,0.015051,0.020196,0.029345
400,0.422769,2.798744e-07,0.002478,0.008311,0.002702,0.015381,0.074893,0.018031,0.005497,0.008827,...,0.069702,0.019709,0.019562,0.027291,0.003381,0.078346,0.015888,0.013821,0.012813,0.025499


Make sure the rows are distributions:

In [18]:
qr_src_geo_tgt.sum('columns').describe()

count    5.000000e+01
mean     1.000000e+00
std      1.218255e-16
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      1.000000e+00
dtype: float64

Everything is 1, we're good to go!

In [19]:
output.save_table(qr_src_geo_tgt, f'task1-{DATA_MODE}-src-geo-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-eval-src-geo-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-eval-src-geo-target.csv.gz: 10.64 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-eval-src-geo-target.parquet
INFO:wptrec.save:data\metric-tables\task1-eval-src-geo-target.parquet: 25.97 KiB


## Gender

Now we're going to grab the gender alignments.  Again, we ignore UNKNOWN.

In [20]:
qr_gender_align = qr_join(gender_align)
qr_gender_align.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,@UNKNOWN,female,male,NB
topic_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
187,682,1.0,0.0,0.0,0.0
187,954,0.0,0.0,1.0,0.0
187,1170,1.0,0.0,0.0,0.0
187,1315,1.0,0.0,0.0,0.0
187,1322,1.0,0.0,0.0,0.0


In [21]:
qr_gender_tgt = qr_gender_align.groupby('topic_id').mean()
qr_gender_fk = qr_gender_tgt.iloc[:, 1:].sum('columns')
qr_gender_tgt.iloc[:, 1:] *= 0.5
qr_gender_tgt.iloc[:, 1:] += qr_gender_fk.apply(lambda k: gender_tgt * k * 0.5)
qr_gender_tgt.head()

Unnamed: 0_level_0,@UNKNOWN,female,male,NB
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
187,0.888195,0.03391,0.077336,0.000574
270,0.371833,0.257322,0.367774,0.003231
359,0.340156,0.170558,0.486007,0.003299
365,0.424643,0.183396,0.389116,0.002877
400,0.011697,0.408054,0.575302,0.005275


In [22]:
output.save_table(qr_gender_tgt, f'task1-{DATA_MODE}-gender-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-eval-gender-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-eval-gender-target.csv.gz: 2.22 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-eval-gender-target.parquet
INFO:wptrec.save:data\metric-tables\task1-eval-gender-target.parquet: 6.80 KiB


## Remaining Attributes

The remaining attributes don't need any further processing, as they aren't averaged.

In [23]:
qr_occ_align = qr_join(occ_align)
qr_occ_tgt = qr_occ_align.groupby('topic_id').sum()
qr_occ_tgt = norm_dist_df(qr_occ_tgt)
qr_occ_tgt.head()

Unnamed: 0_level_0,@UNKNOWN,activist,agricultural worker,artist,athlete,biologist,businessperson,chemist,civil servant,clergyperson,...,military personnel,musician,performing artist,physicist,politician,scientist,social scientist,sportsperson (non-athlete),transportation occupation,writer
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
187,0.891108,0.000192,4.9e-05,0.005105,0.000383,0.000193,0.002763,5e-06,0.000194,8.1e-05,...,0.000335,0.000128,0.00011,5.2e-05,0.001044,0.001168,0.000461,4e-05,3.1e-05,0.001421
270,0.379033,0.000143,0.000153,0.000569,0.597543,0.000145,0.001116,0.000123,0.000671,0.00011,...,0.000867,0.000404,0.001072,2.4e-05,0.002388,0.000277,0.000275,0.00855,0.000281,0.000811
359,0.355009,0.000216,4.8e-05,0.000564,0.587417,4.5e-05,0.004931,6.2e-05,0.000336,4.6e-05,...,0.001501,0.000922,0.002827,1e-05,0.001808,3.7e-05,4.5e-05,0.031237,5.9e-05,0.001414
365,0.427646,8.1e-05,1.6e-05,0.000186,0.499385,2.3e-05,0.001868,4.7e-05,0.000207,9.4e-05,...,0.000696,0.000274,0.001756,0.0,0.001031,6.3e-05,7e-05,0.061864,9.4e-05,0.000777
400,0.044346,0.004397,0.000387,0.316302,0.003669,0.00153,0.019926,0.000269,0.002284,0.001724,...,0.002074,0.010823,0.128105,0.000393,0.007384,0.003,0.003345,0.001635,0.00052,0.249432


In [24]:
output.save_table(qr_occ_tgt, f'task1-{DATA_MODE}-occ-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-eval-occ-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-eval-occ-target.csv.gz: 14.99 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-eval-occ-target.parquet
INFO:wptrec.save:data\metric-tables\task1-eval-occ-target.parquet: 38.59 KiB


In [25]:
qr_age_align = qr_join(age_align)
qr_age_tgt = norm_dist_df(qr_age_align.groupby('topic_id').sum())
output.save_table(qr_age_tgt, f'task1-{DATA_MODE}-age-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-eval-age-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-eval-age-target.csv.gz: 2.13 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-eval-age-target.parquet
INFO:wptrec.save:data\metric-tables\task1-eval-age-target.parquet: 6.23 KiB


In [26]:
qr_alpha_align = qr_join(alpha_align)
qr_alpha_tgt = norm_dist_df(qr_alpha_align.groupby('topic_id').sum())
output.save_table(qr_alpha_tgt, f'task1-{DATA_MODE}-alpha-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-eval-alpha-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-eval-alpha-target.csv.gz: 2.11 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-eval-alpha-target.parquet
INFO:wptrec.save:data\metric-tables\task1-eval-alpha-target.parquet: 5.99 KiB


In [27]:
qr_langs_align = qr_join(langs_align)
qr_langs_tgt = norm_dist_df(qr_langs_align.groupby('topic_id').sum())
output.save_table(qr_langs_tgt, f'task1-{DATA_MODE}-langs-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-eval-langs-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-eval-langs-target.csv.gz: 1.67 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-eval-langs-target.parquet
INFO:wptrec.save:data\metric-tables\task1-eval-langs-target.parquet: 5.20 KiB


In [28]:
qr_pop_align = qr_join(pop_align)
qr_pop_tgt = norm_dist_df(qr_pop_align.groupby('topic_id').sum())
output.save_table(qr_pop_tgt, f'task1-{DATA_MODE}-pop-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-eval-pop-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-eval-pop-target.csv.gz: 2.17 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-eval-pop-target.parquet
INFO:wptrec.save:data\metric-tables\task1-eval-pop-target.parquet: 6.14 KiB


## Multidimensional Alignment

Now, we need to set up the *multidimensional* alignment.  The basic version is just to multiply the targets, but that doesn't include the target averaging we want to do for geographic and gender targets.

Doing that averaging further requires us to very carefully handle the unknown cases.

We are going to proceed in three steps:

1. Define the averaged dimensions (with their background targets) and the un-averaged dimensions
2. Demonstrate the logic by working through the alignment computations for a single topic
3. Apply step (2) to all topics

### Dimension Definitions

Let's define background distributions for some of our dimensions:

In [29]:
dim_backgrounds = {
    'sub-geo': world_pop,
    'src-geo': world_pop,
    'gender': gender_tgt,
}

Now we'll make a list of dimensions to treat with averaging:

In [30]:
DR = namedtuple('DimRec', ['name', 'align', 'background'], defaults=[None])
avg_dims = [
    DR(d.name, d.page_align_xr, xr.DataArray(dim_backgrounds[d.name], dims=[d.name]))
    for d in dimensions
    if d.name in dim_backgrounds
]
[d.name for d in avg_dims]

['sub-geo', 'src-geo', 'gender']

And a list of dimensions to use as-is:

In [31]:
raw_dims = [
    DR(d.name, d.page_align_xr)
    for d in dimensions
    if d.name not in dim_backgrounds
]
[d.name for d in raw_dims]

['occ', 'alpha', 'age', 'pop', 'langs']

Now: these dimension are in the original order - `dimensions` has the averaged dimensions before the non-averaged ones. **This is critical for the rest of the code to work.**

### Demo

To demonstrate how the logic works, let's first work it out in cells for one query (1).

What are its documents?

In [32]:
qno = qrels['topic_id'].iloc[0]
qdf = qrels[qrels['topic_id'] == qno]
qdf.name = qno
qdf

Unnamed: 0,topic_id,page_id
0,187,682
1,187,954
2,187,1170
3,187,1315
4,187,1322
...,...,...
68641,187,69882575
68642,187,69890514
68643,187,69891122
68644,187,69891390


We can use these page IDs to get its alignments.

In [33]:
q_pages = qdf['page_id'].values

#### Accumulating Initial Targets

We're now going to grab the dimensions that have targets, and create a single xarray with all of them:

In [34]:
q_xta = reduce(operator.mul, [d.align.loc[q_pages] for d in avg_dims])
q_xta

We can similarly do this for the dimensions without targets:

In [35]:
q_raw_xta = reduce(operator.mul, [d.align.loc[q_pages] for d in raw_dims])
q_raw_xta

Now, we need to combine this with the other matrix to produce a complete alignment matrix, which we then will collapse into a query target matrix.  However, we don't have memory to do the whole thing at one go. Therefore, we will do it page by page.

The `mean_outer` function does this:

In [36]:
from wptrec.dimension import mean_outer

In [37]:
q_tam = mean_outer(q_xta, q_raw_xta)
q_tam

In [38]:
q_tam

In [39]:
q_tam.sum()

In 2021, we ignored fully-unknown for Task 1. However, it isn't clear hot to properly do that with some attributes that are never fully unknown - they still need to be counted. Therefore, we consistently treat fully-unknown as a distinct category for both Task 1 and Task 2 metrics.

#### Data Subsetting

Before we average, we need to be able to select data by its known/unknown status.

Let's start by making a list of cases - the known/unknown status of each dimension.

In [40]:
avg_cases = list(product(*[[True, False] for d in avg_dims]))
avg_cases

[(True, True, True),
 (True, True, False),
 (True, False, True),
 (True, False, False),
 (False, True, True),
 (False, True, False),
 (False, False, True),
 (False, False, False)]

The last entry is the all-unknown case - remove it:

In [41]:
avg_cases.pop()
avg_cases

[(True, True, True),
 (True, True, False),
 (True, False, True),
 (True, False, False),
 (False, True, True),
 (False, True, False),
 (False, False, True)]

We now want the ability to create an indexer to look up the subset of the alignment frame corresponding to a case. Let's write that function:

In [42]:
def case_selector(case):
    def mksel(known):
        if known:
            # select all but 1st column
            return slice(1, None, None)
        else:
            # select 1st column
            return 0
    
    return tuple(mksel(k) for k in case)

Let's test this function quick:

In [43]:
case_selector(avg_cases[0])

(slice(1, None, None), slice(1, None, None), slice(1, None, None))

In [44]:
case_selector(avg_cases[-1])

(0, 0, slice(1, None, None))

And make sure we can use it:

In [45]:
q_tam[case_selector(avg_cases[1])]

Fantastic! Given a case (known and unknown statuses), we can select the subset of the target matrix with exactly those.

#### Averaging

Ok, now we have to - very carefully - average with our target modifier.  For each dimension that is not fully-unknown, we average with the intersectional target defined over the known dimensions.

At all times, we also need to respect the fraction of the total it represents.

We'll use the selection capabilities above to handle this.

First, let's make sure that our target matrix sums to 1 to start with:

In [46]:
q_tam.sum()

Fantastic.  This means that if we sum up a subset of the data, it will give us the fraction of the distribution that has that combination of known/unknown status.

For each condition, we are going to proceed as follows:

1. Compute an appropriate intersectional background distribution (based on the dimensions that are "known")
2. Select the subset of the target matrix with this known status
3. Compute the sum of this subset
4. Re-normalize the subset to sum to 1
5. Compute a normalization table such that each coordinate in the distributions to correct sums to 1 (so multiplying this by the background distribution spreads the background across the other dimensions appropriately), and use this to spread the background distribution
6. Average with the spread background distribution
7. Re-normalize to preserve the original sum

Let's define the whole process as a function:

In [47]:
def avg_with_bg(tm, verbose=False):
    tm = tm.copy()
    
    tail_names = [d.name for d in raw_dims]
    
    # compute the tail mass for each coordinate (can be done once)
    tail_mass = tm.sum(tail_names)
    
    # now some things don't have any mass, but we still need to distribute background distributions.
    # solution: we impute the marginal tail distribution
    # first compute it
    tail_marg = tm.sum([d.name for d in avg_dims])
    # then impute that where we don't have mass
    tm_imputed = xr.where(tail_mass > 0, tm, tail_marg)
    # and re-compute the tail mass
    tail_mass = tm_imputed.sum(tail_names)
    # and finally we compute the rescaled matrix
    tail_scale = tm_imputed / tail_mass
    del tm_imputed
    
    for case in avg_cases:
        # for deugging: get names
        known_names = [d.name for (d, known) in zip(avg_dims, case) if known]
        if verbose:
            print('processing known:', known_names)
        
        # Step 1: background
        bg = reduce(operator.mul, [
            d.background
            for (d, known) in zip(avg_dims, case)
            if known
        ])
        if not np.allclose(bg.sum(), 1.0):
            warnings.warn('background distribution for {} sums to {}, expected 1'.format(known_names, bg.values.sum()))
        
        # Step 2: selector
        sel = case_selector(case)
        
        # Steps 3: sum in preparation for normalization
        c_sum = tm[sel].sum()
        
        # Step 5: spread the background
        bg_spread = bg * tail_scale[sel] * c_sum
        if not np.allclose(bg_spread.sum(), c_sum):
            warnings.warn('rescaled background sums to {}, expected c_sum'.format(bg_spread.values.sum()))
        
        # Step 4 & 6: average with the background
        tm[sel] *= 0.5
        bg_spread *= 0.5
        tm[sel] += bg_spread
                        
        if not np.allclose(tm[sel].sum(), c_sum):
            warnings.warn('target distribution for {} sums to {}, expected {}'.format(known_names, tm[sel].values.sum(), c_sum))
    
    return tm

And apply it:

In [48]:
q_target = avg_with_bg(q_tam, True)
q_target.sum()

processing known: ['sub-geo', 'src-geo', 'gender']
processing known: ['sub-geo', 'src-geo']
processing known: ['sub-geo', 'gender']
processing known: ['sub-geo']
processing known: ['src-geo', 'gender']
processing known: ['src-geo']
processing known: ['gender']


In [49]:
q_target

In [50]:
print(number(q_target.values.size), 'values taking', binarysize(q_target.nbytes))

11,176,704 values taking 89.41 MiB


Is it still a distribution?

In [51]:
q_target.sum()

We can unravel this value into a single-dimensional array representing the multidimensional target:

In [52]:
q_target.values.ravel()

array([3.90778732e-05, 9.13512756e-04, 0.00000000e+00, ...,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

Now we have all the pieces to compute this for each of our queries.

### Implementing Function

To perform this combination for every query, we'll use a function that takes a data frame for a query's relevant docs and performs all of the above operations:

In [53]:
def query_xalign(pages):
    # compute targets to average
    avg_pages = reduce(operator.mul, [d.align.loc[pages] for d in avg_dims])
    raw_pages = reduce(operator.mul, [d.align.loc[pages] for d in raw_dims])

    # convert to query distribution
    tgt = mean_outer(avg_pages, raw_pages)

    # average with background distributions
    tgt = avg_with_bg(tgt)
    
    # and return the result
    return tgt

Make sure it works:

In [54]:
query_xalign(qdf.page_id.values)

### Computing Query Targets

Now with that function, we can compute the alignment vector for each query.  Extract queries into a dictionary:

In [55]:
queries = {
    t: df['page_id'].values
    for (t, df) in qrels.groupby('topic_id')
}

Make an index that we'll need later for setting up the XArray dimension:

In [56]:
q_ids = pd.Index(queries.keys(), name='topic_id')
q_ids

Int64Index([ 187,  270,  359,  365,  400,  404,  480,  517,  568,  596,  715,
             807,  834,  881,  883,  949,  951,  955,  995, 1018, 1180, 1233,
            1328, 1406, 1417, 1448, 1449, 1479, 1499, 1548, 1558, 1647, 1685,
            1806, 1821, 1877, 1884, 1890, 2000, 2028, 2106, 2153, 2160, 2229,
            2244, 2448, 2483, 2758, 2867, 2872],
           dtype='int64', name='topic_id')

Now let's create targets for each of these:

In [57]:
q_tgts = [query_xalign(queries[q]) for q in tqdm(q_ids)]

  0%|          | 0/50 [00:00<?, ?it/s]

Assemble a composite xarray:

In [58]:
q_tgts = xr.concat(q_tgts, q_ids)
q_tgts

Save this to NetCDF (xarray's recommended format):

In [59]:
output.save_xarray(q_tgts, f'task1-{DATA_MODE}-int-targets')

INFO:wptrec.save:saving NetCDF to data\metric-tables\task1-eval-int-targets.nc
