# Alignments

This notebook analyzes page alignments and prepares metrics for final use.

## Setup

We begin by loading necessary libraries:

In [1]:
from pathlib import Path
import pandas as pd
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gzip
import pickle
import binpickle
from natural.size import binarysize

In [2]:
codec = binpickle.codecs.Blosc('zstd')

Set up progress bar and logging support:

In [3]:
from tqdm.auto import tqdm
tqdm.pandas(leave=False)

In [4]:
import sys, logging
logging.basicConfig(level=logging.INFO, stream=sys.stderr)
log = logging.getLogger('alignment')

Import metric code:

In [5]:
%load_ext autoreload
%autoreload 1

In [6]:
%aimport metrics
from trecdata import scan_runs

## Loading Data

We first load the page metadata:

In [7]:
pages = pd.read_json('data/trec_metadata_eval.json.gz', lines=True)
pages = pages.drop_duplicates('page_id')
pages.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6023415 entries, 0 to 6023435
Data columns (total 5 columns):
 #   Column                Dtype  
---  ------                -----  
 0   page_id               int64  
 1   quality_score         float64
 2   quality_score_disc    object 
 3   geographic_locations  object 
 4   gender                object 
dtypes: float64(1), int64(1), object(3)
memory usage: 275.7+ MB


Now we will load the evaluation topics:

In [8]:
eval_topics = pd.read_json('data/eval-topics-with-qrels.json.gz', lines=True)
eval_topics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             49 non-null     int64 
 1   title          49 non-null     object
 2   rel_docs       49 non-null     object
 3   assessed_docs  49 non-null     object
 4   max_tier       49 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.0+ KB


In [9]:
train_topics = pd.read_json('data/trec_topics.json.gz', lines=True)
train_topics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        57 non-null     int64 
 1   title     57 non-null     object
 2   keywords  57 non-null     object
 3   scope     57 non-null     object
 4   homepage  57 non-null     object
 5   rel_docs  57 non-null     object
dtypes: int64(1), object(5)
memory usage: 2.8+ KB


Train and eval topics use a disjoint set of IDs:

In [10]:
train_topics['id'].describe()

count    57.000000
mean     29.000000
std      16.598193
min       1.000000
25%      15.000000
50%      29.000000
75%      43.000000
max      57.000000
Name: id, dtype: float64

In [11]:
eval_topics['id'].describe()

count     49.000000
mean     125.346939
std       14.687794
min      101.000000
25%      113.000000
50%      125.000000
75%      138.000000
max      150.000000
Name: id, dtype: float64

This allows us to create a single, integrated topics list for convenience:

In [12]:
topics = pd.concat([train_topics, eval_topics], ignore_index=True)
topics['eval'] = False
topics.loc[topics['id'] >= 100, 'eval'] = True
topics.head()

Unnamed: 0,id,title,keywords,scope,homepage,rel_docs,assessed_docs,max_tier,eval
0,1,Agriculture,"[agriculture, crops, livestock, forests, farming]",This WikiProject strives to develop and improv...,https://en.wikipedia.org/wiki/Wikipedia:WikiPr...,"[572, 627, 903, 1193, 1542, 1634, 3751, 3866, ...",,,False
1,2,Architecture,"[architecture, skyscraper, landscape, building...",This WikiProject aims to: 1. Thoroughly explor...,https://en.wikipedia.org/wiki/Wikipedia:WikiPr...,"[682, 954, 1170, 1315, 1322, 1324, 1325, 1435,...",,,False
2,3,Athletics,"[athletics, player, sports, game, gymnastics]","WikiProject Athletics, a project focused on im...",https://en.wikipedia.org/wiki/Wikipedia:WikiPr...,"[5729, 8490, 9623, 10391, 12231, 13791, 16078,...",,,False
3,4,Aviation,"[aviation, aircraft, airplane, airship, pilot,...",The project generally considers any article re...,https://en.wikipedia.org/wiki/Wikipedia:WikiPr...,"[849, 852, 1293, 1902, 1942, 2039, 2075, 2082,...",,,False
4,5,Baseball,[baseball],Articles pertaining to baseball including base...,https://en.wikipedia.org/wiki/Wikipedia:WikiPr...,"[1135, 1136, 1293, 1893, 2129, 2140, 3797, 380...",,,False


Finally, a bit of hard-coded data - the world population:

In [13]:
world_pop = pd.Series({
    'Africa': 0.155070563,
    'Antarctica': 1.54424E-07,
    'Asia': 0.600202585,
    'Europe': 0.103663858,
    'Latin America and the Caribbean': 0.08609797,
    'Northern America': 0.049616733,
    'Oceania': 0.005348137,
})
world_pop.name = 'geography'

And a gender global target:

In [14]:
gender_tgt = pd.Series({
    'female': 0.495,
    'male': 0.495,
    'third': 0.01
})
gender_tgt.name = 'gender'
gender_tgt.sum()

1.0

Xarray intesectional global target:

In [15]:
geo_tgt_xa = xr.DataArray(world_pop, dims=['geography'])
gender_tgt_xa = xr.DataArray(gender_tgt, dims=['gender'])
int_tgt = geo_tgt_xa * gender_tgt_xa
int_tgt

And the order of work-needed codes:

In [16]:
work_order = [
    'Stub',
    'Start',
    'C',
    'B',
    'GA',
    'FA',
]

## Query Relevance

We now need to get the qrels for the topics.  This is done by creating frames with entries for every relevant document; missing documents are assumed irrelevant (0).

First the training topics:

In [17]:
train_qrels = train_topics[['id', 'rel_docs']].explode('rel_docs', ignore_index=True)
train_qrels.rename(columns={'rel_docs': 'page_id'}, inplace=True)
train_qrels['page_id'] = train_qrels['page_id'].astype('i4')
train_qrels = train_qrels.drop_duplicates()
train_qrels.head()

Unnamed: 0,id,page_id
0,1,572
1,1,627
2,1,903
3,1,1193
4,1,1542


In [18]:
eval_qrels = eval_topics[['id', 'rel_docs']].explode('rel_docs', ignore_index=True)
eval_qrels.rename(columns={'rel_docs': 'page_id'}, inplace=True)
eval_qrels['page_id'] = eval_qrels['page_id'].astype('i4')
eval_qrels = eval_qrels.drop_duplicates()
eval_qrels.head()

Unnamed: 0,id,page_id
0,101,915
1,101,2948
2,101,9110
3,101,9742
4,101,10996


And concatenate:

In [19]:
qrels = pd.concat([train_qrels, eval_qrels], ignore_index=True)

## Page Alignments

All of our metrics require page "alignments": the protected-group membership of each page.

### Geography

Let's start with the straight page geography alignment for the public evaluation of the training queries.  The page metadata has that; let's get the geography column.

In [20]:
page_geo = pages[['page_id', 'geographic_locations']].explode('geographic_locations', ignore_index=True)
page_geo.head()

Unnamed: 0,page_id,geographic_locations
0,12,
1,25,
2,39,
3,290,
4,303,Northern America


And we will now pivot this into a matrix so we get page alignment vectors:

In [21]:
page_geo_align = page_geo.assign(x=1).pivot(index='page_id', columns='geographic_locations', values='x')
page_geo_align.rename(columns={np.nan: 'Unknown'}, inplace=True)
page_geo_align.fillna(0, inplace=True)
page_geo_align.head()

geographic_locations,Unknown,Africa,Antarctica,Asia,Europe,Latin America and the Caribbean,Northern America,Oceania
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
290,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
303,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


And convert this to an xarray for multidimensional usage:

In [22]:
page_geo_xr = xr.DataArray(page_geo_align, dims=['page', 'geography'])
page_geo_xr

In [23]:
binarysize(page_geo_xr.nbytes)

'385.50 MiB'

### Gender

The "undisclosed personal attribute" is gender.  Not all articles have gender as a relevant variable - articles not about a living being generally will not.

We're going to follow the same approach for gender:

In [24]:
page_gender = pages[['page_id', 'gender']].explode('gender', ignore_index=True)
page_gender.fillna('unknown', inplace=True)
page_gender.head()

Unnamed: 0,page_id,gender
0,12,unknown
1,25,unknown
2,39,unknown
3,290,unknown
4,303,unknown


We need to do a little targeted repair - there is an erroneous record of a gender of "Taira no Kiyomori" is actually male. Replace that:

In [25]:
page_gender = page_gender.loc[page_gender['gender'] != 'Taira no Kiyomori']

Now, we're going to do a little more work to reduce the dimensionality of the space.  Points:

1. Trans men are men
2. Trans women are women
3. Cisgender is an adjective that can be dropped for the present purposes

The result is that we will collapse "transgender female" and "cisgender female" into "female".

The **downside** to this is that trans men are probabily significantly under-represented, but are now being collapsed into the dominant group.

In [26]:
pgcol = page_gender['gender']
pgcol = pgcol.str.replace(r'(?:tran|ci)sgender\s+((?:fe)?male)', r'\1', regex=True)

Now, we're going to group the remaining gender identities together under the label 'third'.  As noted above, this is a debatable exercise that collapses a lot of identity.

In [27]:
genders = ['unknown', 'male', 'female', 'third']
pgcol[~pgcol.isin(genders)] = 'third'

Now put this column back in the frame and deduplicate.

In [28]:
page_gender['gender'] = pgcol
page_gender = page_gender.drop_duplicates()

And make an alignment matrix (reordering so 'unknown' is first for consistency):

In [29]:
page_gend_align = page_gender.assign(x=1).pivot(index='page_id', columns='gender', values='x')
page_gend_align.fillna(0, inplace=True)
page_gend_align = page_gend_align.reindex(columns=['unknown', 'female', 'male', 'third'])
page_gend_align.head()

gender,unknown,female,male,third
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12,1.0,0.0,0.0,0.0
25,1.0,0.0,0.0,0.0
39,1.0,0.0,0.0,0.0
290,1.0,0.0,0.0,0.0
303,1.0,0.0,0.0,0.0


Let's see how frequent each of the genders is:

In [30]:
page_gend_align.sum(axis=0).sort_values(ascending=False)

gender
unknown    4246540.0
male       1441813.0
female      334946.0
third          452.0
dtype: float64

And convert to an xarray:

In [31]:
page_gend_xr = xr.DataArray(page_gend_align, dims=['page', 'gender'])
page_gend_xr

In [32]:
binarysize(page_gend_xr.nbytes)

'192.75 MiB'

### Intersectional Alignment

We'll now convert this data array to an **intersectional** alignment array:

In [33]:
page_xalign = page_geo_xr * page_gend_xr
page_xalign

In [34]:
binarysize(page_xalign.nbytes)

'1.54 GiB'

Make sure that did the right thing and we have intersectional numbers:

In [35]:
page_xalign.sum(axis=0)

And make sure combination with targets work as expected:

In [36]:
(page_xalign.sum(axis=0) + int_tgt) * 0.5

## Task 1 Metric Preparation

Now that we have our alignments and qrels, we are ready to prepare the Task 1 metrics.

Task 1 ignores the "unknown" alignment category, so we're going to create a `kga` frame (for **K**nown **G**eographic **A**lignment), and corresponding frames for intersectional alignment.

In [37]:
page_kga = page_geo_align.iloc[:, 1:]
page_kga.head()

geographic_locations,Africa,Antarctica,Asia,Europe,Latin America and the Caribbean,Northern America,Oceania
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
12,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39,0.0,0.0,0.0,0.0,0.0,0.0,0.0
290,0.0,0.0,0.0,0.0,0.0,0.0,0.0
303,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Intersectional is a little harder to do, because things can be **intersectionally unknown**: we may know gender but not geography, or vice versa.  To deal with these missing values for Task 1, we're going to ignore *totally unknown* values, but keep partially-known as a category.

We also need to ravel our tensors into a matrix for compatibility with the metric code. Since 'unknown' is the first value on each axis, we can ravel, and then drop the first column.

In [38]:
xshp = page_xalign.shape
xshp = (xshp[0], xshp[1] * xshp[2])
page_xa_df = pd.DataFrame(page_xalign.values.reshape(xshp), index=page_xalign.indexes['page'])
page_xa_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
page,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
290,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


And drop unknown, to get our page alignment vectors:

In [39]:
page_kia = page_xa_df.iloc[:, 1:]

### Geographic Alignment

We'll start with the metric configuration for public training data, considering only geographic alignment.  We configure the metric to do this for both the training and the eval queries.

#### Training Queries

In [40]:
train_qalign = train_qrels.join(page_kga, on='page_id').drop(columns=['page_id']).groupby('id').sum()
tqa_sums = train_qalign.sum(axis=1)
train_qalign = train_qalign.divide(tqa_sums, axis=0)

In [41]:
train_qalign.head()

Unnamed: 0_level_0,Africa,Antarctica,Asia,Europe,Latin America and the Caribbean,Northern America,Oceania
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0.049495,0.0,0.121886,0.356566,0.03165,0.261616,0.178788
2,0.013388,0.0,0.112008,0.574026,0.026105,0.228715,0.045758
3,0.109664,0.0,0.125529,0.456033,0.10004,0.158419,0.050316
4,0.062495,0.00025,0.116161,0.327272,0.079514,0.369277,0.045032
5,0.000835,0.0,0.065433,0.010149,0.064755,0.850192,0.008636


In [42]:
train_qtarget = (train_qalign + world_pop) * 0.5
train_qtarget.head()

Unnamed: 0_level_0,Africa,Antarctica,Asia,Europe,Latin America and the Caribbean,Northern America,Oceania
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0.102283,7.7212e-08,0.361044,0.230115,0.058874,0.155616,0.092068
2,0.084229,7.7212e-08,0.356105,0.338845,0.056101,0.139166,0.025553
3,0.132367,7.7212e-08,0.362866,0.279848,0.093069,0.104018,0.027832
4,0.108783,0.0001250113,0.358182,0.215468,0.082806,0.209447,0.02519
5,0.077953,7.7212e-08,0.332818,0.056906,0.075427,0.449904,0.006992


And we can prepare a metric and save it:

In [43]:
t1_train_metric = metrics.Task1Metric(train_qrels.set_index('id'), page_kga, train_qtarget)
binpickle.dump(t1_train_metric, 'task1-train-geo-metric.bpk', codec=codec)

INFO:binpickle.write:pickled 337312647 bytes with 5 buffers


#### Eval Queries

Do the same thing for the eval data for a geo-only eval metric:

In [44]:
eval_qalign = eval_qrels.join(page_kga, on='page_id').drop(columns=['page_id']).groupby('id').sum()
eqa_sums = eval_qalign.sum(axis=1)
eval_qalign = eval_qalign.divide(eqa_sums, axis=0)
eval_qtarget = (eval_qalign + world_pop) * 0.5
t1_eval_metric = metrics.Task1Metric(eval_qrels.set_index('id'), page_kga, eval_qtarget)
binpickle.dump(t1_eval_metric, 'task1-eval-geo-metric.bpk', codec=codec)

INFO:binpickle.write:pickled 337312643 bytes with 5 buffers


### Intersectional Alignment

Now we need to apply similar logic, but for the intersectional (geography * gender) alignment.

As noted as above, we need to carefully handle the unknown cases.

#### Demo

To demonstrate how the logic works, let's first work it out in cells for one query (1).

What are its documents?

In [45]:
qdf = qrels[qrels['id'] == 1]
qdf.name = 1
qdf

Unnamed: 0,id,page_id
0,1,572
1,1,627
2,1,903
3,1,1193
4,1,1542
...,...,...
6959,1,67066971
6960,1,67075177
6961,1,67178925
6962,1,67190032


We can use these page IDs to get its alignments:

In [46]:
q_xa = page_xalign.loc[qdf['page_id'].values, :, :]
q_xa

Summing over the first axis ('page') will produce an alignment matrix:

In [47]:
q_am = q_xa.sum(axis=0)
q_am

Now we need to do reset the (0,0) coordinate (full unknown), and normalize to a proportion.

In [48]:
q_am[0, 0] = 0
q_am = q_am / q_am.sum()
q_am

Ok, now we have to - very carefully - average with our target modifier.  There are three groups:

- known (use intersectional target)
- known-geo (use geo target)
- known-gender (use gender target)

For each of these, we need to respect the fraction of the total it represents.  Let's compute those fractions:

In [49]:
q_fk_all = q_am[1:, 1:].sum()
q_fk_geo = q_am[1:, :1].sum()
q_fk_gen = q_am[:1, 1:].sum()
q_fk_all, q_fk_geo, q_fk_gen

(<xarray.DataArray ()>
 array(0.12383613),
 <xarray.DataArray ()>
 array(0.79795158),
 <xarray.DataArray ()>
 array(0.07821229))

And now do some surgery.  Weighted-average to incorporate the target for fully-known:

In [50]:
q_tm = q_am.copy()
q_tm[1:, 1:] *= 0.5
q_tm[1:, 1:] += int_tgt * 0.5 * q_fk_all
q_tm

And for known-geo:

In [51]:
q_tm[1:, :1] *= 0.5
q_tm[1:, :1] += geo_tgt_xa * 0.5 * q_fk_geo

And known-gender:

In [52]:
q_tm[:1, 1:] *= 0.5
q_tm[:1, 1:] += gender_tgt_xa * 0.5 * q_fk_gen

In [53]:
q_tm

Now we can unravel this and drop the first entry:

In [54]:
q_tm.values.ravel()[1:]

array([2.74270639e-02, 5.03941651e-02, 3.91061453e-04, 8.17328395e-02,
       6.61502352e-03, 5.83910794e-03, 9.60166894e-05, 6.16114376e-08,
       4.73300933e-09, 4.73300933e-09, 9.56163501e-11, 2.89435265e-01,
       2.01028882e-02, 2.28961843e-02, 3.71633817e-04, 1.87231499e-01,
       6.74645100e-03, 1.80748185e-02, 6.41866532e-05, 4.66104719e-02,
       3.88031961e-03, 3.72513649e-03, 5.33101956e-05, 1.15699041e-01,
       5.86585240e-03, 2.18497134e-02, 3.07217202e-05, 7.72424054e-02,
       1.09501611e-03, 6.52642517e-03, 3.31146285e-06])

#### Implementation

Now, to do this for every query, we'll use a function that takes a data frame for a query's relevant docs and performs all of the above operations:

In [55]:
def query_xalign(qdf):
    pages = qdf['page_id']
    pages = pages[pages.isin(page_xalign.indexes['page'])]
    q_xa = page_xalign.loc[pages.values, :, :]
    q_am = q_xa.sum(axis=0)

    # clear and normalize
    q_am[0, 0] = 0
    q_am = q_am / q_am.sum()
    
    # compute fractions in each section
    q_fk_all = q_am[1:, 1:].sum()
    q_fk_geo = q_am[1:, :1].sum()
    q_fk_gen = q_am[:1, 1:].sum()
    
    # known average
    q_am[1:, 1:] *= 0.5
    q_am[1:, 1:] += int_tgt * 0.5 * q_fk_all
    
    # known-geo average
    q_am[1:, :1] *= 0.5
    q_am[1:, :1] += geo_tgt_xa * 0.5 * q_fk_geo
    
    # known-gender average
    q_am[:1, 1:] *= 0.5
    q_am[:1, 1:] += gender_tgt_xa * 0.5 * q_fk_gen
    
    # and return the result
    return pd.Series(q_am.values.ravel()[1:])

In [56]:
query_xalign(qdf)

0     2.742706e-02
1     5.039417e-02
2     3.910615e-04
3     8.173284e-02
4     6.615024e-03
5     5.839108e-03
6     9.601669e-05
7     6.161144e-08
8     4.733009e-09
9     4.733009e-09
10    9.561635e-11
11    2.894353e-01
12    2.010289e-02
13    2.289618e-02
14    3.716338e-04
15    1.872315e-01
16    6.746451e-03
17    1.807482e-02
18    6.418665e-05
19    4.661047e-02
20    3.880320e-03
21    3.725136e-03
22    5.331020e-05
23    1.156990e-01
24    5.865852e-03
25    2.184971e-02
26    3.072172e-05
27    7.724241e-02
28    1.095016e-03
29    6.526425e-03
30    3.311463e-06
dtype: float64

Now with that function, we can compute the alignment vector for each query.

In [57]:
train_qtarget = train_qrels.groupby('id').apply(query_xalign)
train_qtarget

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.027427,0.050394,0.000391,0.081733,0.006615,0.005839,9.6e-05,6.161144e-08,4.733009e-09,4.733009e-09,...,0.003725,5.3e-05,0.115699,0.005866,0.02185,3.1e-05,0.077242,0.001095,0.006526,3.311463e-06
2,0.012235,0.032073,0.000232,0.073168,0.003571,0.003669,7e-05,6.684003e-08,3.431828e-09,3.431828e-09,...,0.003276,3.9e-05,0.114341,0.00341,0.015195,2.2e-05,0.021868,0.000564,0.001981,2.401088e-06
3,0.022553,0.035541,0.000292,0.023527,0.040981,0.059556,0.000574,1.665751e-08,2.774295e-08,2.774295e-08,...,0.03963,0.000312,0.024684,0.023416,0.049637,0.000202,0.006253,0.007383,0.012547,1.941043e-05
4,0.012472,0.029112,0.000209,0.09484,0.004409,0.004901,8.6e-05,0.0001197781,4.231748e-09,4.231748e-09,...,0.002998,4.8e-05,0.176399,0.003847,0.020421,2.7e-05,0.020782,0.000413,0.00294,2.960754e-06
5,0.023416,0.063398,0.000436,0.020521,0.024932,0.025194,0.000504,2.031701e-08,2.482831e-08,2.482831e-08,...,0.038382,0.00028,0.12127,0.014276,0.274932,0.000161,0.002668,0.000967,0.002729,1.737119e-05
6,0.12682,0.201558,0.001691,2.1e-05,0.028097,0.030215,0.000519,2.098131e-11,2.559433e-08,5.05775e-06,...,0.019846,0.000288,5.2e-05,0.043116,0.107266,0.000206,1.1e-05,0.006256,0.011167,2.797145e-05
7,0.050837,0.115432,0.000836,2.6e-05,0.034747,0.051416,0.000646,2.510823e-11,3.182076e-08,3.182076e-08,...,0.052265,0.000358,7.8e-05,0.019805,0.095237,0.000212,2.5e-05,0.005422,0.022694,2.226348e-05
8,0.038785,0.054361,0.000521,6.5e-05,0.038044,0.038746,0.000702,6.44277e-11,3.460808e-08,3.460808e-08,...,0.029176,0.000396,0.000206,0.078311,0.12929,0.00039,9e-06,0.006967,0.01004,3.745849e-05
9,0.059002,0.157276,0.001087,0.00563,0.028051,0.056632,0.000554,5.304181e-09,2.72867e-08,2.72867e-08,...,0.074846,0.000307,0.018965,0.013625,0.092816,0.000177,0.003089,0.002289,0.014736,1.909121e-05
10,0.064617,0.137545,0.001016,0.046254,0.008038,0.008038,0.000162,4.535438e-08,8.004062e-09,8.004062e-09,...,0.00503,9e-05,0.035855,0.009666,0.019314,5.2e-05,0.00299,0.000561,0.000703,5.600064e-06


And save:

In [58]:
t1_train_metric = metrics.Task1Metric(train_qrels.set_index('id'), page_kia, train_qtarget)
binpickle.dump(t1_train_metric, 'task1-train-metric.bpk', codec=codec)

INFO:binpickle.write:pickled 1493808204 bytes with 5 buffers


Do the same for eval:

In [59]:
eval_qtarget = eval_qrels.groupby('id').apply(query_xalign)
t1_eval_metric = metrics.Task1Metric(eval_qrels.set_index('id'), page_kia, eval_qtarget)
binpickle.dump(t1_eval_metric, 'task1-eval-metric.bpk', codec=codec)

INFO:binpickle.write:pickled 1493808200 bytes with 5 buffers


## Task 2 Metric Preparation

Task 2 requires some different preparation.

We're going to start by computing work-needed information:

In [60]:
page_work = pages.set_index('page_id').quality_score_disc.astype(pd.CategoricalDtype(ordered=True))
page_work = page_work.cat.reorder_categories(work_order)
page_work.name = 'quality'

### Work and Target Exposure

The first thing we need to do to prepare the metric is to compute the work-needed for each topic's pages, and use that to compute the target exposure for each (relevant) page in the topic.

This is because an ideal ranking orders relevant documents in decreasing order of work needed, followed by irrelevant documents.  All relevant documents at a given work level should receive the same expected exposure.

First, look up the work for each query page ('query page work', or qpw):

In [61]:
qpw = qrels.join(page_work, on='page_id')
qpw

Unnamed: 0,id,page_id,quality
0,1,572,C
1,1,627,FA
2,1,903,C
3,1,1193,B
4,1,1542,GA
...,...,...,...
2199072,150,63656179,Start
2199073,150,63807245,
2199074,150,64614938,C
2199075,150,64716982,C


And now  use that to compute the number of documents at each work level:

In [62]:
qwork = qpw.groupby(['id', 'quality'])['page_id'].count()
qwork

id   quality
1    Stub       1527
     Start      2822
     C          1603
     B           610
     GA          240
                ... 
150  Start       138
     C           127
     B            35
     GA           16
     FA            8
Name: page_id, Length: 636, dtype: int64

Now we need to convert this into target exposure levels.  This function will, given a series of counts for each work level, compute the expected exposure a page at that work level should receive.

In [63]:
def qw_tgt_exposure(qw_counts: pd.Series) -> pd.Series:
    if 'id' == qw_counts.index.names[0]:
        qw_counts = qw_counts.reset_index(level='id', drop=True)
    qwc = qw_counts.reindex(work_order, fill_value=0).astype('i4')
    tot = int(qwc.sum())
    da = metrics.discount(tot)
    qwp = qwc.shift(1, fill_value=0)
    qwc_s = qwc.cumsum()
    qwp_s = qwp.cumsum()
    res = pd.Series(
        [np.mean(da[s:e]) for (s, e) in zip(qwp_s, qwc_s)],
        index=qwc.index
    )
    return res

We'll then apply this to each topic, to determine the per-topic target exposures:

In [64]:
qw_pp_target = qwork.groupby('id').apply(qw_tgt_exposure)
qw_pp_target.name = 'tgt_exposure'
qw_pp_target

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


id   quality
1    Stub       0.114738
     Start      0.087373
     C          0.081146
     B          0.079298
     GA         0.078702
                  ...   
150  Start      0.154202
     C          0.127359
     B          0.120441
     GA         0.118827
     FA         0.118126
Name: tgt_exposure, Length: 636, dtype: float32

We can now merge the relevant document work categories with this exposure, to compute the target exposure for each relevant document:

In [65]:
qp_exp = qpw.join(qw_pp_target, on=['id', 'quality'])
qp_exp = qp_exp.set_index(['id', 'page_id'])['tgt_exposure']
qp_exp.index.names = ['q_id', 'page_id']
qp_exp

q_id  page_id 
1     572         0.081146
      627         0.078438
      903         0.081146
      1193        0.079298
      1542        0.078702
                    ...   
150   63656179    0.154202
      63807245         NaN
      64614938    0.127359
      64716982    0.127359
      65355704    0.127359
Name: tgt_exposure, Length: 2199077, dtype: float32

### Geographic Alignment

Now that we've computed per-page target exposure, we're ready to set up the geographic alignment vectors for computing the per-*group* expected exposure with geographic data.

We're going to start by getting the alignments for relevant documents for each topic:

In [66]:
qp_geo_align = qrels.join(page_geo_align, on='page_id').set_index(['id', 'page_id'])
qp_geo_align.index.names = ['q_id', 'page_id']
qp_geo_align

Unnamed: 0_level_0,Unnamed: 1_level_0,Unknown,Africa,Antarctica,Asia,Europe,Latin America and the Caribbean,Northern America,Oceania
q_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,572,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,627,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,903,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1193,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1542,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
150,63656179,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
150,63807245,,,,,,,,
150,64614938,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
150,64716982,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we need to compute the per-query target exposures.  This starst with aligning our vectors:

In [67]:
qp_geo_exp, qp_geo_align = qp_exp.align(qp_geo_align, fill_value=0)

And now we can multiply the exposure vector by the alignment vector, and summing by topic - this is equivalent to the matrix-vector multiplication on a topic-by-topic basis.

In [68]:
qp_aexp = qp_geo_align.multiply(qp_geo_exp, axis=0)
q_geo_align = qp_aexp.groupby('q_id').sum()

Now things get a *little* weird.  We want to average the empirical distribution with the world population to compute our fairness target.  However, we don't have empirical data on the distribution of articles that do or do not have geographic alignments.

Therefore, we are going to average only the *known-geography* vector with the world population.  This proceeds in N steps:

1. Normalize the known-geography matrix so its rows sum to 1.
2. Average each row with the world population.
3. De-normalize the known-geography matrix so it is in the original scale, but adjusted w/ world population
4. Normalize the *entire* matrix so its rows sum to 1

Let's go.

In [69]:
qg_known = q_geo_align.drop(columns=['Unknown'])

Normalize (adding a small value to avoid division by zero - affected entries will have a zero numerator anyway):

In [70]:
qg_ksums = qg_known.sum(axis=1)
qg_kd = qg_known.divide(np.maximum(qg_ksums, 1.0e-6), axis=0)

Average:

In [71]:
qg_kd = (qg_kd + world_pop) * 0.5

De-normalize:

In [72]:
qg_known = qg_kd.multiply(qg_ksums, axis=0)

Recombine with the Unknown column:

In [73]:
q_geo_tgt = q_geo_align[['Unknown']].join(qg_known)

Normalize targets:

In [74]:
q_geo_tgt = q_geo_tgt.divide(q_geo_tgt.sum(axis=1), axis=0)
q_geo_tgt

Unnamed: 0_level_0,Unknown,Africa,Antarctica,Asia,Europe,Latin America and the Caribbean,Northern America,Oceania
q_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0.575338,0.043635,3.278897e-08,0.153851,0.098450,0.025042,0.065388,0.038296
2,0.173889,0.069608,6.378567e-08,0.294269,0.280798,0.046323,0.115193,0.019920
3,0.234897,0.101882,5.907510e-08,0.278161,0.215027,0.071196,0.077784,0.021053
4,0.312664,0.076008,8.262075e-05,0.246140,0.145192,0.058319,0.143947,0.017648
5,0.182143,0.063760,6.314834e-08,0.273795,0.046710,0.061549,0.366345,0.005697
...,...,...,...,...,...,...,...,...
146,0.292441,0.090378,5.463208e-08,0.299627,0.067556,0.045686,0.178497,0.025815
147,0.434276,0.060053,4.368069e-08,0.195520,0.130625,0.061604,0.091005,0.026916
148,0.637050,0.033542,2.802409e-08,0.233693,0.045680,0.018613,0.025322,0.006099
149,0.370828,0.061724,4.857964e-08,0.243518,0.172170,0.040886,0.073876,0.036999


This is our group exposure target distributions for each query, for the geographic data.  We're now ready to set up the matrix.

In [75]:
train_geo_qtgt = q_geo_tgt.loc[train_topics['id']]
eval_geo_qtgt = q_geo_tgt.loc[eval_topics['id']]

In [76]:
t2_train_geo_metric = metrics.Task2Metric(train_qrels.set_index('id'), 
                                          page_geo_align, page_work, 
                                          train_geo_qtgt)
binpickle.dump(t2_train_geo_metric, 'task2-train-geo-metric.bpk', codec=codec)

INFO:binpickle.write:pickled 2018 bytes with 9 buffers


In [77]:
t2_eval_geo_metric = metrics.Task2Metric(eval_qrels.set_index('id'), 
                                         page_geo_align, page_work, 
                                         eval_geo_qtgt)
binpickle.dump(t2_eval_geo_metric, 'task2-eval-geo-metric.bpk', codec=codec)

INFO:binpickle.write:pickled 2014 bytes with 9 buffers


### Intersectional Alignment

Now we need to compute the intersectional targets for Task 2.  We're going to take a slightly different approach here, based on the intersectional logic for Task 1, because we've come up with better ways to write the code, but the effect is the same: only known aspects are averaged.

We'll write a function very similar to the one for Task 1:

In [78]:
def query_xideal(qdf, ravel=True):
    pages = qdf['page_id']
    pages = pages[pages.isin(page_xalign.indexes['page'])]
    q_xa = page_xalign.loc[pages.values, :, :]
    
    # now we need to get the exposure for the pages, and multiply
    p_exp = qp_exp.loc[qdf.name]
    assert p_exp.index.is_unique
    p_exp = xr.DataArray(p_exp, dims=['page'])
    
    # and we multiply!
    q_xa = q_xa * p_exp

    # normalize into a matrix (this time we don't clear)
    q_am = q_xa.sum(axis=0)
    q_am = q_am / q_am.sum()
    
    # compute fractions in each section - combined with q_am[0,0], this should be about 1
    q_fk_all = q_am[1:, 1:].sum()
    q_fk_geo = q_am[1:, :1].sum()
    q_fk_gen = q_am[:1, 1:].sum()
    
    # known average
    q_am[1:, 1:] *= 0.5
    q_am[1:, 1:] += int_tgt * 0.5 * q_fk_all
    
    # known-geo average
    q_am[1:, :1] *= 0.5
    q_am[1:, :1] += geo_tgt_xa * 0.5 * q_fk_geo
    
    # known-gender average
    q_am[:1, 1:] *= 0.5
    q_am[:1, 1:] += gender_tgt_xa * 0.5 * q_fk_gen
    
    # and return the result
    if ravel:
        return pd.Series(q_am.values.ravel())
    else:
        return q_am

Test this function out:

In [79]:
query_xideal(qdf, ravel=False)

And let's go!

In [80]:
q_xtgt = qrels.groupby('id').progress_apply(query_xideal)
q_xtgt

  0%|          | 0/106 [00:00<?, ?it/s]

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.540211,0.012290,0.022661,0.000176,0.038091,0.002908,0.002593,0.000043,2.855279e-08,2.096911e-09,...,0.001652,0.000024,0.053265,0.002515,0.009594,0.000014,0.034868,0.000471,0.002955,1.467109e-06
2,0.135109,0.010633,0.027958,0.000201,0.063400,0.003032,0.003115,0.000059,5.789347e-08,2.916166e-09,...,0.002811,0.000033,0.099662,0.002826,0.012684,0.000019,0.017811,0.000468,0.001639,2.040304e-06
3,0.185923,0.018891,0.029817,0.000245,0.018607,0.033486,0.049321,0.000471,1.300894e-08,2.280358e-08,...,0.032680,0.000257,0.019061,0.018746,0.039812,0.000164,0.004908,0.005947,0.010179,1.595459e-05
4,0.283620,0.008665,0.020234,0.000145,0.069568,0.003021,0.003361,0.000059,8.261490e-05,2.894759e-09,...,0.002071,0.000033,0.127457,0.002601,0.013870,0.000019,0.015398,0.000279,0.001968,2.025326e-06
5,0.102865,0.021347,0.057531,0.000396,0.017768,0.022647,0.022888,0.000458,1.758741e-08,2.255280e-08,...,0.034300,0.000254,0.104245,0.012692,0.249255,0.000146,0.002342,0.000883,0.002456,1.577913e-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146,0.242344,0.017108,0.032738,0.000250,0.033631,0.031692,0.024843,0.000212,3.349054e-08,1.046506e-08,...,0.009259,0.000118,0.164611,0.010455,0.003362,0.000068,0.013121,0.012324,0.000362,7.321910e-06
147,0.380085,0.025582,0.026067,0.001304,0.028472,0.017849,0.014999,0.000207,2.317530e-08,1.019747e-08,...,0.018809,0.000115,0.050914,0.022379,0.017457,0.000066,0.016122,0.005498,0.005220,7.134685e-06
148,0.620663,0.005550,0.010755,0.000082,0.031143,0.001188,0.001188,0.000024,2.563480e-08,1.182699e-09,...,0.000659,0.000013,0.020264,0.000380,0.004670,0.000008,0.003416,0.000041,0.002642,8.274784e-07
149,0.365415,0.002870,0.002516,0.000027,0.060143,0.000783,0.000783,0.000016,4.700522e-08,7.793387e-10,...,0.002786,0.000009,0.071839,0.000250,0.001781,0.000005,0.036944,0.000027,0.000027,5.452665e-07


In [81]:
train_qtgt = q_xtgt.loc[train_topics['id']]
eval_qtgt = q_xtgt.loc[eval_topics['id']]

In [82]:
t2_train_metric = metrics.Task2Metric(train_qrels.set_index('id'), 
                                      page_xa_df, page_work, 
                                      train_qtgt)
binpickle.dump(t2_train_metric, 'task2-train-metric.bpk', codec=codec)

INFO:binpickle.write:pickled 1879 bytes with 9 buffers


In [83]:
t2_eval_metric = metrics.Task2Metric(eval_qrels.set_index('id'), 
                                     page_xa_df, page_work, 
                                     eval_qtgt)
binpickle.dump(t2_eval_metric, 'task2-eval-metric.bpk', codec=codec)

INFO:binpickle.write:pickled 1875 bytes with 9 buffers
