# Page Alignments

This notebook computes the *page alignments* from the Wikipedia metadata.  These are then used by the
task-specific alignment notebooks to compute target distributions and page alignment subsets for retrieved pages.

**Warning:** this notebook takes quite a bit of memory to run.

## Setup

We begin by loading necessary libraries:

In [1]:
import sys
from pathlib import Path
import pandas as pd
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gzip
import json
from natural.size import binarysize

Set up progress bar and logging support:

In [2]:
from tqdm.auto import tqdm
tqdm.pandas(leave=False)

In [3]:
import sys, logging
logging.basicConfig(level=logging.INFO, stream=sys.stderr)
log = logging.getLogger('PageAlignments')

And set up an output directory:

In [4]:
from wptrec.save import OutRepo
output = OutRepo('data/metric-tables')

## Loading Data

Now we need to load the data.

### Static Data
We need a set of subregions that are folded into [Oceania](https://en.wikipedia.org/wiki/United_Nations_geoscheme_for_Oceania):

In [5]:
oc_regions = [
    'Australia and New Zealand',
    'Melanesia',
    'Micronesia',
    'Polynesia',
]

And finally a name for unknown:

In [6]:
UNKNOWN = '@UNKNOWN'

Now all our background data is set up.

### Page Data

Finally, we load the page metadata.  This is a little manual to manage memory usage.  Two memory usage tricks:

- Only import the things we need
- Use `sys.intern` for strings representing categoricals to decrease memory use

Bonus is that, through careful logic, we get a progress bar.

In [7]:
page_path = Path('data/trec_2022_articles_discrete.json.gz')
page_file_size = page_path.stat().st_size
binarysize(page_file_size)

'236.81 MiB'

#### Definitions

Let's define the different attributes we need to extract:

In [8]:
SUB_GEO_ATTR = 'page_subcont_regions'
SRC_GEO_ATTR = 'source_subcont_regions'
GENDER_ATTR = 'gender'
OCC_ATTR = 'occupations'
BASIC_ATTRS = [
    'page_id',
    'first_letter_category',
    'creation_date_category',
    'relative_pageviews_category',
    'num_sitelinks_category',
]

#### Read Data

Now, we're going to process by creating lists we can reassemble with `pd.DataFrame.from_records`.  We'll fill these with tuples and dictionaries as appropriate.

In [9]:
qual_recs = []
sub_geo_recs = []
src_geo_recs = []
gender_recs = []
occ_recs = []
att_recs = []
seen_pages = set()

And we're off.

In [10]:
with tqdm(total=page_file_size, desc='compressed input', unit='B', unit_scale=True) as fpb:
    with open(page_path, 'rb') as gzf, gzip.GzipFile(fileobj=gzf, mode='r') as decoded:
        for line in decoded:
            line = json.loads(line)
            page = line['page_id']
            if page in seen_pages:
                continue
            else:
                seen_pages.add(page)
            
            # page quality
            qual_recs.append((page, line['qual_cat']))
            
            # page geography
            for geo in line[SUB_GEO_ATTR]:
                sub_geo_recs.append((page, sys.intern(geo)))
            
            # src geography
            psg = {'page_id': page}
            for g, v in line[SRC_GEO_ATTR].items():
                if g == 'UNK':
                    g = UNKNOWN
                psg[sys.intern(g)] = v
            src_geo_recs.append(psg)
            
            # genders
            for g in line[GENDER_ATTR]:
                gender_recs.append((page, sys.intern(g)))
            
            # occupations
            for occ in line[OCC_ATTR]:
                occ_recs.append((page, sys.intern(occ)))
            
            # other attributes
            att_recs.append(tuple((sys.intern(line[a]) if isinstance(line[a], str) else line[a])
                                  for a in BASIC_ATTRS))
            
            fpb.update(gzf.tell() - fpb.n)  # update the progress bar

compressed input:   0%|          | 0.00/237M [00:00<?, ?B/s]

#### Reassemble DFs

Now we will assemble these records into data frames.

In [11]:
quality = pd.DataFrame.from_records(qual_recs, columns=['page_id', 'quality'])

In [11]:
sub_geo = pd.DataFrame.from_records(sub_geo_recs, columns=['page_id', 'sub_geo'])
sub_geo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3773443 entries, 0 to 3773442
Data columns (total 2 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   page_id  int64 
 1   sub_geo  object
dtypes: int64(1), object(1)
memory usage: 57.6+ MB


In [12]:
src_geo = pd.DataFrame.from_records(src_geo_recs)
src_geo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6460210 entries, 0 to 6460209
Data columns (total 25 columns):
 #   Column                     Dtype  
---  ------                     -----  
 0   page_id                    int64  
 1   @UNKNOWN                   float64
 2   Northern America           float64
 3   Northern Europe            float64
 4   Western Europe             float64
 5   Central America            float64
 6   Australia and New Zealand  float64
 7   Eastern Asia               float64
 8   Southern Europe            float64
 9   South America              float64
 10  Western Asia               float64
 11  Eastern Europe             float64
 12  Northern Africa            float64
 13  Southern Asia              float64
 14  Polynesia                  float64
 15  South-eastern Asia         float64
 16  Caribbean                  float64
 17  Western Africa             float64
 18  Southern Africa            float64
 19  Middle Africa              float64
 20  Ea

In [13]:
gender = pd.DataFrame.from_records(gender_recs, columns=['page_id', 'gender'])
gender.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1850219 entries, 0 to 1850218
Data columns (total 2 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   page_id  int64 
 1   gender   object
dtypes: int64(1), object(1)
memory usage: 28.2+ MB


In [14]:
occupations = pd.DataFrame.from_records(occ_recs, columns=['page_id', 'occ'])
occupations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2445899 entries, 0 to 2445898
Data columns (total 2 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   page_id  int64 
 1   occ      object
dtypes: int64(1), object(1)
memory usage: 37.3+ MB


In [15]:
cat_attrs = pd.DataFrame.from_records(att_recs, columns=BASIC_ATTRS)
cat_attrs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6460210 entries, 0 to 6460209
Data columns (total 5 columns):
 #   Column                       Dtype 
---  ------                       ----- 
 0   page_id                      int64 
 1   first_letter_category        object
 2   creation_date_category       object
 3   relative_pageviews_category  object
 4   num_sitelinks_category       object
dtypes: int64(1), object(4)
memory usage: 246.4+ MB


In [16]:
all_pages = np.array(list(seen_pages))
all_pages = np.sort(all_pages)
all_pages = pd.Series(all_pages)

In [17]:
del src_geo_recs, sub_geo_recs
del gender_recs, occ_recs
del seen_pages

In [18]:
%reset -f out

Flushing output cache (1 entries)


In [19]:
import gc
gc.collect()

0

## Helper Functions

These functions will help with further computations.

### Normalize Distribution

We are going to compute a number of data frames that are alignment vectors, such that each row is to be a multinomial distribution.  This function
normalizes such a frame.

In [20]:
def norm_align_matrix(df):
    df = df.fillna(0)
    sums = df.sum(axis='columns')
    return df.div(sums, axis='rows')

## Page Alignments

All of our metrics require page "alignments": the protected-group membership of each page.

### Quality

Quality isn't an alignment, but we're going to save it here:

In [12]:
output.save_table(quality, 'page-quality', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\page-quality.csv.gz
INFO:wptrec.save:data\metric-tables\page-quality.csv.gz: 35.62 MiB
INFO:wptrec.save:saving Parquet to data\metric-tables\page-quality.parquet
INFO:wptrec.save:data\metric-tables\page-quality.parquet: 9.14 MiB


### Page Geography

Let's start with the straight page geography alignment for the public evaluation of the training queries.  We've already loaded it above.

We need to do a little cleanup on this data:

- Align pages with no known geography with '@UNKNOWN' (to sort before known categories)
- Replace Oceania subregions with Oceania

In [21]:
sub_geo.head()

Unnamed: 0,page_id,sub_geo
0,303,Northern America
1,307,Northern America
2,316,Northern America
3,324,Northern America
4,330,Southern Europe


Let's start by turning this into a wide frame:

In [22]:
sub_geo_align = sub_geo.assign(x=1).pivot(index='page_id', columns='sub_geo', values='x')
sub_geo_align.fillna(0, inplace=True)
sub_geo_align.head()

sub_geo,Antarctica,Australia and New Zealand,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Melanesia,Micronesia,...,Northern Europe,Polynesia,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
307,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
324,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


Now we need to collapse Oceania into one column.

In [23]:
ocean = sub_geo_align.loc[:, oc_regions].sum(axis='columns')
sub_geo_align = sub_geo_align.drop(columns=oc_regions)
sub_geo_align['Oceania'] = ocean

Next we need to add the Unknown column and expand this.

Sum the items to find total amounts, and then create a series for unknown:

In [24]:
sub_geo_sums = sub_geo_align.sum(axis='columns')
sub_geo_unknown = ~(sub_geo_sums > 0)
sub_geo_unknown = sub_geo_unknown.astype('f8')
sub_geo_unknown = sub_geo_unknown.reindex(all_pages, fill_value=1)

Now let's join this with the original frame:

In [25]:
sub_geo_align = sub_geo_unknown.to_frame(UNKNOWN).join(sub_geo_align, how='left')
sub_geo_align = norm_align_matrix(sub_geo_align)
sub_geo_align.head()

Unnamed: 0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe,Oceania
12,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
290,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [66]:
sub_geo_align.sort_index(axis='columns', inplace=True)
sub_geo_align.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6460210 entries, 12 to 70194530
Data columns (total 21 columns):
 #   Column              Dtype  
---  ------              -----  
 0   @UNKNOWN            float64
 1   Antarctica          float64
 2   Caribbean           float64
 3   Central America     float64
 4   Central Asia        float64
 5   Eastern Africa      float64
 6   Eastern Asia        float64
 7   Eastern Europe      float64
 8   Middle Africa       float64
 9   Northern Africa     float64
 10  Northern America    float64
 11  Northern Europe     float64
 12  Oceania             float64
 13  South America       float64
 14  South-eastern Asia  float64
 15  Southern Africa     float64
 16  Southern Asia       float64
 17  Southern Europe     float64
 18  Western Africa      float64
 19  Western Asia        float64
 20  Western Europe      float64
dtypes: float64(21)
memory usage: 1.3 GB


And convert this to an xarray for multidimensional usage:

In [27]:
sub_geo_xr = xr.DataArray(sub_geo_align, dims=['page', 'sub_geo'])
sub_geo_xr

In [28]:
binarysize(sub_geo_xr.nbytes)

'1.90 GiB'

In [67]:
output.save_table(sub_geo_align, 'page-sub-geo-align', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\page-sub-geo-align.csv.gz
INFO:wptrec.save:data\metric-tables\page-sub-geo-align.csv.gz: 23.97 MiB
INFO:wptrec.save:saving Parquet to data\metric-tables\page-sub-geo-align.parquet
INFO:wptrec.save:data\metric-tables\page-sub-geo-align.parquet: 13.10 MiB


### Page Source Geography

We now need to do a similar setup for page source geography, which comes to us as a multinomial distribution already.

In [30]:
src_geo.head()

Unnamed: 0,page_id,@UNKNOWN,Northern America,Northern Europe,Western Europe,Central America,Australia and New Zealand,Eastern Asia,Southern Europe,South America,...,South-eastern Asia,Caribbean,Western Africa,Southern Africa,Middle Africa,Eastern Africa,Central Asia,Antarctica,Melanesia,Micronesia
0,12,52.0,44.0,38.0,,,,,,,...,,,,,,,,,,
1,25,160.0,37.0,16.0,2.0,,,,,,...,,,,,,,,,,
2,39,25.0,24.0,6.0,5.0,,,,,,...,,,,,,,,,,
3,290,15.0,15.0,1.0,1.0,,,,,,...,,,,,,,,,,
4,303,27.0,199.0,6.0,4.0,2.0,,,,,...,,,,,,,,,,


Set up the index:

In [31]:
src_geo.set_index('page_id', inplace=True)

Expand, then put 1 in UNKNOWN for everything that's missing:

In [32]:
src_geo_align = src_geo.reindex(all_pages, fill_value=0)
src_geo_align.loc[src_geo_align.sum('columns') == 0, UNKNOWN] = 1
src_geo_align

Unnamed: 0,@UNKNOWN,Northern America,Northern Europe,Western Europe,Central America,Australia and New Zealand,Eastern Asia,Southern Europe,South America,Western Asia,...,South-eastern Asia,Caribbean,Western Africa,Southern Africa,Middle Africa,Eastern Africa,Central Asia,Antarctica,Melanesia,Micronesia
12,52.0,44.0,38.0,,,,,,,,...,,,,,,,,,,
25,160.0,37.0,16.0,2.0,,,,,,,...,,,,,,,,,,
39,25.0,24.0,6.0,5.0,,,,,,,...,,,,,,,,,,
290,15.0,15.0,1.0,1.0,,,,,,,...,,,,,,,,,,
303,27.0,199.0,6.0,4.0,2.0,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70194419,1.0,,,,,,,,,,...,,,,,,,,,,
70194480,1.0,,,,,,,,,,...,,,,,,,,,,
70194481,1.0,7.0,,,,,,,,,...,,,,,,,,,,
70194489,2.0,,,,,1.0,,,,,...,,,,,,,,,,


Collapse Oceania:

In [33]:
ocean = src_geo_align.loc[:, oc_regions].sum(axis='columns')
src_geo_align = src_geo_align.drop(columns=oc_regions)
src_geo_align['Oceania'] = ocean

And normalize.

In [34]:
src_geo_align = norm_align_matrix(src_geo_align)

In [65]:
src_geo_align.sort_index(axis='columns', inplace=True)
src_geo_align.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6460210 entries, 12 to 70194530
Data columns (total 21 columns):
 #   Column              Dtype  
---  ------              -----  
 0   @UNKNOWN            float64
 1   Antarctica          float64
 2   Caribbean           float64
 3   Central America     float64
 4   Central Asia        float64
 5   Eastern Africa      float64
 6   Eastern Asia        float64
 7   Eastern Europe      float64
 8   Middle Africa       float64
 9   Northern Africa     float64
 10  Northern America    float64
 11  Northern Europe     float64
 12  Oceania             float64
 13  South America       float64
 14  South-eastern Asia  float64
 15  Southern Africa     float64
 16  Southern Asia       float64
 17  Southern Europe     float64
 18  Western Africa      float64
 19  Western Asia        float64
 20  Western Europe      float64
dtypes: float64(21)
memory usage: 1.1 GB


Xarray:

In [35]:
src_geo_xr = xr.DataArray(src_geo_align, dims=['page', 'src_geo'])
src_geo_xr

And save:

In [68]:
output.save_table(src_geo_align, 'page-src-geo-align', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\page-src-geo-align.csv.gz
INFO:wptrec.save:data\metric-tables\page-src-geo-align.csv.gz: 43.17 MiB
INFO:wptrec.save:saving Parquet to data\metric-tables\page-src-geo-align.parquet
INFO:wptrec.save:data\metric-tables\page-src-geo-align.parquet: 28.10 MiB


### Gender

Now let's work on extracting gender - this is going work a lot like page geography.

In [37]:
gender.head()

Unnamed: 0,page_id,gender
0,307,male
1,308,male
2,339,female
3,340,male
4,344,male


And summarize:

In [38]:
gender['gender'].value_counts()

male                        1495445
female                       353301
transgender female              636
non-binary                      329
transgender male                197
intersex                         94
eunuch                           70
genderfluid                      29
genderqueer                      27
cisgender female                 18
two-spiriit                      11
travesti                         10
transgender person               10
cisgender male                    7
agender                           6
transmasculine                    6
neutral sex                       5
transfeminine                     4
bigender                          4
third gender                      2
demiboy                           2
fa'afafine                        2
neutrois                          1
assigned female at birth          1
māhū                              1
hijra                             1
Name: gender, dtype: int64

Now, we're going to do a little more work to reduce the dimensionality of the space.  Points:

1. Trans men are men
2. Trans women are women
3. Cis/trans status is an adjective that can be dropped for the present purposes

The result is that we will collapse "transgender female" and "cisgender female" into "female".

The **downside** to this is that trans men are probabily significantly under-represented, but are now being collapsed into the dominant group.

In [39]:
pgcol = gender['gender']
pgcol = pgcol.str.replace(r'(?:tran|ci)sgender\s+((?:fe)?male)', r'\1', regex=True)
pgcol.value_counts()

male                        1495649
female                       353955
non-binary                      329
intersex                         94
eunuch                           70
genderfluid                      29
genderqueer                      27
two-spiriit                      11
transgender person               10
travesti                         10
agender                           6
transmasculine                    6
neutral sex                       5
transfeminine                     4
bigender                          4
third gender                      2
demiboy                           2
fa'afafine                        2
māhū                              1
hijra                             1
neutrois                          1
assigned female at birth          1
Name: gender, dtype: int64

Now, we're going to group the remaining gender identities together under the label 'NB'.  As noted above, this is a debatable exercise that collapses a lot of identity.

In [40]:
gender_labels = [UNKNOWN, 'female', 'male', 'NB']
pgcol[~pgcol.isin(gender_labels)] = 'NB'
pgcol.value_counts()

male      1495649
female     353955
NB            615
Name: gender, dtype: int64

Now put this column back in the frame and deduplicate.

In [41]:
page_gender = gender.assign(gender=pgcol)
page_gender = page_gender.drop_duplicates()

In [42]:
del pgcol

Now we need to add unknown genders.

In [43]:
kg_mask = all_pages.isin(page_gender['page_id'])
unknown = all_pages[~kg_mask]
page_gender = pd.concat([
    page_gender,
    pd.DataFrame({'page_id': unknown, 'gender': UNKNOWN})
], ignore_index=True)
page_gender

Unnamed: 0,page_id,gender
0,307,male
1,308,male
2,339,female
3,340,male
4,344,male
...,...,...
6460607,70194419,@UNKNOWN
6460608,70194480,@UNKNOWN
6460609,70194481,@UNKNOWN
6460610,70194489,@UNKNOWN


And make an alignment matrix:

In [44]:
gender_align = page_gender.reset_index().assign(x=1).pivot(index='page_id', columns='gender', values='x')
gender_align.fillna(0, inplace=True)
gender_align = gender_align.reindex(columns=gender_labels)
gender_align.head()

gender,@UNKNOWN,female,male,NB
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12,1.0,0.0,0.0,0.0
25,1.0,0.0,0.0,0.0
39,1.0,0.0,0.0,0.0
290,1.0,0.0,0.0,0.0
303,1.0,0.0,0.0,0.0


Let's see how frequent each of the genders is:

In [45]:
gender_align.sum(axis=0).sort_values(ascending=False)

gender
@UNKNOWN    4610461.0
male        1495647.0
female       353933.0
NB              571.0
dtype: float64

And convert to an xarray:

In [46]:
gender_xr = xr.DataArray(gender_align, dims=['page', 'gender'])
gender_xr

In [47]:
binarysize(gender_xr.nbytes)

'206.73 MiB'

In [48]:
output.save_table(gender_align, 'page-gender-align', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\page-gender-align.csv.gz
INFO:wptrec.save:data\metric-tables\page-gender-align.csv.gz: 18.80 MiB
INFO:wptrec.save:saving Parquet to data\metric-tables\page-gender-align.parquet
INFO:wptrec.save:data\metric-tables\page-gender-align.parquet: 9.33 MiB


### Occupation

Occupation works like gender, but without the need for processing.

Convert to a matrix:

In [49]:
occ_align = occupations.assign(x=1).pivot(index='page_id', columns='occ', values='x')
occ_align.head()

occ,activist,agricultural worker,artist,athlete,biologist,businessperson,chemist,civil servant,clergyperson,computer scientist,...,military personnel,musician,performing artist,physicist,politician,scientist,social scientist,sportsperson (non-athlete),transportation occupation,writer
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
307,,1.0,,,,,,,,,...,1.0,,,,1.0,,,,,1.0
308,,,,,1.0,,,,,,...,,,,1.0,,1.0,,,,1.0
339,,,,,,,,,,,...,,,,,,,,,,1.0
340,,,,,,1.0,,,,,...,,,,,,,,,,
344,,,1.0,,,1.0,,,,,...,,,,,,,,,,1.0


Set up unknown and merge:

In [50]:
occ_unk = pd.Series(1.0, index=all_pages)
occ_unk.index.name = 'page_id'
occ_kmask = all_pages.isin(occ_align.index)
occ_kmask.index = all_pages
occ_unk[occ_kmask] = 0
occ_align = occ_unk.to_frame(UNKNOWN).join(occ_align, how='left')
occ_align = norm_align_matrix(occ_align)
occ_align.head()

Unnamed: 0_level_0,@UNKNOWN,activist,agricultural worker,artist,athlete,biologist,businessperson,chemist,civil servant,clergyperson,...,military personnel,musician,performing artist,physicist,politician,scientist,social scientist,sportsperson (non-athlete),transportation occupation,writer
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
39,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
290,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
303,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [51]:
occ_xr = xr.DataArray(occ_align, dims=['page', 'occ'])
occ_xr

And save:

In [52]:
output.save_table(occ_align, 'page-occ-align', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\page-occ-align.csv.gz
INFO:wptrec.save:data\metric-tables\page-occ-align.csv.gz: 26.18 MiB
INFO:wptrec.save:saving Parquet to data\metric-tables\page-occ-align.parquet
INFO:wptrec.save:data\metric-tables\page-occ-align.parquet: 12.66 MiB


### Other Attributes

The other attributes don't require as much re-processing - they can be used as-is as categorical variables.  Let's save!

In [53]:
pages = cat_attrs.set_index('page_id')
pages

Unnamed: 0_level_0,first_letter_category,creation_date_category,relative_pageviews_category,num_sitelinks_category
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12,a-d,2001-2006,High,5+ languages
25,a-d,2001-2006,High,5+ languages
39,a-d,2001-2006,High,5+ languages
290,a-d,2001-2006,High,5+ languages
303,a-d,2001-2006,High,5+ languages
...,...,...,...,...
70194419,l-r,2017-2022,Low,2-4 languages
70194480,a-d,2017-2022,Low,English only
70194481,a-d,2017-2022,Low,English only
70194489,l-r,2017-2022,Low,2-4 languages


Now each of these needs to become another table.  The `get_dummies` function is our friend.

In [54]:
alpha_align = pd.get_dummies(pages['first_letter_category'])

In [55]:
output.save_table(alpha_align, 'page-alpha-align', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\page-alpha-align.csv.gz
INFO:wptrec.save:data\metric-tables\page-alpha-align.csv.gz: 19.47 MiB
INFO:wptrec.save:saving Parquet to data\metric-tables\page-alpha-align.parquet
INFO:wptrec.save:data\metric-tables\page-alpha-align.parquet: 10.52 MiB


In [56]:
alpha_xr = xr.DataArray(alpha_align, dims=['page', 'alpha'])

In [57]:
age_align = pd.get_dummies(pages['creation_date_category'])
output.save_table(age_align, 'page-age-align', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\page-age-align.csv.gz
INFO:wptrec.save:data\metric-tables\page-age-align.csv.gz: 17.29 MiB
INFO:wptrec.save:saving Parquet to data\metric-tables\page-age-align.parquet
INFO:wptrec.save:data\metric-tables\page-age-align.parquet: 7.53 MiB


In [58]:
age_xr = xr.DataArray(age_align, dims=['page', 'age'])

In [59]:
pop_align = pd.get_dummies(pages['relative_pageviews_category'])
output.save_table(pop_align, 'page-pop-align', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\page-pop-align.csv.gz
INFO:wptrec.save:data\metric-tables\page-pop-align.csv.gz: 18.69 MiB
INFO:wptrec.save:saving Parquet to data\metric-tables\page-pop-align.parquet
INFO:wptrec.save:data\metric-tables\page-pop-align.parquet: 9.52 MiB


In [60]:
pop_xr = xr.DataArray(pop_align, dims=['page', 'pop'])

In [61]:
langs_align = pd.get_dummies(pages['num_sitelinks_category'])
output.save_table(langs_align, 'page-langs-align', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\page-langs-align.csv.gz
INFO:wptrec.save:data\metric-tables\page-langs-align.csv.gz: 18.64 MiB
INFO:wptrec.save:saving Parquet to data\metric-tables\page-langs-align.parquet
INFO:wptrec.save:data\metric-tables\page-langs-align.parquet: 9.80 MiB


In [62]:
langs_xr = xr.DataArray(langs_align, dims=['page', 'langs'])

## Working with Alignments

At this point, we have computed an alignment matrix for each of our attributes, and extracted the qrels.

We will use the data saved from this in separate notebooks to compute targets and alignments for tasks.