# 1Lib1Ref Equity Analysis
The goal of this analysis is to understand the impact of the 1Lib1Ref campaign -- i.e. a large-scale campaign -- from the perspective of content equity. In particular, I examine the distribution of Wikipedia articles improved from the perspective of gender equity and geographic equity. In both cases, I ask the question whether the 1Lib1Ref system maintained the status quo (generally biases towards articles about men and the United States or language-relevant countries) or led to contributions to a more balanced distribution of articles.

I look at 1Lib1Ref specifically for a few reasons:
* The edit task is not directly related to equity (almost all articles would benefit from references)
* It has good data -- the edits are hashtagged and it is large-scale enough for stratifying the data

How do editors choose their tasks:
* Unlike e.g., SuggestedEdits or Newcomer Tasks where the algorithm for serving tasks to users is clear, #1Lib1Ref is a public campaign and so participants likely use a variety of tools / processes to select their articles to edit. It's not easy (impossible?) to tie each edit to what might have influenced the article choice but a quick analysis of the [programs](https://outreachdashboard.wmflabs.org/campaigns/1lib1ref_may_2021/programs) through which participants may have joined suggests:
  * [African Librarians Week](https://outreachdashboard.wmflabs.org/courses/AfLIA/AfLIA_1Lib1Ref_African_Librarians_Week_2021_(May_-_June_2021)) added a lot of references related to Africa and English was its home wiki. We should expect a shift in geographic impact on English Wikipedia from US/UK -> Africa.
  * [Australia and Aotearoa New Zealand](https://outreachdashboard.wmflabs.org/courses/Wikimedia_Australia_and_Wikimedia_Aotearoa_New_Zealand/1Lib1Ref_Australia_and_Aotearoa_New_Zealand_2021) had English Wikipedia as a home wiki. We should expect a geographic impact on English WIkipedia from US/UK -> Australia / New Zealand.
  * [BiblioAssNat](https://outreachdashboard.wmflabs.org/courses/Biblioth%C3%A8que_de_l'Assembl%C3%A9e_nationale_du_Qu%C3%A9bec/1Bib1Ref_BiblioAssNat) has a home wiki of French Wikipedia and potentially shifted the geographic impact on French Wikipedia towards Canada.
  * [1Lib1Ref in fa.wiki](https://outreachdashboard.wmflabs.org/courses/Iranian_User_group/1Lib1Ref_in_fa.wiki_(2021)) had an article list but it's less clear how this would have impacted gender/geography.
  * There were a number more that would have likely shifted French Wikipedia towards France and perhaps other smaller geographic impacts. None of the campaigns/tools seem to have had a stated goal around gender but focus on the gender gap very easily could have incorporated.

Summary of results:
* Overview:
  * While the impact of the programs on gender is less clear, they had a notable impact on the geographic distribution of content. Sometimes this reinforced trends on the wiki but sometimes this expanded representation of content. Clearly organized campaigns are a powerful tool but their impact on content equity depends heavily on their focus.
* Gender:
  * Edits to English and French Wikipedia were more balanced in terms of gender representation than baseline but Persian Wikipedia was even more skewed towards men in terms of impact.
* Geography:
  * For English Wikipedia this was a shift towards Oceania and Africa and away from the US and UK. For French Wikipedia, France and Canada saw the largest boosts (which further biased content towards France).
  * For Persian Wikipedia, even more edits than expected went to content about Iran, which was already the most prominent region.

## Setup

In [1]:
from collections import defaultdict
import os
import re

import pandas as pd

import wmfdata

In [2]:
spark = wmfdata.spark.get_session(app_name='pyspark regular; 1Lib1Ref evall',
                                  type='yarn-large', # local, yarn-regular, yarn-large
                                  )  

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


## Parameters

In [3]:
print("Mediawiki partitions:")
spark.sql("SHOW PARTITIONS wmf_raw.mediawiki_project_namespace_map").show(50, False)

print("\nWikidata partitions:")
spark.sql("SHOW PARTITIONS wmf.wikidata_item_page_link").show(50, False)

Mediawiki partitions:
+------------------------+
|partition               |
+------------------------+
|snapshot=2016-12_private|
|snapshot=2017-07_private|
|snapshot=2021-04        |
|snapshot=2021-05        |
|snapshot=2021-06        |
|snapshot=2021-07        |
|snapshot=2021-08        |
|snapshot=2021-09        |
+------------------------+


Wikidata partitions:
+-------------------+
|partition          |
+-------------------+
|snapshot=2021-08-30|
|snapshot=2021-09-06|
|snapshot=2021-09-13|
|snapshot=2021-09-20|
|snapshot=2021-09-27|
|snapshot=2021-10-04|
+-------------------+



In [4]:
# Important parameters
mediawiki_snapshot = '2021-08'
# Substantial edits per: https://outreachdashboard.wmflabs.org/campaigns/1lib1ref_may_2021/articles_csv.csv
wiki_dbs = ('frwiki','enwiki','zhwiki')
wikidata_snapshot = '2021-09-06'
edit_subset_tablename = 'isaacj.oneliboneref'
start_date = '2021-05-15'  # reduce amount of data so pipeline doesn't fail
end_date = '2021-06-06'
gen_table = 'isaacj.gender_wikidata'
geo_table = 'isaacj.qid_to_country_2021_08_02'


In [5]:
print("Gender data example:")
spark.sql(f'SELECT * FROM {gen_table} LIMIT 5').show(50, False)

print("\nGeography data example:")
spark.sql(f'SELECT * FROM {geo_table} LIMIT 5').show(50, False)

Gender data example:
+----------+--------+
|item_id   |gender  |
+----------+--------+
|Q100066   |Q6581072|
|Q100137735|Q6581097|
|Q100146397|Q6581097|
|Q100152085|Q6581097|
|Q100166821|Q6581097|
+----------+--------+


Geography data example:
+---------+--------+-------+
|qid      |property|country|
+---------+--------+-------+
|Q7068891 |P625    |Tuvalu |
|Q34967204|P625    |Tuvalu |
|Q11762732|P625    |Tuvalu |
|Q15256609|P625    |Tuvalu |
|Q3394461 |P625    |Tuvalu |
+---------+--------+-------+



In [6]:
def qid_to_gender_category(qid):
    """Map individual Wikidata gender values to a few more categories so long-tail more likely to be represented."""
    # male, male organism, eunuch, cisgender male
    if qid in ('Q6581097', 'Q44148', 'Q179294', 'Q15145778'):
        return 'male'
    # female, female organism, cisgender female
    elif qid in ('Q6581072', 'Q43445', 'Q15145779'):
        return 'female'
    # transgender male, transmasculine
    elif qid in ('Q2449503', 'Q27679766'):
        return 'transgender male'
    # transgender female, transfeminine
    elif qid in ('Q1052281', 'Q27679684'):
        return 'transgender female'
    # contains identities like non-binary, transgender person, two-spirit, genderfluid, etc.
    # See for more details: https://www.wikidata.org/wiki/Property_talk:P21
    else:
        return 'non-binary'
    
spark.udf.register('gender_lbl', qid_to_gender_category, 'String')

<function __main__.qid_to_gender_category(qid)>

## 1Lib1Ref Edits
Gather data on 1Lib1Ref edits as identified by the `#1Lib1Ref` hash tag

In [8]:
create_table_query = f"""
    CREATE TABLE IF NOT EXISTS {edit_subset_tablename} (
        wiki_db                         STRING        COMMENT 'Wiki -- e.g., enwiki for English Wikipedia',
        page_id                         BIGINT        COMMENT 'Wikidata page ID',
        page_title                      STRING        COMMENT 'Page title (QID)',
        qid                             STRING        COMMENT 'Wikidata item ID',
        user_id                         BIGINT        COMMENT 'User ID; -1 if anonymous',
        user_text                       STRING        COMMENT 'User text; IP address if anonymous',
        revision_id                     BIGINT        COMMENT 'Revision ID',
        parent_rev_id                   BIGINT        COMMENT 'Parent revision ID',
        revision_timestamp              TIMESTAMP     COMMENT 'Revision timestamp (UTC)',
        revision_is_identity_reverted   BOOLEAN       COMMENT 'Was this revision reverted via identity revert?',
        gender                          STRING        COMMENT 'Gender value if exists and is human (as QID)',
        regions                         ARRAY<STRING> COMMENT 'List of regions associated with items'
    )
"""

print(create_table_query)
spark.sql(create_table_query)


    CREATE TABLE IF NOT EXISTS isaacj.oneliboneref (
        wiki_db                         STRING        COMMENT 'Wiki -- e.g., enwiki for English Wikipedia',
        page_id                         BIGINT        COMMENT 'Wikidata page ID',
        page_title                      STRING        COMMENT 'Page title (QID)',
        qid                             STRING        COMMENT 'Wikidata item ID',
        user_id                         BIGINT        COMMENT 'User ID; -1 if anonymous',
        user_text                       STRING        COMMENT 'User text; IP address if anonymous',
        revision_id                     BIGINT        COMMENT 'Revision ID',
        parent_rev_id                   BIGINT        COMMENT 'Parent revision ID',
        revision_timestamp              TIMESTAMP     COMMENT 'Revision timestamp (UTC)',
        revision_is_identity_reverted   BOOLEAN       COMMENT 'Was this revision reverted via identity revert?',
        gender                        

DataFrame[]

In [12]:
# Populate table with all edits tagged with #1Lib1Ref
#
# CTE explanations:
# * edits: gather all #1lib1ref edits from mediawiki history snapshot
# * pid_to_qid: build map of wiki+page ID -> Wikidata ID for joining in equity components
# * regions: build map of Wikidata ID -> any associated regions (this data has been precomputed)
# * edits_with_qid: join in Wikidata IDs to edit data and remove known test users from data
# * edits_with_equity_facets: join in region + gender data
# * OVERWRITE: fill in table

print_for_hive = False
do_execute = True

query = f"""
WITH edits AS (
  SELECT
    wiki_db,
    page_id,
    page_title,
    event_user_id AS user_id,
    REPLACE(event_user_text, ' ', '_') AS user_text,
    revision_id,
    revision_parent_id,
    CAST(event_timestamp AS TIMESTAMP) as revision_timestamp,
    revision_is_identity_reverted
  FROM wmf.mediawiki_history
  WHERE
    snapshot = '{mediawiki_snapshot}'
    AND page_namespace = 0
    AND NOT SIZE(event_user_is_bot_by) > 0
    AND event_type = 'create'
    AND event_entity = 'revision'
    AND CAST(event_timestamp AS DATE) > '{start_date}'
    AND CAST(event_timestamp AS DATE) < '{end_date}'
    AND LOWER(event_comment) LIKE '%#1lib1ref%'
),
pid_to_qid AS (
    SELECT
      wiki_db,
      page_id,
      item_id
    FROM wmf.wikidata_item_page_link
    WHERE
      snapshot = '{wikidata_snapshot}'
      AND page_namespace = 0
),
regions AS (
  SELECT
    qid AS qid,
    COLLECT_SET(country) AS regions
  FROM {geo_table} g
  INNER JOIN pid_to_qid p
    ON (g.qid = p.item_id)
  GROUP BY
    qid
),
edits_with_qid AS (
  SELECT
    e.*,
    pq.item_id AS qid
  FROM edits e
  LEFT JOIN pid_to_qid pq
    ON (e.wiki_db = pq.wiki_db
        AND e.page_id = pq.page_id)
),
edits_with_equity_facets AS (
  SELECT
    e.*,
    g.gender AS gender,
    r.regions AS regions
  FROM edits_with_qid e
  LEFT JOIN {gen_table} g
    ON (e.qid = g.item_id)
  LEFT JOIN regions r
    ON (e.qid = r.qid)
)
INSERT OVERWRITE TABLE {edit_subset_tablename}
  SELECT
    wiki_db,
    page_id,
    page_title,
    qid,
    user_id,
    user_text,
    revision_id,
    revision_parent_id,
    revision_timestamp,
    revision_is_identity_reverted,
    gender,
    regions
  FROM edits_with_equity_facets
"""

if print_for_hive:
    print(re.sub(' +', ' ', re.sub('\n', ' ', query)).strip())
else:
    print(query)

if do_execute:
    result = spark.sql(query)


WITH edits AS (
  SELECT
    wiki_db,
    page_id,
    page_title,
    event_user_id AS user_id,
    REPLACE(event_user_text, ' ', '_') AS user_text,
    revision_id,
    revision_parent_id,
    CAST(event_timestamp AS TIMESTAMP) as revision_timestamp,
    revision_is_identity_reverted
  FROM wmf.mediawiki_history
  WHERE
    snapshot = '2021-08'
    AND page_namespace = 0
    AND NOT SIZE(event_user_is_bot_by) > 0
    AND event_type = 'create'
    AND event_entity = 'revision'
    AND CAST(event_timestamp AS DATE) > '2021-05-15'
    AND CAST(event_timestamp AS DATE) < '2021-06-06'
    AND LOWER(event_comment) LIKE '%#1lib1ref%'
),
pid_to_qid AS (
    SELECT
      wiki_db,
      page_id,
      item_id
    FROM wmf.wikidata_item_page_link
    WHERE
      snapshot = '2021-09-06'
      AND page_namespace = 0
),
regions AS (
  SELECT
    qid AS qid,
    COLLECT_SET(country) AS regions
  FROM isaacj.qid_to_country_2021_08_02 g
  INNER JOIN pid_to_qid p
    ON (g.qid = p.item_id)
  GROUP BY

### Descriptive stats

In [13]:
# basic summary stats for edits for all wikis
# NOTE: num_rows should equal num_edits but occasionally a row is duplicated when joining in gender info
print_for_hive = False
do_execute = True

query = f"""
SELECT
  wiki_db,
  COUNT(1) AS num_rows,
  COUNT(DISTINCT(revision_id)) AS num_edits,
  COUNT(DISTINCT(page_id)) AS num_pages,
  COUNT(DISTINCT(user_id)) AS num_users,
  SUM(IF(gender IS NOT NULL, 1, 0)) AS edits_to_bios,
  SUM(IF(regions IS NOT NULL, 1, 0)) AS edits_to_geos,
  SUM(IF(revision_is_identity_reverted, 1, 0)) / COUNT(DISTINCT(revision_id)) AS pct_reverted
FROM {edit_subset_tablename}
GROUP BY
  wiki_db
ORDER BY
  num_edits DESC
"""

if print_for_hive:
    print(re.sub(' +', ' ', re.sub('\n', ' ', query)).strip())
else:
    print(query)

if do_execute:
    spark.sql(query).show(50, False)


SELECT
  wiki_db,
  COUNT(1) AS num_rows,
  COUNT(DISTINCT(revision_id)) AS num_edits,
  COUNT(DISTINCT(page_id)) AS num_pages,
  COUNT(DISTINCT(user_id)) AS num_users,
  SUM(IF(gender IS NOT NULL, 1, 0)) AS edits_to_bios,
  SUM(IF(regions IS NOT NULL, 1, 0)) AS edits_to_geos,
  SUM(IF(revision_is_identity_reverted, 1, 0)) / COUNT(DISTINCT(revision_id)) AS pct_reverted
FROM isaacj.oneliboneref
GROUP BY
  wiki_db
ORDER BY
  num_edits DESC

+------------+--------+---------+---------+---------+-------------+-------------+--------------------+
|wiki_db     |num_rows|num_edits|num_pages|num_users|edits_to_bios|edits_to_geos|pct_reverted        |
+------------+--------+---------+---------+---------+-------------+-------------+--------------------+
|enwiki      |7573    |7573     |2733     |152      |2181         |5292         |0.07579558959461244 |
|frwiki      |1399    |1394     |630      |50       |798          |1091         |0.007173601147776184|
|ukwiki      |880     |880      |673     

In [14]:
wikis_to_analyze = ['enwiki', 'frwiki', 'fawiki']

In [15]:
# gender equity for edits by language
#
# CTEs:
# * qids: gather all QIDs with a Wikipedia article in a given language
# * baseline: get baseline gender distribution for the language by joining qids with gender data (articles without gender ignored)
# * baseline_pct: convert counts into percentages
# * individual_edits: get campaign edit data with gender iinfo
# * edit_counts: edit counts by gender category
# * SELECT: join together edit gender data and baseline gender data
# 

print_for_hive = False
do_execute = True

for wikidb in wikis_to_analyze:
    qids_cte = f"""
    with qids AS (
        SELECT
          item_id
        FROM wmf.wikidata_item_page_link
        WHERE
          snapshot = '{wikidata_snapshot}'
          AND wiki_db = '{wikidb}'
          AND page_namespace = 0
    ),
    """
    if wikidb == 'wikidatawiki':
        qids_cte = f"""
        WITH wikipedia_projects AS (
            SELECT DISTINCT
              dbname
            FROM wmf_raw.mediawiki_project_namespace_map
            WHERE
              snapshot = '{mediawiki_snapshot}'
              AND hostname LIKE '%wikipedia%'
        ),
        qids AS (
            SELECT DISTINCT
              item_id
            FROM wmf.wikidata_item_page_link wd
            INNER JOIN wikipedia_projects wp
              ON (wd.wiki_db = wp.dbname)
            WHERE
              snapshot = '{wikidata_snapshot}'
              AND page_namespace = 0
        ),
        """
        
    print(f"\n== Analyzing {wikidb} ==")
    query = f"""
    {qids_cte}
    baseline AS (
        SELECT
          gender_lbl(gender) AS gender_cat,
          COUNT(1) AS num_bios
        FROM {gen_table} g
        INNER JOIN qids q
          ON (g.item_id = q.item_id)
        WHERE
          gender IS NOT NULL
        GROUP BY
          gender_cat
    ),
    baseline_pct AS (
        SELECT
          gender_cat,
          num_bios / (SUM(num_bios) OVER ()) AS pct_bios
        FROM baseline
    ),
    individual_edits AS (
        SELECT
          page_id,
          user_id,
          gender_lbl(gender) AS gender_cat
        FROM {edit_subset_tablename}
        WHERE
          wiki_db = '{wikidb}'
          AND gender IS NOT NULL
    ),
    edit_counts AS (
        SELECT
          gender_cat,
          COUNT(DISTINCT(user_id)) AS num_users,
          COUNT(DISTINCT(page_id)) AS num_pages,
          COUNT(1) AS num_edits
        FROM individual_edits
        GROUP BY
          gender_cat
    )
    SELECT
      b.gender_cat AS gender,
      COALESCE(num_users, 0) AS num_users,
      COALESCE(num_pages, 0) AS num_pages,
      COALESCE(num_edits, 0) AS num_edits,
      ROUND(num_edits / SUM(num_edits) OVER (), 3) AS pct_edits,
      ROUND(num_pages / SUM(num_pages) OVER (), 3) AS pct_pages,
      ROUND(pct_bios, 3) AS pct_baseline
    FROM baseline_pct b
    LEFT JOIN edit_counts i
      ON (b.gender_cat = i.gender_cat)
    ORDER BY
      num_edits DESC
    """

    if do_execute:
        result = spark.sql(query)
        result.show(500, False)


== Analyzing enwiki ==
+------------------+---------+---------+---------+---------+---------+------------+
|gender            |num_users|num_pages|num_edits|pct_edits|pct_pages|pct_baseline|
+------------------+---------+---------+---------+---------+---------+------------+
|male              |70       |577      |1478     |0.678    |0.589    |0.808       |
|female            |48       |403      |703      |0.322    |0.411    |0.191       |
|non-binary        |0        |0        |0        |null     |null     |0.0         |
|transgender male  |0        |0        |0        |null     |null     |0.0         |
|transgender female|0        |0        |0        |null     |null     |0.0         |
+------------------+---------+---------+---------+---------+---------+------------+


== Analyzing frwiki ==
+------------------+---------+---------+---------+---------+---------+------------+
|gender            |num_users|num_pages|num_edits|pct_edits|pct_pages|pct_baseline|
+------------------+-------

In [17]:
# geographic breakdown of edits by wiki
# 
# CTEs:
# * qids: gather all QIDs with a Wikipedia article in a given language
# * baseline: get count of articles for each country on wiki (articles with no countries ignored)
# * baseline_pct: convert counts into percentages
#    * NOTE: a single article can have many associated countries but the percentages are still normalized to add to 100%
#    *  so if e.g., 20% of geographic articles are attributed to UK here, it's maybe 30% of articles actually associated with the UK
# * individual_edits: get edit data and countries
#    * If an article that was edited is associated with e.g., 3 countries, it'll show up 3 times, once with each country
# * edit_counts: stratify counts by geography
# * SELECT: join together edit geo data and baseline geo data
# 
# NOTE: geographic baseline varies greatly by language though United States is usually top-3
# 
# NOTE: if desired, could use regions, continents, global north/south as geographic aggregations too
#   * Would just need to join against isaacj.country_to_region

print_for_hive = False
do_execute = True

for wikidb in wikis_to_analyze:
    print(f"\n== Analyzing {wikidb} ==")
    query = f"""
    with qids AS (
        SELECT
          item_id
        FROM wmf.wikidata_item_page_link
        WHERE
          snapshot = '{wikidata_snapshot}'
          AND wiki_db = '{wikidb}'
          AND page_namespace = 0
    ),
    baseline AS (
        SELECT
          country,
          COUNT(DISTINCT(qid)) AS num_articles
        FROM {geo_table} g
        INNER JOIN qids q
          ON (g.qid = q.item_id)
        GROUP BY
          country
    ),
    baseline_pct AS (
        SELECT
          country,
          num_articles / (SUM(num_articles) OVER ()) AS pct_articles
        FROM baseline
    ),
    individual_edits AS (
        SELECT
          page_id,
          user_id,
          EXPLODE(regions) AS country
        FROM {edit_subset_tablename}
        WHERE
          wiki_db = '{wikidb}'
          AND regions IS NOT NULL
          AND SIZE(regions) > 0
    ),
    edit_counts AS (
        SELECT
          country,
          COUNT(DISTINCT(user_id)) AS num_users,
          COUNT(DISTINCT(page_id)) AS num_pages,
          COUNT(1) AS num_edits
        FROM individual_edits
        GROUP BY
          country
    )
    SELECT
      b.country AS country,
      COALESCE(num_users, 0) AS num_users,
      COALESCE(num_pages, 0) AS num_pages,
      COALESCE(num_edits, 0) AS num_edits,
      ROUND(num_edits / SUM(num_edits) OVER (), 3) AS pct_edits,
      ROUND(num_pages / SUM(num_pages) OVER (), 3) AS pct_pages,
      ROUND(pct_articles, 3) AS pct_baseline
    FROM baseline_pct b
    LEFT JOIN edit_counts i
      ON (b.country = i.country)
    WHERE
      (pct_articles >= 0.01 OR num_edits > 25)
    ORDER BY
      num_edits DESC
    """

    if do_execute:
        result = spark.sql(query)
        result.show(500, False)


== Analyzing enwiki ==
+--------------------------------+---------+---------+---------+---------+---------+------------+
|country                         |num_users|num_pages|num_edits|pct_edits|pct_pages|pct_baseline|
+--------------------------------+---------+---------+---------+---------+---------+------------+
|Nigeria                         |20       |186      |1784     |0.333    |0.089    |0.003       |
|New Zealand                     |8        |725      |869      |0.162    |0.348    |0.009       |
|Ghana                           |8        |266      |399      |0.075    |0.128    |0.002       |
|Kenya                           |7        |133      |334      |0.062    |0.064    |0.002       |
|Australia                       |26       |187      |309      |0.058    |0.09     |0.031       |
|Zambia                          |11       |56       |260      |0.049    |0.027    |0.0         |
|United States of America        |41       |141      |229      |0.043    |0.068    |0.272     