## Section ID Data

Section ID data was not limited to top-level (H2) sections during data capture, requiring post-capture processing for section ID click data. Post-processing included scraping section data from HTML pages. There is a potential for data loss since scraping occured after click capture. Only WPM pages were scraped on 2019-04-23 PDT, meaning no comparison to W pages is possible.

Example of campture issue: https://en.wikipedia.org/wiki/Hepatitis#Signs_and_symptoms.
Clicks on links under "Acute hepatitis" were captured with section_id Acute_hepatitis, not Signs_and_symptoms.

Post-capture data augmentation: H2 sections were extracted from public HTML of WPM pages 2019-04-23

See [section-extraction.ipynb](section-extraction.ipynb) for extraction details


In [1]:
# basic setup
# use PySpark YARN kernel
import pyspark
import re
import pyspark.sql
from pyspark.sql import *
import pandas as pd
import matplotlib.pyplot as plt
import hashlib
import os.path
from pyspark.sql.functions import desc
from datetime import timedelta, date
from IPython.display import Markdown, display

%matplotlib inline
spark_hive = pyspark.sql.HiveContext(sc)

In [2]:
## basic data defaults (copied from pageload-event.ipynb)

# set date ranges for all queries
start_date = date(2019, 3, 29)
end_date = date(2019, 4, 22)
date_format = '%Y-%m-%d'
start_date_string = start_date.strftime(date_format)
end_date_string = end_date.strftime(date_format)

# for iterating over the range of study dates (used in daily count of events queries)
def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days + 1)):
        yield start_date + timedelta(n)

# convenience method for converting dates to 'YYYY-MM-DD%' for SQL queries
def date_to_dt(date):
    return str(date.year) + '-' + '{0:02d}'.format(date.month) + '-' + '{0:02d}'.format(date.day) + '%'

## common exclusion SQL
#
# exclude event data that is either: 
# - has page or revision ID of zero (pages not yet created as per bmansurov https://phabricator.wikimedia.org/T213969#4998281)
# - is 'extClick' but is an internal link improperly coded as external as per bmansurov https://phabricator.wikimedia.org/T213969#5003710
event_exclusion_sql = """
AND (citationusage.page_id = 0 OR citationusage.revision_id = 0) = FALSE
AND (citationusage.action = 'extClick' AND 
    (citationusage.link_url LIKE 'https://en.wikipedia.org%' 
    OR citationusage.link_url LIKE 'https://en.m.wikipedia.org%')) = FALSE
"""
# exclude pageload data that:
# - has page or revision ID of zero (pages not yet created as per bmansurov https://phabricator.wikimedia.org/T213969#4998281)
pageload_exclusion_sql = """
AND (citationusagepageload.page_id = 0 OR citationusagepageload.revision_id = 0) = FALSE
"""

#### load anonymized data from parquet files extracted with [DatasetAnonymization.ipynb](DatasetAnonymization.ipynb)

In [3]:
parquetFilePageloads = spark.read.parquet("/user/ryanmax/anonymous_pageloads_april.parquet")
parquetFileCitationusage = spark.read.parquet("/user/ryanmax/anonymous_citationusage_april.parquet")
parquetFilePageloads.createOrReplaceTempView("citationusagepageload")
parquetFileCitationusage.createOrReplaceTempView("citationusage")

In [4]:
# Total count of events (by all event types) for each top-level (H2) section ID for WP:M pages only
pm_section_events_query = """
SELECT wpm_sections.section_h2, action, count(*) count
FROM 
    citationusage
    LEFT JOIN ryanmax.wpm_sections 
        ON 
        wpm_sections.page_id = citationusage.page_id 
        AND wpm_sections.section_id = citationusage.section_id
WHERE
    wiki = 'enwiki'
    AND citationusage.page_id IN (
                            SELECT DISTINCT page_id 
                            FROM ryanmax.projmed_with_extlinks 
                            WHERE to_date(dt) >= '{}' AND to_date(dt) <= '{}'
                        )
    {}
    AND to_date(citationusage.event_time) >= '{}'
    AND to_date(citationusage.event_time) <= '{}'
    AND useragent_is_bot = FALSE
    AND session_id in (
        SELECT session_id
        FROM citationusagepageload
        WHERE wiki = 'enwiki'
        {}
        AND to_date(event_time) >= '{}'
        AND to_date(event_time) <= '{}'
        AND useragent_is_bot = FALSE
        )
GROUP BY wpm_sections.section_h2, action
ORDER BY count desc
"""

pm_section_events = spark.sql(
    pm_section_events_query.format(
        start_date_string, end_date_string,
        event_exclusion_sql, start_date_string, end_date_string,
        pageload_exclusion_sql, start_date_string, end_date_string,
    ))
pm_section_events_rdd = pm_section_events.rdd
pm_section_events_df = sqlContext.createDataFrame(pm_section_events_rdd)
pm_section_events_pandas = pm_section_events_df.toPandas()


In [5]:
section_pda = pm_section_events_pandas.copy()
# replace 'NaN' section_h2 with 'None'
section_pda.section_h2.fillna(value='-- missing --', inplace=True)
# limit to counts of 1K or more
df_filtered = section_pda.query('count>1000').copy()
# set precision before pivot
df_filtered['count'] = df_filtered['count'].map(lambda x: '{0:.0f}'.format(x))
display(Markdown("**Table I**: Total count of events (by all event types) for each top-level (H2) section ID for WP:M pages only. Limited to >= 1000 events."))
display(Markdown('_Numbers may not exactly match "raw" data (table II) because some section IDs could not be mapped to H2 section IDs (changed section ID, missing WPM page, etc.)_'))
display(Markdown('***missing*** values are largely because section IDs were not recorded "if the section is the Main Section" as per Schema:CitationUsage.'))
df_filtered.pivot(index='section_h2', columns='action', values='count')

**Table I**: Total count of events (by all event types) for each top-level (H2) section ID for WP:M pages only. Limited to >= 1000 events.

_Numbers may not exactly match "raw" data (table II) because some section IDs could not be mapped to H2 section IDs (changed section ID, missing WPM page, etc.)_

***missing*** values are largely because section IDs were not recorded "if the section is the Main Section" as per Schema:CitationUsage.

action,extClick,fnClick,fnHover,upClick
section_h2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-- missing --,37057.0,93880.0,133710.0,
Adverse_effects,,2109.0,3785.0,
Background,,,1211.0,
Cause,,2374.0,5726.0,
Causes,,5255.0,10052.0,
Chemistry,,,1100.0,
Classification,,,2108.0,
Diagnosis,,4042.0,9029.0,
Effects,,,1181.0,
Epidemiology,,2549.0,5189.0,


In [6]:
# "raw" section data from captured events to show extent of the capture issue described above
pm_section_events_raw_query = """
SELECT section_id, action, count(*) count
FROM 
    citationusage 
WHERE page_id IN (
                        SELECT DISTINCT page_id 
                        FROM ryanmax.projmed_with_extlinks
                        WHERE to_date(dt) >= '{}' AND to_date(dt) <= '{}'
                        )
    AND wiki = 'enwiki'
    {}
    AND to_date(event_time) >= '{}'
    AND to_date(event_time) <= '{}'
    AND useragent_is_bot = FALSE
    AND session_id in (
        SELECT session_id
        FROM citationusagepageload
        WHERE wiki = 'enwiki'
        {}
        AND to_date(event_time) >= '{}'
        AND to_date(event_time) <= '{}'
        AND useragent_is_bot = FALSE
        )
GROUP BY section_id, action
ORDER BY count desc
LIMIT 100
"""

pm_section_events_raw = spark.sql(
    pm_section_events_raw_query.format(
        start_date_string, end_date_string,
        event_exclusion_sql, start_date_string, end_date_string,
        pageload_exclusion_sql, start_date_string, end_date_string,
    ))
pm_section_events_raw_rdd = pm_section_events_raw.rdd
pm_section_events_raw_df = sqlContext.createDataFrame(pm_section_events_raw_rdd)
pm_section_events_raw_pandas = pm_section_events_raw_df.toPandas()
#pm_section_events_raw_pandas
#pm_section_events_pandas.pivot(index='section_id', columns='action', values='count')

In [7]:
section_pda_raw = pm_section_events_raw_pandas.copy()
# replace 'NaN' section_id with 'None'
section_pda_raw.section_id.fillna(value='-- missing --', inplace=True)
# limit to counts of 1K or more
df_filtered_raw = section_pda_raw.query('count>1000').copy()
# set precision before pivot
df_filtered_raw['count'] = df_filtered_raw['count'].map(lambda x: '{0:.0f}'.format(x))
display(Markdown("**Table II**: Raw total count of events (by all event types) for each section ID for WP:M pages only. Limited to >= 1000 events."))
df_filtered_raw.pivot(index='section_id', columns='action', values='count')

**Table II**: Raw total count of events (by all event types) for each section ID for WP:M pages only. Limited to >= 1000 events.

action,extClick,fnClick,fnHover,upClick
section_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-- missing --,36996.0,93425.0,133043.0,
Adverse_effects,,1157.0,1679.0,
Cause,,,2076.0,
Causes,,2875.0,4731.0,
Classification,,,1711.0,
Diagnosis,,1870.0,3742.0,
Epidemiology,,2084.0,4027.0,
External_links,29687.0,,,
Further_reading,3302.0,,,
Genetics,,,1835.0,
