## Section ID Data

Section ID data was not limited to top-level (H2) sections during event data capture, meaning event data was associated with sections of any heirarchical order.

Example of capture issue: https://en.wikipedia.org/wiki/Hepatitis#Signs_and_symptoms.
Clicks on links under "Acute hepatitis" were captured with section_id Acute_hepatitis, not Signs_and_symptoms.

To rectify this, we extract all section headings from XML dump files (2019-04-01 and 2019-04-20) and match section IDs from event data with their associated H2 section headings.

Although this dump-based approach has the advantage of allowing a comarison between W and WP:M pages, it's problematic as well because section IDs as recorded in event data may not match section IDs created from dumps. Examples:

 - 2722104 {{Flagdeco|Finland}}_Finland    {{Flagdeco|Finland}}_Finland
 - 2722905 {{Anchor|MASTER-ELECTION}}Terminology   Replication_models_in_distributed_systems_=
 - 2778573 {{anchor|Health}}Healthcare     {{anchor|Health}}Healthcare
 - 2781683 Toppinen's_idols{{citation_needed|date=January_2018}}   Toppinen's_idols{{citation_needed|date=January_2018}}
 - 2803532 {{flag|Serbia}} {{flag|Serbia}}
 - 32611733        Track_listing<ref>{{cite_web|url=http://www.heistorhitrecords.com/category/blog/catalogue/sd-double/_|title=Archived_copy_|accessdate=2011-08-02_|deadurl=yes_|archiveurl=https://web.archive.org/web/20110711161819/http://www.heistorhitrecords.com/category/blog/catalogue/sd-double/_|archivedate=2011-07-11_|df=_}}</ref>        Track_listing<ref>{{cite_web|url=http://www.heistorhitrecords.com/category/blog/catalogue/sd-double/_|title=Archived_copy_|accessdate=2011-08-02_|deadurl=yes_|archiveurl=https://web.archive.org/web/20110711161819/http://www.heistorhitrecords.com/category/blog/catalogue/sd-double/_|archivedate=2011-07-11_|df=_}}</ref>
 - 49038574        General_information<ref>{{cite_web|title=桜の馬場　城彩苑|url=http://www.sakuranobaba-johsaien.jp/english/|website=桜の馬場　城彩苑|accessdate=8_January_2016}}</ref>      General_information<ref>{{cite_web|title=桜の馬場　城彩苑|url=http://www.sakuranobaba-johsaien.jp/english/|website=桜の馬場　城彩苑|accessdate=8_January_2016}}</ref>
 - 49102570        {{wp|_Hungary_}}        {{wp|_Hungary_}}
 - 49108440        {{wpw|Hungary}} {{wpw|Hungary}}
 
#### use section_ids-anonymizedData.ipynb instead


In [1]:
# basic defaults, including study dates, common SQL exclusions and parquet files for anonymized data
%run -i 'data-defaults.py'

In [2]:
WIKIPEDIA_XML_DUMPS = 'enwiki-201904*-pages-articles-multistream.xml.bz2'

SECTION_REGEX = re.compile(r'(={2,})(.+)={2,}')

def extract_sections(entity):
    page_text = entity.revision.text._VALUE
    sections = SECTION_REGEX.findall(page_text)
    rows = list()
    h2 = ''
    for section in sections:
        heading_level = len(section[0])
        # replace space with underscores to match section IDs in event data
        heading = section[1].strip().replace(' ','_')
        if (heading_level == 2):
            h2 = heading
        rows.append(Row(page_id=entity.id, section_h2=h2, section_id=heading))
    return rows

wikipedia = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='page').load(WIKIPEDIA_XML_DUMPS)

articles = wikipedia\
    .filter("ns = '0'")\
    .filter("redirect._title is null") \
    .filter("revision.text._VALUE is not null") \
    .filter("length(revision.text._VALUE) > 0")

sections = sqlContext.createDataFrame(\
              articles.rdd.map(extract_sections)\
                          .filter(lambda ls: len(ls)>0)\
                          .flatMap(lambda ls: [(row) for row in ls]))

In [3]:
sections.show()

+-------+--------------------+--------------------+
|page_id|          section_h2|          section_id|
+-------+--------------------+--------------------+
|     12|Etymology,_termin...|Etymology,_termin...|
|     12|             History|             History|
|     12|             History|Prehistoric_and_a...|
|     12|             History|Classical_anarchi...|
|     12|             History|Post-World_War_II...|
|     12|Anarchist_schools...|Anarchist_schools...|
|     12|Anarchist_schools...|         Classical_=|
|     12|Anarchist_schools...|        Mutualism_==|
|     12|Anarchist_schools...|Collectivist_anar...|
|     12|Anarchist_schools...|Anarcho-communism_==|
|     12|Anarchist_schools...|Anarcho-syndicali...|
|     12|Anarchist_schools...|Individualist_ana...|
|     12|Anarchist_schools...|    Post-classical_=|
|     12|Anarchist_schools...| Anarcha-feminism_==|
|     12|Internal_issues_a...|Internal_issues_a...|
|     12|  Topics_of_interest|  Topics_of_interest|
|     12|  T

In [4]:
# write section data for later use
sections.createOrReplaceTempView("temp_sections")
sqlContext.sql("DROP TABLE IF EXISTS ryanmax.sections")
sqlContext.sql("CREATE TABLE ryanmax.sections AS SELECT DISTINCT page_id, section_h2, section_id FROM temp_sections")



DataFrame[]

In [5]:
# count of top-level (H2) section IDs for WP:M pages only
pm_sections_query = """
SELECT section_h2, count(*) count
FROM
    ryanmax.sections
WHERE 
    page_id IN (
                SELECT DISTINCT page_id 
                FROM ryanmax.projmed_with_extlinks 
                WHERE to_date(dt) >= '{}' AND to_date(dt) <= '{}'
                )
GROUP BY section_h2
ORDER BY count desc, section_h2
"""

pm_sections = spark.sql(pm_sections_query.format(start_date_string, end_date_string))
pm_sections_rdd = pm_sections.rdd
pm_sections_df = sqlContext.createDataFrame(pm_sections_rdd)
pm_sections_df.toPandas()

Unnamed: 0,section_h2,count
0,References,31510
1,External_links,18774
2,See_also,13494
3,Diagnosis,8588
4,History,8136
5,Treatment,7244
6,Signs_and_symptoms,4631
7,Causes,4390
8,Further_reading,3287
9,Society_and_culture,3062


In [6]:
# Total count of events (by all event types) for each top-level (H2) section ID for WP:M pages only
pm_section_events_query = """
SELECT sections.section_h2, action, count(*) count
FROM 
    citationusage
    LEFT JOIN ryanmax.sections 
        ON 
        sections.page_id = citationusage.page_id 
        AND sections.section_id = citationusage.section_id
WHERE
    wiki = 'enwiki'
    AND citationusage.page_id IN (
                            SELECT DISTINCT page_id 
                            FROM ryanmax.projmed_with_extlinks 
                            WHERE to_date(dt) >= '{}' AND to_date(dt) <= '{}'
                        )
    {}
    AND to_date(citationusage.event_time) >= '{}'
    AND to_date(citationusage.event_time) <= '{}'
    AND useragent_is_bot = FALSE
    AND session_id in (
        SELECT session_id
        FROM citationusagepageload
        WHERE wiki = 'enwiki'
        {}
        AND to_date(event_time) >= '{}'
        AND to_date(event_time) <= '{}'
        AND useragent_is_bot = FALSE
        )
GROUP BY sections.section_h2, action
ORDER BY count desc
"""

pm_section_events = spark.sql(
    pm_section_events_query.format(
        start_date_string, end_date_string,
        event_exclusion_sql, start_date_string, end_date_string,
        pageload_exclusion_sql, start_date_string, end_date_string,
    ))
pm_section_events_rdd = pm_section_events.rdd
pm_section_events_df = sqlContext.createDataFrame(pm_section_events_rdd)
pm_section_events_pandas = pm_section_events_df.toPandas()


In [7]:
section_pda = pm_section_events_pandas.copy()
# replace 'NaN' section_h2 with 'None'
section_pda.section_h2.fillna(value='-- missing --', inplace=True)
# limit to counts of 1K or more
df_filtered = section_pda.query('count>1000').copy()
# set precision before pivot
df_filtered['count'] = df_filtered['count'].map(lambda x: '{0:.0f}'.format(x))
display(Markdown("**Table I**: Total count of events (by all event types) for each top-level (H2) section ID for WP:M pages only. Limited to >= 1000 events."))
display(Markdown('_Numbers may not exactly match "raw" data (table II) because some section IDs could not be mapped to H2 section IDs (changed section ID, missing WPM page, etc.)_'))
display(Markdown('***missing*** values are largely because section IDs were not recorded "if the section is the Main Section" as per Schema:CitationUsage.'))
df_filtered.pivot(index='section_h2', columns='action', values='count')

**Table I**: Total count of events (by all event types) for each top-level (H2) section ID for WP:M pages only. Limited to >= 1000 events.

_Numbers may not exactly match "raw" data (table II) because some section IDs could not be mapped to H2 section IDs (changed section ID, missing WPM page, etc.)_

***missing*** values are largely because section IDs were not recorded "if the section is the Main Section" as per Schema:CitationUsage.

action,extClick,fnClick,fnHover,upClick
section_h2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-- missing --,42548.0,152896.0,267933.0,
Adverse_effects,,1089.0,1585.0,
Cause,,,2065.0,
Causes,,2856.0,4690.0,
Diagnosis,,1842.0,3702.0,
Epidemiology,,2060.0,4000.0,
External_links,29667.0,,,
Further_reading,3282.0,,,
History,,5973.0,12083.0,
Management,,,1041.0,


In [8]:
# "raw" section data from captured events to show extent of the capture issue described above
pm_section_events_raw_query = """
SELECT section_id, action, count(*) count
FROM 
    citationusage 
WHERE page_id IN (
                        SELECT DISTINCT page_id 
                        FROM ryanmax.projmed_with_extlinks
                        WHERE to_date(dt) >= '{}' AND to_date(dt) <= '{}'
                        )
    AND wiki = 'enwiki'
    {}
    AND to_date(event_time) >= '{}'
    AND to_date(event_time) <= '{}'
    AND useragent_is_bot = FALSE
    AND session_id in (
        SELECT session_id
        FROM citationusagepageload
        WHERE wiki = 'enwiki'
        {}
        AND to_date(event_time) >= '{}'
        AND to_date(event_time) <= '{}'
        AND useragent_is_bot = FALSE
        )
GROUP BY section_id, action
ORDER BY count desc
LIMIT 100
"""

pm_section_events_raw = spark.sql(
    pm_section_events_raw_query.format(
        start_date_string, end_date_string,
        event_exclusion_sql, start_date_string, end_date_string,
        pageload_exclusion_sql, start_date_string, end_date_string,
    ))
pm_section_events_raw_rdd = pm_section_events_raw.rdd
pm_section_events_raw_df = sqlContext.createDataFrame(pm_section_events_raw_rdd)
pm_section_events_raw_pandas = pm_section_events_raw_df.toPandas()
#pm_section_events_raw_pandas
#pm_section_events_pandas.pivot(index='section_id', columns='action', values='count')

In [9]:
section_pda_raw = pm_section_events_raw_pandas.copy()
# replace 'NaN' section_id with 'None'
section_pda_raw.section_id.fillna(value='-- missing --', inplace=True)
# limit to counts of 1K or more
df_filtered_raw = section_pda_raw.query('count>1000').copy()
# set precision before pivot
df_filtered_raw['count'] = df_filtered_raw['count'].map(lambda x: '{0:.0f}'.format(x))
display(Markdown("**Table II**: Raw total count of events (by all event types) for each section ID for WP:M pages only. Limited to >= 1000 events."))
df_filtered_raw.pivot(index='section_id', columns='action', values='count')

**Table II**: Raw total count of events (by all event types) for each section ID for WP:M pages only. Limited to >= 1000 events.

action,extClick,fnClick,fnHover,upClick
section_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-- missing --,36996.0,93425.0,133043.0,
Adverse_effects,,1157.0,1679.0,
Cause,,,2076.0,
Causes,,2875.0,4731.0,
Classification,,,1711.0,
Diagnosis,,1870.0,3742.0,
Epidemiology,,2084.0,4027.0,
External_links,29687.0,,,
Further_reading,3302.0,,,
Genetics,,,1835.0,
