# Regression Testing 2019-03-05
 * tickets
  - https://phabricator.wikimedia.org/T212937
  - https://phabricator.wikimedia.org/T213969
 * docs
   - [CitationUsage schema and data collection review](https://docs.google.com/document/d/1WOJT2Fg3M-sW2D3PBU-dPgtf4Lic3I7hqRE2jvRSKFY)
   - [section_id examples for Baha](https://docs.google.com/spreadsheets/d/1iDkyLcn_xM4pm995qjE_hZM1Qe033_skPwf5Q4KyPG4)
   - [Updated section_id document](https://docs.google.com/spreadsheets/d/1iDkyLcn_xM4pm995qjE_hZM1Qe033_skPwf5Q4KyPG4)


In [1]:
# basic setup
import pyspark
import re
import pyspark.sql
from pyspark.sql import *
import pandas as pd
import matplotlib.pyplot as plt
import hashlib
import os.path
from pyspark.sql.functions import desc
from datetime import timedelta, date

%matplotlib inline
spark_hive = pyspark.sql.HiveContext(sc)

schema_revision = "18810892"

## miscoded external links

In [2]:

query = """
select event.action, (event.link_url like 'https://en.wikipedia.org%') as likely_internal, count(*) as count 
from event.citationusage 
where hour > -1 
and revision = '{}'
group by event.action, (event.link_url like 'https://en.wikipedia.org%') 
order by event.action, count desc limit 100
"""

events = spark.sql(query.format(schema_revision))
sqlContext.createDataFrame(events.rdd).toPandas()

Unnamed: 0,action,likely_internal,count
0,extClick,False,33258
1,extClick,True,611
2,fnClick,True,9434
3,fnClick,False,8180
4,fnHover,True,24718
5,fnHover,False,1228
6,upClick,True,6068
7,upClick,False,60


** Summary: ** 1.8% of extClick events are likely miscoded as external

### potential extClick exclusion criteria: missing page/revision data + miscoded external links

In [3]:
query = """
select ((event.page_id = 0 and event.revision_id = 0) or event.link_url like 'https://en.wikipedia.org%') as excluded, count(*) as count 
from event.citationusage 
where hour > -1 
and revision = '{}'
and event.action = 'extClick'
group by ((event.page_id = 0 and event.revision_id = 0) or event.link_url like 'https://en.wikipedia.org%')
"""

events = spark.sql(query.format(schema_revision))
sqlContext.createDataFrame(events.rdd).toPandas()

Unnamed: 0,excluded,count
0,True,617
1,False,33252


**Summary**: Skipping page_id = 0 and miscoded external links, 1.8% of extClick events would be excluded.

### link_occurrence and ext_position
** Weakness**: When an external link occurs more than once on a page, the position of the last occurrence is reported, potentially biasing data to indicate clicking occurs lower on the page

In [4]:
query = """
select (event.link_occurrence > 1) as greater_than_one_link_occurrence, count(*) as count 
from event.citationusage 
where hour > -1 
and revision = '{}'
and event.action = 'extClick'
group by (event.link_occurrence > 1)
"""

events = spark.sql(query.format(schema_revision))
sqlContext.createDataFrame(events.rdd).toPandas()

Unnamed: 0,greater_than_one_link_occurrence,count
0,True,7703
1,False,26166


** Summary** : 23% of extClick events have a link_occurrence of more than 1, potentially skewing ext_position downward.
In order for ext_position to be useful, analysis may need to exclude links with more than one occurrence.