# Regression Testing 2019-03-01
 * tickets
  - https://phabricator.wikimedia.org/T212937
  - https://phabricator.wikimedia.org/T213969
 * docs
   - [CitationUsage schema and data collection review](https://docs.google.com/document/d/1WOJT2Fg3M-sW2D3PBU-dPgtf4Lic3I7hqRE2jvRSKFY)
   - [section_id examples for Baha](https://docs.google.com/spreadsheets/d/1iDkyLcn_xM4pm995qjE_hZM1Qe033_skPwf5Q4KyPG4)
   - [Updated section_id document](https://docs.google.com/spreadsheets/d/1iDkyLcn_xM4pm995qjE_hZM1Qe033_skPwf5Q4KyPG4)


In [1]:
# basic setup
import pyspark
import re
import pyspark.sql
from pyspark.sql import *
import pandas as pd
import matplotlib.pyplot as plt
import hashlib
import os.path
from pyspark.sql.functions import desc
from datetime import timedelta, date

%matplotlib inline
spark_hive = pyspark.sql.HiveContext(sc)

In [2]:
# show data by date just to see distribution and revision #
events_query = """
SELECT CONCAT(year, '-', month, '-', day) as date, revision, wiki, event.action, count(*) count
FROM event.citationusage
WHERE dt like '2019-02-%'
GROUP BY year, month, day, revision, wiki, event.action
ORDER BY year, month, day
"""
daily_events = spark.sql(events_query)
events_rdd = daily_events.rdd

events = sqlContext.createDataFrame(events_rdd)
events_pandas = events.toPandas()
events_pandas

Unnamed: 0,date,revision,wiki,action,count
0,2019-2-1,18359729,enwiki,fnHover,1
1,2019-2-1,18359729,enwiki,extClick,3
2,2019-2-2,18359729,enwiki,extClick,1
3,2019-2-2,18359729,enwiki,fnHover,1
4,2019-2-3,18359729,enwiki,extClick,2
5,2019-2-3,18359729,enwiki,fnHover,1
6,2019-2-4,18359729,enwiki,fnHover,3
7,2019-2-5,18359729,enwiki,fnHover,4
8,2019-2-5,18359729,enwiki,fnClick,1
9,2019-2-5,18359729,enwiki,extClick,1


## freely_accessible

revision 18359729 is previous data run: rate = 0.01%

| freely_accessible | count    |
|-------------------|----------|
| false             | 52181036 |
| true              | 6113     |

In [3]:
# rev 18359729 is a previous data run: 
# freely_accessible	count
# false	52181036
# true	6113
# rate: 0.01%

# rev 18810892 is new 1% sample: Mon, Feb 25, 11:37 AM - Wed, Feb 27, 9:28 AM
schema_revision = "18810892"

freely_accessible_query = """
select event.freely_accessible, count(*) as count 
from event.citationusage 
where hour>-1 
and event.action='extClick'
and revision = '{}'
group by event.freely_accessible
"""

freely_accessible_query.format(schema_revision)

free_events = spark.sql(freely_accessible_query.format(schema_revision))
free_events_rdd = free_events.rdd

free_events_pandas = sqlContext.createDataFrame(free_events_rdd).toPandas()
free_events_pandas

Unnamed: 0,freely_accessible,count
0,True,75
1,False,33770


In [4]:

total = free_events_pandas['count'][0]+free_events_pandas['count'][1]

print("rate: ", (free_events_pandas['count'][0]/total) * 100, "or ", "{0:.1%}".format(free_events_pandas['count'][0] / total))


rate:  0.221598463584 or  0.2%


** old rate was 0.01%, new is 0.2% which is a large improvement **



## section_id

[previous data](https://docs.google.com/spreadsheets/d/1UTsp1T3Dac94ny0O80U2mVwXoK3E2B2eAmpjuDUjRk4/edit#gid=498998310)

| action   | null_section_id | count    | percent of action |
|----------|-----------------|----------|-------------------|
| extClick | TRUE            | 6102236  | 16.25%            |
| fnHover  | TRUE            | 24338619 | 68.87%            |
| fnClick  | TRUE            | 8556200  | 39.65%            |
| upClick  | TRUE            | 44101    | 5.79%             |


In [5]:
# new data
schema_revision = "18810892"
section_id_query = """
select event.action, isnull(event.section_id) as null_section_id, count(*) as count 
from event.citationusage
where event.in_infobox = false 
and hour > -1 
and revision = '{}'
group by event.action, isnull(event.section_id)
order by event.action, isnull(event.section_id)
"""

section_id_query.format(schema_revision)

section_id_events = spark.sql(section_id_query.format(schema_revision))
section_id_events_rdd = section_id_events.rdd

section_id_events_pandas = sqlContext.createDataFrame(section_id_events_rdd).toPandas()
section_id_events_pandas

Unnamed: 0,action,null_section_id,count
0,extClick,False,22954
1,extClick,True,2035
2,fnClick,False,13555
3,fnClick,True,2903
4,fnHover,False,20437
5,fnHover,True,4680
6,upClick,False,6125


[new data](https://docs.google.com/spreadsheets/d/1iDkyLcn_xM4pm995qjE_hZM1Qe033_skPwf5Q4KyPG4/edit#gid=498998310)

| action   | null_section_id | count | percent of action |
|----------|-----------------|-------|-------------------|
| extClick | TRUE            | 2035  | 8.14%             |
| fnClick  | TRUE            | 2903  | 17.64%            |
| fnHover  | TRUE            | 4679  | 18.63%            |


** 8% of extClick actions are missing section_id data ... this is an improvement from old rate of 16% **

manual spot-check did not find instrumentation errors


## missing page_id and revision_id

query: select (event.page_id = 0 and event.revision_id = 0) as missing_pageId_and_revId, count(*) as count from event.citationusage where hour > -1 and revision = 18810892 group by (event.page_id = 0 and event.revision_id = 0);

| missing_pageid_and_revid | count |
|--------------------------|-------|
| false                    | 83077 |
| true                     | 392   |

**missing: 0.47% ... not huge, but potentially significant**

## missing page_title

query: select isnull(event.page_title) as null_page_title, count(*) as count from event.citationusage where hour > -1 and revision = 18810892 group by isnull(event.page_title);

| missing_page_title | count |
|--------------------|-------|
| true               | 83469 |

**100% of samples missing page_title ... intentional?**

## pageload data

In [16]:
# show data by date just to see distribution by revision #
events_query = """
SELECT CONCAT(year, '-', month, '-', day) as date, revision, wiki, event.action, count(*) count
FROM event.citationusagepageload
WHERE dt like '2019-02-%'
GROUP BY year, month, day, revision, wiki, event.action
ORDER BY year, month, day
"""
daily_events = spark.sql(events_query)
events_rdd = daily_events.rdd

events = sqlContext.createDataFrame(events_rdd)
events_pandas = events.toPandas()
events_pandas

Unnamed: 0,date,revision,wiki,action,count
0,2019-2-1,18359580,enwiki,pageLoad,200
1,2019-2-2,18359580,enwiki,pageLoad,187
2,2019-2-3,18359580,enwiki,pageLoad,177
3,2019-2-4,18359580,enwiki,pageLoad,171
4,2019-2-5,18359580,enwiki,pageLoad,191
5,2019-2-6,18359580,enwiki,pageLoad,193
6,2019-2-7,18359580,enwiki,pageLoad,178
7,2019-2-8,18359580,enwiki,pageLoad,161
8,2019-2-9,18359580,enwiki,pageLoad,173
9,2019-2-10,18359580,enwiki,pageLoad,156


**page_title is missing from pageload data as well ... likely intentional**

query: select isnull(event.page_title) as null_page_title, count(*) as count from event.citationusagepageload where hour > -1 and revision = 18502712 group by isnull(event.page_title);


** missing page_id and revision_id data; rate is lower than citationusage: 0.13% **

query: select (event.page_id = 0 and event.revision_id = 0) as missing_pageId_and_revId, count(*) as count from event.citationusagepageload where hour > -1 and revision = 18502712 group by (event.page_id = 0 and event.revision_id = 0);

| missing_pageid_and_revid | count   |
|--------------------------|---------|
| false                    | 3183029 |
| true                     | 4033    |