# Enfield, Andrew - DATA 512, A2: Bias in Data

TBD UPDATE

The assignment is at https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data.

TBD remove

This notebook pulls, prepares, and analyzes data about the amount of monthly English Wikipedia traffic from January 1, 2008 through September 30, 2017. For more information about the work and data, refer to the [README](Readme.md).

A few notes:
- Normally I'd prefer to keep the explanation and background that's in the README here in the notebook, so everything's in a single file, but I've split it up this time as that's what the assignment requested. I won't copy/paste because keeping duplicate content in sync is horrible.
- Real reproducibility needs tests for the code. A lot of my implementation below is in functions. I'd normally put these functions in at least one separate file that I import into this notebook, and I'd have tests in an additional file. For this assignment I'll just keep everything in this file, for simplicity, even though it means I can't test the code the way I normally would.

# Prereqs

This code requires the libraries as described below.

In [32]:
# load data
import requests
import json
# import os

# load, prepare, and analyze data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#from mpl_toolkits.axes_grid.anchored_artists import AnchoredText # for addtl annotations in charts
#from matplotlib.ticker import FuncFormatter # for custom axis labels
from IPython.core.pylabtools import figsize
import seaborn as sns # for formatting
%matplotlib inline 

In [3]:
sns.set_style("whitegrid")
figsize(14,7)

# Load data

TBD UPDATE

This section loads the data from the two APIs described in the README, producing five separate .json files, one for each API and access combination.

In [30]:
d_wikipedia = pd.read_csv('page_data.csv')
d_wikipedia.shape

(47997, 3)

In [31]:
d_wikipedia[:3]

Unnamed: 0,country,page,last_edit
0,Abkhazia,Zurab Achba,802551672
1,Abkhazia,Garri Aiba,774499188
2,Abkhazia,Zaur Avidzba,803841397


In [18]:
d_population = pd.read_csv('Population Mid-2015.csv', skiprows=2, thousands=',')
d_population.shape

(210, 6)

In [29]:
d_wikipedia.groupby(['country']).size().sort_values(ascending=False)[:10]

country
France           1858
Australia        1610
Pakistan         1268
China            1261
Mexico           1137
United States    1115
Russia           1109
Iran             1055
Spain            1003
India             993
dtype: int64

In [19]:
d_population[:3]

Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,


## Pull article scores

Docs: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model

In [175]:
user_agent = 'https://github.com/aenfield'

def get_full_ores_score_json(rev_id):
    """TBD referring to https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model"""
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/{revid}/{model}'

    # TODO update to hardcode enwiki and wp10?
    params = {'project' : 'enwiki',
              'model' : 'wp10',
              'revid' : rev_id
              }

    api_call = requests.get(endpoint.format(**params), headers = {'User-Agent':'{}'.format(user_agent)})
    return api_call.json()

def get_multiple_full_ores_score_json(rev_ids):
    """TBD referring to ..."""
    endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models=wp10&revids={rev_ids_delimited}'

    #https://ores.wikimedia.org/v3/scores/enwiki?models=wp10&revids=802551672%7C774499188%7C785284614
        
    rev_ids_delimited = '802551672|774499188|803841397|789818648|785284614|798644673|728644481|788591677|758713659|802860970|797469371|804349394|799618550|805063877|718383950|805775169|778690357|779839643|803055503|805920528'
    params = { 'rev_ids_delimited' : rev_ids_delimited }

    api_call = requests.get(endpoint.format(**params), headers = {'User-Agent':'{}'.format(user_agent)})
    return api_call.json()

    

def get_ores_prediction_from_score_json(json):
    """Return the most likely article type, per ORES. Assumes a JSON dict from Ores with only a single article."""
    return json['enwiki']['scores'][list(json['enwiki']['scores'].keys())[0]]['wp10']['score']['prediction']

def get_ores_prediction(rev_id):
    j = get_full_ores_score_json(rev_id)
    return get_ores_prediction_from_score_json(j)

In [173]:
%time foo = get_multiple_full_ores_score_json('foo')
foo

CPU times: user 26.7 ms, sys: 2.91 ms, total: 29.6 ms
Wall time: 303 ms


<Response [200]>

In [174]:
foo.text

'{\n  "enwiki": {\n    "models": {\n      "wp10": {\n        "version": "0.5.0"\n      }\n    },\n    "scores": {\n      "718383950": {\n        "wp10": {\n          "score": {\n            "prediction": "Stub",\n            "probability": {\n              "B": 0.00867108446083565,\n              "C": 0.010201201419512923,\n              "FA": 0.0012308185869063053,\n              "GA": 0.002347248512459868,\n              "Start": 0.08362945430408676,\n              "Stub": 0.8939201927161984\n            }\n          }\n        }\n      },\n      "728644481": {\n        "wp10": {\n          "score": {\n            "prediction": "Stub",\n            "probability": {\n              "B": 0.011743849354179213,\n              "C": 0.018551150267917215,\n              "FA": 0.0017216633333940244,\n              "GA": 0.004963145860590862,\n              "Start": 0.2475321059686044,\n              "Stub": 0.7154880852153144\n            }\n          }\n        }\n      },\n      "758713659"

In [177]:
%time bar = get_multiple_full_ores_score_json(3)

CPU times: user 27.2 ms, sys: 3.81 ms, total: 31 ms
Wall time: 298 ms


In [178]:
bar

{'enwiki': {'models': {'wp10': {'version': '0.5.0'}},
  'scores': {'718383950': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.00867108446083565,
       'C': 0.010201201419512923,
       'FA': 0.0012308185869063053,
       'GA': 0.002347248512459868,
       'Start': 0.08362945430408676,
       'Stub': 0.8939201927161984}}}},
   '728644481': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.011743849354179213,
       'C': 0.018551150267917215,
       'FA': 0.0017216633333940244,
       'GA': 0.004963145860590862,
       'Start': 0.2475321059686044,
       'Stub': 0.7154880852153144}}}},
   '758713659': {'wp10': {'score': {'prediction': 'C',
      'probability': {'B': 0.13051629022094477,
       'C': 0.661331647033431,
       'FA': 0.008345185189334319,
       'GA': 0.0506476033411268,
       'Start': 0.13543386979975322,
       'Stub': 0.01372540441540996}}}},
   '774499188': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B'

In [179]:
#bar['enwiki']['scores'][list(j['enwiki']['scores'].keys())[0]]['wp10']['score']['prediction']

IndexError: list index out of range

**TODO** check that the output of the below matches the output of calling it individually (way below). At a glance the results didn't match, but i was only looking at the results - i should check w/ the rev_ids for each, since the ordering could be different.

In [188]:
[bar['enwiki']['scores'][list(bar['enwiki']['scores'].keys())[i]]['wp10']['score']['prediction'] for i in range(10)]

['Stub', 'Stub', 'C', 'Stub', 'Start', 'Stub', 'Start', 'Start', 'Start', 'GA']

In [128]:
type(j)

dict

In [129]:
get_ores_prediction('797882120')

'Start'

In [130]:
j = get_full_ores_score_json('797882120')
j

{'enwiki': {'models': {'wp10': {'version': '0.5.0'}},
  'scores': {'797882120': {'wp10': {'score': {'prediction': 'Start',
      'probability': {'B': 0.0325056273665757,
       'C': 0.10161634736900718,
       'FA': 0.003680032854794337,
       'GA': 0.021044772033944954,
       'Start': 0.8081343649161963,
       'Stub': 0.033018855459481376}}}}}}}

In [131]:
list(j['enwiki']['scores'].keys())[0]

'797882120'

In [138]:
d_wikipedia['last_edit'][:10]

0    802551672
1    774499188
2    803841397
3    789818648
4    785284614
5    798644673
6    728644481
7    788591677
8    758713659
9    802860970
Name: last_edit, dtype: int64

In [143]:
d_wikipedia['last_edit'][:10].values

array([802551672, 774499188, 803841397, 789818648, 785284614, 798644673,
       728644481, 788591677, 758713659, 802860970])

In [151]:
'|'.join([str(i) for i in d_wikipedia['last_edit'][:20].values])

'802551672|774499188|803841397|789818648|785284614|798644673|728644481|788591677|758713659|802860970|797469371|804349394|799618550|805063877|718383950|805775169|778690357|779839643|803055503|805920528'

In [146]:
'|'.join(['1','2','3'])

'1|2|3'

In [132]:
j['enwiki']['scores']['797882120']['wp10']['score']['prediction']

'Start'

In [133]:
list(j['enwiki']['scores'].keys())[0]

'797882120'

In [134]:
j['enwiki']['scores'][list(j['enwiki']['scores'].keys())[0]]['wp10']['score']['prediction']

'Start'

Based on one run of the below, it takes 29.2s to process 100 IDs, or 0.292 sec/ID. Since we have 48000 IDs, this is ~3.9 hrs. The 'many at the same time' API appears much faster - at least requesting 20 (not pulling out the results, but just getting the data) takes 300-800ms.

In [137]:
%time d_wikipedia['last_edit'][:100].apply(get_ores_prediction)

CPU times: user 2.6 s, sys: 117 ms, total: 2.72 s
Wall time: 29.2 s


0         C
1      Stub
2         C
3     Start
4     Start
5      Stub
6      Stub
7     Start
8         C
9     Start
10       GA
11    Start
12    Start
13     Stub
14     Stub
15    Start
16    Start
17     Stub
18       GA
19       GA
20     Stub
21        C
22    Start
23    Start
24        B
25        C
26     Stub
27     Stub
28        C
29     Stub
      ...  
70        C
71       GA
72     Stub
73     Stub
74       GA
75    Start
76     Stub
77     Stub
78     Stub
79    Start
80        B
81    Start
82        C
83        C
84       GA
85    Start
86       GA
87    Start
88    Start
89        C
90    Start
91    Start
92     Stub
93        C
94     Stub
95        C
96    Start
97     Stub
98        C
99     Stub
Name: last_edit, Length: 100, dtype: object