# Enfield, Andrew - DATA 512, A2: Bias in Data

TBD UPDATE

The assignment is at https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data.

TBD remove

This notebook pulls, prepares, and analyzes data about the amount of monthly English Wikipedia traffic from January 1, 2008 through September 30, 2017. For more information about the work and data, refer to the [README](Readme.md).

A few notes:
- Normally I'd prefer to keep the explanation and background that's in the README here in the notebook, so everything's in a single file, but I've split it up this time as that's what the assignment requested. I won't copy/paste because keeping duplicate content in sync is horrible.
- Real reproducibility needs tests for the code. A lot of my implementation below is in functions. I'd normally put these functions in at least one separate file that I import into this notebook, and I'd have tests in an additional file. For this assignment I'll just keep everything in this file, for simplicity, even though it means I can't test the code the way I normally would.

# Prereqs

This code requires the libraries as described below.

In [32]:
# load data
import requests
import json
# import os

# load, prepare, and analyze data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#from mpl_toolkits.axes_grid.anchored_artists import AnchoredText # for addtl annotations in charts
#from matplotlib.ticker import FuncFormatter # for custom axis labels
from IPython.core.pylabtools import figsize
import seaborn as sns # for formatting
%matplotlib inline 

In [3]:
sns.set_style("whitegrid")
figsize(14,7)

# Load data

TBD UPDATE

This section loads the data from the two APIs described in the README, producing five separate .json files, one for each API and access combination.

In [30]:
d_wikipedia = pd.read_csv('page_data.csv')
d_wikipedia.shape

(47997, 3)

In [31]:
d_wikipedia[:3]

Unnamed: 0,country,page,last_edit
0,Abkhazia,Zurab Achba,802551672
1,Abkhazia,Garri Aiba,774499188
2,Abkhazia,Zaur Avidzba,803841397


In [18]:
d_population = pd.read_csv('Population Mid-2015.csv', skiprows=2, thousands=',')
d_population.shape

(210, 6)

In [29]:
d_wikipedia.groupby(['country']).size().sort_values(ascending=False)[:10]

country
France           1858
Australia        1610
Pakistan         1268
China            1261
Mexico           1137
United States    1115
Russia           1109
Iran             1055
Spain            1003
India             993
dtype: int64

In [19]:
d_population[:3]

Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,


## Pull article scores

Docs: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model and https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context

Note that when I try the multiple rev ID API with a bunch of valid IDs and one that's a text string, then it gives me a 500 and no data at all. However when I try with a bunch of valid IDs and an ID that's not valid - like -1 - then I get valid/good data for the valid IDs and output like the following. This seems good: I'll go ahead and try just pulling batches of IDs w/o further error handling.

    "-1": {
        "wp10": {
          "error": {
            "message": "RevisionNotFound: Could not find revision ({revision}:-1)",
            "type": "RevisionNotFound"
          }
        }

In [351]:
user_agent = 'https://github.com/aenfield'

def get_full_ores_score_json(rev_id):
    """TBD referring to https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model"""
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/{revid}/{model}'

    # TODO update to hardcode enwiki and wp10?
    params = {'project' : 'enwiki',
              'model' : 'wp10',
              'revid' : rev_id
              }

    api_call = requests.get(endpoint.format(**params), headers = {'User-Agent':'{}'.format(user_agent)})
    return api_call.json()

def get_multiple_full_ores_score_json(rev_ids):
    """TBD referring to ..., with rev_ids as a list"""
    endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models=wp10&revids={rev_ids_delimited}'

    #rev_ids_delimited = '802551672|774499188|803841397'
    rev_ids_delimited = '|'.join([str(i) for i in rev_ids])
    #print(rev_ids_delimited)
    #print()

    params = { 'rev_ids_delimited' : rev_ids_delimited }

    api_call = requests.get(endpoint.format(**params), headers = {'User-Agent':'{}'.format(user_agent)})
    return api_call.json()
    

#def get_ores_prediction_from_score_json(score_json, index=0):
def get_ores_prediction_from_score_json(score_json, rev_id):
    """Return the most likely article type, per ORES. Assumes a JSON dict from Ores. """
    #return score_json['enwiki']['scores'][list(score_json['enwiki']['scores'].keys())[index]]['wp10']['score']['prediction']
    return score_json['enwiki']['scores'][str(rev_id)]['wp10']['score']['prediction']

def get_ores_prediction(rev_id):
    j = get_full_ores_score_json(rev_id)
    return get_ores_prediction_from_score_json(j)

**TODO** Update above code to handle the fact that the rev IDs in the JSON don't/aren't guaranteed to come back in the same order as they were specified in the call. It might be as simple as just using the rev ID itself as the parameter instead of the index, and then using the rev ID in the dict itself instead of looking up something w/ an index?

**TODO** Then eyeball the results in the file and make sure they match the results the single-ID call gives.

In [345]:
%time foo = get_multiple_full_ores_score_json(d_wikipedia['last_edit'][:2].values)
foo

802551672|774499188

CPU times: user 26.1 ms, sys: 6.64 ms, total: 32.7 ms
Wall time: 267 ms


{'enwiki': {'models': {'wp10': {'version': '0.5.0'}},
  'scores': {'774499188': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.03488477079112925,
       'C': 0.06953258948284814,
       'FA': 0.0025762575670963965,
       'GA': 0.007911851615317388,
       'Start': 0.4106575723489943,
       'Stub': 0.4744369581946146}}}},
   '802551672': {'wp10': {'score': {'prediction': 'C',
      'probability': {'B': 0.17804496652032606,
       'C': 0.4950097267737061,
       'FA': 0.010481758617897869,
       'GA': 0.05901957893404478,
       'Start': 0.25371601149510725,
       'Stub': 0.0037279576589180907}}}}}}}

In [346]:
foo['enwiki']['scores']['774499188']['wp10']['score']['prediction']

'Stub'

In [347]:
get_ores_prediction_from_score_json(foo, '802551672')

'C'

In [348]:
def chunker(seq, size):
    """Get a generator that returns chunks of size 'size' of the sequence in 'seq'.
    
    From: https://stackoverflow.com/questions/434287/what-is-the-most-pythonic-way-to-iterate-over-a-list-in-chunks
    """
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

In [352]:
import csv
with open('article_scores.csv', 'w') as output_file:
    writer = csv.writer(output_file, delimiter=',')
    
    progress_frequency = 5
    count_of_rev_ids_in_chunk = 140
    rev_ids_in_chunks = [x for x in chunker(d_wikipedia['last_edit'].values, count_of_rev_ids_in_chunk)]
    
    #for chunk_index, chunk_of_rev_ids in enumerate(rev_ids_in_chunks[:2]):
    for chunk_index, chunk_of_rev_ids in enumerate(rev_ids_in_chunks):
        if (chunk_index % progress_frequency == 0): print(f"Retrieving chunk with index {chunk_index}.")
        
        #print(chunk_of_rev_ids)
        #print()
        
        scores_json = get_multiple_full_ores_score_json(chunk_of_rev_ids)
        for rev_id in chunk_of_rev_ids:
        #for (rev_id, rev_id_index) in zip(chunk_of_rev_ids, range(len(chunk_of_rev_ids))):
            #print(f'Processing {rev_id}.')
            #print(f'With {scores_json}.')
            writer.writerow([rev_id, get_ores_prediction_from_score_json(scores_json, rev_id)])
            
    print(f"Retrieved {chunk_index + 1} chunks. Done.")

Retrieving chunk with index 0.
Retrieving chunk with index 5.
Retrieving chunk with index 10.
Retrieving chunk with index 15.
Retrieving chunk with index 20.
Retrieving chunk with index 25.
Retrieving chunk with index 30.
Retrieving chunk with index 35.
Retrieving chunk with index 40.
Retrieving chunk with index 45.
Retrieving chunk with index 50.
Retrieving chunk with index 55.
Retrieving chunk with index 60.
Retrieving chunk with index 65.
Retrieving chunk with index 70.
Retrieving chunk with index 75.
Retrieving chunk with index 80.
Retrieving chunk with index 85.
Retrieving chunk with index 90.
Retrieving chunk with index 95.
Retrieving chunk with index 100.
Retrieving chunk with index 105.
Retrieving chunk with index 110.
Retrieving chunk with index 115.
Retrieving chunk with index 120.
Retrieving chunk with index 125.
Retrieving chunk with index 130.
Retrieving chunk with index 135.
Retrieving chunk with index 140.
Retrieving chunk with index 145.
Retrieving chunk with index 150.

In [317]:
len(d_wikipedia['last_edit'])

47997

In [310]:
for i in range(50):
    if (i % 5 == 0): print(i)

0
5
10
15
20
25
30
35
40
45


In [291]:
list(seqs[3])

[421,
 422,
 423,
 424,
 425,
 426,
 427,
 428,
 429,
 430,
 431,
 432,
 433,
 434,
 435,
 436,
 437,
 438,
 439,
 440,
 441,
 442,
 443,
 444,
 445,
 446,
 447,
 448,
 449,
 450,
 451,
 452,
 453,
 454,
 455,
 456,
 457,
 458,
 459,
 460,
 461,
 462,
 463,
 464,
 465,
 466,
 467,
 468,
 469,
 470,
 471,
 472,
 473,
 474,
 475,
 476,
 477,
 478,
 479,
 480,
 481,
 482,
 483,
 484,
 485,
 486,
 487,
 488,
 489,
 490,
 491,
 492,
 493,
 494,
 495,
 496,
 497,
 498,
 499,
 500,
 501,
 502,
 503,
 504,
 505,
 506,
 507,
 508,
 509,
 510,
 511,
 512,
 513,
 514,
 515,
 516,
 517,
 518,
 519,
 520,
 521,
 522,
 523,
 524,
 525,
 526,
 527,
 528,
 529,
 530,
 531,
 532,
 533,
 534,
 535,
 536,
 537,
 538,
 539,
 540,
 541,
 542,
 543,
 544,
 545,
 546,
 547,
 548,
 549,
 550,
 551,
 552,
 553,
 554,
 555,
 556,
 557,
 558,
 559,
 560]

In [290]:
d_wikipedia['last_edit'][421:561].values

array([791693230, 786975664, 786628383, 789270814, 779725350, 801084884,
       790720159, 788182293, 790720559, 790720433, 788250379, 798610427,
       802444424, 797116789, 799304187, 799467293, 799450784, 786199183,
       790719984, 788182602, 802437823, 799695433, 745805099, 790208674,
       788889966, 789270784, 788890514, 802437830, 788182587, 797855253,
       796896129, 788180240, 778418380, 802442569, 802437953, 802438008,
       802437918, 796600327, 791415808, 795321320, 802438004, 782099774,
       789270614, 803812055, 751432945, 789270021, 802444443, 789270004,
       801554031, 788181008, 802437935, 802444467, 764898690, 796371886,
       789269944, 747064542, 802437981, 749473060, 789270096, 789270087,
       790720862, 798292485, 789270464, 802438011, 791694358, 790721337,
       751070415, 802438016, 802438013, 788910415, 747887240, 789270517,
       749828657, 802444462, 750648438, 789270025, 789269991, 802437920,
       788182594, 789451460, 750544980, 752913381, 

In [289]:
d_wikipedia['last_edit'][seqs[3]].values

KeyError: range(421, 561)

In [272]:
range(1, 5000)[30:50]

range(31, 51)

In [229]:
get_ores_prediction_from_score_json(foo)

'Stub'

In [230]:
get_ores_prediction_from_score_json(foo, 1)

'C'

In [231]:
get_ores_prediction_from_score_json(foo, 2)

IndexError: list index out of range

In [211]:
d_wikipedia['last_edit'][:20].values

array([802551672, 774499188, 803841397, 789818648, 785284614, 798644673,
       728644481, 788591677, 758713659, 802860970, 797469371, 804349394,
       799618550, 805063877, 718383950, 805775169, 778690357, 779839643,
       803055503, 805920528])

In [208]:
%time foo = get_multiple_full_ores_score_json('foo')
foo

CPU times: user 25.3 ms, sys: 4.28 ms, total: 29.6 ms
Wall time: 294 ms


<Response [500]>

In [200]:
foo['enwiki']['scores'][list(foo['enwiki']['scores'].keys())[4]]['wp10']['score']['prediction']

'Start'

In [191]:
%time bar = get_multiple_full_ores_score_json(3)

CPU times: user 26.4 ms, sys: 3.56 ms, total: 29.9 ms
Wall time: 313 ms


In [178]:
bar

{'enwiki': {'models': {'wp10': {'version': '0.5.0'}},
  'scores': {'718383950': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.00867108446083565,
       'C': 0.010201201419512923,
       'FA': 0.0012308185869063053,
       'GA': 0.002347248512459868,
       'Start': 0.08362945430408676,
       'Stub': 0.8939201927161984}}}},
   '728644481': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.011743849354179213,
       'C': 0.018551150267917215,
       'FA': 0.0017216633333940244,
       'GA': 0.004963145860590862,
       'Start': 0.2475321059686044,
       'Stub': 0.7154880852153144}}}},
   '758713659': {'wp10': {'score': {'prediction': 'C',
      'probability': {'B': 0.13051629022094477,
       'C': 0.661331647033431,
       'FA': 0.008345185189334319,
       'GA': 0.0506476033411268,
       'Start': 0.13543386979975322,
       'Stub': 0.01372540441540996}}}},
   '774499188': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B'

In [179]:
#bar['enwiki']['scores'][list(j['enwiki']['scores'].keys())[0]]['wp10']['score']['prediction']

IndexError: list index out of range

In [188]:
[bar['enwiki']['scores'][list(bar['enwiki']['scores'].keys())[i]]['wp10']['score']['prediction'] for i in range(10)]

['Stub', 'Stub', 'C', 'Stub', 'Start', 'Stub', 'Start', 'Start', 'Start', 'GA']

In [128]:
type(j)

dict

In [129]:
get_ores_prediction('797882120')

'Start'

In [130]:
j = get_full_ores_score_json('797882120')
j

{'enwiki': {'models': {'wp10': {'version': '0.5.0'}},
  'scores': {'797882120': {'wp10': {'score': {'prediction': 'Start',
      'probability': {'B': 0.0325056273665757,
       'C': 0.10161634736900718,
       'FA': 0.003680032854794337,
       'GA': 0.021044772033944954,
       'Start': 0.8081343649161963,
       'Stub': 0.033018855459481376}}}}}}}

In [131]:
list(j['enwiki']['scores'].keys())[0]

'797882120'

In [138]:
d_wikipedia['last_edit'][:10]

0    802551672
1    774499188
2    803841397
3    789818648
4    785284614
5    798644673
6    728644481
7    788591677
8    758713659
9    802860970
Name: last_edit, dtype: int64

In [143]:
d_wikipedia['last_edit'][:10].values

array([802551672, 774499188, 803841397, 789818648, 785284614, 798644673,
       728644481, 788591677, 758713659, 802860970])

In [151]:
'|'.join([str(i) for i in d_wikipedia['last_edit'][:20].values])

'802551672|774499188|803841397|789818648|785284614|798644673|728644481|788591677|758713659|802860970|797469371|804349394|799618550|805063877|718383950|805775169|778690357|779839643|803055503|805920528'

In [146]:
'|'.join(['1','2','3'])

'1|2|3'

In [132]:
j['enwiki']['scores']['797882120']['wp10']['score']['prediction']

'Start'

In [133]:
list(j['enwiki']['scores'].keys())[0]

'797882120'

In [134]:
j['enwiki']['scores'][list(j['enwiki']['scores'].keys())[0]]['wp10']['score']['prediction']

'Start'

Based on one run of the below, it takes 29.2s to process 100 IDs, or 0.292 sec/ID. Since we have 48000 IDs, this is ~3.9 hrs. The 'many at the same time' API appears much faster - at least requesting 20 (not pulling out the results, but just getting the data) takes 300-800ms.

In [137]:
%time d_wikipedia['last_edit'][:100].apply(get_ores_prediction)

CPU times: user 2.6 s, sys: 117 ms, total: 2.72 s
Wall time: 29.2 s


0         C
1      Stub
2         C
3     Start
4     Start
5      Stub
6      Stub
7     Start
8         C
9     Start
10       GA
11    Start
12    Start
13     Stub
14     Stub
15    Start
16    Start
17     Stub
18       GA
19       GA
20     Stub
21        C
22    Start
23    Start
24        B
25        C
26     Stub
27     Stub
28        C
29     Stub
      ...  
70        C
71       GA
72     Stub
73     Stub
74       GA
75    Start
76     Stub
77     Stub
78     Stub
79    Start
80        B
81    Start
82        C
83        C
84       GA
85    Start
86       GA
87    Start
88    Start
89        C
90    Start
91    Start
92     Stub
93        C
94     Stub
95        C
96    Start
97     Stub
98        C
99     Stub
Name: last_edit, Length: 100, dtype: object