# Enfield, Andrew - DATA 512, A2: Bias in Data

TBD UPDATE

The assignment is at https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data.

TBD remove

This notebook pulls, prepares, and analyzes data about the amount of monthly English Wikipedia traffic from January 1, 2008 through September 30, 2017. For more information about the work and data, refer to the [README](Readme.md).

A few notes:
- Normally I'd prefer to keep the explanation and background that's in the README here in the notebook, so everything's in a single file, but I've split it up this time as that's what the assignment requested. I won't copy/paste because keeping duplicate content in sync is horrible.
- Real reproducibility needs tests for the code. A lot of my implementation below is in functions. I'd normally put these functions in at least one separate file that I import into this notebook, and I'd have tests in an additional file. For this assignment I'll just keep everything in this file, for simplicity, even though it means I can't test the code the way I normally would.

# Prereqs

This code requires the libraries as described below.

In [1]:
# retrieve, load data
import requests
import json
import csv
import os

# prepare and analyze data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#from mpl_toolkits.axes_grid.anchored_artists import AnchoredText # for addtl annotations in charts
#from matplotlib.ticker import FuncFormatter # for custom axis labels
from IPython.core.pylabtools import figsize
import seaborn as sns # for formatting
%matplotlib inline 

In [2]:
sns.set_style("whitegrid")
figsize(14,7)

# Load data

TBD UPDATE

This section loads the data from the two APIs described in the README, producing five separate .json files, one for each API and access combination.

In [3]:
d_wikipedia = pd.read_csv('page_data.csv')
d_wikipedia.shape

(47997, 3)

In [4]:
d_wikipedia[:3]

Unnamed: 0,country,page,last_edit
0,Abkhazia,Zurab Achba,802551672
1,Abkhazia,Garri Aiba,774499188
2,Abkhazia,Zaur Avidzba,803841397


In [5]:
d_population = pd.read_csv('Population Mid-2015.csv', skiprows=2, thousands=',')
d_population.shape

(210, 6)

In [6]:
d_wikipedia.groupby(['country']).size().sort_values(ascending=False)[:10]

country
France            1695
Australia         1568
China             1138
United States     1097
Mexico            1079
Pakistan          1069
India              981
Russia             956
Spain              907
United Kingdom     865
dtype: int64

In [7]:
d_population[:3]

Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,


## Pull article scores

Docs: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model and https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context

Note that when I try the multiple rev ID API with a bunch of valid IDs and one that's a text string, then it gives me a 500 and no data at all. However when I try with a bunch of valid IDs and an ID that's not valid - like -1 - then I get valid/good data for the valid IDs and output like the following. This seems good: I'll go ahead and try just pulling batches of IDs w/o further error handling.

    "-1": {
        "wp10": {
          "error": {
            "message": "RevisionNotFound: Could not find revision ({revision}:-1)",
            "type": "RevisionNotFound"
          }
        }

In [8]:
user_agent = 'https://github.com/aenfield'

def get_multiple_full_ores_score_json(rev_ids):
    """TBD referring to ..., with rev_ids as a list"""
    endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models=wp10&revids={rev_ids_delimited}'

    rev_ids_delimited = '|'.join([str(i) for i in rev_ids])

    params = { 'rev_ids_delimited' : rev_ids_delimited }

    api_call = requests.get(endpoint.format(**params), headers = {'User-Agent':'{}'.format(user_agent)})
    return api_call.json()
    
def get_ores_prediction_from_score_json(score_json, rev_id):
    """Return the most likely article type, per ORES. Assumes a JSON dict from Ores. """
    return score_json['enwiki']['scores'][str(rev_id)]['wp10']['score']['prediction']
            
def chunker(seq, size):
    """Get a generator that returns chunks of size 'size' of the sequence in 'seq'.
    
    From: https://stackoverflow.com/questions/434287/what-is-the-most-pythonic-way-to-iterate-over-a-list-in-chunks
    """
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

def get_article_scores_data(rev_ids, force=False, verbose=False):
    """TBD update Download and save results from the specified API to a local file, by default only if local file doesn't exist.
    TBD call with d_wikipedia['last_edit'].values for 'rev_ids'
    
    apiname - 'pagecounts' or 'pageviews'
    params - a dict containing 'access', 'start', and 'end' keys; use get_param_dict_from_params
    user_agent - an identifier for the user making the request; can be a GitHub user URL or general email address
    force - download data and overwrite local file, even if file already exists; default is False
    verbose - print diagnostic data; default is false
    """
    
    filename = 'article_scores.csv'

    if (not os.path.exists(filename)) | (force == True):
        # download and save the data locally
        if verbose: print("Local file doesn't exist or download was forced. Downloading...")
        with open(filename, 'w') as output_file:
            writer = csv.writer(output_file, delimiter=',')
            writer.writerow(['RevisionId','Score'])

            progress_frequency = 25
            count_of_rev_ids_in_chunk = 140
            rev_ids_in_chunks = [x for x in chunker(rev_ids, count_of_rev_ids_in_chunk)]

            for chunk_index, chunk_of_rev_ids in enumerate(rev_ids_in_chunks):
                if (chunk_index % progress_frequency == 0): print(f"Retrieving chunk with index {chunk_index}.")

                scores_json = get_multiple_full_ores_score_json(chunk_of_rev_ids)
                for rev_id in chunk_of_rev_ids:
                    writer.writerow([rev_id, get_ores_prediction_from_score_json(scores_json, rev_id)])

            if verbose: print(f"Retrieved {chunk_index + 1} chunks and saved to {filename}. Done.")        
    else:
        if verbose: print("Local file exists already.")
                  
    # now open and return dataframe
    return pd.read_csv(filename, index_col='RevisionId')

In [9]:
d_scores = get_article_scores_data(d_wikipedia['last_edit'].values)
d_scores.shape

(47997, 1)

In [10]:
d_scores[:3]

Unnamed: 0_level_0,Score
RevisionId,Unnamed: 1_level_1
802551672,C
774499188,Stub
803841397,C


In [23]:
d_wikipedia_with_scores = pd.merge(left=d_wikipedia, right=d_scores, left_on='last_edit', right_index=True, how='left')
d_wikipedia_with_scores.shape

(49827, 4)

In [24]:
d_wikipedia_with_scores[:3]

Unnamed: 0,country,page,last_edit,Score
0,Abkhazia,Zurab Achba,802551672,C
1,Abkhazia,Garri Aiba,774499188,Stub
2,Abkhazia,Zaur Avidzba,803841397,C


In [25]:
d_wikipedia_with_scores['Score'].value_counts(dropna=False)

Stub     25336
Start    16239
C         6203
GA         937
B          777
FA         335
Name: Score, dtype: int64

In [26]:
d_wikipedia.shape

(47997, 3)

In [31]:
len(d_wikipedia_with_scores) - len(d_wikipedia)

1830

In [38]:
d_scores.shape

(47997, 1)

In [27]:
d_wikipedia_with_scores.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49827 entries, 0 to 47996
Data columns (total 4 columns):
country      49827 non-null object
page         49827 non-null object
last_edit    49827 non-null int64
Score        49827 non-null object
dtypes: int64(1), object(3)
memory usage: 1.9+ MB


In [32]:
# d_wikipedia_with_scores.to_excel('eyeball.xlsx')

In [36]:
len(d_wikipedia_with_scores['last_edit'].unique())

47122

In [37]:
len(d_wikipedia['last_edit'].unique())

47122

In [46]:
len(d_scores.index.unique())

47122

In [47]:
d_wikipedia[d_wikipedia['page'] == 'George Grey']

Unnamed: 0,country,page,last_edit
9821,Cape Colony,George Grey,647367482


In [49]:
d_wikipedia[d_wikipedia['last_edit'] == 647367482]

Unnamed: 0,country,page,last_edit
9821,Cape Colony,George Grey,647367482
41369,South Africa,Klaas Afrikaner,647367482


In [50]:
d_wikipedia_with_scores[d_wikipedia_with_scores['page'] == 'George Grey']

Unnamed: 0,country,page,last_edit,Score
9821,Cape Colony,George Grey,647367482,Start
9821,Cape Colony,George Grey,647367482,Start


In [51]:
d_wikipedia_with_scores[d_wikipedia_with_scores['last_edit'] == 647367482]

Unnamed: 0,country,page,last_edit,Score
9821,Cape Colony,George Grey,647367482,Start
9821,Cape Colony,George Grey,647367482,Start
41369,South Africa,Klaas Afrikaner,647367482,Start
41369,South Africa,Klaas Afrikaner,647367482,Start
