# A2: Bias in Data Assignment

### DATA 512

#### Emily Yamauchi

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.  

You are expected to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. Your analysis will consist of a series of tables that show:  

1. the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. the countries with the highest and lowest proportion of high quality articles about politicians.
3. a ranking of geographic regions by articles-per-person and proportion of high quality articles.  

You are also expected to write a short reflection on the project that focuses on how both your findings from this analysis and the process you went through to reach those findings helps you understand the causes and consequences of biased data in large, complex data science projects.


## Step 1: Getting the Article and Population Data

The first step is getting the data, which lives in several different places. The Wikipedia [politicians by country dataset](https://figshare.com/articles/Untitled_Item/5513449) can be found on Figshare. Read through the documentation for this repository, then download and unzip it to extract the data file, which is called `page_data.csv`.  

The population data is available in CSV format as [`WPDS_2020_data.csv`](https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit?usp=sharing). This dataset is drawn from the [world population data sheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau.

In [1]:
from zipfile import ZipFile
import os

import pandas as pd

In [2]:
## Step 0: download the two files above to data_raw directory

In [3]:
# Unzip politician file

os.chdir('data_raw')

with ZipFile('country.zip') as zipfiles:
    zipfiles.extractall()
    
os.chdir('..')

os.getcwd()

'C:\\Users\\admin\\Documents\\UW\\DATA512\\Assignments\\A2'

In [4]:
# load country politician data from unzipped folder

pols = pd.read_csv('data_raw/country/data/page_data.csv')

pols.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [5]:
# load population data

skiprows = range(4) # csv file includes headers
pops = pd.read_csv('data_raw/export.csv', skiprows=list(skiprows))

pops.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data
0,WORLD,WORLD,World,2019,7772.85
1,AFRICA,AFRICA,Sub-Region,2019,1337.918
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344
3,DZ,Algeria,Country,2019,44.357
4,EG,Egypt,Country,2019,100.803


## Step 2: Cleaning the Data

Both `page_data.csv` and `WPDS_2020_data.csv` contain some rows that you will need to filter out and/or ignore when you combine the datasets in the next step. In the case of `page_data.csv`, the dataset contains some page names that start with the string "`Template`:". These pages are not Wikipedia articles, and should not be included in your analysis.  

Similarly, `WPDS_2020_data.csv` contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. `AFRICA`, `OCEANIA`). These rows won't match the country values in `page_data.csv`, but you will want to retain them (either in the original file, or a separate file) so that you can report coverage and quality by region in the analysis section.

In [6]:
# how many template files?

pols.loc[pols['page'].str.contains('Template:')].shape

(496, 3)

In [236]:
# drop the template pages

pols_keep = pols[~pols.page.str.contains('Template:')].reset_index(drop=True)

pols_keep.head()

Unnamed: 0,page,country,rev_id
0,Bir I of Kanem,Chad,355319463
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
2,Yos Por,Cambodia,393822005
3,Julius Gregr,Czech Republic,395521877
4,Edvard Gregr,Czech Republic,395526568


In [8]:
# population types?

pops.Type.unique()

array(['World', 'Sub-Region', 'Country'], dtype=object)

In [9]:
# keep just countries

pops_country = pops.loc[pops.Type == 'Country'].copy().reset_index(drop=True)

pops_country.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data
0,DZ,Algeria,Country,2019,44.357
1,EG,Egypt,Country,2019,100.803
2,LY,Libya,Country,2019,6.891
3,MA,Morocco,Country,2019,35.952
4,SD,Sudan,Country,2019,43.849


In [237]:
# write clean files to csv

pols_keep.to_csv('data_clean/politicians.csv', index=False)

pops_country.to_csv('data_clean/populations.csv', index=False)

## Step 3: Getting Article Quality Predictions

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:  

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article  

These were learned based on articles in Wikipedia that were peer-reviewed using the [Wikipedia content assessment](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment) procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any `rev_id` you send it.  

In order to get article predictions for each article in the Wikipedia dataset, you will first need to read `page_data.csv` into Python (or R), and then read through the dataset line by line, using the value of the `rev_id` column to make an API query.

#### Option 1:   

Install and run the ORES client (Python only)

#### Option 2:

Use the REST API endpoint (Python or R)  

The ORES REST API is configured fairly similarly to the pageviews API we used for Assignment 1. You should review the ORES REST [documentation](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model). It expects a revision ID, which is the third column in the Wikipedia dataset, and a model, which is "`articlequality`".  

Whether you query the API or use the client, you will notice that ORES returns a `prediction` value that contains the name of one category, as well as `probability` values for each of the 6 quality categories. For this assignment, you only need to capture and use the value for `prediction`.  

Note: It's possible that you will be unable to get a score for a particular article. If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log can be saved as a separate file, or (if it's only a few articles), simply printed and logged within the notebook. The choice is up to you.


In [350]:
import requests
import json
import numpy as np

In [239]:
page_data = pd.read_csv('data_clean/politicians.csv')    

print('page_data.shape: ', page_data.shape)

page_data.shape:  (46701, 3)


In [50]:
rev_ids = page_data.rev_id

In [140]:
url = 'https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={rev_id}'

headers = {
    'User-Agent': 'https://github.com/emi90',
    'From': 'eyamauch@uw.edu'
}

In [44]:
def api_call(url, rev_id):
    
    call = requests.get(url.format(rev_id=rev_id), headers=headers)
    response = call.json()
    
    return response

In [300]:
# test

api_call(url, rev_ids[0])

{'enwiki': {'models': {'articlequality': {'version': '0.8.2'}},
  'scores': {'355319463': {'articlequality': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.005643168767502225,
       'C': 0.005641424870624224,
       'FA': 0.0010757577110297029,
       'GA': 0.001543343686495854,
       'Start': 0.010537503531047517,
       'Stub': 0.9755588014333005}}}}}}}

In [450]:
# function to get batch number array

def get_batches(n):
    
    """
    Function to get the batch number array
    n: number of api calls within a batch
    Returns: batch array and number of batches
    """
    
    # number of batches to create
    num_batch = np.ceil(page_data.shape[0]/n)
    
    # mod for last batch count
    mod = np.mod(page_data.shape[0], n)
    
    # get batch number array
    batch_no = np.repeat(np.arange(0, num_batch-1), n)
    
    # append final uneven array
    batch_no = np.append(batch_no, np.repeat(num_batch-1, mod))
    
    return batch_no, int(num_batch)

In [451]:
# check

print('batch_no.shape: ', get_batches(50)[0].shape)
print('page_data.shape: ', page_data.shape)

batch_no.shape:  (46701,)
page_data.shape:  (46701, 3)


In [457]:
# test - first 50 revids

batch_arr = get_batches(50)[0]
rev_ids[batch_arr==0][0:5]

0    355319463
1    393276188
2    393822005
3    395521877
4    395526568
Name: rev_id, dtype: int64

In [453]:
# function to call api for each batch count

def api_call_batch(url, n, batch):
    
    """
    Function to call api for given batch, given n calls in each batch
    url: endpoint url
    n: number of api calls in a batch
    batch: ith batch call
    """
    
    batch_arr = get_batches(n)[0] # batch numbers
    batch_ids = rev_ids[batch_arr==batch] # just the rev_id for those that match the ith batch
    rev_id = "|".join(str(x) for x in batch_ids) # join to single string
    
    call = requests.get(url.format(rev_id=rev_id))
    response = call.json()
    
    return response

In [468]:
api_call_batch(url, n, 934)

{'enwiki': {'models': {'articlequality': {'version': '0.8.2'}},
  'scores': {'807484325': {'articlequality': {'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:807484325)',
      'type': 'RevisionNotFound'}}}}}}

In [469]:
def get_json(url, n):
    
    path_name = 'data_clean/json_files/batch_{batch_no}.json'
    total_batches = get_batches(n)[1]
    print('total batches: ', total_batches)
    
    all_data = {}
    err_revids = []
    
    for i in range(total_batches):
        if np.mod(i, 100) == 0:
            print('Currently processing batch', i)
        resp = api_call_batch(url, n, i)
        
        try:
            scores = resp['enwiki']['scores']
            all_data.update(scores)
        except:
            err_revids.append(i)
            
    return all_data, err_revids
        
        #name = {'batch_no':i}
        #file_path = path_name.format(**name)
        
        #with open(file_path, "w") as f:
        #    json.dump(resp, f)

In [470]:
# this will take a while
import time

start = time.time()

res = get_json(url, 50)

end = time.time()

print('time: ', end-start)

total batches:  935
Currently processing batch 0
Currently processing batch 100
Currently processing batch 200
Currently processing batch 300
Currently processing batch 400
Currently processing batch 500
Currently processing batch 600
Currently processing batch 700
Currently processing batch 800
Currently processing batch 900
time:  312.40019035339355


[]

In [456]:
res

({'807484325': {'articlequality': {'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:807484325)',
     'type': 'RevisionNotFound'}}}},
 [])

In [416]:
batch_arr = get_batches(50)[0]
rev_ids[batch_arr == 934]

Series([], Name: rev_id, dtype: int64)

In [426]:
batch_arr[-2]

933.0

In [413]:
err_revids

Series([], Name: rev_id, dtype: int64)

In [384]:
json_dir = 'data_clean/json_files/'

json_files = [file for file in os.listdir(json_dir)]

all_data = {}

for i, jsn in enumerate(json_files):
    
    with open(os.path.join(json_dir, jsn)) as f:
        json_text = json.load(f)
        all_data.update(json_text)

In [388]:
all_data['enwiki']['scores']

{'722705539': {'articlequality': {'score': {'prediction': 'Stub',
    'probability': {'B': 0.00689144111526763,
     'C': 0.007624425453316215,
     'FA': 0.0012407463086529457,
     'GA': 0.001928379221398795,
     'Start': 0.020866036614712348,
     'Stub': 0.961448971286652}}}},
 '722725032': {'articlequality': {'score': {'prediction': 'Stub',
    'probability': {'B': 0.017773652798148267,
     'C': 0.022180336300355496,
     'FA': 0.002615834261356982,
     'GA': 0.005275107364488364,
     'Start': 0.21978100818838803,
     'Stub': 0.7323740610872628}}}},
 '722745112': {'articlequality': {'score': {'prediction': 'Stub',
    'probability': {'B': 0.010992607002648128,
     'C': 0.013158958801671726,
     'FA': 0.0016884063135995156,
     'GA': 0.002481886643808845,
     'Start': 0.04208215964318175,
     'Stub': 0.92959598159509}}}},
 '722745301': {'articlequality': {'score': {'prediction': 'Stub',
    'probability': {'B': 0.009883904072726669,
     'C': 0.01920371930770394,
     'FA

In [379]:
all_data['enwiki']['scores'].keys()

dict_keys(['722705539', '722725032', '722745112', '722745301', '722771207', '722771593', '722780352', '722784593', '722785100', '722789661', '722806951', '722809464', '722813751', '722814265', '722814345', '722816132', '722820209', '722833303', '722845139', '722845833', '722853260', '722853919', '722854384', '722858228', '722859616', '722874876', '722891369', '722893652', '722893996', '722895901', '722903661', '722908046', '722908287', '722920330', '722921768', '722934060', '722937211', '722941632', '722943338', '722949480', '722951661', '722955105', '722956416', '722956986', '722957867', '722958477', '722959223', '722964404', '722965184', '722975222'])

In [370]:
os.path.join(json_dir, json_files[0])

'data_clean/json_files/batch_0.json'

In [218]:
# function to call api for each batch count

def api_call_batch(url, n, batch):
    
    """
    Function to call api for given batch, given n calls in each batch
    url: endpoint url
    n: number of api calls in a batch
    batch: ith batch call
    """
    
    batch_no = get_batches(n)[0] # number of batches
    batch_ids = rev_ids[batch_no==batch] # just the rev_id for those that match the ith batch
    rev_id = "|".join(str(x) for x in batch_ids) # join to single string
    
    call = requests.get(url.format(rev_id=rev_id))
    response = call.json()
    
    return response

In [230]:
# function to return json

def get_json(url, n, nbatch=None):
    """
    Function to return json from api call, given n calls in each batch
    url: endpoint url
    n: number of api calls in a batch
    Returns: json of all the api calls
    """
    
    json = {}
    
    if nbatch == None:
        num_batch = get_batches(n)[1]
    else: num_batch = nbatch
    
    for i in range(num_batch):
        
        #print('Currently processing ', i, 'th batch')
        json.update(api_call_batch(url, n, i))
    
    return json

In [291]:
test_json = {}
arr_batch = np.arange(0,5)
for i in arr_batch:
    call_json = api_call_batch(url, 50, i)
    test_json.update(call_json)
    
def show_keys(json):
    return json['enwiki']['scores'].keys()

In [362]:
test_json

{'enwiki': {'models': {'articlequality': {'version': '0.8.2'}},
  'scores': {'677568976': {'articlequality': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.007682049326841926,
       'C': 0.012996430164975486,
       'FA': 0.0014879307394895609,
       'GA': 0.0035305154178284135,
       'Start': 0.05970950439642935,
       'Stub': 0.9145935699544353}}}},
   '677724847': {'articlequality': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.007587304305834236,
       'C': 0.010365850569134727,
       'FA': 0.0012843060394473315,
       'GA': 0.0031575328893562113,
       'Start': 0.01775463667837411,
       'Stub': 0.9598503695178535}}}},
   '677747943': {'articlequality': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.019490930905103745,
       'C': 0.043058033936536934,
       'FA': 0.0031191702709674877,
       'GA': 0.008300674559984888,
       'Start': 0.3239688375477002,
       'Stub': 0.6020623527797068}}}},
   '677750304': {'articlequalit

In [288]:
test_json = {}
call_json = api_call_batch(url, 50, 0)

In [289]:
test_json.update(call_json)

In [299]:
#test_json = {}

call_json = api_call_batch(url, 50, 5)
test_json.update(call_json)


len(show_keys(test_json))

50

In [268]:
test_json['enwiki']['scores'].keys()

dict_keys(['671462644', '671473289', '671475479', '671483959', '671484594', '671511760', '671515606', '671516033', '671637265', '671868722', '671890885', '671891037', '671895550', '671897239', '671903264', '672119030', '672156082', '672284154', '672515581', '672547562', '672709444', '672862914', '673008587', '673393730', '674487383', '674576443', '674702694', '674919905', '675271208', '675303236', '675376529', '676040670', '676043559', '676044272', '676498385', '676610041', '676612108', '676661990', '676691070', '676732644', '676734572', '676782180', '676828191', '676841743', '676947124', '677094551', '677317570', '677435096', '677516121', '677523953'])

In [262]:
rev_ids[batch_no==4]

200    671462644
201    671473289
202    671475479
203    671483959
204    671484594
205    671511760
206    671515606
207    671516033
208    671637265
209    671868722
210    671890885
211    671891037
212    671895550
213    671897239
214    671903264
215    672119030
216    672156082
217    672284154
218    672515581
219    672547562
220    672709444
221    672862914
222    673008587
223    673393730
224    674487383
225    674576443
226    674702694
227    674919905
228    675271208
229    675303236
230    675376529
231    676040670
232    676043559
233    676044272
234    676498385
235    676610041
236    676612108
237    676661990
238    676691070
239    676732644
240    676734572
241    676782180
242    676828191
243    676841743
244    676947124
245    677094551
246    677317570
247    677435096
248    677516121
249    677523953
Name: rev_id, dtype: int64

In [240]:
get_batches(50)[1]

935

In [241]:
# This will take a while

json = get_json(url, 50)

Currently processing  0 th batch
Currently processing  1 th batch
Currently processing  2 th batch
Currently processing  3 th batch
Currently processing  4 th batch
Currently processing  5 th batch
Currently processing  6 th batch
Currently processing  7 th batch
Currently processing  8 th batch
Currently processing  9 th batch
Currently processing  10 th batch
Currently processing  11 th batch
Currently processing  12 th batch
Currently processing  13 th batch
Currently processing  14 th batch
Currently processing  15 th batch
Currently processing  16 th batch
Currently processing  17 th batch
Currently processing  18 th batch
Currently processing  19 th batch
Currently processing  20 th batch
Currently processing  21 th batch
Currently processing  22 th batch
Currently processing  23 th batch
Currently processing  24 th batch
Currently processing  25 th batch
Currently processing  26 th batch
Currently processing  27 th batch
Currently processing  28 th batch
Currently processing  29

Currently processing  238 th batch
Currently processing  239 th batch
Currently processing  240 th batch
Currently processing  241 th batch
Currently processing  242 th batch
Currently processing  243 th batch
Currently processing  244 th batch
Currently processing  245 th batch
Currently processing  246 th batch
Currently processing  247 th batch
Currently processing  248 th batch
Currently processing  249 th batch
Currently processing  250 th batch
Currently processing  251 th batch
Currently processing  252 th batch
Currently processing  253 th batch
Currently processing  254 th batch
Currently processing  255 th batch
Currently processing  256 th batch
Currently processing  257 th batch
Currently processing  258 th batch
Currently processing  259 th batch
Currently processing  260 th batch
Currently processing  261 th batch
Currently processing  262 th batch
Currently processing  263 th batch
Currently processing  264 th batch
Currently processing  265 th batch
Currently processing

Currently processing  473 th batch
Currently processing  474 th batch
Currently processing  475 th batch
Currently processing  476 th batch
Currently processing  477 th batch
Currently processing  478 th batch
Currently processing  479 th batch
Currently processing  480 th batch
Currently processing  481 th batch
Currently processing  482 th batch
Currently processing  483 th batch
Currently processing  484 th batch
Currently processing  485 th batch
Currently processing  486 th batch
Currently processing  487 th batch
Currently processing  488 th batch
Currently processing  489 th batch
Currently processing  490 th batch
Currently processing  491 th batch
Currently processing  492 th batch
Currently processing  493 th batch
Currently processing  494 th batch
Currently processing  495 th batch
Currently processing  496 th batch
Currently processing  497 th batch
Currently processing  498 th batch
Currently processing  499 th batch
Currently processing  500 th batch
Currently processing

Currently processing  708 th batch
Currently processing  709 th batch
Currently processing  710 th batch
Currently processing  711 th batch
Currently processing  712 th batch
Currently processing  713 th batch
Currently processing  714 th batch
Currently processing  715 th batch
Currently processing  716 th batch
Currently processing  717 th batch
Currently processing  718 th batch
Currently processing  719 th batch
Currently processing  720 th batch
Currently processing  721 th batch
Currently processing  722 th batch
Currently processing  723 th batch
Currently processing  724 th batch
Currently processing  725 th batch
Currently processing  726 th batch
Currently processing  727 th batch
Currently processing  728 th batch
Currently processing  729 th batch
Currently processing  730 th batch
Currently processing  731 th batch
Currently processing  732 th batch
Currently processing  733 th batch
Currently processing  734 th batch
Currently processing  735 th batch
Currently processing

In [249]:
json['enwiki']['models']

{'articlequality': {'version': '0.8.2'}}

In [166]:
# separate into batches

# single batch length: per slack discussion, n >= 50 causes errors
n = 50

# number of batches to create
num_batch = np.ceil(page_data.shape[0]/n)

# mod for last batch count
mod = np.mod(page_data.shape[0], n)

# get batch number array
batch_no = np.repeat(np.arange(0, num_batch-1), n)

# append final uneven array
batch_no = np.append(batch_no, np.repeat(num_batch, mod))

In [172]:
# check

print('batch_no.shape: ', batch_no.shape)
print('page_data.shape: ', page_data.shape)
print('number of batches: ', len(np.unique(batch_no)))

batch_no.shape:  (46701,)
page_data.shape:  (46701, 3)
number of batches:  935


In [198]:
json = {}

json.update(api_call_batch(url, 0))

In [185]:
test_dict = {}
new_dict = first
test_dict.update(first)

In [190]:
json0 = api_call_batch(url, 0)

In [191]:
json0.update(api_call_batch(url, 1))

In [159]:
ids = "|".join(str(x) for x in rev_ids[batch_no==0])

In [160]:
first = api_call(url, ids)

In [161]:
ids = "|".join(str(x) for x in rev_ids[batch_no==1])

In [162]:
second = api_call(url, ids)

In [163]:
first.update(second)

In [164]:
first

{'enwiki': {'models': {'articlequality': {'version': '0.8.2'}},
  'scores': {'627432937': {'articlequality': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.007549306609112667,
       'C': 0.009658383617862537,
       'FA': 0.0013389176611320235,
       'GA': 0.002591174490342612,
       'Start': 0.019747936069590622,
       'Stub': 0.9591142815519595}}}},
   '627547024': {'articlequality': {'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:627547024)',
      'type': 'RevisionNotFound'}}},
   '628261896': {'articlequality': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.007722074597760802,
       'C': 0.010298779948291212,
       'FA': 0.0014512195174666005,
       'GA': 0.0027890605439809244,
       'Start': 0.02127044532293234,
       'Stub': 0.9564684200695682}}}},
   '628268705': {'articlequality': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.008339353567238268,
       'C': 0.010911344181544996,
       'FA': 0.