# DATA 512 - A2: Bias in Data
**Corey Christopherson**
**10/17/2019**  

The purpose of this project is to investigate the quality of Wikipedia articles on political figures from different countries and explore the concept of any suspected bias based on the results.

Project information and data can be found in the Github repo https://github.com/chrico7/data-512-a2

Three data sets were used to conduct this analysis:
1. page_data.csv - Wikipedia Politicians by Country Data <br>
   https://figshare.com/articles/Untitled_Item/5513449
2. WPDS_2018_data.csv - Population Reference Bureau World Population Datasheet <br>
   https://canvas.uw.edu/courses/1319253/files/
3. Objective Revision Evaluation Service (ORES) Article Quality Scores API <br>
   https://www.mediawiki.org/wiki/ORES

In [1]:
import numpy as np
import pandas as pd
import requests
import json
import time

## Data Acquisition
The two data sets in the list above were downloaded to a local directory as csv files and read in using Pandas read_csv. These raw files can be viewed in hte Github repo noted above.

In [2]:
#
### ACQUIRE DATA ###
#

In [3]:
# Define data paths
path = r'C:/Users/chrico7/Documents/__Corey Christopherson/MS Data Science/Courses/HCDE 512/Week 2/Homework/'
polPath = r'C:/Users/chrico7/Documents/__Corey Christopherson/MS Data Science/Courses/HCDE 512/Week 2/Homework/Data/country/data/'
popPath = r'C:/Users/chrico7/Documents/__Corey Christopherson/MS Data Science/Courses/HCDE 512/Week 2/Homework/Data/'

In [4]:
# Read in csv data
polData_raw = pd.read_csv(r'{}page_data.csv'.format(polPath))
popData_raw = pd.read_csv(r'{}WPDS_2018_data.csv'.format(popPath))

## Data Cleaning
Both data sets were cleaned to ensure the quality of the final data by executing the following actions:
1. polData <br>
   Page names that begin with the string 'Template' were removed since these are not Wikipedia pages
2. popData <br> 
   Records in the 'geography' field that were ALL CAPS were broken out to a separate table because these are aggregates and not countries

In [5]:
#
### CLEAN DATA ###
#

In [6]:
# Politician Data (polData)
polData = polData_raw.copy()
# Remove page names that start with the string 'Template'
polData = polData[~polData['page'].str.contains('Template',regex=True)].reset_index(drop=True)

In [7]:
# Population Data (popData)
popData = popData_raw.copy()
# Break out records with ALL CAP records in 'geography' field
popData_agg = popData[popData['Geography'].str.isupper()].reset_index(drop=True)
popData = popData[~popData['Geography'].str.isupper()].reset_index(drop=True)

A separate table was then generated from the aggregate fields extracted from the population data to map regions to the corresponding countries

In [9]:
# Derive map between Geogrpahy and Country
geoMap = pd.DataFrame()
for i in range(popData_agg['Geography'].shape[0]):
    # derive the start and stop points for each region
    start = popData_raw.loc[popData_raw['Geography']==popData_agg['Geography'][i],:].index[0]
    if i == popData_agg['Geography'].shape[0]:
        stop = popData_raw.loc[popData_raw['Geography']==popData_agg['Geography'][i+1],:].index[0]
    else:
        stop = popData_raw.loc[popData_raw['Geography']==popData_raw.iloc[-1,0],:].index[0]
    # extract countries for Region i
    temp = popData_raw.iloc[start:stop,:]
    temp.loc[:,'Region'] = popData_agg['Geography'][i]
    # add country list to geoMap frame
    geoMap = geoMap.append(temp[['Region','Geography']], ignore_index=True, sort=False)
# trim out Region rows from geoMap frame
geoMap = geoMap[geoMap['Region']!=geoMap['Geography']]

## Data Processing

First, the quality score for each article was obtained from the ORES API based on the rev_id article identifier.

In [10]:
#
### PROCESS DATA ###
#

In [11]:
# Get article quality predictions

A function to use the ORES API was obtained with permission from https://github.com/Ironholds/data-512-a2 and modified to better work with Pandas.

The function passes a list of rev_ids to the API using the python requests library and then converts the response to json

In [12]:
def get_ores_data(revision_ids):
    """
    Function to get ores data when passed a list of revision ids
    """
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    # Call the API and convert to json
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    
    return response

Exploration of the API revealed that the maximum number of rev_ids it could process in one pull was 100. The data frame was therefore divided into 100 element chunks and fed into the API in order.

In [15]:
# Loop through 100 line chunks and compile cumulative ores data frame
start = time.time()
oresData_raw = pd.DataFrame()
for chunk in np.array_split(polData.iloc[:,2],np.ceil(polData.iloc[:,:].shape[0] / 100)):
    temp_dict = get_ores_data(chunk)
    temp_revid = pd.Series(list(temp_dict['enwiki']['scores'].keys()))
    temp_df = pd.io.json.json_normalize(temp_dict['enwiki']['scores'].values())
    temp_df.loc[:,'rev_id.rev_id'] = temp_revid
    oresData_raw = oresData_raw.append(temp_df, ignore_index=True, sort=False)
print(time.time() - start)

198.28700017929077


The frame columns were then renamed to more readable values.

In [50]:
# Rename columns to common terms and set data types
colDict = dict(zip(pd.Series(oresData_raw.columns),
                   pd.Series(oresData_raw.columns).str.rsplit('.',n=1,expand=True)[1]))
oresData = oresData_raw.rename(colDict, axis='columns')
oresData.rename({'message':'error.message','type':'error.type'},axis=1,inplace=True)
oresData.loc[:,'rev_id'] = oresData['rev_id'].astype(int)

In [59]:
oresData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46701 entries, 0 to 46700
Data columns (total 10 columns):
prediction       46546 non-null object
B                46546 non-null float64
C                46546 non-null float64
FA               46546 non-null float64
GA               46546 non-null float64
Start            46546 non-null float64
Stub             46546 non-null float64
error.message    155 non-null object
error.type       155 non-null object
rev_id           46701 non-null int32
dtypes: float64(6), int32(1), object(3)
memory usage: 3.4+ MB


Some of the rev_ids returned errors instead of scores. The records were extracted and placed in a seprate frame for future reference if needed.

In [60]:
# Extract good and bad data
oresData_good = oresData[oresData['error.type'].isnull()]
oresData_bad = oresData[~oresData['error.type'].isnull()]

The article score was extracted and added on to the polData frame. Unmatched rows were then removed and stored.

In [61]:
# Add article score to polData and extract bad rows
polData_score = pd.merge(polData,
                         oresData_good[['rev_id','prediction']],
                         how='outer',on='rev_id')
polData_noScore = polData_score[polData_score['prediction'].isnull()]
polData_score = polData_score[~polData_score['prediction'].isnull()]

The population data was then added to the data to form a common frame. Again, unmatched rows were then removed and stored.

In [62]:
# Add popData to pol data and extract bad rows
allData_raw = pd.merge(polData_score,
                       popData,how='outer',left_on='country',right_on='Geography')
allData_raw_noGeo = allData_raw[allData_raw['Geography'].isnull()]
allData_raw_noPage = allData_raw[allData_raw['page'].isnull()]
allData_raw_good = allData_raw[(~allData_raw['Geography'].isnull())&
                               (~allData_raw['page'].isnull())]

Finally, the columns were renamed to more user friendly terms, reordered, and the population counts were converted to raw counts from millions

In [63]:
# Format columns
finalCols = ['country','article_name','revision_id','article_quality','population']
allData = allData_raw_good.rename({'page':'article_name',
                                   'rev_id':'revision_id',
                                   'prediction':'article_quality',
                                   'Population mid-2018 (millions)':'population'},axis=1)
allData = allData[finalCols]
# Format data types
allData.loc[:,'population'] = allData['population'].str.replace(',','',regex=True).astype(float)*1000000

In [64]:
allData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44464 entries, 0 to 46539
Data columns (total 5 columns):
country            44464 non-null object
article_name       44464 non-null object
revision_id        44464 non-null float64
article_quality    44464 non-null object
population         44464 non-null float64
dtypes: float64(2), object(3)
memory usage: 2.0+ MB


In [65]:
# Output unmatched rows
badData = pd.concat([polData_noScore,allData_raw_noGeo,allData_raw_noPage], sort=False)
badData.to_csv(r'{}wp_wpds_countries-no_match.csv'.format(path),header=True,index=False)

In [66]:
# Output final data
allData.to_csv(r'{}wp_wpds_politicians_by_country.csv'.format(path),header=True,index=False)

## Data Analysis
The final data was then split into several different frames to calculate following metrics
1. Country article count
2. Country population total
3. Country article count for good articles (FA and GA)
4. Region article count
5. Region population sum
6. Country article count for good articles (FA and GA)

These tables were then used to generate the following tables

1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [67]:
#
### ANALYZE DATA ###
#

In [68]:
allData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44464 entries, 0 to 46539
Data columns (total 5 columns):
country            44464 non-null object
article_name       44464 non-null object
revision_id        44464 non-null float64
article_quality    44464 non-null object
population         44464 non-null float64
dtypes: float64(2), object(3)
memory usage: 2.0+ MB


In [69]:
# Make a frame with Region information
allData_region = pd.merge(allData, geoMap, how='left',left_on='country',right_on='Geography').drop('Geography',axis=1)

In [70]:
# Calculate metrics
country_article_count = (allData.groupby('country')['article_name'].count()
                         .reset_index().rename({'article_name':'article_count'},axis=1))

country_pop_sum = (allData.groupby('country')['population'].mean()
                   .reset_index().rename({'population':'population_sum'},axis=1))

country_article_good = (allData[(allData['article_quality']=='GA')|
                                (allData['article_quality']=='FA')]
                        .groupby('country')['article_name'].count()
                        .reset_index().rename({'article_name':'article_count_good'},axis=1))

region_article_count = (allData_region.groupby('Region')['article_name'].count()
                        .reset_index().rename({'article_name':'article_count'},axis=1))

region_pop_sum = (allData_region.groupby('Region')['population'].mean()
                  .reset_index().rename({'population':'population_sum'},axis=1))

region_article_good = (allData_region[(allData_region['article_quality']=='GA')|
                                      (allData_region['article_quality']=='FA')]
                       .groupby('Region')['article_name'].count()
                       .reset_index().rename({'article_name':'article_count_good'},axis=1))

In [71]:
# Calculate data tables
country_cov = pd.merge(country_article_count,country_pop_sum,how='left',on='country')
country_cov.loc[:,'coverage%'] = country_cov['article_count']/country_cov['population_sum']*100

country_qual = pd.merge(country_article_good,country_article_count,how='left',on='country')
country_qual.loc[:,'quality%'] = country_qual['article_count_good']/country_qual['article_count']*100

region_cov = pd.merge(region_article_count,region_pop_sum,how='left',on='Region')
region_cov.loc[:,'coverage%'] = region_cov['article_count']/region_cov['population_sum']*100

region_qual = pd.merge(region_article_good,region_article_count,how='left',on='Region')
region_qual.loc[:,'quality%'] = region_qual['article_count_good']/region_qual['article_count']*100

In [72]:
# Output data tables

In [73]:
# Top 10 Countries by Coverage
country_cov.sort_values('coverage%',ascending=False)[0:10].reset_index(drop=True)

Unnamed: 0,country,article_count,population_sum,coverage%
0,Tuvalu,54,10000.0,0.54
1,Nauru,52,10000.0,0.52
2,San Marino,81,30000.0,0.27
3,Monaco,40,40000.0,0.1
4,Liechtenstein,28,40000.0,0.07
5,Tonga,63,100000.0,0.063
6,Marshall Islands,37,60000.0,0.061667
7,Iceland,201,400000.0,0.05025
8,Andorra,34,80000.0,0.0425
9,Grenada,36,100000.0,0.036


In [74]:
# Bottom 10 Countries by Coverage
country_cov.sort_values('coverage%',ascending=True)[0:10].reset_index(drop=True)

Unnamed: 0,country,article_count,population_sum,coverage%
0,India,980,1371300000.0,7.1e-05
1,Indonesia,210,265200000.0,7.9e-05
2,China,1130,1393800000.0,8.1e-05
3,Uzbekistan,28,32900000.0,8.5e-05
4,Ethiopia,101,107500000.0,9.4e-05
5,"Korea, North",36,25600000.0,0.000141
6,Zambia,25,17700000.0,0.000141
7,Thailand,112,66200000.0,0.000169
8,Mozambique,58,30500000.0,0.00019
9,Bangladesh,319,166400000.0,0.000192


In [75]:
# Top 10 Countries by Relative Quality
country_qual.sort_values('quality%',ascending=False)[0:10].reset_index(drop=True)

Unnamed: 0,country,article_count_good,article_count,quality%
0,"Korea, North",7,36,19.444444
1,Saudi Arabia,15,118,12.711864
2,Mauritania,6,48,12.5
3,Central African Republic,8,66,12.121212
4,Romania,39,343,11.370262
5,Tuvalu,5,54,9.259259
6,Bhutan,3,33,9.090909
7,Dominica,1,12,8.333333
8,Syria,10,128,7.8125
9,Benin,7,91,7.692308


In [76]:
# Bottom 10 Countries by Relative Quality
country_qual.sort_values('quality%',ascending=True)[0:10].reset_index(drop=True)

Unnamed: 0,country,article_count_good,article_count,quality%
0,Belgium,1,520,0.192308
1,Tanzania,1,405,0.246914
2,Switzerland,1,402,0.248756
3,Nepal,1,357,0.280112
4,Peru,1,350,0.285714
5,Nigeria,2,677,0.295421
6,Colombia,1,285,0.350877
7,Lithuania,1,244,0.409836
8,Fiji,1,197,0.507614
9,Azerbaijan,1,179,0.558659


In [77]:
# Geographic Regions by Coverage
region_cov.sort_values('coverage%',ascending=False).reset_index(drop=True)

Unnamed: 0,Region,article_count,population_sum,coverage%
0,EUROPE,18934,31485320.0,0.060136
1,AFRICA,44406,117284900.0,0.037862
2,NORTHERN AMERICA,37555,130358200.0,0.028809
3,LATIN AMERICA AND THE CARIBBEAN,35634,126583000.0,0.028151
4,ASIA,30465,137274900.0,0.022193
5,OCEANIA,3070,14031980.0,0.021879


In [78]:
# Geographic Regions by Relative Quality
region_qual.sort_values('quality%',ascending=False).reset_index(drop=True)

Unnamed: 0,Region,article_count_good,article_count,quality%
0,NORTHERN AMERICA,863,37555,2.297963
1,ASIA,695,30465,2.281306
2,AFRICA,988,44406,2.224925
3,LATIN AMERICA AND THE CARIBBEAN,764,35634,2.14402
4,OCEANIA,63,3070,2.052117
5,EUROPE,385,18934,2.033379


## Reflections and Implications

The results of this study were surprising because almost none of my expectations were met. Specifically, I expected to see areas with strong institutional freedoms score highly in both the coverage and quality tables. However, there didn't seem to be much rhyme or reason to any of the table. I believe that this is because of two factors: the coverage and quality percentage calculations were biased by extremes in the numerator or denominators; and the lack of quality of the 'quality' indicator.

Ratios are always suspect because they are vulnurable to extremes in either the numerator or the denominator. For examples, India scores very low in the coverage category because of their vast population. I don't think that there are even enough unique articles to be written that could overcome such an imbalance.

Also, measures of quality are always suspect. This is most clearly shown in this exercise by the very high ranking of North Korea in the quality score. This is likely due to the oppressive nature of the country that limits the types of articles coming out of the country which has clearly biased the results.

As with anything in data science, it is critical to be aware of the biases that can crop up in data as well as the underlying logic used to derive any specific metric.