# DATA 512 - A2: Bias in Data
**Corey Christopherson**
**10/10/2019**  

The purpose of this project is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries

In [1]:
import numpy as np
import pandas as pd
import requests
import json
import time

## Data Acquisition
Two data sets were obtained for this project were obtained
1. page_data.csv - Wikipedia politicians by country dataset 
2. WPDS_2018_data.csv - Population Reference Bureau world population datasheet

In [2]:
#
### ACQUIRE DATA ###
#

In [72]:
# Define data paths
path = r'C:/Users/chrico7/Documents/__Corey Christopherson/MS Data Science/Courses/HCDE 512/Week 2/Homework/'
polPath = r'C:/Users/chrico7/Documents/__Corey Christopherson/MS Data Science/Courses/HCDE 512/Week 2/Homework/Data/country/data/'
popPath = r'C:/Users/chrico7/Documents/__Corey Christopherson/MS Data Science/Courses/HCDE 512/Week 2/Homework/Data/'

In [4]:
# Read in csv data
polData_raw = pd.read_csv(r'{}page_data.csv'.format(polPath))
popData_raw = pd.read_csv(r'{}WPDS_2018_data.csv'.format(popPath))

## Data Cleaning
Both data sets were cleaned to ensure the quality of the final data according to the following list
1. polData - Remove page name that begin with 'Template' since these are not Wikipedia pages
2. popData - Break out ALL CAP records in the 'geography' field to a separate table because these are aggregates

Additionally, a separate table was generated to map regions to the correcponding countries

In [5]:
#
### CLEAN DATA ###
#

In [6]:
# Politician Data (polData)
polData = polData_raw.copy()
# Remove page names that start with the string 'Template'
polData = polData[~polData['page'].str.contains('Template',regex=True)].reset_index(drop=True)

In [7]:
# Population Data (popData)
popData = popData_raw.copy()
# Break out records with ALL CAP records in 'geography' field
popData_agg = popData[popData['Geography'].str.isupper()].reset_index(drop=True)
popData = popData[~popData['Geography'].str.isupper()].reset_index(drop=True)

In [153]:
# Derive map between Geogrpahy and Country
geoMap = pd.DataFrame()
firstGeo = popData_agg['Geography'][0]
for i in popData_agg['Geography'][1:]:
    start = popData_raw.loc[popData_raw['Geography']==firstGeo,:].index[0]
    stop = popData_raw.loc[popData_raw['Geography']==i,:].index[0]
    temp = popData_raw.iloc[start:stop,:]
    temp.loc[:,'Region'] = firstGeo
    geoMap = geoMap.append(temp[['Region','Geography']], ignore_index=True, sort=False)
    #print(popData_raw.iloc[start,0], popData_raw.iloc[stop,0])
    firstGeo = popData_raw.iloc[stop,0]
geoMap = geoMap[geoMap['Region']!=geoMap['Geography']]

## Data Processing

First, the quality score for each article was obtained from the ores API. All data was then combined into a common table with unmatched rows extracted for reference

In [8]:
#
### PROCESS DATA ###
#

In [9]:
# Get article quality predictions

In [10]:
def get_ores_data(revision_ids):
    """
    Function to get ores data when passed a list of revision ids
    """
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    # Call the API and convert to json
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    # Convert json to a pandas data frame
    temp_df = pd.io.json.json_normalize(response['enwiki'])
    # Rename columns to common terms
    colDict = dict(zip(pd.Series(temp_df.columns),
                       pd.Series(temp_df.columns).str.rsplit('.',n=1,expand=True)[1]))
    temp_df.rename(colDict, axis='columns',inplace=True)
    
    return temp_df

In [11]:
polData.shape

(46701, 3)

In [20]:
polData.head(2)

Unnamed: 0,page,country,rev_id
0,Bir I of Kanem,Chad,355319463
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188


In [13]:
# Loop through rows and compile cumulative ores data frame
start = time.time()
oresData_raw_good = pd.DataFrame()
oresData_raw_bad = pd.DataFrame()
for row in polData.itertuples():
    rev_id = [row[3]]
    temp_df = get_ores_data(rev_id)
    temp_df.loc[:,'rev_id'] = rev_id
    if temp_df.shape[1]==9:
        oresData_raw_good = oresData_raw_good.append(temp_df, ignore_index=True, sort=False)
    else:
        oresData_raw_bad = oresData_raw_bad.append(temp_df, ignore_index=True, sort=False)
print(time.time() - start)

14633.435400724411


In [14]:
oresData_raw_good.shape

(46546, 9)

In [18]:
oresData_raw_good.head(2)

Unnamed: 0,version,prediction,B,C,FA,GA,Start,Stub,rev_id
0,0.8.1,Stub,0.005417,0.007053,0.001113,0.001616,0.01102,0.97378,355319463
1,0.8.1,Stub,0.008993,0.009415,0.001568,0.002877,0.047603,0.929544,393276188


In [16]:
oresData_raw_bad.shape

(155, 4)

In [19]:
oresData_raw_bad.head(2)

Unnamed: 0,version,message,type,rev_id
0,0.8.1,RevisionNotFound: Could not find revision ({re...,RevisionNotFound,516633096
1,0.8.1,RevisionNotFound: Could not find revision ({re...,RevisionNotFound,550682925


In [37]:
# Add article score to polData and extract bad rows
polData_score = pd.merge(polData,
                         oresData_raw_good[['rev_id','prediction']],
                         how='outer',on='rev_id')
polData_noScore = polData_score[polData_score['prediction'].isnull()]
polData_score = polData_score[~polData_score['prediction'].isnull()]

In [63]:
# Add popData to pol data and extract bad rows
allData_raw = pd.merge(polData_score,
                       popData,how='outer',left_on='country',right_on='Geography')
allData_raw_noGeo = allData_raw[allData_raw['Geography'].isnull()]
allData_raw_noPage = allData_raw[allData_raw['page'].isnull()]
allData_raw_good = allData_raw[(~allData_raw['Geography'].isnull())&
                               (~allData_raw['page'].isnull())]

In [122]:
# Format columns
finalCols = ['country','article_name','revision_id','article_quality','population']
allData = allData_raw_good.rename({'page':'article_name',
                                   'rev_id':'revision_id',
                                   'prediction':'article_quality',
                                   'Population mid-2018 (millions)':'population'},axis=1)
allData = allData[finalCols]
# Format data types
allData.loc[:,'population'] = allData['population'].str.replace(',','',regex=True).astype(float)*1000000

In [123]:
allData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44464 entries, 0 to 46539
Data columns (total 5 columns):
country            44464 non-null object
article_name       44464 non-null object
revision_id        44464 non-null float64
article_quality    44464 non-null object
population         44464 non-null float64
dtypes: float64(2), object(3)
memory usage: 2.0+ MB


In [124]:
# Output unmatched rows
badData = pd.concat([polData_noScore,allData_raw_noGeo,allData_raw_noPage], sort=False)
badData.to_csv(r'{}wp_wpds_countries-no_match.csv'.format(path),header=True,index=False)

In [125]:
# Output final data
allData.to_csv(r'{}wp_wpds_politicians_by_country.csv'.format(path),header=True,index=False)

## Data Analysis
The final data was then split into several different frames to calculate following metrics
1. Country article count
2. Country population total
3. Country article count for good articles (FA and GA)
4. Region article count
5. Region population sum
6. Country article count for good articles (FA and GA)

These tables were then used to generate the following tables

1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [118]:
#
### ANALYZE DATA ###
#

In [126]:
allData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44464 entries, 0 to 46539
Data columns (total 5 columns):
country            44464 non-null object
article_name       44464 non-null object
revision_id        44464 non-null float64
article_quality    44464 non-null object
population         44464 non-null float64
dtypes: float64(2), object(3)
memory usage: 2.0+ MB


In [168]:
# Make a frame with Region information
allData_region = pd.merge(allData, geoMap, how='left',left_on='country',right_on='Geography').drop('Geography',axis=1)

In [175]:
# Calculate metrics
country_article_count = (allData.groupby('country')['article_name'].count()
                         .reset_index().rename({'article_name':'article_count'},axis=1))

country_pop_sum = (allData.groupby('country')['population'].mean()
                   .reset_index().rename({'population':'population_sum'},axis=1))

country_article_good = (allData[(allData['article_quality']=='GA')|
                                (allData['article_quality']=='FA')]
                        .groupby('country')['article_name'].count()
                        .reset_index().rename({'article_name':'article_count_good'},axis=1))

region_article_count = (allData_region.groupby('Region')['article_name'].count()
                        .reset_index().rename({'article_name':'article_count'},axis=1))

region_pop_sum = (allData_region.groupby('Region')['population'].mean()
                  .reset_index().rename({'population':'population_sum'},axis=1))

region_article_good = (allData_region[(allData_region['article_quality']=='GA')|
                                      (allData_region['article_quality']=='FA')]
                       .groupby('Region')['article_name'].count()
                       .reset_index().rename({'article_name':'article_count_good'},axis=1))

In [191]:
# Calculate data tables
country_cov = pd.merge(country_article_count,country_pop_sum,how='left',on='country')
country_cov.loc[:,'coverage%'] = country_cov['article_count']/country_cov['population_sum']*100

country_qual = pd.merge(country_article_good,country_article_count,how='left',on='country')
country_qual.loc[:,'quality%'] = country_qual['article_count_good']/country_qual['article_count']*100

region_cov = pd.merge(region_article_count,region_pop_sum,how='left',on='Region')
region_cov.loc[:,'coverage%'] = region_cov['article_count']/region_cov['population_sum']*100

region_qual = pd.merge(region_article_good,region_article_count,how='left',on='Region')
region_qual.loc[:,'quality%'] = region_qual['article_count_good']/region_qual['article_count']*100

In [None]:
# Output data tables

In [203]:
# Top 10 Countries by Coverage
country_cov.sort_values('coverage%',ascending=False)[0:10].reset_index(drop=True)

Unnamed: 0,country,article_count,population_sum,coverage%
0,Tuvalu,54,10000.0,0.54
1,Nauru,52,10000.0,0.52
2,San Marino,81,30000.0,0.27
3,Monaco,40,40000.0,0.1
4,Liechtenstein,28,40000.0,0.07
5,Tonga,63,100000.0,0.063
6,Marshall Islands,37,60000.0,0.061667
7,Iceland,201,400000.0,0.05025
8,Andorra,34,80000.0,0.0425
9,Grenada,36,100000.0,0.036


In [204]:
# Bottom 10 Countries by Coverage
country_cov.sort_values('coverage%',ascending=True)[0:10].reset_index(drop=True)

Unnamed: 0,country,article_count,population_sum,coverage%
0,India,980,1371300000.0,7.1e-05
1,Indonesia,210,265200000.0,7.9e-05
2,China,1130,1393800000.0,8.1e-05
3,Uzbekistan,28,32900000.0,8.5e-05
4,Ethiopia,101,107500000.0,9.4e-05
5,"Korea, North",36,25600000.0,0.000141
6,Zambia,25,17700000.0,0.000141
7,Thailand,112,66200000.0,0.000169
8,Mozambique,58,30500000.0,0.00019
9,Bangladesh,319,166400000.0,0.000192


In [205]:
# Top 10 Countries by Relative Quality
country_qual.sort_values('quality%',ascending=False)[0:10].reset_index(drop=True)

Unnamed: 0,country,article_count_good,article_count,quality%
0,"Korea, North",7,36,19.444444
1,Saudi Arabia,15,118,12.711864
2,Mauritania,6,48,12.5
3,Central African Republic,8,66,12.121212
4,Romania,39,343,11.370262
5,Tuvalu,5,54,9.259259
6,Bhutan,3,33,9.090909
7,Dominica,1,12,8.333333
8,Syria,10,128,7.8125
9,Benin,7,91,7.692308


In [206]:
# Bottom 10 Countries by Relative Quality
country_qual.sort_values('quality%',ascending=True)[0:10].reset_index(drop=True)

Unnamed: 0,country,article_count_good,article_count,quality%
0,Belgium,1,520,0.192308
1,Tanzania,1,405,0.246914
2,Switzerland,1,402,0.248756
3,Nepal,1,357,0.280112
4,Peru,1,350,0.285714
5,Nigeria,2,677,0.295421
6,Colombia,1,285,0.350877
7,Lithuania,1,244,0.409836
8,Fiji,1,197,0.507614
9,Azerbaijan,1,179,0.558659


In [207]:
# Geographic Regions by Coverage
region_cov.sort_values('coverage%',ascending=False).reset_index(drop=True)

Unnamed: 0,Region,article_count,population_sum,coverage%
0,EUROPE,15864,34862890.0,0.045504
1,AFRICA,6851,45621520.0,0.015017
2,LATIN AMERICA AND THE CARIBBEAN,5169,63567160.0,0.008132
3,ASIA,11531,310982200.0,0.003708
4,NORTHERN AMERICA,1921,200387100.0,0.000959


In [208]:
# Geographic Regions by Relative Quality
region_qual.sort_values('quality%',ascending=False).reset_index(drop=True)

Unnamed: 0,Region,article_count_good,article_count,quality%
0,NORTHERN AMERICA,99,1921,5.153566
1,ASIA,310,11531,2.688405
2,EUROPE,322,15864,2.029753
3,AFRICA,125,6851,1.824551
4,LATIN AMERICA AND THE CARIBBEAN,69,5169,1.334881


## Reflections and Implications

The results of this study were surprising because almost none of my expectations were met. Specifically, I expected to see areas with strong institutional freedoms score highly in both the coverage and quality tables. However, there didn't seem to be much rhyme or reason to any of the table. I believe that this is because of two factors: the coverage and quality percentage calculations were biased by extremes in the numerator or denominators; and the lack of quality of the 'quality' indicator.

Ratios are always suspect because they are vulnurable to extremes in either the numerator or the denominator. For examples, India scores very low in the coverage category because of their vast population. I don't think that there are even enough unique articles to be written that could overcome such an imbalance.

Also, measures of quality are always suspect. This is most clearly shown in this exercise by the very high ranking of North Korea in the quality score. This is likely due to the oppressive nature of the country that limits the types of articles coming out of the country which has clearly biased the results.

As with anything in data science, it is critical to be aware of the biases that can crop up in data as well as the underlying logic used to derive any specific metric.