# Assignment A2: Bias in Data

This is a report for Assignment A2: "Bias in Data" of the course DATA 512 of the 2021 fall quarter in MSDS at UW.

The goal of this assignment was to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through August 31 2021.

The goal of this assignment was to explore the concept of bias using data on Wikipedia articles - specifically, articles on political figures from a variety of countries and the estimated quality of articles based on a machine learning service called ORES.

## Stage 1: Data Acquisition

There are three data sources for this project:

1. A list of Wikipedia politician articles by country
2. A list of countries and corresponding population
3. A list with the estimated quality ranking for each article

### 1.1 List of politician articles

The dataset "Politicians by Country from the English-language Wikipedia" is available from figshare.com:

> Keyes, Os (2017): Politicians by Country from the English-language Wikipedia. figshare. Dataset. https://doi.org/10.6084/m9.figshare.5513449.v6 

At the time of this writing, the latest version available and used in this project was "Version 6, posted on 28.10.2017, 10:49", downloaded as "page_data.csv".

In [1]:
import pandas as pd
import numpy as np

In [2]:
pages_raw = pd.read_csv("page_data.csv")

In [3]:
pages_raw

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409
...,...,...,...
47192,Yahya Jammeh,Gambia,807482007
47193,Lucius Fairchild,United States,807483006
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
47195,Francis Fessenden,United States,807483270


Wikipedia has a lot of placeholder and template pages that can be excluded by filtering out pages with prefix "Template:".

In [4]:
pages = pages_raw[pages_raw['page'].str.startswith('Template:') == False]

In [5]:
pages = pages.rename(columns = {'rev_id': 'revision_id'})

In [6]:
pages

Unnamed: 0,page,country,revision_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568
...,...,...,...
47192,Yahya Jammeh,Gambia,807482007
47193,Lucius Fairchild,United States,807483006
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
47195,Francis Fessenden,United States,807483270


### 1.2 List of countries and population

The list of countries and their corresponding population and region was obtained from the [assignment page](https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit?usp=sharing) and was originally sourced from the Population Reference Bureau at https://www.prb.org/international/indicator/population/table/

In [7]:
geography = pd.read_csv("WPDS_2020_data.csv")

In [8]:
geography

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.850,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000
230,SB,Solomon Islands,Country,2019,0.715,715000
231,TO,Tonga,Country,2019,0.099,99000
232,TV,Tuvalu,Country,2019,0.010,10000


Note that rather than having a column called 'Region', the file is organized such that a row using all capitalized characters and a 'Type' of 'Sub-Region' is used to separate the list of countries into geographical regions. Rows immediately after a sub-region are considered to be in that region.

Here the 'Sub-Region' type is pivoted as a new column called 'Region'.

In [9]:
geography['Region'] = geography['Name'].mask(geography['Type'] != 'Sub-Region')
geography['Region'] = geography['Region'].ffill(downcast='infer')

In [10]:
geography

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,Region
0,WORLD,WORLD,World,2019,7772.850,7772850000,
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000,AFRICA
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000,NORTHERN AFRICA
3,DZ,Algeria,Country,2019,44.357,44357000,NORTHERN AFRICA
4,EG,Egypt,Country,2019,100.803,100803000,NORTHERN AFRICA
...,...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000,OCEANIA
230,SB,Solomon Islands,Country,2019,0.715,715000,OCEANIA
231,TO,Tonga,Country,2019,0.099,99000,OCEANIA
232,TV,Tuvalu,Country,2019,0.010,10000,OCEANIA


Finally, the table is cleaned up from non-country rows, and the unnecessary columns are dropped.

In [11]:
countries = geography[geography['Type'] == 'Country']
countries = countries.drop(['FIPS', 'Type', 'TimeFrame', 'Data (M)'], axis=1)
countries = countries.rename(columns = {'Name': 'country', 'Population': 'population', 'Region': 'region'})

In [12]:
countries

Unnamed: 0,country,population,region
3,Algeria,44357000,NORTHERN AFRICA
4,Egypt,100803000,NORTHERN AFRICA
5,Libya,6891000,NORTHERN AFRICA
6,Morocco,35952000,NORTHERN AFRICA
7,Sudan,43849000,NORTHERN AFRICA
...,...,...,...
229,Samoa,200000,OCEANIA
230,Solomon Islands,715000,OCEANIA
231,Tonga,99000,OCEANIA
232,Tuvalu,10000,OCEANIA


### 1.3 List of estimated quality for each article

Wikipedia labels the quality of articles using the following scores, from highest quality to lowest:

- FA - Featured article
- GA - Good article
- B - B-class article
- C - C-class article
- Start - Start-class article
- Stub - Stub-class article

Given an article's revision id, it is possible to obtain the article's estimated quality ranking by using a  machine learning system called ORES (originally an acronym for "Objective Revision Evaluation Service").

The service and API are documented at https://ores.wikimedia.org/ and the Swagger is available at https://ores.wikimedia.org/v3/. A library is also available through the 'ores' package.

The API accepts a list of up to 50 revision ids, so the list of pages has to be split into multiple batch calls.

In [13]:
revision_ids = pages.revision_id.tolist()

In [14]:
def split_list(l, n):
    for i in range(0, len(l), n):
        yield l[i:i + n]

In [15]:
max_batch_size = 50
revision_ids_batches = split_list(revision_ids, max_batch_size)

In [16]:
import statistics
from ores import api
from timeit import default_timer as timer

In [17]:
ores_session = api.Session(
    "https://ores.wikimedia.org",
    "DATA 512 - Assignment 2 - <your_email_address@uw.edu>") # E-mail removed before commit

results = []
request_times = []
for batch in revision_ids_batches:
    start = timer()
    results.extend(ores_session.score("enwiki", ["articlequality"], batch))
    end = timer()
    request_times.append(end - start)
    
print(f"Retrieved {len(results)} results in {len(request_times)} requests. Average time per request: {statistics.mean(request_times)} seconds.")

Retrieved 46701 results in 935 requests. Average time per request: 0.6457659775764695 seconds.


The predictions are then parsed from json. Note that ORES doesn't always produce a score and emits an error for some articles. The errors are here converted into an 'Unknown' score.

In [18]:
predictions = []
for id, result in zip(revision_ids, results):
    predictions.append({
        'revision_id': id,
        'score': 'Unknown' if 'error' in result['articlequality'] else result['articlequality']['score']['prediction']
    })

In [19]:
ratings = pd.DataFrame.from_records(predictions)

In [20]:
ratings

Unnamed: 0,revision_id,score
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub
...,...,...
46696,807482007,GA
46697,807483006,C
46698,807483153,GA
46699,807483270,C


## Stage 2: Data Processing

After retrieving all data sources, further processing is necessary to join the tables and compute the necessary metrics for the analysis.

First, the list of pages and countries are merged to join the population size of each country and its corresponding region into a single table.

In [21]:
politicians_by_country = pages.merge(countries, on='country', how='outer')

In [22]:
politicians_by_country = politicians_by_country.merge(ratings, on='revision_id', how='left')

In [23]:
politicians_by_country

Unnamed: 0,page,country,revision_id,population,region,score
0,Bir I of Kanem,Chad,355319463.0,16877000.0,MIDDLE AFRICA,Stub
1,Abdullah II of Kanem,Chad,498683267.0,16877000.0,MIDDLE AFRICA,Stub
2,Salmama II of Kanem,Chad,565745353.0,16877000.0,MIDDLE AFRICA,Stub
3,Kuri I of Kanem,Chad,565745365.0,16877000.0,MIDDLE AFRICA,Stub
4,Mohammed I of Kanem,Chad,565745375.0,16877000.0,MIDDLE AFRICA,Stub
...,...,...,...,...,...,...
46722,,French Polynesia,,280000.0,OCEANIA,
46723,,Guam,,175000.0,OCEANIA,
46724,,New Caledonia,,295000.0,OCEANIA,
46725,,Palau,,18000.0,OCEANIA,


Some countries don't have articles and some country names don't have matches between Wikipedia data and WPDS data. Analysis of the incomplete data is outside the scope of this project. But a file containing the excluded rows is provided for future analysis. 

In [24]:
wp_wpds_countries_no_match = politicians_by_country[politicians_by_country.isna().any(axis=1)]
wp_wpds_countries_no_match.to_csv('wp_wpds_countries-no_match.csv', index=False)

Finally, the list of politician articles is filtered to include only complete rows, and columns are adjusted to more appropriate names.

In [25]:
politicians_by_country = politicians_by_country.dropna()
politicians_by_country = politicians_by_country.rename(columns = {'page': 'article_name', 'score': 'article_quality_est'})
politicians_by_country = politicians_by_country[['country', 'region', 'article_name', 'revision_id', 'article_quality_est', 'population']]

In [26]:
wp_wpds_politicians_by_country = politicians_by_country.drop('region', axis=1)

In [27]:
wp_wpds_politicians_by_country

Unnamed: 0,country,article_name,revision_id,article_quality_est,population
0,Chad,Bir I of Kanem,355319463.0,Stub,16877000.0
1,Chad,Abdullah II of Kanem,498683267.0,Stub,16877000.0
2,Chad,Salmama II of Kanem,565745353.0,Stub,16877000.0
3,Chad,Kuri I of Kanem,565745365.0,Stub,16877000.0
4,Chad,Mohammed I of Kanem,565745375.0,Stub,16877000.0
...,...,...,...,...,...
46690,Seychelles,Rita Sinon,800323154.0,Stub,98000.0
46691,Seychelles,Sylvette Frichot,800323798.0,Stub,98000.0
46692,Seychelles,May De Silva,800969960.0,Start,98000.0
46693,Seychelles,Vincent Meriton,802051093.0,Stub,98000.0


In [28]:
wp_wpds_politicians_by_country.to_csv('wp_wpds_politicians_by_country.csv', index=False)

Here, the following 6 tables are calculated:

1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality


First, the number of articles per country is calculated.

In [29]:
articles_by_country = politicians_by_country.groupby(['country', 'population'])['article_name'].count()

In [30]:
articles_by_country = articles_by_country.to_frame().reset_index()
articles_by_country = articles_by_country.rename(columns= {'article_name': 'articles_count'})

In [31]:
articles_by_country

Unnamed: 0,country,population,articles_count
0,Afghanistan,38928000.0,322
1,Albania,2838000.0,457
2,Algeria,44357000.0,116
3,Andorra,82000.0,34
4,Angola,32522000.0,106
...,...,...,...
178,Venezuela,28645000.0,131
179,Vietnam,96209000.0,187
180,Yemen,29826000.0,118
181,Zambia,18384000.0,25


Next, the proportion of articles by country's population is calculated, yielding the first two tables.

In [32]:
articles_by_country['article_by_population_percentage'] = articles_by_country['articles_count'] * 100.0 / articles_by_country['population']

#### 1. Top 10 countries by coverage

The 10 highest-ranked countries in terms of number of politician articles as a proportion of country population.

In [33]:
articles_by_country.sort_values('article_by_population_percentage', ascending = False).head(10)

Unnamed: 0,country,population,articles_count,article_by_population_percentage
169,Tuvalu,10000.0,54,0.54
117,Nauru,11000.0,52,0.472727
138,San Marino,34000.0,81,0.238235
110,Monaco,38000.0,40,0.105263
95,Liechtenstein,39000.0,28,0.071795
104,Marshall Islands,57000.0,37,0.064912
164,Tonga,99000.0,63,0.063636
70,Iceland,368000.0,202,0.054891
3,Andorra,82000.0,34,0.041463
52,Federated States of Micronesia,106000.0,36,0.033962


#### 2. Bottom 10 countries by coverage

The 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population.

In [34]:
articles_by_country.sort_values('article_by_population_percentage', ascending = False).tail(10)

Unnamed: 0,country,population,articles_count,article_by_population_percentage
13,Bangladesh,169809000.0,321,0.000189
114,Mozambique,31166000.0,58,0.000186
162,Thailand,66534000.0,112,0.000168
84,"Korea, North",25779000.0,36,0.00014
181,Zambia,18384000.0,25,0.000136
51,Ethiopia,114916000.0,101,8.8e-05
176,Uzbekistan,34174000.0,28,8.2e-05
34,China,1402385000.0,1133,8.1e-05
72,Indonesia,271739000.0,211,7.8e-05
71,India,1400100000.0,985,7e-05


Next, focusing only on high quality articles (Featured and Good Articles), similar metrics are calculated.

In [35]:
high_quality_articles_by_country = politicians_by_country[politicians_by_country['article_quality_est'].isin(['FA', 'GA'])]

In [36]:
high_quality_articles_by_country = high_quality_articles_by_country.groupby('country')['article_name'].count()
high_quality_articles_by_country = high_quality_articles_by_country.to_frame().reset_index()
high_quality_articles_by_country = high_quality_articles_by_country.rename(columns= {'article_name': 'top_articles_count'})

In [37]:
high_quality_articles_by_country

Unnamed: 0,country,top_articles_count
0,Afghanistan,13
1,Albania,3
2,Algeria,2
3,Argentina,16
4,Armenia,5
...,...,...
141,Vanuatu,3
142,Venezuela,3
143,Vietnam,13
144,Yemen,3


In [38]:
articles_by_country = articles_by_country.merge(high_quality_articles_by_country, on='country', how='left').fillna(0)

In [39]:
articles_by_country

Unnamed: 0,country,population,articles_count,article_by_population_percentage,top_articles_count
0,Afghanistan,38928000.0,322,0.000827,13.0
1,Albania,2838000.0,457,0.016103,3.0
2,Algeria,44357000.0,116,0.000262,2.0
3,Andorra,82000.0,34,0.041463,0.0
4,Angola,32522000.0,106,0.000326,0.0
...,...,...,...,...,...
178,Venezuela,28645000.0,131,0.000457,3.0
179,Vietnam,96209000.0,187,0.000194,13.0
180,Yemen,29826000.0,118,0.000396,3.0
181,Zambia,18384000.0,25,0.000136,0.0


In [40]:
articles_by_country['high_quality_articles_percentage'] = articles_by_country['top_articles_count'] * 100.0 / articles_by_country['articles_count']

#### 3. Top 10 countries by relative quality

The 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality.

In [41]:
articles_by_country.sort_values('high_quality_articles_percentage', ascending = False)[['country', 'high_quality_articles_percentage']].head(10)

Unnamed: 0,country,high_quality_articles_percentage
84,"Korea, North",22.222222
140,Saudi Arabia,12.711864
135,Romania,12.244898
31,Central African Republic,12.121212
176,Uzbekistan,10.714286
106,Mauritania,10.416667
64,Guatemala,8.433735
44,Dominica,8.333333
158,Syria,7.751938
18,Benin,7.692308


#### 4. Bottom 10 countries by relative quality

The 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality.

In [42]:
articles_by_country.sort_values('high_quality_articles_percentage', ascending = True)[['country', 'high_quality_articles_percentage']].head(10)

Unnamed: 0,country,high_quality_articles_percentage
63,Guadeloupe,0.0
164,Tonga,0.0
148,Solomon Islands,0.0
138,San Marino,0.0
67,Guyana,0.0
166,Tunisia,0.0
139,Sao Tome and Principe,0.0
62,Grenada,0.0
81,Kazakhstan,0.0
168,Turkmenistan,0.0


Notice there are more than 10 countries with no high quality articles.

Finally, the total population, and article count, is calculated per region.

In [43]:
articles_by_region = politicians_by_country.groupby(['region', 'population'])['article_name'].count()

In [44]:
articles_by_region = articles_by_region.to_frame().reset_index()
articles_by_region = articles_by_region.rename(columns= {'article_name': 'articles_count'})

In [45]:
articles_by_region = articles_by_region.groupby('region').sum().reset_index()

In [46]:
articles_by_region

Unnamed: 0,region,population,articles_count
0,CARIBBEAN,39056000.0,697
1,CENTRAL AMERICA,162267000.0,1545
2,CENTRAL ASIA,74960000.0,247
3,Channel Islands,105680000.0,3781
4,EAST ASIA,1632883000.0,2477
5,EASTERN AFRICA,443825000.0,2509
6,EASTERN EUROPE,281186000.0,3771
7,MIDDLE AFRICA,90189000.0,669
8,NORTHERN AFRICA,243748000.0,902
9,NORTHERN AMERICA,368068000.0,1940


In [47]:
articles_by_region['articles_by_population_percentage'] = articles_by_region['articles_count'] * 100.0 / articles_by_region['population']

#### 5. Geographic regions by coverage

Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population.

In [48]:
articles_by_region.sort_values('articles_by_population_percentage', ascending = False).head(10)

Unnamed: 0,region,population,articles_count,articles_by_population_percentage
10,OCEANIA,42031000.0,3132,0.007452
3,Channel Islands,105680000.0,3781,0.003578
15,SOUTHERN EUROPE,151136000.0,3729,0.002467
18,WESTERN EUROPE,195479000.0,4577,0.002341
0,CARIBBEAN,39056000.0,697,0.001785
6,EASTERN EUROPE,281186000.0,3771,0.001341
14,SOUTHERN AFRICA,66628000.0,635,0.000953
1,CENTRAL AMERICA,162267000.0,1545,0.000952
17,WESTERN ASIA,272499000.0,2580,0.000947
7,MIDDLE AFRICA,90189000.0,669,0.000742


Here, only high quality articles by region are considered.

In [49]:
high_quality_articles_by_region = politicians_by_country[politicians_by_country['article_quality_est'].isin(['FA', 'GA'])]

In [50]:
high_quality_articles_by_region = high_quality_articles_by_region.groupby('region')['article_name'].count()
high_quality_articles_by_region = high_quality_articles_by_region.to_frame().reset_index()
high_quality_articles_by_region = high_quality_articles_by_region.rename(columns= {'article_name': 'top_articles_count'})

In [51]:
articles_by_region = articles_by_region.merge(high_quality_articles_by_region, on='region', how='left').fillna(0)

In [52]:
articles_by_region['high_quality_articles_percentage'] = articles_by_region['top_articles_count'] * 100.0 / articles_by_region['articles_count']

#### 6. Article quality by geographic regions

Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality.

In [53]:
articles_by_region.sort_values('high_quality_articles_percentage', ascending = False)[['region', 'high_quality_articles_percentage']].head(10)

Unnamed: 0,region,high_quality_articles_percentage
9,NORTHERN AMERICA,5.360825
13,SOUTHEAST ASIA,3.588987
17,WESTERN ASIA,3.449612
6,EASTERN EUROPE,3.129143
4,EAST ASIA,3.068228
2,CENTRAL ASIA,2.834008
3,Channel Islands,2.697699
7,MIDDLE AFRICA,2.391629
8,NORTHERN AFRICA,2.10643
10,OCEANIA,2.011494


## Stage 3: Analysis

A quick look at the findings:

- Tables 1 and 2 show that the countries with top coverage are mostly islands or small countries such as Liechtenstein and Monaco. Similarly, the list of bottom countries by coverage is filled with highly populated countries. Even China, despite having over a thousand articles, is ranked at the bottom due to its large population. This is relatively expected, given that the ratio will be mostly affected by the denominator (size of the population) given that the numerator (number of politicians) should be relatively similar, even for countries with several states or regions. 
- Table 3 shows North Korea at the top spot for countries with the highest proportion of high quality articles - with only 36 articles, North Korea has over 20% ranked as FA or GA. It would be interesting to understand if these high quality articles about North Korean politicians were authored by/for English speakers.
- Table 4 reveals several countries with no high-quality articles, including small countries and islands.
- Table 5, similarly to table 1, shows regions with smaller populations at the top.
- Table 6 shows North America with the highest proportion of high quality articles.

### Bias

Considering that North American countries are not among the top countries with highest proportion of articles by population, nor among the top countries in proportion of high quality articles, it is reasonable to assume some bias exist when North America is ranked as the top region in Table 6. 

An obvious source of bias is the fact that this assignment is based on English language pages. This has a few implications such as:

- (i) the articles were written by/to English speakers.
- (ii) politicians and population from other countries might not invest as much time and resources creating politician pages for the English language wikipedia.
- (iii) the machine learning model ORES was probably trained with pages and labels from English language pages.

Other considerations include:

- (iv) Accessibility: Not all the countries listed have the same level of internet accessibility for their general population. Given that Wikipedia is a crowd sourced encyclopedia, less participation may affect the actual quality of articles.
- (v) Cultural differences: politician articles generally include some biographical details and the way to portray these aspects of a person's history or career might differ depending on their country. This will, in part, affect the machine learning evaluation as mentioned in (iii).

### Possible Improvements

One possible way to improve these results and minimize biases would be to train the machine learning model with labels from articles related to people/politicians from a diverse set of countries.

Another possible way would be to pivot the analysis to consider the scores of each article in their own language such that we could compare the quality of articles for American politicians in the English Wikipedia relative to the quality of Brazilian politician articles in the Brazilian-Portuguese Wikipedia, German politician articles in the German Wikipedia, etc.

Finally, the data sources could also be improved so no countries are excluded due to simple mismatches between their names in different tables.