Anushna Prakash  
DATA 512 - Human-Centered Data Science  
October 14, 2021  
# A2 - Bias in Data
The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.
You are expected to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. Your analysis will consist of a series of tables that show:  
- the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.  
- the countries with the highest and lowest proportion of high quality articles about politicians.  
- a ranking of geographic regions by articles-per-person and proportion of high quality articles.  
You are also expected to write a short reflection on the project that focuses on how both your findings from this analysis and the process you went through to reach those findings helps you understand the causes and consequences of biased data in large, complex data science projects.

## Step 0: Set Up Notebook

In [1]:
# Optional: Make notebook width wider
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:55% !important; }</style>"))

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import json
import requests
import math

## Step 1: Import data

In [3]:
# See README.md for where data was downloaded from originally.
# Import from data_raw folder assuming we are running from the src folder.
page_data = pd.read_csv('../data_raw/country/data/page_data.csv')
population = pd.read_csv('../data_raw/WPDS_2020_data.csv')

In [4]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [5]:
page_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47197 entries, 0 to 47196
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   page     47197 non-null  object
 1   country  47197 non-null  object
 2   rev_id   47197 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.1+ MB


In [7]:
page_data.isna().sum()

page       0
country    0
rev_id     0
dtype: int64

In [8]:
population.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


In [9]:
population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   FIPS        233 non-null    object 
 1   Name        234 non-null    object 
 2   Type        234 non-null    object 
 3   TimeFrame   234 non-null    int64  
 4   Data (M)    234 non-null    float64
 5   Population  234 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 11.1+ KB


In [10]:
population.isna().sum()

FIPS          1
Name          0
Type          0
TimeFrame     0
Data (M)      0
Population    0
dtype: int64

In [12]:
population.loc[population['FIPS'].isna(),]

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
62,,Namibia,Country,2019,2.541,2541000


## Step 2: Cleaning the Data  
Both `page_data.csv` and `WPDS_2020_data.csv` contain some rows that you will need to filter out and/or ignore when you combine the datasets in the next step. In the case of `page_data.csv`, the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and should not be included in your analysis.  
Similarly, `WPDS_2020_data.csv` contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the `geography` field (e.g. AFRICA, OCEANIA). These rows won't match the country values in `page_data.csv`, but you will want to retain them (either in the original file, or a separate file) so that you can report coverage and quality by region in the analysis section.

In [13]:
# Remove page names that begin with 'Template:' since these are not wikipedia articles
page_data = page_data.loc[~page_data['page'].str.startswith('Template:'), ]

In [14]:
page_data

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568
...,...,...,...
47192,Yahya Jammeh,Gambia,807482007
47193,Lucius Fairchild,United States,807483006
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
47195,Francis Fessenden,United States,807483270


**Can I just filter on type == 'country'? Why do we have to check if its all caps. This method ends up including the Channel Islands**

In [15]:
# Remove rows that have regional population counts and not country-level counts. Ex: AFRICA, OCEANIA
# population = population.loc[population['Type'] == 'Country', ]
# population.info()

In [16]:
original_population = population.copy()
population = population.loc[~population['Name'].str.isupper(), ]
population.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 3 to 233
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   FIPS        209 non-null    object 
 1   Name        210 non-null    object 
 2   Type        210 non-null    object 
 3   TimeFrame   210 non-null    int64  
 4   Data (M)    210 non-null    float64
 5   Population  210 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 11.5+ KB


In [17]:
population.loc[population['Type'] != 'Country',]

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
168,Channel Islands,Channel Islands,Sub-Region,2019,0.172,172000


In [132]:
population.to_csv('../data_clean/WPDS_countries_only.csv', index = False)

## Step 3: Getting Article Quality Predictions

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:  
- FA - Featured article  
- GA - Good article  
- B - B-class article  
- C - C-class article  
- Start - Start-class article  
- Stub - Stub-class article  

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any rev_id you send it.  
In order to get article predictions for each article in the Wikipedia dataset, you will first need to read page_data.csv into Python (or R), and then read through the dataset line by line, using the value of the rev_id column to make an API query.

I wasn't able to use the `ores` package directly to do all of the revid calls at once, so instead I'll use the Ores API and send the `rev_id`s in batches not exceeding 50 ids per batch to prevent api call failures.

In [18]:
def api_call(endpoint,parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

# Fill with your own information if reproducing
headers = {
    'User-Agent': 'https://github.com/anushnap',
    'From': 'anushnap@uw.edu'
}

endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={revid}'

df = page_data.copy()
n_batches = math.ceil(len(df) / 49)
df['row_num'] = np.arange(len(df))
df['batch_num'] = df['row_num'] % n_batches

Uncomment and run these cells if you are re-running and re-downloading the predictions from the API. Otherwise, skip this block and download the already-saved data from .json files in the `data_raw/api_dump` folder.

In [19]:
# for n in range(n_batches):
#     id_str = '|'.join(df.loc[df['batch_num'] == n, 'rev_id'].astype(str))
#     params = {
#         "revid" : id_str
#     }
    
#     call = api_call(endpoint, params)
#     filename = 'ores_scores_enwiki_articlequality_batchnum-' + str(n) + '.json'
    
#     with open('../data_raw/api_dump/' + filename, 'w', encoding='utf-8') as f:
#         json.dump(call, f, ensure_ascii = False, indent=4)

In [20]:
# Read the data back in and get the prediction if it exists
for n in range(n_batches):
    filename = '../data_raw/api_dump/ores_scores_enwiki_articlequality_batchnum-' + str(n) + '.json'
    temp = json.load(open(filename))['enwiki']['scores']
    
    for i in temp:
        int_id = int(i)
#         print(int_id)
        try:
            prediction = temp[i]['articlequality']['score']['prediction']
        except KeyError:
            prediction = np.nan
        finally:
            df.loc[(df['rev_id'] == int_id), 'prediction'] = prediction

In [21]:
df.isna().sum()

page            0
country         0
rev_id          0
row_num         0
batch_num       0
prediction    276
dtype: int64

Get a list of the articles for which we were unable to get a prediction and save this in `data_clean`.

In [22]:
df.loc[df['prediction'].isnull()].to_csv('../data_clean/articles_missing_prediction.csv', index = False)

## Part 4: Combining the Datasets

Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.  
Please remove any rows that do not have matching data, and output them to a CSV file called: `wp_wpds_countries-no_match.csv`  
Consolidate the remaining data into a single CSV file called: `wp_wpds_politicians_by_country.csv`  

The schema for that file should look something like this:  

| Column |
|--------|
| country      |
| article_name      |
| revision_id      |
| article_quality_est.      |
| population      |


Note: `revision_id` here is the same thing as `rev_id`, which you used to get scores from ORES.

In [23]:
# Join all data together
full_results = df.merge(population, how = 'outer', left_on = 'country', right_on = 'Name')

# Find data that is missing in either table and save assuming we are running from src folder
missing_results = full_results.loc[(full_results['country'].isnull() | full_results['Name'].isnull())]
missing_results.to_csv('../data_clean/wp_wpds_countries-no_match.csv', index = False)

In [24]:
print(missing_results['country'].unique())
print(missing_results['Name'].unique())

['Czech Republic' 'Salvadoran' 'Rhodesian' 'Congo, Dem. Rep. of'
 'East Timorese' 'Faroese' 'Cape Colony' 'South Korean' 'Samoan'
 'Montserratian' 'Pitcairn Islands' 'Saint Kitts and Nevis' 'Macedonia'
 'Abkhazia' 'Niuean' 'Ivorian' 'Carniolan' 'Saint Lucian'
 'South African Republic' 'Hondura' 'Incan' 'Chechen' 'Jersey' 'Guernsey'
 'Saint Vincent and the Grenadines' 'South Ossetian' 'Cook Island' 'Omani'
 'Tokelauan' 'Swaziland' 'Dagestani' 'Greenlandic' 'Ossetian' 'Palauan'
 'Somaliland' 'Rojava' nan]
[nan 'Western Sahara' "Cote d'Ivoire" 'Mayotte' 'Reunion'
 'Congo, Dem. Rep.' 'eSwatini' 'El Salvador' 'Honduras' 'Curacao'
 'Puerto Rico' 'St. Kitts-Nevis' 'Saint Lucia'
 'St. Vincent and the Grenadines' 'Georgia' 'Oman' 'Brunei' 'Timor-Leste'
 'China, Hong Kong SAR' 'China, Macao SAR' 'Channel Islands' 'Czechia'
 'North Macedonia' 'French Polynesia' 'Guam' 'New Caledonia' 'Palau'
 'Samoa']


In [36]:
# Join data together, but only non-missing data
results = df.merge(population, how = 'inner', left_on = 'country', right_on = 'Name')\
    [['country', 'page', 'rev_id', 'prediction', 'Population']]\
    .rename(
        {'page': 'article_name', 'rev_id': 'revision_id', 'prediction': 'article_quality_est.', 'Population': 'population'}, 
        axis = 1)

# Write results to a table assuming we are in the src folder
results.to_csv('../data_clean/wp_wpds_politicians_by_country.csv', index = False)

## Step 5: Analysis

Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.  
Examples:  
- if a country has a population of 10,000 people, and you found 10 FA or GA class articles about politicians from that country, then the percentage of articles-per-population would be .1%.  
- if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.

In [96]:
# High quality articles are ones that are classified as FA or GA
results['high_quality'] = results['article_quality_est.'].isin(['FA', 'GA'])
country_stats = results.groupby(['country', 'population', 'high_quality'], dropna = False, as_index = False)\
    .agg({'revision_id': 'count'})

In [101]:
country_stats['total_ids'] = country_stats.groupby(['country', 'population'])['revision_id'].transform('sum')
country_stats['percentage_of_articles'] = country_stats['revision_id'] / country_stats['total_ids']
country_stats['percentage_of_pop'] = country_stats['revision_id'] / country_stats['population']
articles_by_country = country_stats.loc[country_stats['high_quality'] == True].drop(['high_quality'], axis = 1)\
    .rename({'revision_id': 'num_high_quality_articles', 'total_ids': 'total_articles'}, axis = 1)
articles_by_country

Unnamed: 0,country,population,num_high_quality_articles,total_articles,percentage_of_articles,percentage_of_pop
1,Afghanistan,38928000,13,322,0.040373,3.339499e-07
3,Albania,2838000,3,457,0.006565,1.057082e-06
5,Algeria,44357000,2,116,0.017241,4.508871e-08
10,Argentina,45377000,16,491,0.032587,3.526015e-07
12,Armenia,2956000,5,196,0.025510,1.691475e-06
...,...,...,...,...,...,...
319,Vanuatu,321000,3,60,0.050000,9.345794e-06
321,Venezuela,28645000,3,131,0.022901,1.047303e-07
323,Vietnam,96209000,13,187,0.069519,1.351225e-07
325,Yemen,29826000,3,118,0.025424,1.005834e-07


I made this region hierarchy table by myself manually in Microsoft Excel. It has each country with its regional hierarchy classification.

In [134]:
region_hierarchy = pd.read_csv('../data_clean/WPDS_countries_with_region_hierarchy.csv')
region_hierarchy

Unnamed: 0,Name,Region_0,Region_1,Region_2
0,Algeria,World,AFRICA,NORTHERN AFRICA
1,Egypt,World,AFRICA,NORTHERN AFRICA
2,Libya,World,AFRICA,NORTHERN AFRICA
3,Morocco,World,AFRICA,NORTHERN AFRICA
4,Sudan,World,AFRICA,NORTHERN AFRICA
...,...,...,...,...
205,Samoa,World,OCEANIA,OCEANIA
206,Solomon Islands,World,OCEANIA,OCEANIA
207,Tonga,World,OCEANIA,OCEANIA
208,Tuvalu,World,OCEANIA,OCEANIA


In [124]:
# Manually create mapping of all countries and their regions


## Step 6: Results

Your results from this analysis will be published in the form of data tables. You are being asked to produce six total tables, that show:  
- Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population  
- Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population  
- Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality  
- Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality  
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population  
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality  

Embed these tables in your Jupyter notebook. You do not need to graph or otherwise visualize the data for this assignment, although you are welcome to do so in addition to generating the data tables described above, if you wish.  
Reminder: you will find the list of geographic regions, which countries are in each region, and total regional population in the raw `WPDS_2020_data.csv` file. See "Step 2: Cleaning the data" above for more information.