The goal of project will be to analyze potential bias in Wikipedia article coverage of politicians by country and region. This method will look into the abundance of articles (or lack thereof) proportional to the geographical population, as well as the proportion of quality articles, as rated by the ORES (Objective Revision Evaluation Service) Wikimedia algorithm.

ORES provides one of 6 quality category estimates for articles based on probabilities for each. The scores take into accout the timing of the evaluation. The 6 categories, from best to worst, are as follows:

* FA - Featured article
* GA - Good article
* B - B-class article
* C - C-class article
* Start - Start-class article
* Stub - Stub-class article

The documentation for the ORES API can be found at https://ores.wikimedia.org/. Version 3 was the API was used.

The politician article data was retrieved from https://figshare.com/articles/Untitled_Item/5513449 on 10/16/2019. Version 6 was used for this analysis. The data is released under the CC-BY-SA 4.0 license. The following data description was copied directly from the page.

<i>The data was extracted via the Wikimedia API using the associated code. It is formatted as a CSV and saved as page_data.csv in the "data" directory. Columns are:

1. "country", containing the sanitised country name, extracted from the category name;
2. "page", containing the unsanitised page title.
3. "last_edit", containing the edit ID of the last edit to the page.

Country codes are inconsistent. Where possible, they have been modified to match the country names found in http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14 - but the PRB dataset contains nations not found in Wikipedia, and vice versa.

The actual recursion only went 2 levels deep into the category tree: someone listed as an Antiguan politician, say, is included - someone exclusively listed as an Antiguan politician who was assassinated is not.</i>

The world population data was retrieved from https://www.prb.org/international/indicator/population/table/, using the population labeled as of mid-2018.
* Geography: The geographical area the for the entry. Entries in all capitals (e.g. ASIA) represent aggregated regions
* Population mid-2018 (millions): The recorded population (in millions) of the associated geographical area as of mid-2018

<i>Note: There is an issue with attempting to use the python ORES API with Windows 10. The pyenchant dependency, which has an enchant C library dependency, is an abandoned project as of 2018 and will likley not be fixed. The pickle file provided to circumvent this issue is courtesy of Alexander Van Roijen.</i>

In [730]:
import pandas as pd
from pandas import DataFrame as DF
import numpy as np
import pickle as pk

from IPython.display import clear_output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# pickle file provided courtesy of Alexander Van Roijen
qualities = pk.load(open('allqualities.pk', 'rb'))

First we will address the page data. The "article_quality" column will be added here.
Also, there are entries in the page data prepended with "Template" in their page name. These are articles and will be removed from consideration.

In [408]:
pg_data = pd.read_csv('page_data.csv')
artQual = DF(np.asarray([None]*len(pg_data)), columns=['article_quality'])
pg_data = pd.concat([pg_data, artQual], axis=1)

# removing rows containing Template pages
pg_data = pg_data[pg_data['page'].str.contains('Template:', na=False) == False]
pg_data.head()

Unnamed: 0,page,country,rev_id,article_quality
1,Bir I of Kanem,Chad,355319463,
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,
12,Yos Por,Cambodia,393822005,
23,Julius Gregr,Czech Republic,395521877,
24,Edvard Gregr,Czech Republic,395526568,


The following code block will populate the ORES prediction scores. This process will take a long time. It is recommended to use the wp_wpds_politicians_by_country.csv file instead for the sake of time.

<i>Note: The lack of a loop condition is intentional</i>

In [496]:
# Add the prediction scores to the dataframe. This will take a long time
x = 0
while True:
    clear_output()
    x = x + 50
    y = x + 50
    
    print(x)
    for i in range(x,y):
        if 'score' in qualities[i]['articlequality']:
            pg_data.article_quality[pg_data.index[i]] = qualities[i]['articlequality']['score']['prediction']

46700


IndexError: list index out of range

In [498]:
# Intermediate save
pg_data.to_csv('wp_wpds_politicians_by_country.csv', index=False)

Next, we will handle the world population data.
Here, we'll be adding the geographical region as a separate column for later. We will also remove them as row entries in the Geography column.

In [499]:
WPDS = pd.read_csv('WPDS_2018_data.csv')
WPDS['Region'] = None
WPDS['Region'][0:56] = WPDS['Geography'][0] # AFRICA
WPDS['Region'][56:59] = WPDS['Geography'][56] # NORTHERN AMERICA
WPDS['Region'][59:95] = WPDS['Geography'][59] # LATIN AMERICA AND THE CARIBBEAN
WPDS['Region'][95:144] = WPDS['Geography'][95] # ASIA
WPDS['Region'][144:189] = WPDS['Geography'][144] # EUROPE
WPDS['Region'][189:] = WPDS['Geography'][189] # OCEANIA

WPDS = WPDS[(WPDS.Geography != 'AFRICA') & (WPDS.Geography != 'ASIA') & (WPDS.Geography != 'EUROPE') 
     & (WPDS.Geography != 'LATIN AMERICA AND THE CARIBBEAN') & (WPDS.Geography != 'NORTHERN AMERICA')
     & (WPDS.Geography != 'OCEANIA')]

WPDS.head()

Unnamed: 0,Geography,Population mid-2018 (millions),Region
1,Algeria,42.7,AFRICA
2,Egypt,97.0,AFRICA
3,Libya,6.5,AFRICA
4,Morocco,35.2,AFRICA
5,Sudan,41.7,AFRICA


Now, we will merge the two data frames, linking them across the "country" and "Geograpy" columns. We will perform an outer join to allow us to pull out the countries without matches. There are countries that exist in the article data without a matching geography and vice versa.

The list of countries without matches will be condensed into a single DataFrame and saved as wp_wpds_countries-no_match.csv.

The DataFrame housing the completed entries will be saved as wp_wpds_politicians_by_country.csv.

In [500]:
# merge
article_data = pg_data.merge(WPDS, how='outer', indicator=True, left_on='country', right_on='Geography')
article_data.head()

Unnamed: 0,page,country,rev_id,article_quality,Geography,Population mid-2018 (millions),Region,_merge
0,Bir I of Kanem,Chad,355319463.0,Stub,Chad,15.4,AFRICA,both
1,Abdullah II of Kanem,Chad,498683267.0,Stub,Chad,15.4,AFRICA,both
2,Salmama II of Kanem,Chad,565745353.0,Stub,Chad,15.4,AFRICA,both
3,Kuri I of Kanem,Chad,565745365.0,Stub,Chad,15.4,AFRICA,both
4,Mohammed I of Kanem,Chad,565745375.0,Stub,Chad,15.4,AFRICA,both


In [502]:
# Countries that occupied one dataset, but not the other.
no_match = DF(np.append(article_data['country'][article_data._merge != 'both'].unique(), article_data['Geography'][article_data._merge != 'both'].unique()), columns = ['country']).sort_values('country')[0:60]
no_match.to_csv('wp_wpds_countries-no_match.csv', index=False)

In [738]:
# Merge the data frames
df_articles = article_data[article_data._merge == 'both']
df_articles = df_articles.drop(['_merge'], axis = 1).rename(columns={'Population mid-2018 (millions)': 'population', 'page' : 'article_name'})
df_articles = df_articles[pd.Series(['country', 'article_name', 'rev_id', 'article_quality', 'population', 'Region'])]
df_articles['population'] = df_articles['population'].str.replace(',', '').astype(float)

df_articles.to_csv('wp_wpds_politicians_by_country.csv', index=False)

df_articles.head()

Unnamed: 0,country,article_name,rev_id,article_quality,population,Region
0,Chad,Bir I of Kanem,355319463.0,Stub,15.4,AFRICA
1,Chad,Abdullah II of Kanem,498683267.0,Stub,15.4,AFRICA
2,Chad,Salmama II of Kanem,565745353.0,Stub,15.4,AFRICA
3,Chad,Kuri I of Kanem,565745365.0,Stub,15.4,AFRICA
4,Chad,Mohammed I of Kanem,565745375.0,Stub,15.4,AFRICA


For the next phase, we will focus on the analysis. This will require manipulation of our data structures.

First, we will calculate the count of articles by country. Then, we will construct a DataFrame that will show the proportion of politician articles by populations (in millions).

In [857]:
# Gather the count of articles by country
counts = []
for ab in df_articles.country.unique():
    artCount = len(df_articles[df_articles['country'] == ab])
    counts.append((ab, artCount))

art_Counts = DF(np.asarray(counts), columns=['country', 'article_count'])
art_Counts['article_count'] = art_Counts['article_count'].astype(int)

# Build the article proportions dataframe
df_proportions = df_articles.merge(art_Counts, how='left', left_on='country', right_on='country')
df_proportions.article_count = (df_proportions.article_count/df_proportions.population)

# Proportions by country
df_proportions = df_proportions.rename(columns={'article_count': 'article_proportion'})[pd.Series(['country', 'article_proportion'])]

# Separate DataFrames for regions, countries and proportion values
countries = df_proportions.sort_values('country')['country'].unique()
proportions = df_proportions.groupby('country', group_keys=False).max()['article_proportion']
df_countries = DF(countries)

# Join the country and proportions into one table
country_proportions = []
for c, p in zip(countries, proportions):
    #reg = df_articles[df_articles['country'] == c]['Region'][0]
    country_proportions.append((c,p))

country_proportions = DF(country_proportions, columns=['country', 'proportion'])

The following view shows the top 10 countries by proportional coverage.

In [740]:
country_proportions.sort_values('proportion', ascending=False).head(10)

Unnamed: 0,country,proportion
166,Tuvalu,5400.0
115,Nauru,5200.0
135,San Marino,2700.0
108,Monaco,1000.0
93,Liechtenstein,700.0
161,Tonga,630.0
103,Marshall Islands,616.666667
68,Iceland,505.0
3,Andorra,425.0
61,Grenada,360.0


The following view shows the bottom 10 countries by proportional coverage.

In [732]:
country_proportions.sort_values('proportion', ascending=True).head(10)

Unnamed: 0,country,proportion
69,India,0.718297
70,Indonesia,0.795626
34,China,0.812886
173,Uzbekistan,0.851064
51,Ethiopia,0.939535
82,"Korea, North",1.40625
178,Zambia,1.412429
159,Thailand,1.691843
112,Mozambique,1.901639
13,Bangladesh,1.929087


This project served primarily to remind me how difficult it can be getting data structures to cooperate at times. A lot of time was lost trying to get columns aligned or aggregated correctly, and structure the data with code that was not intuitive.
However, there was not enough analysis performed to discover any stories regarding bias. There does appear to be a trend on countries with low proportional coverage. Primarily, they reside in the Eastern Hemisphere and the lowest proportions are within Asia. 

Meanwhile, the highest proportions are within Oceania and Europe. However, given how much higher the proportions are for Tuvalu through Monaco, there is a potential misinterpretation of the data. It is quite possible that only looking at the counts over population favors countries with smaller populations. 

There would also be potential for countries with large deviations from the mean population size or article counts can influence the results. This isn't to say the data for this analysis is without use. Potentially, it could be used to quantify differences in media coverage for certain regions if paired with another dataset. Without factoring in other sources of information, the data can lack the context required for more interpretation. There is also the question of who the source of the article content is. It is highly uncommon for governments to speak negatively on themselves. If they are the source of much content on, say, incumbent politicians, then there is a bias to account for. So using this data for this interpretations beyond Wikipedia and without additional context is likley a flawed approach.