The goal of project will be to analyze potential bias in Wikipedia article coverage of politicians by country and region. This method will look into the abundance of articles (or lack thereof) proportional to the geographical population, as well as the proportion of quality articles, as rated by the ORES (Objective Revision Evaluation Service) Wikimedia algorithm.

ORES provides one of 6 quality category estimates for articles based on probabilities for each. The scores take into accout the timing of the evaluation. The 6 categories, from best to worst, are as follows:

* FA - Featured article
* GA - Good article
* B - B-class article
* C - C-class article
* Start - Start-class article
* Stub - Stub-class article

The documentation for the ORES API can be found at https://ores.wikimedia.org/. Version 3 was the API was used.

The politician article data was retrieved from https://figshare.com/articles/Untitled_Item/5513449 on 10/16/2019. Version 6 was used for this analysis. The data is released under the CC-BY-SA 4.0 license. The following data description was copied directly from the page.

<i>The data was extracted via the Wikimedia API using the associated code. It is formatted as a CSV and saved as page_data.csv in the "data" directory. Columns are:

1. "country", containing the sanitised country name, extracted from the category name;
2. "page", containing the unsanitised page title.
3. "last_edit", containing the edit ID of the last edit to the page.

Country codes are inconsistent. Where possible, they have been modified to match the country names found in http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14 - but the PRB dataset contains nations not found in Wikipedia, and vice versa.

The actual recursion only went 2 levels deep into the category tree: someone listed as an Antiguan politician, say, is included - someone exclusively listed as an Antiguan politician who was assassinated is not.</i>

The world population data was retrieved from https://www.prb.org/international/indicator/population/table/, using the population labeled as of mid-2018.
* Geography: The geographical area the for the entry. Entries in all capitals (e.g. ASIA) represent aggregated regions
* Population mid-2018 (millions): The recorded population (in millions) of the associated geographical area as of mid-2018

<i>Note: There is an issue with attempting to use the python ORES API with Windows 10. The pyenchant dependency, which has an enchant C library dependency, is an abandoned project as of 2018 and will likley not be fixed. The pickle file provided to circumvent this issue is courtesy of Alexander Van Roijen.</i>

In [730]:
import pandas as pd
from pandas import DataFrame as DF
import numpy as np
import pickle as pk

from IPython.display import clear_output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# pickle file provided courtesy of Alexander Van Roijen
qualities = pk.load(open('allqualities.pk', 'rb'))

First we will address the page data. The "article_quality" column will be added here.
Also, there are entries in the page data prepended with "Template" in their page name. These are articles and will be removed from consideration.

In [408]:
pg_data = pd.read_csv('page_data.csv')
artQual = DF(np.asarray([None]*len(pg_data)), columns=['article_quality'])
pg_data = pd.concat([pg_data, artQual], axis=1)

# removing rows containing Template pages
pg_data = pg_data[pg_data['page'].str.contains('Template:', na=False) == False]
pg_data.head()

Unnamed: 0,page,country,rev_id,article_quality
1,Bir I of Kanem,Chad,355319463,
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,
12,Yos Por,Cambodia,393822005,
23,Julius Gregr,Czech Republic,395521877,
24,Edvard Gregr,Czech Republic,395526568,


The following code block will populate the ORES prediction scores. This process will take a long time. It is recommended to use the wp_wpds_politicians_by_country.csv file instead for the sake of time.

<i>Note: The lack of a loop condition is intentional</i>

In [496]:
# Add the prediction scores to the dataframe. This will take a long time
x = 0
while True:
    clear_output()
    x = x + 50
    y = x + 50
    
    print(x)
    for i in range(x,y):
        if 'score' in qualities[i]['articlequality']:
            pg_data.article_quality[pg_data.index[i]] = qualities[i]['articlequality']['score']['prediction']

46700


IndexError: list index out of range

In [498]:
# Intermediate save
pg_data.to_csv('wp_wpds_politicians_by_country.csv', index=False)

Next, we will handle the world population data.
Here, we'll be adding the geographical region as a separate column for later. We will also remove them as row entries in the Geography column and store them in their own table.

In [1112]:
WPDS = pd.read_csv('WPDS_2018_data.csv')
WPDS['Region'] = None
WPDS['Region'][0:56] = WPDS['Geography'][0] # AFRICA
WPDS['Region'][56:59] = WPDS['Geography'][56] # NORTHERN AMERICA
WPDS['Region'][59:95] = WPDS['Geography'][59] # LATIN AMERICA AND THE CARIBBEAN
WPDS['Region'][95:144] = WPDS['Geography'][95] # ASIA
WPDS['Region'][144:189] = WPDS['Geography'][144] # EUROPE
WPDS['Region'][189:] = WPDS['Geography'][189] # OCEANIA

REGIONS_OR = (WPDS.Geography == 'AFRICA') | (WPDS.Geography == 'NORTHERN AMERICA') | \
     (WPDS.Geography == 'LATIN AMERICA AND THE CARIBBEAN') | (WPDS.Geography == 'ASIA') | \
     (WPDS.Geography == 'EUROPE') | (WPDS.Geography == 'OCEANIA')

REGIONS_AND = (WPDS.Geography != 'AFRICA') & (WPDS.Geography != 'NORTHERN AMERICA') & \
     (WPDS.Geography != 'LATIN AMERICA AND THE CARIBBEAN') & (WPDS.Geography != 'ASIA') & \
     (WPDS.Geography != 'EUROPE') & (WPDS.Geography != 'OCEANIA')

WPDS_R = WPDS[REGIONS_OR].rename(columns={'Population mid-2018 (millions)' : 'population'})

WPDS = WPDS[REGIONS_AND].rename(columns={'Population mid-2018 (millions)' : 'population'})

WPDS.head()
WPDS_R.head()

Unnamed: 0,Geography,population,Region
1,Algeria,42.7,AFRICA
2,Egypt,97.0,AFRICA
3,Libya,6.5,AFRICA
4,Morocco,35.2,AFRICA
5,Sudan,41.7,AFRICA


Unnamed: 0,Geography,population,Region
0,AFRICA,1284,AFRICA
56,NORTHERN AMERICA,365,NORTHERN AMERICA
59,LATIN AMERICA AND THE CARIBBEAN,649,LATIN AMERICA AND THE CARIBBEAN
95,ASIA,4536,ASIA
144,EUROPE,746,EUROPE


Now, we will merge the two data frames, linking them across the "country" and "Geograpy" columns. We will perform an outer join to allow us to pull out the countries without matches. There are countries that exist in the article data without a matching geography and vice versa.

The list of countries without matches will be condensed into a single DataFrame and saved as wp_wpds_countries-no_match.csv.

The DataFrame housing the completed entries will be saved as wp_wpds_politicians_by_country.csv.

In [1113]:
# merge
article_data = pg_data.merge(WPDS, how='outer', indicator=True, left_on='country', right_on='Geography')
article_data.head()

Unnamed: 0,page,country,rev_id,article_quality,Geography,population,Region,_merge
0,Bir I of Kanem,Chad,355319463.0,Stub,Chad,15.4,AFRICA,both
1,Abdullah II of Kanem,Chad,498683267.0,Stub,Chad,15.4,AFRICA,both
2,Salmama II of Kanem,Chad,565745353.0,Stub,Chad,15.4,AFRICA,both
3,Kuri I of Kanem,Chad,565745365.0,Stub,Chad,15.4,AFRICA,both
4,Mohammed I of Kanem,Chad,565745375.0,Stub,Chad,15.4,AFRICA,both


In [1114]:
# Countries that occupied one dataset, but not the other.
no_match = DF(np.append(article_data['country'][article_data._merge != 'both'].unique(), article_data['Geography'][article_data._merge != 'both'].unique()), columns = ['country']).sort_values('country')[0:60]
no_match.to_csv('wp_wpds_countries-no_match.csv', index=False)

In [1115]:
# Merge the data frames
df_articles = article_data[article_data._merge == 'both']
df_articles = df_articles.drop(['_merge'], axis=1).rename(columns={'Population mid-2018 (millions)': 'population', 'page' : 'article_name'})
df_articles = df_articles[pd.Series(['Region', 'country', 'article_name', 'rev_id', 'article_quality', 'population'])]
df_articles['population'] = df_articles['population'].str.replace(',', '').astype(float)

df_articles.to_csv('wp_wpds_politicians_by_country.csv', index=False)

df_articles = df_articles.drop(['rev_id'], axis=1)
df_articles.head()

Unnamed: 0,Region,country,article_name,article_quality,population
0,AFRICA,Chad,Bir I of Kanem,Stub,15.4
1,AFRICA,Chad,Abdullah II of Kanem,Stub,15.4
2,AFRICA,Chad,Salmama II of Kanem,Stub,15.4
3,AFRICA,Chad,Kuri I of Kanem,Stub,15.4
4,AFRICA,Chad,Mohammed I of Kanem,Stub,15.4


Creating a DataFrame housing only high quality articles. High quality articles are scored as 'FA' or 'GA'

In [1096]:
df_articlesHQ = df_articles[(df_articles.article_quality == 'FA') | (df_articles.article_quality == 'GA')]
df_articlesHQ.head()

Unnamed: 0,Region,country,article_name,article_quality,population
61,AFRICA,Chad,Mahamat Nouri,FA,15.4
83,AFRICA,Chad,Hissène Habré,GA,15.4
441,ASIA,Cambodia,Norodom Chakrapong,GA,16.0
461,ASIA,Cambodia,Norodom Sihanouk,FA,16.0
482,ASIA,Cambodia,Nuon Chea,GA,16.0


For the next phase, we will focus on the analysis. This will require manipulation of our data structures.

First, we will calculate the count of articles by country and region.

In [1185]:
df_articlesHQ.head()

Unnamed: 0,Region,country,article_name,article_quality,population
61,AFRICA,Chad,Mahamat Nouri,FA,15.4
83,AFRICA,Chad,Hissène Habré,GA,15.4
441,ASIA,Cambodia,Norodom Chakrapong,GA,16.0
461,ASIA,Cambodia,Norodom Sihanouk,FA,16.0
482,ASIA,Cambodia,Nuon Chea,GA,16.0


In [1189]:
df_articles.head()

Unnamed: 0,Region,country,article_name,article_quality,population
0,AFRICA,Chad,Bir I of Kanem,Stub,15.4
1,AFRICA,Chad,Abdullah II of Kanem,Stub,15.4
2,AFRICA,Chad,Salmama II of Kanem,Stub,15.4
3,AFRICA,Chad,Kuri I of Kanem,Stub,15.4
4,AFRICA,Chad,Mohammed I of Kanem,Stub,15.4


In [1199]:
# Gather the count of articles by country
counts = []
for c in df_articles.country.unique():
    artCount = len(df_articles[df_articles['country'] == c])
    counts.append((c, artCount))

countsHQ = []
for c in df_articlesHQ.country.unique():
    artCount = len(df_articlesHQ[df_articlesHQ['country'] == c])
    countsHQ.append((c, artCount))

# Store article counts as integers
art_Counts = DF(np.asarray(counts), columns=['country', 'article_count'])
art_Counts['article_count'] = art_Counts['article_count'].astype(int)

art_CountsHQ = DF(np.asarray(countsHQ), columns=['country', 'article_count'])
art_CountsHQ['article_count'] = art_CountsHQ['article_count'].astype(int)

# Append article counts with population and region data
temp_df = df_articles.drop(['article_name', 'article_quality'], axis=1)
df_art = art_Counts.set_index('country').join(temp_df.groupby('country').min(), how='left').reset_index()

temp_df = df_art.drop(columns='population', axis=1).groupby('Region').sum()
df_art_R = WPDS_R.set_index('Geography').join(temp_df, how='left').reset_index().drop(columns='Region', axis=1)\
    [pd.Series(['Geography','article_count', 'population'])]
df_art_R['population'] = df_art_R['population'].str.replace(',', '').astype(float)

df_art = df_art[pd.Series(['country','article_count', 'population'])]

temp_df = df_articlesHQ.drop(['article_name', 'article_quality'], axis=1)
df_artHQ = art_CountsHQ.set_index('country').join(temp_df.groupby('country').min(), how='left').reset_index()\
    

temp_df = df_artHQ.drop(['population'], axis=1).groupby('Region').sum()
df_artHQ_R = WPDS_R.set_index('Geography').join(temp_df, how='left').reset_index().drop(columns='Region', axis=1)\
    [pd.Series(['Geography','article_count', 'population'])]
df_artHQ_R['population'] = df_artHQ_R['population'].str.replace(',', '').astype(float)

df_artHQ = df_artHQ[pd.Series(['country','article_count', 'population'])]

df_art.head()
df_art_R.head()
df_artHQ.head()
df_artHQ_R.head()

Unnamed: 0,country,article_count,population
0,Chad,97,15.4
1,Cambodia,213,16.0
2,Canada,848,37.2
3,Egypt,237,97.0
4,Pakistan,1040,200.6


Unnamed: 0,Geography,article_count,population
0,AFRICA,6861,1284.0
1,NORTHERN AMERICA,1940,365.0
2,LATIN AMERICA AND THE CARIBBEAN,5174,649.0
3,ASIA,11588,4536.0
4,EUROPE,15923,746.0


Unnamed: 0,country,article_count,population
0,Chad,2,15.4
1,Cambodia,4,16.0
2,Canada,22,37.2
3,Egypt,9,97.0
4,Pakistan,19,200.6


Unnamed: 0,Geography,article_count,population
0,AFRICA,125,1284.0
1,NORTHERN AMERICA,99,365.0
2,LATIN AMERICA AND THE CARIBBEAN,69,649.0
3,ASIA,310,4536.0
4,EUROPE,322,746.0


Now, we will construct DataFrames that will show the proportion of politician articles by populations (in millions).

In [1193]:
# Article proportions by country
df_art['article_proportion'] = (df_art.article_count/df_art.population)
# Article proportions by region
df_art_R['article_proportion'] = (df_art_R.article_count/df_art_R.population)
# High quality article proportions by country
df_artHQ['article_proportion'] = (df_artHQ.article_count/df_artHQ.population)
# High quality article proportions by region
df_artHQ_R['article_proportion'] = (df_artHQ_R.article_count/df_artHQ_R.population)

The following view shows the top 10 countries by proportional coverage.

In [1176]:
df_art.sort_values('article_proportion', ascending=False).head(10)

Unnamed: 0,country,article_count,population,article_proportion
99,Tuvalu,54,0.01,5400.0
149,Nauru,52,0.01,5200.0
42,San Marino,81,0.03,2700.0
65,Monaco,40,0.04,1000.0
98,Liechtenstein,28,0.04,700.0
87,Tonga,63,0.1,630.0
105,Marshall Islands,37,0.06,616.666667
68,Iceland,202,0.4,505.0
166,Andorra,34,0.08,425.0
78,Grenada,36,0.1,360.0


The following view shows the bottom 10 countries by proportional coverage.

In [1177]:
df_art.sort_values('article_proportion', ascending=True).head(10)

Unnamed: 0,country,article_count,population,article_proportion
6,India,985,1371.3,0.718297
60,Indonesia,211,265.2,0.795626
22,China,1133,1393.8,0.812886
150,Uzbekistan,28,32.9,0.851064
107,Ethiopia,101,107.5,0.939535
163,"Korea, North",36,25.6,1.40625
178,Zambia,25,17.7,1.412429
126,Thailand,112,66.2,1.691843
125,Mozambique,58,30.5,1.901639
116,Bangladesh,321,166.4,1.929087


The following view shows the top 10 countries by relative proportion of high quality articles.

In [1179]:
df_artHQ.sort_values('article_proportion', ascending=False).head(10)

Unnamed: 0,country,article_count,population,article_proportion
85,Tuvalu,5,0.01,500.0
140,Dominica,1,0.07,14.285714
69,Grenada,1,0.1,10.0
14,Vanuatu,3,0.3,10.0
60,Iceland,2,0.4,5.0
31,Ireland,21,4.9,4.285714
106,Bhutan,3,0.8,3.75
96,Maldives,1,0.4,2.5
51,New Zealand,12,4.9,2.44898
110,Israel,20,8.5,2.352941


The following view shows the bottom 10 countries by relative proportional of high quality articles.

In [1181]:
df_artHQ.sort_values('article_proportion', ascending=True).head(10)

Unnamed: 0,country,article_count,population,article_proportion
34,Nigeria,2,195.9,0.010209
6,India,17,1371.3,0.012397
18,Tanzania,1,59.1,0.01692
100,Bangladesh,3,166.4,0.018029
91,Ethiopia,2,107.5,0.018605
47,Colombia,1,49.8,0.02008
35,Brazil,6,209.4,0.028653
21,China,41,1393.8,0.029416
75,Peru,1,32.2,0.031056
72,Nepal,1,29.7,0.03367


The following view ranks the regions by total coverage proportional the total population of the region.

In [1195]:
df_art_R.sort_values('article_proportion', ascending=False)

Unnamed: 0,Geography,article_count,population,article_proportion
5,OCEANIA,3132,41.0,76.390244
4,EUROPE,15923,746.0,21.344504
2,LATIN AMERICA AND THE CARIBBEAN,5174,649.0,7.972265
0,AFRICA,6861,1284.0,5.343458
1,NORTHERN AMERICA,1940,365.0,5.315068
3,ASIA,11588,4536.0,2.554674


The following view ranks the regions by relative proportional of high quality articles.

In [1196]:
df_artHQ_R.sort_values('article_proportion', ascending=False)

Unnamed: 0,Geography,article_count,population,article_proportion
5,OCEANIA,66,41.0,1.609756
4,EUROPE,322,746.0,0.431635
1,NORTHERN AMERICA,99,365.0,0.271233
2,LATIN AMERICA AND THE CARIBBEAN,69,649.0,0.106317
0,AFRICA,125,1284.0,0.097352
3,ASIA,310,4536.0,0.068342


This project served primarily to remind me how difficult it can be getting data structures to cooperate at times. A lot of time was lost trying to get columns aligned or aggregated correctly, and structure the data with code that was not intuitive.
However, there was not enough analysis performed to discover any stories regarding bias. There does appear to be a trend on countries with low proportional coverage. Primarily, they reside in the Eastern Hemisphere and the lowest proportions are within Asia. 

Meanwhile, the highest proportions are within Oceania and Europe. However, given how much higher the proportions are for Tuvalu through Monaco, there is a potential misinterpretation of the data. It is quite possible that only looking at the counts over population favors countries with smaller populations. 

There would also be potential for countries with large deviations from the mean population size or article counts can influence the results. This isn't to say the data for this analysis is without use. Potentially, it could be used to quantify differences in media coverage for certain regions if paired with another dataset. Without factoring in other sources of information, the data can lack the context required for more interpretation. There is also the question of who the source of the article content is. It is highly uncommon for governments to speak negatively on themselves. If they are the source of much content on, say, incumbent politicians, then there is a bias to account for. So using this data for this interpretations beyond Wikipedia and without additional context is likley a flawed approach.

### Update after extension

On the country level, those with lower populations have significantly higherer proportional counts than their denser counterparts. For instance, China has the highest count of articles across the bottom and top 10 raw article counts by population. However, their much larger population ranks them near the end. They also have a higher number of quality articles, but not by the same magnitude as their total article count. 

A similar finding goes for the regional level proportions. Asia houses China and India, two of the largest populations on the planet. Meanwhile, Northern America only houses two countries altogether. Viewing the results without this context could be quite easy to misinterpret. Also, the metrics by which ORES rates may favor a certain writing style which is culturally different. This could reflect bias in the measuring system. Also, there are different styles of government and inclinations to add to Wikipedia content. For instance, Nepal and Dominica are quite far from each other on the rankings, but each has a single article written as quality. The difference is that Nepal has a much larger population.