## Step 1: Data Acquisition

The below data is can be found at https://figshare.com/articles/dataset/Untitled_Item/5513449 and https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit?usp=sharing. 


In [1]:
import pandas as pd
import json
from pandas import json_normalize
import numpy as np
import requests

In [2]:
# reading in the csv files found in the raw-data folder
page_data = pd.read_csv("raw-data/page_data.csv")
population_data = pd.read_csv("raw-data/WPDS_2020_data - WPDS_2020_data.csv")

In [3]:
# previewing page data file
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [4]:
# previewing population data file
population_data.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


## Step 2: Data Cleaning
Next I cleaned the data. For page_data I removed pages that started with "Template:" as these are not Wikipedia articles. I then mapped each country to it's corresponding region, assuming that countries can be found underneath the region it corresponds to in the population_data file. I then seperated the countries from the regions so they could be joined with the page_data file later on.

In [5]:
# removing pages that start with "Template:"
page_data = page_data[page_data["page"].str.startswith("Template:")==False]

In [6]:
# filtering out regions by assuming all countries have type equal to country
population_country_lvl = population_data[population_data['Type'] == "Country"]

In [7]:
# outputting cleaned data files into csvs in the clean-data folder
page_data.to_csv("clean-data/page_data.csv", index=False)
population_country_lvl.to_csv("clean-data/population_country_lvl.csv", index=False)

In [8]:
# creating new list for country and region
country_region_map = []
# mapping all countries to the region that is above them in the population data file.
for i in population_data.index[population_data["Type"]=="Country"].tolist():
    j = i
    while population_data.iloc[j]["Type"] == "Country":
        j = j - 1
    country_region_map.append((population_data.iloc[j]["Name"], population_data.iloc[i]['Name']))
# converting list into data frame with columns Region and Country
country_region_map = pd.DataFrame(data = country_region_map, columns=['Region', 'Country'])


In [9]:
# retaining region information in seperate file for use in analysis section
# outputting to csv in clean-data folder
country_region_map.to_csv("clean-data/country_region_map.csv")

## Step 3: Getting Article Quality Predictions
Next I got the predicted quality scores for each Wikipedia article using the ORES API, a machine learning tool that predicts the quality score of a given Wikipedia article. The documentation for the ORES API can be found here: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model For my purposes I used the value for prediction as the quality score.

In [10]:
# API endpoint for enwiki and articlequality model
endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={rev_id}'

In [11]:
# My github and email contact information
headers = {
    'User-Agent': 'https://github.com/geiercc',
    'From': 'geiercc@uw.edu'
}

In [12]:
# api_call takes in an endpoint, an int i and an int n which are used for signifying 
# the start and end points of a batch GET call. api_call returns a reponse
def api_call(endpoint, i, n):
    # calling multiple rev_ids at once in a batch
    rev_id =  "|".join(str(rev_id) for rev_id in page_data.rev_id.iloc[i:n])
    call = requests.get(endpoint.format(rev_id = rev_id), headers=headers)
    response = call.json() 
    return response

In [13]:
# creating seperate lists for articles that have a score associated with them and those that don't
no_score = [] 
ores_data = []

# api can only GET 50 responses at a time so I create a loop to go through all the rev_ids in page_data
for x in range(len(page_data["rev_id"]) // 50 + 1):
    # updating start and end numbers
    i = x * 50
    n = (min((x+1) * 50, len(page_data["rev_id"])))
    response = api_call(endpoint, i, n)
    # updating lists with either the rev_id and the score or with the rev_id and error message if there is no prediction
    for rev in response:
        for rev in response["enwiki"]["scores"]:
            if response["enwiki"]["scores"][rev]["articlequality"].get("score") is None:
                no_score.append(response["enwiki"]["scores"][rev]["articlequality"]["error"]["message"])
            else:
                ores_data.append([rev, response["enwiki"]["scores"][rev]["articlequality"]["score"]["prediction"]])


In [14]:
# outputting the rev_ids with scores and rev_ids with no scores to seperate csv files in the clean-data folder
ores_data = pd.DataFrame(ores_data, columns=['revision_id', 'article_quality_est.'])
ores_data.to_csv("clean-data/ores_data.csv")

no_score = pd.DataFrame(data = no_score, columns = ['Error Message'])
no_score.to_csv("clean-data/no_quality_score.csv")

## Step 4: Combining the Datasets

Next I merge the ORES data for each article with the wikipedia page data and the population data.
The rows that do not have matching data are output in a CSV file called wp_wpds_countries-no_match.csv while the remaining rows are output into a csv called wp_wpds_politicians_by_country.csv

In [15]:
# merging page_data with population data in an inner join to get rid of rows that don't match up
wp_wpds_politicians = pd.merge(page_data, population_country_lvl, left_on = "country", right_on = "Name", how="inner")

In [16]:
# finding rows that don't match through the indicator and keeping those that are in either right only or left only but not both
wp_wpds_politicians_no_match = pd.merge(page_data, population_country_lvl, left_on = "country", right_on = "Name", how='outer', indicator=True)
wp_wpds_politicians_no_match = wp_wpds_politicians_no_match[wp_wpds_politicians_no_match["_merge"].isin(["right_only", "left_only"])]
# outputting these rows to csv in the clean-data folder
wp_wpds_politicians_no_match.to_csv("clean-data/wp_wpds_countries-no_match.csv", index=False)

In [17]:
# merging ores_data with the merged page data and population data, making sure the rev_id is the same type
# using an inner join so all articles have a score
ores_data = ores_data.astype({"revision_id": np.int64})
wp_wpds_politicians = pd.merge(wp_wpds_politicians, ores_data, left_on = "rev_id", right_on = "revision_id", how="inner")

In [18]:
# keeping relevant columns - country, page, revision_id, article_quality_est., and Population
wp_wpds_politicians_by_country = wp_wpds_politicians[["country", "page", "revision_id", "article_quality_est.", "Population"]]
# renaming page column to article_name for clarity
wp_wpds_politicians_by_country = wp_wpds_politicians_by_country.rename(columns={"page": "article_name"})

In [19]:
# outputting this final merged file to a csv in results called wp_wpds_politicians_by_country.csv
wp_wpds_politicians_by_country.to_csv("results/wp_wpds_politicians_by_country.csv", index = False)

## Step 5: Analysis

Calculating proportions, as a percentage, of articles per population and high-quality articles for each country and geographic region. High-quality is defined as an article with either a Featured Article or Good Article rating.

In [20]:
# calculating the total article count for each country
article_count = wp_wpds_politicians_by_country.groupby('country').count()['article_name']

In [21]:
# calculating the number of quality articles for each country
quality_count = wp_wpds_politicians_by_country[(wp_wpds_politicians_by_country['article_quality_est.'] == "FA") | (wp_wpds_politicians_by_country['article_quality_est.'] == "GA")].groupby('country').count()['article_name']

In [22]:
# converting both article and quality counts to data frames
article_count = article_count.to_frame()
quality_count = quality_count.to_frame()
# renaming column to article_count
article_count = article_count.rename(columns={"article_name": "country_article_count"})
# renaming column to quality_count
quality_count = quality_count.rename(columns={"article_name": "quality_count"})

In [23]:
# adding article_count column to cleaned final dataframe
wp_wpds_politicians_analysis = pd.merge(wp_wpds_politicians_by_country, article_count, on = "country")

In [24]:
# selecting relevant columns country, Population, and country_article_count
wp_wpds_politicians_analysis = wp_wpds_politicians_analysis[["country", "Population", "country_article_count"]]

In [25]:
# calculating articles per population percentage from country_article_count and Population
wp_wpds_politicians_analysis["articles-per-pop"] = (wp_wpds_politicians_analysis["country_article_count"] / wp_wpds_politicians_analysis["Population"]) * 100.0

In [26]:
# dropping duplicate country rows
country_coverage = wp_wpds_politicians_analysis.drop_duplicates()
# getting descending sort 
country_coverage_desc = country_coverage.sort_values(by = "articles-per-pop", ascending=False)
# getting ascending sort
country_coverage_asc = country_coverage.sort_values(by = "articles-per-pop", ascending=True)

In [27]:
# getting top 10 and bottom ten countries by coverage
top_10_countries_by_coverage = country_coverage_desc.head(10)
bottom_10_countries_by_coverage = country_coverage_asc.head(10)

In [28]:
# merging new dataframe with article count with the quality counts, 
# using a left join to account for countries with no quality articles
wp_wpds_politicians_analysis_with_quality = pd.merge(wp_wpds_politicians_analysis, quality_count, on = "country", how = "left")
# replacing NaN values with 0 for countries with no quality articles
wp_wpds_politicians_analysis_with_quality = wp_wpds_politicians_analysis_with_quality.replace(np.nan, 0)

In [29]:
# calculating the quality article percentage
wp_wpds_politicians_analysis_with_quality["quality_articles_pct"] = (wp_wpds_politicians_analysis_with_quality["quality_count"] / wp_wpds_politicians_analysis_with_quality["country_article_count"]) * 100.0

In [30]:
# selecting the relevant columns of country, Population, country_article_count, quality_count, and quality_articles_pct
wp_wpds_politicians_analysis_for_quality = wp_wpds_politicians_analysis_with_quality[["country", "Population", "country_article_count", "quality_count", "quality_articles_pct"]]

In [31]:
# getting ascending and descending sort for quality article percentage
country_level_quality_asc = wp_wpds_politicians_analysis_for_quality.sort_values(by = "quality_articles_pct", ascending=True).drop_duplicates()
country_level_quality_desc = wp_wpds_politicians_analysis_for_quality.sort_values(by = "quality_articles_pct", ascending=False).drop_duplicates()

In [32]:
# getting top and bottom 10 countries by quality article percentage
bottom_10_countries_by_quality = country_level_quality_asc.head(10)
top_10_countries_by_quality = country_level_quality_desc.head(10)

In [33]:
# dropping all duplicate country rows for region level analysis
region_level_analysis = wp_wpds_politicians_analysis_with_quality.drop_duplicates()
# merging this with the mapping from earlier of a country to its region
region_level_analysis = pd.merge(region_level_analysis, country_region_map, left_on="country", right_on="Country")

In [34]:
# grouping by region to get the sum of all the counts for each region instead of each country
region_level_analysis = region_level_analysis.groupby('Region').sum()
# calculating the regional level article and quality percentages
region_level_analysis["regional_article_pct"] = (region_level_analysis["country_article_count"] / region_level_analysis["Population"]) * 100.0
region_level_analysis["regional_quality_pct"] = (region_level_analysis["quality_count"] / region_level_analysis["country_article_count"]) * 100.0

In [35]:
# renaming country_article_count column to signify it is on a regional level
region_level_analysis = region_level_analysis.rename(columns={"country_article_count": "regional_article_count"})
# selecting relevant columns
region_level_analysis = region_level_analysis[["Population", "regional_article_count", "quality_count", "regional_article_pct", "regional_quality_pct"]]
# getting descending sorted rows for article coverage perecentage and quality coverage percentage
region_level_analysis_article_count = region_level_analysis.sort_values(by = "regional_article_pct", ascending=False)
region_level_analysis_quality_count = region_level_analysis.sort_values(by = "regional_quality_pct", ascending=False)

## Step 6: Results
Embedding 6 tables in Jupyter notebook with the results of the analysis.

### Top 10 Countries By Coverage
Below are the 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [36]:
top_10_countries_by_coverage

Unnamed: 0,country,Population,country_article_count,articles-per-pop
34683,Tuvalu,10000,54,0.54
42891,Nauru,11000,52,0.472727
18981,San Marino,34000,81,0.238235
27090,Monaco,38000,40,0.105263
34655,Liechtenstein,39000,28,0.071795
35093,Marshall Islands,57000,37,0.064912
32935,Tonga,99000,63,0.063636
27522,Iceland,368000,201,0.05462
43985,Andorra,82000,34,0.041463
44322,Federated States of Micronesia,106000,36,0.033962


### Bottom 10 Countries By Coverage
Below are the 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population


In [37]:
bottom_10_countries_by_coverage

Unnamed: 0,country,Population,country_article_count,articles-per-pop
3642,India,1400100000,968,6.9e-05
25121,Indonesia,271739000,209,7.7e-05
10856,China,1402385000,1129,8.1e-05
42943,Uzbekistan,34174000,28,8.2e-05
36689,Ethiopia,114916000,101,8.8e-05
44522,Zambia,18384000,25,0.000136
43818,"Korea, North",25779000,36,0.00014
39497,Thailand,66534000,112,0.000168
39439,Mozambique,31166000,58,0.000186
38316,Bangladesh,169809000,317,0.000187


### Top 10 countries by relative quality
Below are the 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [38]:
top_10_countries_by_quality

Unnamed: 0,country,Population,country_article_count,quality_count,quality_articles_pct
43852,"Korea, North",25779000,36,8.0,22.222222
44213,Saudi Arabia,35041000,117,15.0,12.820513
21956,Romania,19241000,343,42.0,12.244898
43764,Central African Republic,4830000,66,8.0,12.121212
42964,Uzbekistan,34174000,28,3.0,10.714286
41923,Mauritania,4650000,48,5.0,10.416667
43262,Guatemala,18066000,83,7.0,8.433735
44367,Dominica,72000,12,1.0,8.333333
22560,Syria,19398000,128,10.0,7.8125
21666,Benin,12209000,91,7.0,7.692308


### Bottom 10 countries by relative quality
Below are the 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality. Note that these comprise of countries with 0 quality articles and that there are other countries with 0 quality articles that would be "tied" with these for bottom 10.


In [39]:
bottom_10_countries_by_quality

Unnamed: 0,country,Population,country_article_count,quality_count,quality_articles_pct
44567,Seychelles,98000,21,0.0,0.0
23733,Moldova,3535000,421,0.0,0.0
7438,French Guiana,294000,27,0.0,0.0
27098,Monaco,38000,40,0.0,0.0
34739,Sao Tome and Principe,210000,21,0.0,0.0
34682,Liechtenstein,39000,28,0.0,0.0
35102,Marshall Islands,57000,37,0.0,0.0
34446,Solomon Islands,715000,97,0.0,0.0
39466,Mozambique,31166000,58,0.0,0.0
39041,Kiribati,125000,30,0.0,0.0


### Geographic regions by coverage
Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [40]:
display(region_level_analysis_article_count)

Unnamed: 0_level_0,Population,regional_article_count,quality_count,regional_article_pct,regional_quality_pct
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
OCEANIA,42031000,3126,63.0,0.007437,2.015355
Channel Islands,105680000,3763,102.0,0.003561,2.710603
SOUTHERN EUROPE,151136000,3710,74.0,0.002455,1.994609
WESTERN EUROPE,195479000,4560,56.0,0.002333,1.22807
CARIBBEAN,39056000,695,13.0,0.001779,1.870504
EASTERN EUROPE,281186000,3732,118.0,0.001327,3.161844
SOUTHERN AFRICA,66628000,634,9.0,0.000952,1.419558
CENTRAL AMERICA,162267000,1543,23.0,0.000951,1.490603
WESTERN ASIA,272499000,2563,89.0,0.000941,3.472493
MIDDLE AFRICA,90189000,665,16.0,0.000737,2.406015


### Geographic regions by high quality article coverage
Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [41]:
display(region_level_analysis_quality_count)

Unnamed: 0_level_0,Population,regional_article_count,quality_count,regional_article_pct,regional_quality_pct
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NORTHERN AMERICA,368068000,1901,104.0,0.000516,5.470805
SOUTHEAST ASIA,660056000,2020,73.0,0.000306,3.613861
WESTERN ASIA,272499000,2563,89.0,0.000941,3.472493
EASTERN EUROPE,281186000,3732,118.0,0.001327,3.161844
EAST ASIA,1632883000,2473,76.0,0.000151,3.07319
CENTRAL ASIA,74960000,245,7.0,0.000327,2.857143
Channel Islands,105680000,3763,102.0,0.003561,2.710603
MIDDLE AFRICA,90189000,665,16.0,0.000737,2.406015
NORTHERN AFRICA,243748000,899,19.0,0.000369,2.113459
OCEANIA,42031000,3126,63.0,0.007437,2.015355
