Getting the Article and Population Data

The data for Politicians by country and Population by country has been obtained by crawling a general list of Wikipedia pages. The data is stored in csv files and can be found in the ```raw``` folder.

Considerations in the raw data:
1. Duplicate category labels
2. Cummulative Regional population counts also present

Getting Article Quality Predictions

An estimate of the Article Quality is provided by ORES, a machine learning tool. The article quality estimates are:
1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

ORES requires a specific revision ID of a specific article to be able to make a label prediction.
MediaWiki Action API is used to make a page info request to get the current page revision of each article and this is provided to ORES system to retrieve an ORES score.

To get a Wikipedia page quality prediction from ORES for each politician’s article page you will need to: 
1. read each line of politicians_by_country.SEPT.2022.csv, 
2. make a page info request to get the current page revision, and 
3. make an ORES request using the page title and current revision id. 

In [3]:
# import libraries
import json, time, urllib.parse
import requests
import pandas as pd
import numpy as np

In [45]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'anuhyabs@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
politicians = pd.read_csv('raw/politicians_by_country_SEPT.2022.csv')
population = pd.read_csv('raw/population_by_country_2022.csv')
ARTICLE_TITLES = politicians['name']

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}

In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

In [4]:
page_info = {}
no_val = pd.DataFrame(columns=['title'])
df = pd.DataFrame(columns=['title','lastrevid'])
for index, article in enumerate(ARTICLE_TITLES):
    if index % 500 == 0:
        print(f"Done with {index} records")
    page_info = request_pageinfo_per_article(article)   
    page_info = page_info['query']['pages']
    info_df = pd.DataFrame.from_dict(page_info, orient = 'index')
    if 'lastrevid' in info_df.columns:
        info_df = info_df[['title','lastrevid']]
        df = pd.concat([df, info_df])
    else:
        no_val = pd.concat([no_val, info_df[['title']]])


Done with 0 records
Done with 500 records
Done with 1000 records
Done with 1500 records
Done with 2000 records
Done with 2500 records
Done with 3000 records
Done with 3500 records
Done with 4000 records
Done with 4500 records
Done with 5000 records
Done with 5500 records
Done with 6000 records
Done with 6500 records
Done with 7000 records
Done with 7500 records


The following articles do not have any last revision IDs:

In [16]:
no_val

Unnamed: 0,title
-1,Prince Ofosu Sefah
-1,Harjit Kaur Talwandi
-1,Abd al-Razzaq al-Hasani
-1,Kang Sun-nam
-1,Segun “Aeroland” Adewale
-1,Nhlanhla “Lux” Dlamini


In [6]:
df.head(10)

Unnamed: 0,title,lastrevid
69537737,Shahjahan Noori,1099689043
42972519,Abdul Ghafar Lakanwal,943562276
10483286,Majah Ha Adrif,852404094
11966231,Haroon al-Afghani,1095102390
46841383,Tayyab Agha,1104998382
68624823,Ahmadullah Wasiq,1109361754
47805901,Aziza Ahmadyar,1087211008
70019038,Muqadasa Ahmadzai,1082489593
27664854,Mohammad Sarwar Ahmedzai,1038918070
12084570,Amir Muhammad Akhundzada,1069322182


In [7]:
i = 0
for index, row in df.iterrows():
    if(i%500 == 0):
        print(f"Done with {i} records")
    lastrevid = row["lastrevid"]
    score = request_ores_score_per_article(lastrevid)
    score = score["enwiki"]["scores"][str(lastrevid)]["articlequality"]["score"]["prediction"]
    df.loc[index,'score'] = score
    i += 1


Done with 0 records
Done with 500 records
Done with 1000 records
Done with 1500 records
Done with 2000 records
Done with 2500 records
Done with 3000 records
Done with 3500 records
Done with 4000 records
Done with 4500 records
Done with 5000 records
Done with 5500 records
Done with 6000 records
Done with 6500 records
Done with 7000 records
Done with 7500 records


In [8]:
df.head()

Unnamed: 0,title,lastrevid,score
69537737,Shahjahan Noori,1099689043,GA
42972519,Abdul Ghafar Lakanwal,943562276,Start
10483286,Majah Ha Adrif,852404094,Start
11966231,Haroon al-Afghani,1095102390,B
46841383,Tayyab Agha,1104998382,Start


In [9]:
df['score'].isnull().values.any()

False

In [43]:
combined = pd.merge(politicians, df, left_on='name', right_on='title', how='left')
combined.head()

Unnamed: 0,name,url,country,title,lastrevid,score
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,Shahjahan Noori,1099689043,GA
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,Abdul Ghafar Lakanwal,943562276,Start
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,Majah Ha Adrif,852404094,Start
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,Haroon al-Afghani,1095102390,B
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,Tayyab Agha,1104998382,Start


In [47]:
population.head()

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5


In [48]:
for i in range(0,len(population)):
    region = population.loc[i,'Geography'] 
    population.loc[i,'region'] = region if region.isupper() else population.loc[i-1,'region'] 


In [49]:
population.head()

Unnamed: 0,Geography,Population (millions),region
0,WORLD,7963.0,WORLD
1,AFRICA,1419.0,AFRICA
2,NORTHERN AFRICA,251.0,NORTHERN AFRICA
3,Algeria,44.9,NORTHERN AFRICA
4,Egypt,103.5,NORTHERN AFRICA


In [117]:
population['population'] = population['Population (millions)'] * 1000000

In [118]:
population.head()

Unnamed: 0,Geography,Population (millions),region,population
0,WORLD,7963.0,WORLD,7963000000.0
1,AFRICA,1419.0,AFRICA,1419000000.0
2,NORTHERN AFRICA,251.0,NORTHERN AFRICA,251000000.0
3,Algeria,44.9,NORTHERN AFRICA,44900000.0
4,Egypt,103.5,NORTHERN AFRICA,103500000.0


In [116]:
population[population['Population (millions)']==0]['Geography'].unique()

array(['Liechtenstein', 'Monaco', 'San Marino', 'Nauru', 'Palau',
       'Tuvalu'], dtype=object)

In [50]:
combined = pd.merge(combined, population, left_on='country', right_on='Geography', how='left')
combined.head()

Unnamed: 0,name,url,country,title,lastrevid,score,Geography,Population (millions),region
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,Shahjahan Noori,1099689043,GA,Afghanistan,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,Abdul Ghafar Lakanwal,943562276,Start,Afghanistan,41.1,SOUTH ASIA
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,Majah Ha Adrif,852404094,Start,Afghanistan,41.1,SOUTH ASIA
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,Haroon al-Afghani,1095102390,B,Afghanistan,41.1,SOUTH ASIA
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,Tayyab Agha,1104998382,Start,Afghanistan,41.1,SOUTH ASIA


In [51]:
combined = combined.drop(columns=['title','url', 'Geography'])
combined.head()

Unnamed: 0,name,country,lastrevid,score,Population (millions),region
0,Shahjahan Noori,Afghanistan,1099689043,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562276,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404094,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102390,B,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998382,Start,41.1,SOUTH ASIA


In [52]:
combined = combined.rename(columns={'name':'article_title', 'lastrevid': 'revision_id', 'score':'article_quality', 'Population (millions)':'population'})
combined.head()

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
0,Shahjahan Noori,Afghanistan,1099689043,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562276,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404094,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102390,B,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998382,Start,41.1,SOUTH ASIA


In [53]:
duplicateRows = combined[combined.duplicated()]
duplicateRows

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
198,Visar Ymeri,Albania,1036757024,Stub,2.8,SOUTHERN EUROPE
382,Hrant Maloyan,Armenia,1114902744,C,3.0,WESTERN ASIA
418,Count Wenzel Chotek of Chotkow and Wognin,Austria,1083654825,Start,9.0,WESTERN EUROPE
434,Eduard Hedvicek,Austria,1072556655,Stub,9.0,WESTERN EUROPE
442,Konstantin Jireček,Austria,1100601439,C,9.0,WESTERN EUROPE
...,...,...,...,...,...,...
7066,Manuel Carrascalão,Timor-Leste,1071880738,Start,1.3,SOUTHEAST ASIA
7290,Sergey Abisov,Ukraine,1113303500,Start,41.0,EASTERN EUROPE
7445,Torokul Dzhanuzakov,Uzbekistan,1092752457,C,35.6,CENTRAL ASIA
7446,Torokul Dzhanuzakov,Uzbekistan,1092752457,C,35.6,CENTRAL ASIA


In [54]:
combined = combined.drop_duplicates()

In [56]:
len(combined[combined['article_title'].duplicated()])

48

In [57]:
combined.isnull().values.any()

True

In [58]:
combined.isnull().sum()

article_title       0
country             0
revision_id         6
article_quality     6
population         70
region             70
dtype: int64

In [81]:
no_match = pd.DataFrame(combined[combined['region'].isnull()]["country"].unique(),
columns=["country"])
no_match

Unnamed: 0,country
0,Korean


In [94]:
no_match.to_csv("data/wp_countries-no_match.csv", index=False)

In [87]:
no_score =(combined[combined['article_quality'].isnull()])
no_score

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
2480,Prince Ofosu Sefah,Ghana,,,33.5,WESTERN AFRICA
3024,Harjit Kaur Talwandi,India,,,1417.2,SOUTH ASIA
3253,Abd al-Razzaq al-Hasani,Iraq,,,44.5,WESTERN ASIA
3834,Kang Sun-nam,"Korea, North",,,26.1,EAST ASIA
4943,Segun “Aeroland” Adewale,Nigeria,,,218.5,WESTERN AFRICA
6434,Nhlanhla “Lux” Dlamini,South Africa,,,60.6,SOUTHERN AFRICA


In [93]:
no_score.to_csv("data/wp_article_quality-no_match.csv", index=False)

In [90]:
combined = combined[combined['region'].notna()]
combined = combined[combined['article_quality'].notna()]
combined.head()

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
0,Shahjahan Noori,Afghanistan,1099689043,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562276,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404094,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102390,B,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998382,Start,41.1,SOUTH ASIA


In [91]:
combined.isnull().sum()

article_title      0
country            0
revision_id        0
article_quality    0
population         0
region             0
dtype: int64

In [109]:
combined.reset_index(drop=True).to_csv("data/wp_politicians_by_country.csv")

In [4]:
combined = pd.read_csv("data/wp_politicians_by_country.csv")

In [5]:
combined[combined['population']==0]['country'].unique()

array(['Liechtenstein', 'Monaco', 'Nauru', 'Palau', 'San Marino',
       'Tuvalu'], dtype=object)

In [6]:
total_article = combined.groupby(['country', 'population'])['country'].count().reset_index(name='count')

In [7]:
total_article['article_per_capita'] = total_article['count']/(total_article['population'] * 1000000)

In [27]:
ans1 = total_article[total_article['article_per_capita'] != np.inf].sort_values(by=['article_per_capita'], ascending=False)[:10]

In [28]:
ans1

Unnamed: 0,country,population,count,article_per_capita
5,Antigua and Barbuda,0.1,17,0.00017
54,Federated States of Micronesia,0.1,13,0.00013
3,Andorra,0.1,10,0.0001
13,Barbados,0.3,28,9.3e-05
104,Marshall Islands,0.1,9,9e-05
110,Montenegro,0.6,36,6e-05
143,Seychelles,0.1,6,6e-05
97,Luxembourg,0.7,37,5.3e-05
18,Bhutan,0.8,41,5.1e-05
64,Grenada,0.1,5,5e-05


In [10]:
total_quality = combined.groupby(['country', 'population'])['article_quality'].count().reset_index(name='count')

In [11]:
total_quality

Unnamed: 0,country,population,count
0,Afghanistan,41.1,118
1,Albania,2.8,83
2,Algeria,44.9,34
3,Andorra,0.1,10
4,Angola,35.6,42
...,...,...,...
179,Venezuela,28.3,62
180,Vietnam,99.4,27
181,Yemen,33.7,61
182,Zambia,20.0,13


In [12]:
total_quality['quality_per_capita'] = total_quality['count']/(total_article['population'] * 1000000)

In [13]:
total_quality[total_quality['quality_per_capita'] != np.inf].sort_values(by=['quality_per_capita'], ascending=False)

Unnamed: 0,country,population,count,quality_per_capita
5,Antigua and Barbuda,0.1,17,1.700000e-04
54,Federated States of Micronesia,0.1,13,1.300000e-04
3,Andorra,0.1,10,1.000000e-04
13,Barbados,0.3,28,9.333333e-05
104,Marshall Islands,0.1,9,9.000000e-05
...,...,...,...,...
73,India,1417.2,178,1.255998e-07
134,Romania,19.0,2,1.052632e-07
140,Saudi Arabia,36.7,3,8.174387e-08
106,Mexico,127.5,1,7.843137e-09


In [17]:
total_region = combined.groupby('region').agg(count=("article_title", "count"), population=("population", "sum")).reset_index()

In [18]:
total_region

Unnamed: 0,region,count,population
0,CARIBBEAN,201,1239.5
1,CENTRAL AMERICA,195,1755.7
2,CENTRAL ASIA,106,1788.4
3,EAST ASIA,245,21763.7
4,EASTERN AFRICA,648,19032.8
5,EASTERN EUROPE,736,37316.1
6,MIDDLE AFRICA,203,7919.0
7,NORTHERN AFRICA,227,7639.9
8,NORTHERN EUROPE,262,1348.4
9,OCEANIA,86,110.1


In [19]:
total_region['article_per_capita'] = total_region['count']/(total_article['population'] * 1000000)

In [20]:
total_region[total_region['article_per_capita'] != np.inf].sort_values(by=['article_per_capita'], ascending=False)

Unnamed: 0,region,count,population,article_per_capita
5,EASTERN EUROPE,736,37316.1,0.00736
3,EAST ASIA,245,21763.7,0.00245
16,WESTERN ASIA,686,15532.6,0.001715
10,SOUTH AMERICA,577,31435.0,0.001442
11,SOUTH ASIA,648,301968.8,0.000432
13,SOUTHERN AFRICA,117,5233.0,0.00039
14,SOUTHERN EUROPE,890,19444.6,9.7e-05
7,NORTHERN AFRICA,227,7639.9,7.6e-05
1,CENTRAL AMERICA,195,1755.7,7e-05
17,WESTERN EUROPE,699,30224.0,5.2e-05


In [22]:
total_quality_region = combined.groupby('region').agg(count=("article_quality", "count"), population=("population", "sum")).reset_index()

In [24]:
total_quality_region['quality_per_capita'] = total_quality_region['count']/(total_quality_region['population'] * 1000000)

In [25]:
total_quality_region[total_quality_region['quality_per_capita'] != np.inf].sort_values(by=['quality_per_capita'], ascending=False)

Unnamed: 0,region,count,population,quality_per_capita
9,OCEANIA,86,110.1,7.811081e-07
8,NORTHERN EUROPE,262,1348.4,1.943044e-07
0,CARIBBEAN,201,1239.5,1.621622e-07
1,CENTRAL AMERICA,195,1755.7,1.110668e-07
2,CENTRAL ASIA,106,1788.4,5.927086e-08
14,SOUTHERN EUROPE,890,19444.6,4.577106e-08
16,WESTERN ASIA,686,15532.6,4.416518e-08
4,EASTERN AFRICA,648,19032.8,3.404649e-08
7,NORTHERN AFRICA,227,7639.9,2.971243e-08
6,MIDDLE AFRICA,203,7919.0,2.563455e-08
