In [1]:
#importing libraries
import pandas as pd
import numpy as np

import json, time, urllib.parse
import requests

## Step 1: Getting the Article and Population Data

We read the excel files stored in the data_raw folder, these correspond to Wikipedias polititians by country and  Population Reference Bureau's world population by country datasets. 

In [2]:
politicians = pd.read_excel('../data_raw/politicians_by_country_SEPT.2022.csv.xlsx')
population = pd.read_excel('../data_raw/population_by_country_2022.csv.xlsx')

## Handling data inconsistencies
We do some data exploration and examination to find if there are any inconsistencies, if so, we will specify how we decided to handle each

In [3]:
politicians[politicians.duplicated(subset=['name'])]

Unnamed: 0,name,url,country
1566,Rudi Kolak,https://en.wikipedia.org/wiki/Rudi_Kolak,Croatia
1654,Count Wenzel Chotek of Chotkow and Wognin,https://en.wikipedia.org/wiki/Count_Wenzel_Cho...,Czechia
1669,Eduard Hedvicek,https://en.wikipedia.org/wiki/Eduard_Hedvicek,Czechia
1676,Konstantin Jireček,https://en.wikipedia.org/wiki/Konstantin_Jireček,Czechia
1680,Maximilian Ulrich von Kaunitz,https://en.wikipedia.org/wiki/Maximilian_Ulric...,Czechia
1711,"Leopold, Count von Thun und Hohenstein","https://en.wikipedia.org/wiki/Leopold,_Count_v...",Czechia
1914,Ibrahim Harun,https://en.wikipedia.org/wiki/Ibrahim_Harun,Ethiopia
2513,José Alejandro de Aycinena,https://en.wikipedia.org/wiki/José_Alejandro_d...,Guatemala
2659,José Francisco Barrundia,https://en.wikipedia.org/wiki/José_Francisco_B...,Honduras
3419,Luca Rovinalti,https://en.wikipedia.org/wiki/Luca_Rovinalti,Italy


We observe that aggregations are all in CAPS, we keep them as they will come in handy later in the analysis. 

Lastly, we notice that there are 50 duplicated politicians in our datasource. Checking in a case-by-case article basis, we note that these people have been politicians in each of the countries. This means that the data isn't corrupted, it's just an edge case. Considering that:

1. It doesn't seem logical to assign these politicians to only one country given that they have participated in politics on each of these countries.
2. These duplicates represent only (100/7584=0.013) 1.3% of the data.
3. These are politicians that have participated in politics in both of the countries so it makes sense to have them in each country. 

We decide to allow these duplicates to exist, but we further check if there are any duplicates for all the columns.

In [4]:
politicians[politicians.duplicated()]
#politicians = politicians.drop_duplicates(subset=['name'], keep=False)

Unnamed: 0,name,url,country
6295,Abdirahman Aw Ali Farrah,https://en.wikipedia.org/wiki/Abdirahman_Aw_Al...,Somalia
6309,Ibrahim Megag Samatar,https://en.wikipedia.org/wiki/Ibrahim_Megag_Sa...,Somalia


We find that there are two politicians that have duplicate rows for all columns, these are not justifiable and will be removed from the data

In [5]:
politicians = politicians.drop_duplicates()

# Step 2: Getting Article Quality Predictions

We begin by making a page info request to get the current page revision. The example relies on some constants that help make the code a bit more readable.

Basic English Wikipedia API endpoint

In [7]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = politicians.name.tolist()

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [8]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

We send a request per article and store the data obtained in a dictionary called ARTICLE_REVISIONS. Any article that does not have a revision or incurred in an error is stores in the list ARTICLE_REVISIONS_NULL

In [9]:
ARTICLE_REVISIONS = {}
ARTICLE_REVISIONS_NULL =[]
for article in ARTICLE_TITLES:
    info = request_pageinfo_per_article(article)
    if info['query']['pages'][list(info['query']['pages'].keys())[0]].get('lastrevid','null') != 'null':
        last_rev_id = info['query']['pages'][list(info['query']['pages'].keys())[0]]['lastrevid']
        ARTICLE_REVISIONS[article] = last_rev_id
    else:
        ARTICLE_REVISIONS_NULL.append(article)

The following articles will have to be removed from the analysis given to their lack of information

In [10]:
ARTICLE_REVISIONS_NULL

['Prince Ofosu Sefah',
 'Harjit Kaur Talwandi',
 'Abd al-Razzaq al-Hasani',
 'Kang Sun-nam',
 'Segun “Aeroland” Adewale',
 'Nhlanhla “Lux” Dlamini']

In [11]:
ARTICLE_REVISIONS

{'Shahjahan Noori': 1099689043,
 'Abdul Ghafar Lakanwal': 943562276,
 'Majah Ha Adrif': 852404094,
 'Haroon al-Afghani': 1095102390,
 'Tayyab Agha': 1104998382,
 'Ahmadullah Wasiq': 1109361754,
 'Aziza Ahmadyar': 1087211008,
 'Muqadasa Ahmadzai': 1082489593,
 'Mohammad Sarwar Ahmedzai': 1038918070,
 'Amir Muhammad Akhundzada': 1069322182,
 'Nasrullah Baryalai Arsalai': 1095526840,
 'Mohammad Asim Asim': 1013838830,
 'Atiqullah Atifmal': 1112407669,
 'Abdul Rahim Ayoubi': 1108886061,
 'Alhaj Mutalib Baig': 1111494041,
 'Ismael Balkhi': 1112534409,
 'Abdul Baqi Turkistani': 889226470,
 'Mohammad Ghous Bashiri': 1102150221,
 'Abas Basir': 1098419766,
 'Jan Baz': 997027082,
 'Ahmad Behzad': 1103948295,
 'Bashir Ahmad Bezan': 1060707209,
 'Rafiullah Bidar': 977208323,
 'Mohammad Siddiq Chakari': 1105913099,
 'Cheragh Ali Cheragh': 1087211968,
 'Nasir Ahmad Durrani': 988838315,
 'Elay Ershad': 1102489654,
 'Muhammad Hashim Esmatullahi': 949986748,
 'Ezatullah (Nangarhar)': 947885788,
 'Aimal

Now that we have the last revision ID for each article, we can now request their respective score from the ORES API endpoint

In [12]:
#########
#
#    CONSTANTS
#

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}


# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}
#
# The current ML models for English wikipedia are:
#   "articlequality"
#   "articletopic"
#   "damaging"
#   "version"
#   "draftquality"
#   "drafttopic"
#   "goodfaith"
#   "wp10"
#
# The specific documentation on these is scattered so if you want to use one you'll have to look around.
#

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article revisions. Therefore, the main parameter is article_revid.

In [13]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


We send a request per article and store the data obtained in a dictionary called ARTICLE_SCORES. Any article that does not have a score or incurred in an error is stored in the list ARTICLE_SCORES_NULL

In [14]:
ARTICLE_SCORES = {}
ARTICLE_SCORES_NULL =[]
for article, last_rev_id in ARTICLE_REVISIONS.items():
    score = request_ores_score_per_article(last_rev_id)
    if score['enwiki']["scores"][str(ARTICLE_REVISIONS[article])]['articlequality']['score'].get('prediction','null') != 'null':
        classification = score['enwiki']["scores"][str(ARTICLE_REVISIONS[article])]['articlequality']['score']['prediction']
        ARTICLE_SCORES[article] = classification
    else:
        ARTICLE_SCORES_NULL.append(article)

In [15]:
ARTICLE_SCORES

{'Shahjahan Noori': 'GA',
 'Abdul Ghafar Lakanwal': 'Start',
 'Majah Ha Adrif': 'Start',
 'Haroon al-Afghani': 'B',
 'Tayyab Agha': 'Start',
 'Ahmadullah Wasiq': 'Start',
 'Aziza Ahmadyar': 'Start',
 'Muqadasa Ahmadzai': 'Start',
 'Mohammad Sarwar Ahmedzai': 'Start',
 'Amir Muhammad Akhundzada': 'Start',
 'Nasrullah Baryalai Arsalai': 'Start',
 'Mohammad Asim Asim': 'Stub',
 'Atiqullah Atifmal': 'Start',
 'Abdul Rahim Ayoubi': 'Start',
 'Alhaj Mutalib Baig': 'Start',
 'Ismael Balkhi': 'Start',
 'Abdul Baqi Turkistani': 'Stub',
 'Mohammad Ghous Bashiri': 'Start',
 'Abas Basir': 'C',
 'Jan Baz': 'Stub',
 'Ahmad Behzad': 'Start',
 'Bashir Ahmad Bezan': 'Start',
 'Rafiullah Bidar': 'Stub',
 'Mohammad Siddiq Chakari': 'Stub',
 'Cheragh Ali Cheragh': 'Start',
 'Nasir Ahmad Durrani': 'Stub',
 'Elay Ershad': 'Stub',
 'Muhammad Hashim Esmatullahi': 'Stub',
 'Ezatullah (Nangarhar)': 'Stub',
 'Aimal Faizi': 'Stub',
 'Mohammad Nasim Faqiri': 'Start',
 'Mohammad Nazar Faqiri': 'Stub',
 'Sharif Ghal

In [16]:
ARTICLE_SCORES_NULL

[]

We observe that all articles had a score!

# Step 3: Combining the Datasets

We first convert the previous article quality prediction result into a DataFrame in order to handle it better

In [17]:
article_quality_df = pd.DataFrame(list(ARTICLE_SCORES.items()),columns = ['name','article_quality'])
revision_id_df = pd.DataFrame(list(ARTICLE_REVISIONS.items()),columns = ['name','revision_id'])

We assign a region to each country in the country list (**Referenced in step 4**)

In [18]:
def populate_region(row):
    if row['Geography'].isupper():
        return row['Geography']
    else:
        return None
population['region'] = population.apply(populate_region,axis=1)
population.ffill(inplace=True)


We obtain the following region populated table

In [22]:
population

Unnamed: 0,Geography,Population (millions),region
0,WORLD,7963.0,WORLD
1,AFRICA,1419.0,AFRICA
2,NORTHERN AFRICA,251.0,NORTHERN AFRICA
3,Algeria,44.9,NORTHERN AFRICA
4,Egypt,103.5,NORTHERN AFRICA
...,...,...,...
228,Samoa,0.2,OCEANIA
229,Solomon Islands,0.7,OCEANIA
230,Tonga,0.1,OCEANIA
231,Tuvalu,0.0,OCEANIA


We now can merge the tables to obtain the desired final table and save the final table in a CSV file

In [23]:
#We first merge with politicians to have the country of each politician
article_quality_country_df = article_quality_df.merge(politicians[['name','country']], left_on='name', right_on='name')
#We merge with population to have the associated population size of the politicians country
wp_politicians_by_country = article_quality_country_df.merge(population, left_on='country', right_on='Geography').drop(columns='Geography')
#We merge with the revision_id_df to associate the revision_id for each article in the table
wp_politicians_by_country = wp_politicians_by_country.merge(revision_id_df, left_on='name', right_on='name')
#We rename to have standard names in our table
wp_politicians_by_country = wp_politicians_by_country.rename({'Population (millions)':'population','name':'article_title'})
#We save the final table to CSV format
wp_politicians_by_country.to_csv('../results/wp_politicians_by_country.csv', index=False)

We notice that there are countries for which there are no matches, we separate these into a separate list that we save as a txt in '../results/wp_countries-no_match.txt'

In [24]:
#We create sets of countries in politicians and in populations that we will compare later
politicians_country_set = set(article_quality_country_df.country)
population_country_set = set(population.Geography)
#We create the lists where we will store the results
match = []
population_not_in_politicians = []
politicians_not_in_population = []
for country in population_country_set:
    if country in politicians_country_set:
        #they matched so no problem
        match.append(country)
    else:
        #we enter the case where it is in the population but not in the politicians set.
        #we check if it is a region
        if country.isupper():
            #not a country so we don't append
            continue
        else:
            population_not_in_politicians.append(country)
#we now check for the other case, namely, when it is in politicians but not in population set
for country in politicians_country_set:
    if country in population_country_set:
        continue
    else:
        politicians_not_in_population.append(country)

#We join the two lists to have a single file with all the mismatches
wp_countries_no_match = population_not_in_politicians + politicians_not_in_population
with open('../results/wp_countries-no_match.txt', 'w') as f:
    for country in wp_countries_no_match:
        f.write(country)
        f.write('\n')

We notice that the politicians data source does not contain multiple important countries such as USA, Canada, UK, New Zeland, Austalia, Phillipines, etc. It also lacks some data for smaller countries such as New Caledonia, Puerto Rico, French Polynesia, etc.

We notice that in the politicians data source there is a record that does not appear in population: 'Korean'. Investigating further into case-by-case articles, we observe that most of these Korean politicians are from older eras when Korea was one whole country, given that we have representation of the newer South and North Korea and that we do not have the population of what was once Korea

# Step 4: Analysis

In this analysis a country can only exist in one region. The population_by_country_2022.csv actually represents regions in a hierarchical order. For your analysis always put a country in the closest (lowest in the hierarchy) region.
**This has already been done before in Step 3**

For this analysis you should consider "high quality" articles to be articles that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes. So we create a function and apply it to the 'article_quality' column in order to obtain this dichotomus variable


In [25]:
def populate_high_quality(row):
    if row['article_quality'] in ['FA', 'GA']:
        return 1
    else:
        return 0
wp_politicians_by_country['high_quality'] = wp_politicians_by_country.apply(populate_high_quality,axis=1)
wp_politicians_by_country

Unnamed: 0,name,article_quality,country,Population (millions),region,revision_id,high_quality
0,Shahjahan Noori,GA,Afghanistan,41.1,SOUTH ASIA,1099689043,1
1,Abdul Ghafar Lakanwal,Start,Afghanistan,41.1,SOUTH ASIA,943562276,0
2,Majah Ha Adrif,Start,Afghanistan,41.1,SOUTH ASIA,852404094,0
3,Haroon al-Afghani,B,Afghanistan,41.1,SOUTH ASIA,1095102390,0
4,Tayyab Agha,Start,Afghanistan,41.1,SOUTH ASIA,1104998382,0
...,...,...,...,...,...,...,...
7501,Rekayi Tangwena,Stub,Zimbabwe,16.3,EASTERN AFRICA,1073818982,0
7502,Josiah Tongogara,C,Zimbabwe,16.3,EASTERN AFRICA,1106932400,0
7503,Langton Towungana,Stub,Zimbabwe,16.3,EASTERN AFRICA,904246837,0
7504,Herbert Ushewokunze,Stub,Zimbabwe,16.3,EASTERN AFRICA,959111842,0


In [26]:
wp_politicians_by_country[wp_politicians_by_country['Population (millions)'] == 0.0]

Unnamed: 0,name,article_quality,country,Population (millions),region,revision_id,high_quality
5265,Alfons Goop,Stub,Liechtenstein,0.0,WESTERN EUROPE,1030085950,0
5266,Peter Kaiser (historian),Start,Liechtenstein,0.0,WESTERN EUROPE,1086304418,0
5635,Jean-Charles Allavena,Stub,Monaco,0.0,WESTERN EUROPE,1003397193,0
5636,Laurent Anselmi,Stub,Monaco,0.0,WESTERN EUROPE,1115213350,0
5637,Camille Blanc,Start,Monaco,0.0,WESTERN EUROPE,1100905415,0
5638,Robert Calcagno,C,Monaco,0.0,WESTERN EUROPE,1030845142,0
5639,Jean Castellini,Stub,Monaco,0.0,WESTERN EUROPE,1084495128,0
5640,Paul Demange (politician),Stub,Monaco,0.0,WESTERN EUROPE,1079067018,0
5641,Catherine Fautrier,Stub,Monaco,0.0,WESTERN EUROPE,1106302712,0
5642,Paul Masseron,Start,Monaco,0.0,WESTERN EUROPE,1089373743,0


We observe that some populations are 0, this means that when we will do the division to find per_capita numbers, these will divide by 0 which we want to prevent, so we remove these countries from the analysis and results 

In [27]:
wp_politicians_by_country = wp_politicians_by_country[wp_politicians_by_country['Population (millions)'] != 0.0]

# Step 5: Results

We first prepare our tables in order to easily obtain the results we want in Step 5

## total-articles-per-population (a ratio representing the number of articles per person)

In [28]:
#Country-by-country
#We groupby country, counting the number of articles per country in column "number_of_articles"
count_by_country = wp_politicians_by_country.groupby('country').count()['name'].reset_index().rename({"name":"number_of_articles"}, axis=1)
#We merge with population table to obtain the population data for each country
total_articles_per_population_country = count_by_country.merge(population, how='left', left_on= 'country', right_on = 'Geography')
#We create a new column with the total_articles_per_population which is the division of number of articles by population. 
total_articles_per_population_country['total_articles_per_population'] = total_articles_per_population_country['number_of_articles']/total_articles_per_population_country['Population (millions)']
total_articles_per_population_country


Unnamed: 0,country,number_of_articles,Geography,Population (millions),region,total_articles_per_population
0,Afghanistan,118,Afghanistan,41.1,SOUTH ASIA,2.871046
1,Albania,83,Albania,2.8,SOUTHERN EUROPE,29.642857
2,Algeria,34,Algeria,44.9,NORTHERN AFRICA,0.757238
3,Andorra,10,Andorra,0.1,SOUTHERN EUROPE,100.000000
4,Angola,42,Angola,35.6,MIDDLE AFRICA,1.179775
...,...,...,...,...,...,...
173,Venezuela,62,Venezuela,28.3,SOUTH AMERICA,2.190813
174,Vietnam,27,Vietnam,99.4,SOUTHEAST ASIA,0.271630
175,Yemen,61,Yemen,33.7,WESTERN ASIA,1.810089
176,Zambia,13,Zambia,20.0,EASTERN AFRICA,0.650000


In [29]:
#Regional
#We groupby region, counting the number of articles per country in column "number_of_articles"
regional = wp_politicians_by_country.groupby('region').count()['name'].reset_index().rename({"name":"number_of_articles"}, axis=1)
#We merge with population table to obtain the population data for each region
total_articles_per_population_region = regional.merge(population[['Geography','Population (millions)']], how='left', left_on= 'region', right_on = 'Geography')
#We create a new column with the total_articles_per_population which is the division of number of articles by population. 
total_articles_per_population_region['total_articles_per_population'] = total_articles_per_population_region['number_of_articles']/total_articles_per_population_region['Population (millions)']
total_articles_per_population_region


Unnamed: 0,region,number_of_articles,Geography,Population (millions),total_articles_per_population
0,CARIBBEAN,201,CARIBBEAN,44.0,4.568182
1,CENTRAL AMERICA,195,CENTRAL AMERICA,178.0,1.095506
2,CENTRAL ASIA,106,CENTRAL ASIA,78.0,1.358974
3,EAST ASIA,245,EAST ASIA,1674.0,0.146356
4,EASTERN AFRICA,648,EASTERN AFRICA,473.0,1.369979
5,EASTERN EUROPE,736,EASTERN EUROPE,287.0,2.56446
6,MIDDLE AFRICA,203,MIDDLE AFRICA,196.0,1.035714
7,NORTHERN AFRICA,227,NORTHERN AFRICA,251.0,0.904382
8,NORTHERN EUROPE,262,NORTHERN EUROPE,107.0,2.448598
9,OCEANIA,72,OCEANIA,44.0,1.636364


 ## high-quality-articles-per-population (a ratio representing the number of high quality articles per person)

In [30]:
#Country-by-country
#We groupby country, summing the number of articles that are classified as high quality per country in column "number_of_high_quality"
count_by_country_hq = wp_politicians_by_country.groupby('country').sum()['high_quality'].reset_index().rename({"high_quality":"number_of_high_quality"}, axis=1)
#We merge with population table to obtain the population data for each country
hq_articles_per_population_country = count_by_country_hq.merge(population, how='left', left_on= 'country', right_on = 'Geography')
#We create a new column with the hq_articles_per_population which is the division of number of high quality articles by population. 
hq_articles_per_population_country['hq_articles_per_population'] = hq_articles_per_population_country["number_of_high_quality"]/total_articles_per_population_country['Population (millions)']
hq_articles_per_population_country

Unnamed: 0,country,number_of_high_quality,Geography,Population (millions),region,hq_articles_per_population
0,Afghanistan,6,Afghanistan,41.1,SOUTH ASIA,0.145985
1,Albania,6,Albania,2.8,SOUTHERN EUROPE,2.142857
2,Algeria,0,Algeria,44.9,NORTHERN AFRICA,0.000000
3,Andorra,2,Andorra,0.1,SOUTHERN EUROPE,20.000000
4,Angola,0,Angola,35.6,MIDDLE AFRICA,0.000000
...,...,...,...,...,...,...
173,Venezuela,0,Venezuela,28.3,SOUTH AMERICA,0.000000
174,Vietnam,2,Vietnam,99.4,SOUTHEAST ASIA,0.020121
175,Yemen,2,Yemen,33.7,WESTERN ASIA,0.059347
176,Zambia,0,Zambia,20.0,EASTERN AFRICA,0.000000


In [31]:
#Regional
#We groupby region, summing the number of articles that are classified as high quality per country in column "number_of_high_quality"
regional_hq = wp_politicians_by_country.groupby('region').sum()['high_quality'].reset_index().rename({"high_quality":"number_of_high_quality"}, axis=1)
#We merge with population table to obtain the population data for each region
hq_articles_per_population_region = regional_hq.merge(population[['Geography','Population (millions)']], how='left', left_on= 'region', right_on = 'Geography')
#We create a new column with the hq_articles_per_population which is the division of number of high quality articles by population. 
hq_articles_per_population_region['hq_articles_per_population'] = hq_articles_per_population_region['number_of_high_quality']/hq_articles_per_population_region['Population (millions)']
hq_articles_per_population_region


Unnamed: 0,region,number_of_high_quality,Geography,Population (millions),hq_articles_per_population
0,CARIBBEAN,8,CARIBBEAN,44.0,0.181818
1,CENTRAL AMERICA,10,CENTRAL AMERICA,178.0,0.05618
2,CENTRAL ASIA,3,CENTRAL ASIA,78.0,0.038462
3,EAST ASIA,16,EAST ASIA,1674.0,0.009558
4,EASTERN AFRICA,15,EASTERN AFRICA,473.0,0.031712
5,EASTERN EUROPE,39,EASTERN EUROPE,287.0,0.135889
6,MIDDLE AFRICA,5,MIDDLE AFRICA,196.0,0.02551
7,NORTHERN AFRICA,6,NORTHERN AFRICA,251.0,0.023904
8,NORTHERN EUROPE,8,NORTHERN EUROPE,107.0,0.074766
9,OCEANIA,1,OCEANIA,44.0,0.022727


### 1. Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order).

In [32]:
total_articles_per_population_country.sort_values(by=['total_articles_per_population'], ascending=False)[:10]

Unnamed: 0,country,number_of_articles,Geography,Population (millions),region,total_articles_per_population
5,Antigua and Barbuda,17,Antigua and Barbuda,0.1,CARIBBEAN,170.0
54,Federated States of Micronesia,13,Federated States of Micronesia,0.1,OCEANIA,130.0
3,Andorra,10,Andorra,0.1,SOUTHERN EUROPE,100.0
13,Barbados,28,Barbados,0.3,CARIBBEAN,93.333333
103,Marshall Islands,9,Marshall Islands,0.1,OCEANIA,90.0
108,Montenegro,36,Montenegro,0.6,SOUTHERN EUROPE,60.0
138,Seychelles,6,Seychelles,0.1,EASTERN AFRICA,60.0
96,Luxembourg,37,Luxembourg,0.7,WESTERN EUROPE,52.857143
18,Bhutan,41,Bhutan,0.8,SOUTH ASIA,51.25
64,Grenada,5,Grenada,0.1,CARIBBEAN,50.0


### 2. Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order).

In [33]:
total_articles_per_population_country.sort_values(by=['total_articles_per_population'])[:10]

Unnamed: 0,country,number_of_articles,Geography,Population (millions),region,total_articles_per_population
32,China,2,China,1436.6,EAST ASIA,0.001392
105,Mexico,1,Mexico,127.5,CENTRAL AMERICA,0.007843
135,Saudi Arabia,3,Saudi Arabia,36.7,WESTERN ASIA,0.081744
130,Romania,2,Romania,19.0,EASTERN EUROPE,0.105263
73,India,178,India,1417.2,SOUTH ASIA,0.1256
148,Sri Lanka,3,Sri Lanka,22.4,SOUTH ASIA,0.133929
48,Egypt,14,Egypt,103.5,NORTHERN AFRICA,0.135266
53,Ethiopia,25,Ethiopia,123.4,EASTERN AFRICA,0.202593
156,Taiwan,5,Taiwan,23.2,EAST ASIA,0.215517
174,Vietnam,27,Vietnam,99.4,SOUTHEAST ASIA,0.27163


### 3.Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .

In [34]:
hq_articles_per_population_country.sort_values(by=['hq_articles_per_population'], ascending=False)[:10]

Unnamed: 0,country,number_of_high_quality,Geography,Population (millions),region,hq_articles_per_population
3,Andorra,2,Andorra,0.1,SOUTHERN EUROPE,20.0
108,Montenegro,3,Montenegro,0.6,SOUTHERN EUROPE,5.0
1,Albania,6,Albania,2.8,SOUTHERN EUROPE,2.142857
152,Suriname,1,Suriname,0.6,SOUTH AMERICA,1.666667
20,Bosnia-Herzegovina,5,Bosnia-Herzegovina,3.4,SOUTHERN EUROPE,1.470588
95,Lithuania,3,Lithuania,2.8,NORTHERN EUROPE,1.071429
39,Croatia,4,Croatia,3.8,SOUTHERN EUROPE,1.052632
142,Slovenia,2,Slovenia,2.1,SOUTHERN EUROPE,0.952381
122,Palestinian Territory,5,Palestinian Territory,5.4,WESTERN ASIA,0.925926
58,Gabon,2,Gabon,2.4,MIDDLE AFRICA,0.833333


### 4. Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).

In [35]:
hq_articles_per_population_country.sort_values(by=['hq_articles_per_population'])[:10]

Unnamed: 0,country,number_of_high_quality,Geography,Population (millions),region,hq_articles_per_population
177,Zimbabwe,0,Zimbabwe,16.3,EASTERN AFRICA,0.0
176,Zambia,0,Zambia,20.0,EASTERN AFRICA,0.0
87,Kuwait,0,Kuwait,4.1,WESTERN ASIA,0.0
148,Sri Lanka,0,Sri Lanka,22.4,SOUTH ASIA,0.0
149,St. Kitts-Nevis,0,St. Kitts-Nevis,0.1,CARIBBEAN,0.0
79,Jamaica,0,Jamaica,2.8,CARIBBEAN,0.0
78,Italy,0,Italy,58.9,SOUTHERN EUROPE,0.0
77,Israel,0,Israel,9.5,WESTERN ASIA,0.0
72,Iceland,0,Iceland,0.4,NORTHERN EUROPE,0.0
150,St. Vincent and the Grenadines,0,St. Vincent and the Grenadines,0.1,CARIBBEAN,0.0


### 5. Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [36]:
total_articles_per_population_region.sort_values(by=['total_articles_per_population'], ascending=False)[:10]

Unnamed: 0,region,number_of_articles,Geography,Population (millions),total_articles_per_population
14,SOUTHERN EUROPE,888,SOUTHERN EUROPE,151.0,5.880795
0,CARIBBEAN,201,CARIBBEAN,44.0,4.568182
17,WESTERN EUROPE,684,WESTERN EUROPE,197.0,3.472081
5,EASTERN EUROPE,736,EASTERN EUROPE,287.0,2.56446
8,NORTHERN EUROPE,262,NORTHERN EUROPE,107.0,2.448598
16,WESTERN ASIA,686,WESTERN ASIA,294.0,2.333333
13,SOUTHERN AFRICA,117,SOUTHERN AFRICA,69.0,1.695652
9,OCEANIA,72,OCEANIA,44.0,1.636364
4,EASTERN AFRICA,648,EASTERN AFRICA,473.0,1.369979
2,CENTRAL ASIA,106,CENTRAL ASIA,78.0,1.358974


### 6. Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [37]:
hq_articles_per_population_region.sort_values(by=['hq_articles_per_population'], ascending=False)[:10]

Unnamed: 0,region,number_of_high_quality,Geography,Population (millions),hq_articles_per_population
14,SOUTHERN EUROPE,46,SOUTHERN EUROPE,151.0,0.304636
0,CARIBBEAN,8,CARIBBEAN,44.0,0.181818
5,EASTERN EUROPE,39,EASTERN EUROPE,287.0,0.135889
17,WESTERN EUROPE,22,WESTERN EUROPE,197.0,0.111675
16,WESTERN ASIA,28,WESTERN ASIA,294.0,0.095238
8,NORTHERN EUROPE,8,NORTHERN EUROPE,107.0,0.074766
13,SOUTHERN AFRICA,4,SOUTHERN AFRICA,69.0,0.057971
1,CENTRAL AMERICA,10,CENTRAL AMERICA,178.0,0.05618
2,CENTRAL ASIA,3,CENTRAL ASIA,78.0,0.038462
12,SOUTHEAST ASIA,24,SOUTHEAST ASIA,676.0,0.035503
