# Article Page Info MediaWiki API Example
This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024



### Libraries

In [1]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd

### Step 0: Define configurations

Wiki configurations

In [28]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
email_address = "nguyenbh@uw.edu"
REQUEST_HEADERS = {
    'User-Agent': f'<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


ORES Configurations

In [16]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': f"<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
# ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

Wikipedia username & Access Token

In [17]:
from apikeys.KeyManager import KeyManager
keyman = KeyManager()

#
#   This is my Wikipedia/Wikimedia username. They suggest you request your keys using your Wikipedia username, so I
#   also stored the API key using my Wikipedia username.
#
#   You should probably use your own username here.
USERNAME = "nguyenbhuw"
key_info = keyman.findRecord(USERNAME)
ACCESS_TOKEN = key_info[0]['access_token']
print(key_info[0]['description'])

A key for the wikimedia API


### Step 1: Define function to get page information for a given Wikipedia article

In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#


def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    """
    Requests page information for a given Wikipedia article.
    Parameters:
    -----------
    article_title : str, optional
        The title of the Wikipedia article to request information for. If not provided, 
        the title must be included in the request_template.
    endpoint_url : str, optional
        The URL endpoint for the Wikipedia API. Defaults to API_ENWIKIPEDIA_ENDPOINT.
    request_template : dict, optional
        The template dictionary containing request parameters. Defaults to PAGEINFO_PARAMS_TEMPLATE.
    headers : dict, optional
        The headers to include in the request. Defaults to REQUEST_HEADERS.
    Returns:
    --------
    dict or None
        The JSON response from the Wikipedia API if the request is successful, otherwise None.
    Raises:
    -------
    Exception
        If no article title is provided or included in the request_template.
        If the headers do not include the required API_HEADER_AGENT field.
        If the API_HEADER_AGENT field does not contain a valid UW email address.
    """
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


### Step 2: Define a function to make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

In [8]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    """
    Requests an ORES score for a given article revision ID.

    Parameters:
    -----------
    article_revid : int, optional
        The revision ID of the article to be scored. Default is None.
    email_address : str, optional
        The email address to be included in the request header. Default is None.
    access_token : str, optional
        The access token to be included in the request header. Default is None.
    endpoint_url : str, optional
        The endpoint URL for the ORES API. Default is API_ORES_LIFTWING_ENDPOINT.
    model_name : str, optional
        The model name to be used for scoring. Default is API_ORES_EN_QUALITY_MODEL.
    request_data : dict, optional
        The template for the request data. Default is ORES_REQUEST_DATA_TEMPLATE.
    header_format : dict, optional
        The template for the request header format. Default is REQUEST_HEADER_TEMPLATE.
    header_params : dict, optional
        The template for the request header parameters. Default is REQUEST_HEADER_PARAMS_TEMPLATE.
    
    Returns:
    --------
    dict or None
        The JSON response from the ORES API if the request is successful, otherwise None.
   
    Raises:
    -------
    Exception
        Throw exception if the required parameters 
        (article_revid, email_address, access_token) are not provided.
    """

    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


### Read in data

Read in politician & population data. The population data contains both regions' names & countries' names. The data is in hierarchy. Based on this hierarchy, create another column called `region` & fill in the region for each country (use the lowest region in the hierarchy).

In [126]:
politicians_data_path = 'raw/politicians_by_country_AUG.2024.csv'
populations_data_path = 'raw/population_by_country_AUG.2024.csv'
output_path = 'data/wp_politicians_by_country.csv' # Output file path
error_path = 'data/wp_errors.txt' # Error log file path

politician_df = pd.read_csv(politicians_data_path)
population_df = pd.read_csv(populations_data_path)
population_df.columns = population_df.columns.str.lower()

# Process the population data
population_df['region'] = population_df['geography'].where(population_df['geography'].str.isupper())
# Forward-fill the region values down the DataFrame
population_df['region'] = population_df['region'].ffill()

# Filter out rows where the Geography column is a region (i.e., in all uppercase)
country_df = population_df[population_df['geography'].str.islower() | population_df['geography'].str.istitle()]
country_df

print(politician_df.head(10).to_string())
print(population_df.head(10).to_string())
print(country_df.tail(10).to_string())

                         name                                                       url      country
0              Majah Ha Adrif              https://en.wikipedia.org/wiki/Majah_Ha_Adrif  Afghanistan
1           Haroon al-Afghani           https://en.wikipedia.org/wiki/Haroon_al-Afghani  Afghanistan
2                 Tayyab Agha                 https://en.wikipedia.org/wiki/Tayyab_Agha  Afghanistan
3        Khadija Zahra Ahmadi        https://en.wikipedia.org/wiki/Khadija_Zahra_Ahmadi  Afghanistan
4              Aziza Ahmadyar              https://en.wikipedia.org/wiki/Aziza_Ahmadyar  Afghanistan
5           Muqadasa Ahmadzai           https://en.wikipedia.org/wiki/Muqadasa_Ahmadzai  Afghanistan
6    Mohammad Sarwar Ahmedzai    https://en.wikipedia.org/wiki/Mohammad_Sarwar_Ahmedzai  Afghanistan
7    Amir Muhammad Akhundzada    https://en.wikipedia.org/wiki/Amir_Muhammad_Akhundzada  Afghanistan
8  Nasrullah Baryalai Arsalai  https://en.wikipedia.org/wiki/Nasrullah_Baryalai_Arsalai  Af

### Step 3: Combining the Datasets

Check the dataset to see which countries appear in one csv file but not the other. Save non-matching countries to a txt file.

For matching countries, join the politician & population data together by country name.

In [128]:
# Find the non-matching countries
popul_countries = set(country_df['geography'])
poli_countries = set(politician_df['country'])
difference = (popul_countries - poli_countries).union(poli_countries - popul_countries)
# Savt to txt file
no_match_path = 'data/wp_countries-no_match.txt'
with open(no_match_path, 'w') as f:
    for item in difference:
        f.write("%s\n" % item)

print(f"Number of non-matching countries: {len(difference)}")

Number of non-matching countries: 48


In [130]:
# Merge the dataframes on the country and geography columns
merged_df = pd.merge(politician_df, country_df, left_on='country', right_on='geography', how='inner')
# delete the 'Geography' column
merged_df.drop(columns=['geography'], inplace=True)
# Rename the 'name' column to 'article_title'
merged_df.rename(columns={'name': 'article_title'}, inplace=True)
merged_df


Unnamed: 0,article_title,url,country,population,region
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,42.4,SOUTH ASIA
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,42.4,SOUTH ASIA
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,42.4,SOUTH ASIA
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan,42.4,SOUTH ASIA
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan,42.4,SOUTH ASIA
...,...,...,...,...,...
6873,Josiah Tongogara,https://en.wikipedia.org/wiki/Josiah_Tongogara,Zimbabwe,16.7,EASTERN AFRICA
6874,Langton Towungana,https://en.wikipedia.org/wiki/Langton_Towungana,Zimbabwe,16.7,EASTERN AFRICA
6875,Sengezo Tshabangu,https://en.wikipedia.org/wiki/Sengezo_Tshabangu,Zimbabwe,16.7,EASTERN AFRICA
6876,Herbert Ushewokunze,https://en.wikipedia.org/wiki/Herbert_Ushewokunze,Zimbabwe,16.7,EASTERN AFRICA


Iterate through all politicians in the policitican data that have matching countries in the population data. For each politician, make a request to the Wikipedia API to get the info and the **revision id**. Use this revision id to make a request to the ORES API to get the **score of the article quality**. Save these two fields to the combined dataset in step 3, and save it as a CSV file.
During this proccess, record errors into a text file.

In [196]:
errors = []
scores = []
revision_ids = []

for i, row in merged_df.iterrows():
    print(i)
    article = row['article_title']
    try:
        info = request_pageinfo_per_article(article)
    except Exception as e:
        print(e)
        errors.append((i,e))
        scores.append(None)
        revision_ids.append(None)
        # raise Exception(e)
        continue
    try:
        info_json = json.dumps(info['query']['pages'],indent=4)
        revid = int(list(info['query']['pages'].keys())[0])
        talk_id = int(list(info['query']['pages'].values())[0]['talkid'])
        revision_ids.append(revid)
    except Exception as e:
        print(e)
        errors.append((i,e))
        scores.append(None)
        revision_ids.append(None)
        continue
    try:
        score_info = request_ores_score_per_article(article_revid=revid,
                                       email_address=email_address,
                                       access_token=ACCESS_TOKEN)
        if 'error' in score_info:
            try:
                # Retry with revid ID with talk page ID
                score_info = request_ores_score_per_article(article_revid=talk_id,
                                       email_address=email_address,
                                       access_token=ACCESS_TOKEN)
            except Exception as e:
                print(score_info['error'])
                errors.append((i,score_info['error']))
                score = None
        else:
            score =  list(score_info['enwiki']['scores'].values())[0]['articlequality']['score']['prediction']
        scores.append(score)
    except Exception as e:
        print(e)
        errors.append(i)
        scores.append(None)
    
merged_df['revision_id'] = revision_ids
merged_df['article_quality'] = scores

if len(errors) > 0:
    error_df = merged_df[merged_df['article_quality'].isna()]
    print(f"Total errors: {len(error_df.index)}. Error rate: {len(error_df.index)/len(merged_df)*100:.2f}%")
non_revid_idx = [x for x in revision_ids if x<0]
print(f"Number of articles with no revision ID: {len(non_revid_idx)}")

# Save the data to a CSV file
merged_df.drop(columns=['url'], inplace=False).to_csv(output_path, index=False) # drop url column

merged_df


Total errors: 9. Error rate: 0.13%
Number of articles with no revision ID: 8


Unnamed: 0,article_title,url,country,population,region,revision_id,article_quality
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,42.4,SOUTH ASIA,10483286,Start
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,42.4,SOUTH ASIA,11966231,B
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,42.4,SOUTH ASIA,46841383,Stub
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan,42.4,SOUTH ASIA,71600382,Stub
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan,42.4,SOUTH ASIA,47805901,B
...,...,...,...,...,...,...,...
6873,Josiah Tongogara,https://en.wikipedia.org/wiki/Josiah_Tongogara,Zimbabwe,16.7,EASTERN AFRICA,633594,Start
6874,Langton Towungana,https://en.wikipedia.org/wiki/Langton_Towungana,Zimbabwe,16.7,EASTERN AFRICA,16375315,Stub
6875,Sengezo Tshabangu,https://en.wikipedia.org/wiki/Sengezo_Tshabangu,Zimbabwe,16.7,EASTERN AFRICA,75270547,Stub
6876,Herbert Ushewokunze,https://en.wikipedia.org/wiki/Herbert_Ushewokunze,Zimbabwe,16.7,EASTERN AFRICA,11742819,B


Log the errors

In [200]:
# Get the rows with errors
error_df = merged_df[merged_df['article_quality'].isna()]
display(error_df)

# Get the revision_id column
revision_ids = merged_df['revision_id'].tolist()
article_titles = merged_df['article_title'].tolist()

# Add revision_id & article titles to the errors list
errors_updates = [(i, revision_ids[i], article_titles[i], e) for i, e in errors]

# Save errors to a txt file
with open(error_path, 'w') as f:
    for error in errors_updates:
        f.write(f"Index: {error[0]}, Revision ID: {error[1]}, Article Title: {error[2]}, Error: {error[3]}\n")

Unnamed: 0,article_title,url,country,population,region,revision_id,article_quality
397,Barbara Eibinger-Miedl,https://en.wikipedia.org/wiki/Barbara_Eibinger...,Austria,9.2,WESTERN EUROPE,-1,
483,Mehrali Gasimov,https://en.wikipedia.org/wiki/Mehrali_Gasimov,Azerbaijan,10.2,WESTERN ASIA,-1,
1159,Kyaw Myint,https://en.wikipedia.org/wiki/Kyaw_Myint,Myanmar,55.4,SOUTHEAST ASIA,-1,
1301,André Ngongang Ouandji,https://en.wikipedia.org/wiki/André_Ngongang_O...,Cameroon,28.1,MIDDLE AFRICA,-1,
1861,Tomás Pimentel,https://en.wikipedia.org/wiki/Tomás_Pimentel,Dominican Republic,11.3,CARIBBEAN,-1,
2333,Richard Sumah,https://en.wikipedia.org/wiki/Richard_Sumah,Ghana,34.1,WESTERN AFRICA,-1,
3503,Adem Hodža,https://en.wikipedia.org/wiki/Adem_Hodža,Kosovo,1.7,SOUTHERN EUROPE,65356695,
4322,Segun ''Aeroland'' Adewale,https://en.wikipedia.org/wiki/Segun_''Aeroland...,Nigeria,223.8,WESTERN AFRICA,-1,
5538,Bashir Bililiqo,https://en.wikipedia.org/wiki/Bashir_Bililiqo,Somalia,18.1,EASTERN AFRICA,-1,


[(397,
  'The MW API does not have any info related to the rev-id provided as input (-1), therefore it is not possible to extract features properly. One possible cause is the deletion of the page related to the revision id. Please contact the ML-Team if you need more info.'),
 (483,
  'The MW API does not have any info related to the rev-id provided as input (-1), therefore it is not possible to extract features properly. One possible cause is the deletion of the page related to the revision id. Please contact the ML-Team if you need more info.'),
 (1159,
  'The MW API does not have any info related to the rev-id provided as input (-1), therefore it is not possible to extract features properly. One possible cause is the deletion of the page related to the revision id. Please contact the ML-Team if you need more info.'),
 (1301,
  'The MW API does not have any info related to the rev-id provided as input (-1), therefore it is not possible to extract features properly. One possible cause

### Step 4: Analysis

The analysis consists of 2 calculations:
1. Total articles per capita
2. High quality articles per capita   
    - Consider "high quality" articles to be articles that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

Analysis 1

In [248]:
article_count_by_country = merged_df.groupby('country')['article_title'].count()
article_count_by_country = article_count_by_country.astype(float)

# Filter out rows where the population is 0
population_by_country = merged_df.groupby('country')['population'].first()*10**6
population_by_country = population_by_country[population_by_country > 0]

article_per_capita = article_count_by_country / population_by_country
article_per_capita = article_per_capita.fillna(0)

article_per_capita.to_csv('data/wp_article_per_capita.csv', header=True)
article_per_capita


country
Afghanistan    2.004717e-06
Albania        2.592593e-05
Algeria        1.517094e-06
Angola         1.580381e-06
Argentina      1.382289e-06
                   ...     
Venezuela      1.944444e-06
Vietnam        3.640040e-07
Yemen          9.302326e-07
Zambia         1.485149e-07
Zimbabwe       4.131737e-06
Length: 159, dtype: float64

Analysis 2

In [222]:
high_qualities = ['FA', 'GA']
hi_quality_articles = merged_df.loc[merged_df['article_quality'].isin(high_qualities)]
hi_article_count_by_country = hi_quality_articles.groupby('country')['article_title'].count()
hi_article_count_by_country = hi_article_count_by_country.astype(float)

hi_article_per_capita = hi_article_count_by_country / population_by_country
hi_article_per_capita = hi_article_per_capita.fillna(0)

hi_article_per_capita.to_csv('data/wp_high_quality_article_per_capita.csv', header=True)
hi_article_per_capita

country
Afghanistan    9.433962e-08
Albania        1.111111e-06
Algeria        6.410256e-08
Angola         5.449591e-08
Argentina      4.319654e-08
                   ...     
Venezuela      3.472222e-08
Vietnam        4.044489e-08
Yemen          8.720930e-08
Zambia         0.000000e+00
Zimbabwe       0.000000e+00
Length: 159, dtype: float64

### Step 5: Results

This result section shows 6 tables to summarize the data.
1. Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .
2. Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .
3. Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .
4. Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).
5. Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.
6. Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.



Table 1: Top 10 countries by coverage

In [250]:
# Get the top 10 countries by articles per capita
top_10_countries_by_coverage = article_per_capita.sort_values(ascending=False).head(10)
top_10_countries_by_coverage

country
Marshall Islands    0.000130
Tonga               0.000100
Barbados            0.000083
Montenegro          0.000060
Seychelles          0.000060
Bhutan              0.000055
Maldives            0.000055
Samoa               0.000040
Luxembourg          0.000039
Bahrain             0.000027
dtype: float64

Table 2: Bottom 10 countries by coverage

In [255]:
bot_10_countries_by_coverage = article_per_capita.sort_values(ascending=True).head(10)
bot_10_countries_by_coverage

country
Monaco          0.000000e+00
Tuvalu          0.000000e+00
China           1.133707e-08
India           1.056979e-07
Ghana           1.173021e-07
Saudi Arabia    1.355014e-07
Zambia          1.485149e-07
Norway          1.818182e-07
Israel          2.040816e-07
Egypt           3.041825e-07
dtype: float64

Table 3: Top 10 countries by high quality

In [253]:
top_10_countries_by_hi_qual = hi_article_per_capita.sort_values(ascending=False).head(10)
top_10_countries_by_hi_qual

country
Seychelles               1.000000e-05
Barbados                 3.333333e-06
Luxembourg               2.857143e-06
Maldives                 1.666667e-06
Guyana                   1.250000e-06
Albania                  1.111111e-06
Comoros                  1.111111e-06
Moldova                  8.823529e-07
Palestinian Territory    7.272727e-07
Bahrain                  6.250000e-07
dtype: float64

Table 4: Bottom 10 countries by high quality

In [256]:
bot_10_countries_by_hi_qual = hi_article_per_capita.sort_values(ascending=True).head(10)
bot_10_countries_by_hi_qual

country
Armenia                     0.0
Belgium                     0.0
Belize                      0.0
Benin                       0.0
Belarus                     0.0
Bahamas                     0.0
Botswana                    0.0
Bhutan                      0.0
Central African Republic    0.0
Cape Verde                  0.0
dtype: float64

Table 5: Geographic regions by total coverage

In [260]:
# Get a rank ordered list of geographic regions (in descending order) by total articles per capita.

# Group by region and sum the articles and population
region_article_counts = merged_df.groupby('region')['article_title'].count()
region_population_sums = merged_df.groupby('region')['population'].sum()

# Calculate articles per capita for each region
region_articles_per_capita = region_article_counts / (region_population_sums * 10**6)

region_articles_per_capita_sorted = region_articles_per_capita.sort_values(ascending=False)
region_articles_per_capita_sorted

region
OCEANIA            5.287147e-07
NORTHERN EUROPE    1.643576e-07
CENTRAL AMERICA    1.325063e-07
CARIBBEAN          1.161868e-07
CENTRAL ASIA       5.343819e-08
WESTERN ASIA       4.562624e-08
SOUTHERN EUROPE    4.438479e-08
MIDDLE AFRICA      4.338289e-08
EASTERN AFRICA     2.777639e-08
WESTERN EUROPE     2.625212e-08
NORTHERN AFRICA    2.480717e-08
EASTERN EUROPE     2.441048e-08
SOUTHERN AFRICA    2.065734e-08
SOUTH AMERICA      1.648602e-08
SOUTHEAST ASIA     8.794450e-09
WESTERN AFRICA     8.607011e-09
EAST ASIA          4.062781e-09
SOUTH ASIA         2.536326e-09
dtype: float64

Table 6: Geographic regions by high quality coverage

In [262]:
# Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

# Group by region and sum the high quality articles and population
region_hi_article_counts = hi_quality_articles.groupby('region')['article_title'].count()
region_population_sums = hi_quality_articles.groupby('region')['population'].sum()

# Calculate high quality articles per capita for each region
region_hi_articles_per_capita = region_hi_article_counts / (region_population_sums * 10**6)

region_hi_articles_per_capita_sorted = region_hi_articles_per_capita.sort_values(ascending=False)
region_hi_articles_per_capita_sorted


region
CARIBBEAN          1.310044e-07
SOUTHERN EUROPE    1.094092e-07
CENTRAL AMERICA    1.057082e-07
NORTHERN EUROPE    9.523810e-08
WESTERN ASIA       4.212860e-08
MIDDLE AFRICA      2.955665e-08
EASTERN AFRICA     2.636667e-08
NORTHERN AFRICA    2.604920e-08
WESTERN EUROPE     2.580275e-08
SOUTH AMERICA      2.572220e-08
EASTERN EUROPE     1.956309e-08
SOUTHERN AFRICA    1.647446e-08
SOUTHEAST ASIA     1.250120e-08
EAST ASIA          1.188119e-08
WESTERN AFRICA     6.032403e-09
SOUTH ASIA         3.466338e-09
dtype: float64