# Considering Bias in Data
Author: Amit Peled

Homework 2

## Requesting ORES scores through LiftWing ML Service API

Wikimedia Foundation (WMF) is reworking access to their APIs. It is likely in the coming years that all API access will require some kind of authentication, either through a simple key/token or through some version of OAuth. For now this is still a work in progress. You can follow the progress from their [API portal](https://api.wikimedia.org/wiki/Main_Page). Another on-going change is better control over API services in situations where those services require additional computational resources, beyond simply serving the text of a web page (i.e., the text of an article). Services like ORES that require running an ML model over the text of an article page is an example of a compute intensive API service.

Wikimedia is implementing a new Machine Learning (ML) service infrastructure that they call [LiftWing](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing). Given that ORES already has several ML models that have been well used, ORES is the first set of APIs that are being moved to LiftWing.

This example illustrates how to generate article quality estimates for article revisions using the LiftWing version of [ORES](https://www.mediawiki.org/wiki/ORES). The [ORES API documentation](https://ores.wikimedia.org) can be accessed from the main ORES page. The [ORES LiftWing documentation](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage) is very thin ... even thinner than the standard ORES documentation. Further, it is clear that some parameters have been renamed (e.g., "revid" in the old ORES API is now "rev_id" in the LiftWing ORES API).

## Article Page Info MediaWiki API 
This notebook also illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). It requests summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024

Note: This project was developed with the assistance of ChatGPT, which helped format code and organize the data and analysis in an efficient and clear manner. All content has been verified and tested by the author.d

This notebook demonstrates how to use the LiftWing API to generate article quality estimates for Wikipedia article revisions using the ORES model.

* ORES API Documentation: [ORES API documentation](https://ores.wikimedia.org)
* LiftWing Documentation: [LiftWing ORES API Documentation](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage)

# Notebook Overview
In this notebook, I aim to analyze the coverage and quality of Wikipedia articles about political figures across various countries. I will:

- Retrieve article metadata and ORES quality predictions using the Wikimedia API and the LiftWing infrastructure.
- Merge this data with population statistics for each country and geographic region.
- Calculate per capita coverage for both articles and high-quality articles.
- Analyze geographic patterns by producing tables that rank countries and regions by article coverage and quality.

## Loading and Preparing Data

### Import Libraries

In [48]:
# 
# These are standard python modules
import json, time, urllib.parse, pandas as pd
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

### Load CSV Data (Politicians and Population) 
These files are located in the repositiry of this project, and were provided to me by the teaching staff of DATA 512 for the purposes of this assingment. 

In [49]:
# Load politicians data
politicians_df = pd.read_csv('data/politicians_by_country_AUG.2024.csv')

# Load population data
population_df = pd.read_csv('data/population_by_country_AUG.2024.csv')

# Display the first few rows of each dataframe
print(politicians_df.head())
print(population_df.head())


                   name                                                url  \
0        Majah Ha Adrif       https://en.wikipedia.org/wiki/Majah_Ha_Adrif   
1     Haroon al-Afghani    https://en.wikipedia.org/wiki/Haroon_al-Afghani   
2           Tayyab Agha          https://en.wikipedia.org/wiki/Tayyab_Agha   
3  Khadija Zahra Ahmadi  https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...   
4        Aziza Ahmadyar       https://en.wikipedia.org/wiki/Aziza_Ahmadyar   

       country  
0  Afghanistan  
1  Afghanistan  
2  Afghanistan  
3  Afghanistan  
4  Afghanistan  
         Geography  Population
0            WORLD      8009.0
1           AFRICA      1453.0
2  NORTHERN AFRICA       256.0
3          Algeria        46.8
4            Egypt       105.2


### Data Cleaning and Handling Inconsistencies

In [50]:
# Check for duplicates in politicians data
print("Duplicates in Politicians Data:", politicians_df.duplicated().sum())

# Check for missing values in politicians data
print("Missing values in Politicians Data:", politicians_df.isnull().sum())

# Split population data into regions and countries
regions_df = population_df[population_df['Geography'].str.isupper()]
countries_df = population_df[~population_df['Geography'].str.isupper()]

# Check the first few rows to ensure correct splitting
print(regions_df.head())
print(countries_df.head())

Duplicates in Politicians Data: 0
Missing values in Politicians Data: name       0
url        0
country    0
dtype: int64
          Geography  Population
0             WORLD      8009.0
1            AFRICA      1453.0
2   NORTHERN AFRICA       256.0
10   WESTERN AFRICA       442.0
27   EASTERN AFRICA       483.0
  Geography  Population
3   Algeria        46.8
4     Egypt       105.2
5     Libya         6.9
6   Morocco        37.0
7     Sudan        48.1


## Making API Requests for Article Metadata and ORES Quality Scores

### Constants and Setup for API Requests

In [51]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
ORES_ENDPOINT = "https://ores.wikimedia.org/v3/scores/enwiki"
API_HEADER_AGENT = 'User-Agent'

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "apeled@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<apeled@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

### Function to Get Wikipedia Article Metadata (Page Info)

In [52]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


### Function to Get ORES Article Quality Scores

In [53]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


## Running Requests for Each Article

### Retrieve Revision IDs for Each Article and then the ORES scores:

In [None]:
# Replace these with your actual email and ORES API access token
email_address = "apgsw30@gmail.com"
access_token = "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiJjM2UzMTkyZGE4YWNmNDc5ZTgzODA5MDkwZWY0NTBjMCIsImp0aSI6ImIwY2NlMGM3ZGQwNGRlYmQ0Yjc1ZDc4ZjRiOGM1ZmU2ZjFiZWQ5MmJiNTc4NmNjYTQxM2I3MTI0ZTIwNjBhMDcyNjBiNDViMGZkYTA3NTUyIiwiaWF0IjoxNzI4ODU3MjMyLjM2NTEzNywibmJmIjoxNzI4ODU3MjMyLjM2NTE0LCJleHAiOjMzMjg1NzY2MDMyLjM2MzQsInN1YiI6Ijc2NzA5MDg2IiwiaXNzIjoiaHR0cHM6Ly9tZXRhLndpa2ltZWRpYS5vcmciLCJyYXRlbGltaXQiOnsicmVxdWVzdHNfcGVyX3VuaXQiOjUwMDAsInVuaXQiOiJIT1VSIn0sInNjb3BlcyI6WyJiYXNpYyJdfQ.qWYa0FklGUYrZbc9FOrPQvjm3sbU_Ei0k6dSRBAU-YKl-bouy_VDAU73G6kBrO27CDMyuWeDSPHW8Nb6N29QWw8bKMDj8edlcigIWhc-IbngwwOsqixQ9K1pwQHooYktyXgeRprtq3OuMnDOO-7uIb6DZflR1AtV9ZDivWkcxobaMzA5XgiEzLNI5L8GtDX_dTXb8tjHGcUVyDdzeoZIZ8RGKybIDJUaayVUYkTVGvak8LEaKNFOE-qyW8y3G5MjwZVGrv7eKtr5q-mQYvQvIVqF98FByD8T2zNeHwsVz1wxUDnYoIpvYqBEwmssPaDhqAFmakiqd49TVJ7_FJfQO-IYfGDG0i4EN_66n7stuPFFLkj5a7NEInNnIXibBMAiNWVlr-4LADUwdLA-UskcJETlbU0BWPCZetkBB7ZVm0TCSy35UFt8s4TZ4QlRc-8Bew8cCUauSFIvBWSK4Xb95ehCI5rW9FoA9JPdX2uquEWLMUlv3dSRTbpRbmMVOQ2ttLqlcA18K6qf0krA4h2WuS18Rmy8YBPhQZTAA-ISoeTSrgMKSDF4vUEhCm0MbimYhvc2SPExv5JCFA00uP6n_oaTFl_i6BkO9TE5s26K7tjKbUHqmCN7k3ABBpHK-bUrwC3dnYoA3ziZU1OQKtU2a7BmeXNXUUc9981DLWTaRGo"

# Function to retrieve the ORES quality score for each article
def process_article(article_title, email_address, access_token):
    # Get the page info for the article (to get revision ID)
    page_info = request_pageinfo_per_article(article_title=article_title)
    if page_info:
        pages_data = page_info.get('query', {}).get('pages', {})
        for page_id, page_details in pages_data.items():
            revision_id = page_details.get('lastrevid', None)
            if revision_id:
                # Get the ORES score using the revision ID
                quality_score = request_ores_score_per_article(revision_id, email_address, access_token)
                return revision_id, quality_score
    return None, None


# Initialize list to store errors
errors = []

# Process articles one by one
for index, row in politicians_df.iterrows():
    article_title = row['name']
    print(f"Processing article: {article_title}")
    
    # Get the ORES quality score for the article
    revision_id, quality_score = process_article(article_title, email_address, access_token)
    
    if revision_id and quality_score:
        politicians_df.at[index, 'revision_id'] = revision_id
        politicians_df.at[index, 'quality_prediction'] = quality_score
    else:
        errors.append(article_title)

# Save the final output to a CSV file
output_filename = 'politicians_with_ores_predictions.csv'
politicians_df.to_csv(output_filename, index=False)
print(f"Data saved to {output_filename}")

# Save the errors to a log file
error_log_filename = 'ores_errors_log.txt'
with open(error_log_filename, 'w') as f:
    for article in errors:
        f.write(f"{article}\n")
print(f"Errors logged to {error_log_filename}")


As we do not want to do the crawl again for time sakes, we are going to load the extracted CSV file from repository called `politicians_with_ores_predictions.csv`

In [31]:
# Load politicians data with scores
politicians_with_ores_predictions = pd.read_csv('data/politicians_with_ores_predictions.csv')

In [32]:
# Log articles where quality_score is None
missing_scores = politicians_with_ores_predictions[politicians_with_ores_predictions['quality_prediction'].isnull()]
print(f"Number of articles missing ORES scores: {len(missing_scores)}")

# Calculate the error rate for missing ORES scores
error_rate = len(missing_scores) / len(politicians_df_batch)
print(f"Error rate: {error_rate:.2%}")

Number of articles missing ORES scores: 8
Error rate: 0.11%


In [33]:
# Check the structure of the dataframe
print(politicians_with_ores_predictions.columns)

# Preview the dataframe to verify the column
print(politicians_with_ores_predictions.head())

Index(['name', 'url', 'country', 'revision_id', 'quality_prediction'], dtype='object')
                   name                                                url  \
0        Majah Ha Adrif       https://en.wikipedia.org/wiki/Majah_Ha_Adrif   
1     Haroon al-Afghani    https://en.wikipedia.org/wiki/Haroon_al-Afghani   
2           Tayyab Agha          https://en.wikipedia.org/wiki/Tayyab_Agha   
3  Khadija Zahra Ahmadi  https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...   
4        Aziza Ahmadyar       https://en.wikipedia.org/wiki/Aziza_Ahmadyar   

       country   revision_id quality_prediction  
0  Afghanistan  1.233203e+09              Start  
1  Afghanistan  1.230460e+09                  B  
2  Afghanistan  1.225662e+09              Start  
3  Afghanistan  1.234742e+09               Stub  
4  Afghanistan  1.195651e+09              Start  


## Combining the Datasets

Merge the politicain data I just obtained with population statistics for each country and geographic region.

In [34]:
# Split the population data into regions and countries
regions_df = population_df[population_df['Geography'].str.isupper()]
countries_df = population_df[~population_df['Geography'].str.isupper()]

# Assign each country to its nearest region (use the previous function)
country_region_mapping = []
current_region = None

for idx, row in population_df.iterrows():
    if row['Geography'].isupper():
        current_region = row['Geography']  # Track the most recent region
    else:
        country_region_mapping.append({
            'country': row['Geography'].lower(),  # Convert to lowercase for consistency
            'region': current_region,
            'population': row['Population']
        })

# Convert the country-region mapping into a DataFrame
country_region_df = pd.DataFrame(country_region_mapping)

# Merge the politician data with the country-region mapping
politicians_with_ores_predictions['country'] = politicians_with_ores_predictions['country'].str.lower()  # Convert to lowercase
merged_df = pd.merge(politicians_with_ores_predictions, country_region_df, on='country', how='left')

# Identify countries that did not match between the datasets
no_match_df = merged_df[merged_df['population'].isnull()]['country'].unique()

# Save the list of unmatched countries to 'wp_countries-no_match.txt'
with open('wp_countries-no_match.txt', 'w') as f:
    for country in no_match_df:
        f.write(f"{country}\n")

# Filter out the unmatched countries (remove NaN population entries)
merged_df = merged_df[merged_df['population'].notnull()]

# Select and rename the required columns for the final CSV
final_df = merged_df[['country', 'region', 'population', 'name', 'revision_id', 'quality_prediction']]
final_df = final_df.rename(columns={
    'name': 'article_title',
    'quality_prediction': 'article_quality'
})

# Save the consolidated dataset to 'wp_politicians_by_country.csv'
final_df.to_csv('data/wp_politicians_by_country.csv', index=False)

print("Files created successfully:")
print("- wp_countries-no_match.txt")
print("- wp_politicians_by_country.csv")


Files created successfully:
- wp_countries-no_match.txt
- wp_politicians_by_country.csv


Some of our countries have population counts of zero, which is not true, hence I looked up the following countries populations and entered the, manually using Wikipedia:

* [Monaco](https://en.wikipedia.org/wiki/Monaco)
* [Tuvalu](https://en.wikipedia.org/wiki/Tuvalu)

In [35]:
# Manually set the population values for Tuvalu and Monaco
final_df.loc[final_df['country'] == 'tuvalu', 'population'] = 0.0119  # In millions (11,900 people)
final_df.loc[final_df['country'] == 'monaco', 'population'] = 0.03836  # In millions (38,369 people)

## Calculate Articles-per-Capita
This is the ratio of the total number of articles per country or region to the population (remember that population is in millions).

In [36]:
# Calculate total articles-per-capita
final_df['total_articles'] = final_df.groupby('country')['article_title'].transform('count')
final_df['articles_per_capita'] = final_df['total_articles'] / final_df['population']

# Show top 5 rows of the updated dataframe
print(final_df[['country', 'population', 'articles_per_capita', 'total_articles']].head())

       country  population  articles_per_capita  total_articles
0  afghanistan        42.4             2.004717              85
1  afghanistan        42.4             2.004717              85
2  afghanistan        42.4             2.004717              85
3  afghanistan        42.4             2.004717              85
4  afghanistan        42.4             2.004717              85


## Calculate High-Quality Articles-per-Capita
Now filter the dataset to consider only the high-quality articles (FA or GA) and calculate the high-quality articles per capita.

In [37]:
# Filter for high-quality articles (FA and GA)
high_quality_df = final_df[final_df['article_quality'].isin(['FA', 'GA'])].copy()

# Calculate high-quality articles-per-capita
high_quality_df['total_high_quality_articles'] = high_quality_df.groupby('country')['article_title'].transform('count')
high_quality_df['high_quality_per_capita'] = high_quality_df['total_high_quality_articles'] / high_quality_df['population']

# Group by country and get the first instance of each country
country_high_quality_df = high_quality_df[['country', 'total_articles', 'population', 'high_quality_per_capita']].drop_duplicates(subset=['country'])


## 1. Top 10 Countries by Total Articles per Capita (in descending order)

In [38]:
# Calculate total articles-per-capita for each country
final_df['articles_per_capita'] = round(final_df.groupby('country')['article_title'].transform('count') / (final_df['population']), 2)

# Group by country and get the first instance of each country
country_coverage_df = final_df[['country', 'total_articles', 'population', 'articles_per_capita']].drop_duplicates(subset=['country'])

# Sort by articles_per_capita and select the top 10
top_10_countries_by_coverage = country_coverage_df.sort_values(by='articles_per_capita', ascending=False).head(10)

# Display the top 10 countries by high quality articles per capita in a formatted table
top_10_countries_by_coverage = top_10_countries_by_coverage.style.format({
    'total_articles': '{:.0f}',
    'population': '{:.2f}',                   # 2 decimal places for population (in millions)
    'articles_per_capita': '{:.6f}'       # 6 decimal places for per capita ratios
})

# Show formatted table
top_10_countries_by_coverage

Unnamed: 0,country,total_articles,population,articles_per_capita
284,antigua and barbuda,33,0.1,330.0
4260,monaco,10,0.04,260.69
4178,federated states of micronesia,14,0.1,140.0
4135,marshall islands,13,0.1,130.0
6598,tonga,10,0.1,100.0
6749,tuvalu,1,0.01,84.03
699,barbados,25,0.3,83.33
4279,montenegro,36,0.6,60.0
5590,seychelles,6,0.1,60.0
856,bhutan,44,0.8,55.0


This table shows the top 10 countries with the highest number of Wikipedia articles per capita.

## 2. Bottom 10 Countries by Total Articles per Capita (in ascending order)

In [39]:
# Sort by articles_per_capita in ascending order and select the bottom 10
bottom_10_countries_by_coverage = country_coverage_df.sort_values(by='articles_per_capita', ascending=True).head(10)

# Display the bottom 10 countries by total articles per capita
bottom_10_countries_by_coverage = bottom_10_countries_by_coverage.style.format({
    'total_articles': '{:.0f}',
    'population': '{:.2f}',               # 2 decimal places for population (in millions)
    'articles_per_capita': '{:.6f}'       # 6 decimal places for per capita ratios
})

# Show formatted table
bottom_10_countries_by_coverage

Unnamed: 0,country,total_articles,population,articles_per_capita
1454,china,16,1411.3,0.01
2699,india,151,1428.6,0.11
2424,ghana,4,34.1,0.12
5475,saudi arabia,5,36.9,0.14
7083,zambia,3,20.2,0.15
4744,norway,1,5.5,0.18
3125,israel,2,9.8,0.2
2009,egypt,32,105.2,0.3
3275,cote d'ivoire,10,30.9,0.32
4379,mozambique,12,33.9,0.35


This table shows the bottom 10 countries with the lowest number of Wikipedia articles per capita.

## 3. Top 10 Countries by High-Quality Articles (Highest High-Quality Articles Per Capita)

In [40]:
# Filter for high-quality articles (FA and GA)
high_quality_df = final_df[final_df['article_quality'].isin(['FA', 'GA'])].copy()
print(f"Number of high-quality articles: {len(high_quality_df)}")

# Calculate the total number of high-quality articles per country
high_quality_df['total_high_quality_articles'] = high_quality_df.groupby('country')['article_title'].transform('count')

# Calculate high-quality articles-per-capita
high_quality_df['high_quality_per_capita'] = high_quality_df['total_high_quality_articles'] / high_quality_df['population']

# Group by country and get the first instance of each country
country_high_quality_df = high_quality_df[['country', 'total_high_quality_articles', 'population', 'high_quality_per_capita']].drop_duplicates(subset=['country'])

# Sort by high_quality_per_capita and select the top 10
top_10_countries_by_high_quality = country_high_quality_df.sort_values(by='high_quality_per_capita', ascending=False).head(10)

# Display the top 10 countries by high-quality articles per capita
top_10_countries_by_high_quality = top_10_countries_by_high_quality.style.format({
    'total_high_quality_articles': '{:.0f}',
    'population': '{:.2f}',                   # 2 decimal places for population (in millions)
    'high_quality_per_capita': '{:.6f}'       # 6 decimal places for per capita ratios
})

# Show formatted table
top_10_countries_by_high_quality


Number of high-quality articles: 304


Unnamed: 0,country,total_high_quality_articles,population,high_quality_per_capita
4279,montenegro,3,0.6,5.0
3903,luxembourg,2,0.7,2.857143
85,albania,7,2.7,2.592593
3667,kosovo,4,1.7,2.352941
4089,maldives,1,0.6,1.666667
3851,lithuania,4,2.9,1.37931
1726,croatia,5,3.8,1.315789
2551,guyana,1,0.8,1.25
4874,palestinian territory,6,5.5,1.090909
5663,slovenia,2,2.1,0.952381


This table shows the top 10 countries with the highest number of high-quality (FA or GA) Wikipedia articles per capita. Note, here we filtered for countries that have at least one entery.

## 4. Bottom 10 Countries by High-Quality Articles (Lowest High-Quality Articles Per Capita)

In [41]:
# Sort by high_quality_per_capita in ascending order and select the bottom 10
bottom_10_countries_by_high_quality = country_high_quality_df.sort_values(by='high_quality_per_capita', ascending=True).head(10)

# Display the bottom 10 countries by high quality articles per capita
bottom_10_countries_by_high_quality = bottom_10_countries_by_high_quality.style.format({
    'total_high_quality_articles': '{:.0f}',
    'population': '{:.2f}',                   # 2 decimal places for population (in millions)
    'high_quality_per_capita': '{:.6f}'       # 6 decimal places for per capita ratios
})

# Show formatted table
bottom_10_countries_by_high_quality

Unnamed: 0,country,total_high_quality_articles,population,high_quality_per_capita
636,bangladesh,1,173.5,0.005764
2025,egypt,1,105.2,0.009506
2101,ethiopia,2,126.5,0.01581
3318,japan,2,124.5,0.016064
4775,pakistan,4,240.5,0.016632
1501,colombia,1,52.2,0.019157
1617,congo dr,2,102.3,0.01955
7049,vietnam,2,98.9,0.020222
6776,uganda,1,48.6,0.020576
176,algeria,1,46.8,0.021368


This table displays the bottom 10 countries with the lowest number of high-quality Wikipedia articles per capita. Note: I filtered for countries with at leadt one article

## 5. Geographic Regions by Total Coverage (Ranked by Articles Per Capita)

In [42]:
# I will use the population data from regions_df for each region (provided by the DATA 512 teaching team)
# regions_df contains the population for each region as per the table provided

# Make a copy of regions_df
regions_df = regions_df.copy()

# Standardize 'Geography' names in regions_df
regions_df['Geography'] = regions_df['Geography'].str.strip().str.upper()

# Group by region and count the total number of articles per region from the final_df
region_coverage_df = final_df.groupby('region').size().reset_index(name='total_articles')

# Merge region_coverage_df with regions_df to get the correct population data for each region
region_coverage_df = pd.merge(region_coverage_df, regions_df, left_on='region', right_on='Geography')

# Calculate total articles-per-capita for each region using the population from regions_df
region_coverage_df['articles_per_capita'] = region_coverage_df['total_articles'] / region_coverage_df['Population']

# Sort regions by articles-per-capita and display the result
regions_by_total_coverage = region_coverage_df.sort_values(by='articles_per_capita', ascending=False)

# Display the result: total articles, population, and articles-per-capita for each region
regions_by_total_coverage = regions_by_total_coverage.style.format({
    'total_articles': '{:.0f}', 
    'Population': '{:.2f}',               # 2 decimal places for population (in millions)
    'articles_per_capita': '{:.6f}'       # 6 decimal places for per capita ratios
})

# Show formatted table
regions_by_total_coverage

Unnamed: 0,region,total_articles,Geography,Population,articles_per_capita
14,SOUTHERN EUROPE,797,SOUTHERN EUROPE,152.0,5.243421
0,CARIBBEAN,219,CARIBBEAN,44.0,4.977273
17,WESTERN EUROPE,498,WESTERN EUROPE,199.0,2.502513
5,EASTERN EUROPE,709,EASTERN EUROPE,285.0,2.487719
16,WESTERN ASIA,610,WESTERN ASIA,299.0,2.040134
8,NORTHERN EUROPE,191,NORTHERN EUROPE,108.0,1.768519
13,SOUTHERN AFRICA,123,SOUTHERN AFRICA,70.0,1.757143
9,OCEANIA,72,OCEANIA,45.0,1.6
4,EASTERN AFRICA,665,EASTERN AFRICA,483.0,1.376812
10,SOUTH AMERICA,569,SOUTH AMERICA,426.0,1.335681


This table shows a rank-ordered list of geographic regions by total articles per capita (highest to lowest).

## 6. Geographic Regions by High-Quality Coverage (Ranked by High-Quality Articles Per Capita)

In [29]:
# Group by region and count the total number of high-quality articles per region from high_quality_df
region_high_quality_df = high_quality_df.groupby('region').size().reset_index(name='total_high_quality_articles')

# Merge region_high_quality_df with regions_df to get the correct population data for each region
region_high_quality_df = pd.merge(region_high_quality_df, regions_df, left_on='region', right_on='Geography')

# Calculate high-quality articles-per-capita for each region using the population from regions_df
region_high_quality_df['high_quality_per_capita'] = region_high_quality_df['total_high_quality_articles'] / region_high_quality_df['Population']

# Sort regions by high-quality articles-per-capita and display the result
regions_by_high_quality_coverage = region_high_quality_df.sort_values(by='high_quality_per_capita', ascending=False)

# Display the result
regions_by_high_quality_coverage = regions_by_high_quality_coverage.style.format({
    'total_high_quality_articles': '{:.0f}',  # No decimal places for article counts
    'Population': '{:.2f}',                   # 2 decimal places for population (in millions)
    'high_quality_per_capita': '{:.6f}'       # 6 decimal places for per capita ratios
})

# Show formatted table
regions_by_high_quality_coverage

Unnamed: 0,region,total_high_quality_articles,Geography,Population,high_quality_per_capita
14,SOUTHERN EUROPE,53,SOUTHERN EUROPE,152.0,0.348684
0,CARIBBEAN,9,CARIBBEAN,44.0,0.204545
5,EASTERN EUROPE,38,EASTERN EUROPE,285.0,0.133333
13,SOUTHERN AFRICA,8,SOUTHERN AFRICA,70.0,0.114286
17,WESTERN EUROPE,21,WESTERN EUROPE,199.0,0.105528
16,WESTERN ASIA,27,WESTERN ASIA,299.0,0.090301
8,NORTHERN EUROPE,9,NORTHERN EUROPE,108.0,0.083333
7,NORTHERN AFRICA,17,NORTHERN AFRICA,256.0,0.066406
2,CENTRAL ASIA,5,CENTRAL ASIA,80.0,0.0625
1,CENTRAL AMERICA,10,CENTRAL AMERICA,182.0,0.054945


This table shows a rank-ordered list of geographic regions by high-quality (FA/GA) articles per capita.