# Considering Bias in Data

The goal of this notebook is to explore the concept of bias in data using Wikipedia articles. This notebook will consider articles on political figures from different countries. We will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.
We will then perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies among countries.


Firstly we start by importing necessary libraries

In [2]:
import pandas as pd
import requests
import json
import time
import os
from dotenv import load_dotenv

## 1. Getting the Data

For this notebook we start with the two files in the raw-data folder which contain population and politician wikipedia data. We augment this with ORES quality predictions by doing a series of API pulls

### Article Page Info MediaWiki API 
The below cell accesses page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this code.

#### License
The code below is built upon the example developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024

In [2]:

# Load the CSV file with article titles (representing article IDs)
csv_file_path = "politicians_by_country_AUG.2024.csv"
politicians_df = pd.read_csv(csv_file_path)

# Set up Wikimedia API endpoint and request header
API_ENDPOINT = "https://en.wikipedia.org/w/api.php"
HEADERS = {
    'User-Agent': 'asheera@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# Log file to store failed article titles and those with missing 'lastrevid'
failed_page_requests_log = "failed_page_requests.log"
missing_lastrevid_log = "missing_lastrevid.log"

# Function to log failed requests
def log_failed_titles(titles, log_file_path):
    with open(log_file_path, "a") as log_file:  # Open in append mode
        log_file.write("\n".join(titles) + "\n")

# Function to make batch requests for a group of article titles
def fetch_page_info(titles):
    params = {
        "action": "query",
        "format": "json",
        "prop": "info",
        "titles": "|".join(titles),
        "inprop": "url"
    }
    response = requests.get(API_ENDPOINT, headers=HEADERS, params=params)
    if response.status_code == 200:
        return response.json().get('query', {}).get('pages', {})
    else:
        # Log the failed titles
        log_failed_titles(titles, failed_page_requests_log)
        return {}

# Split article names into batches of 50 for API requests
batch_size = 50
article_batches = [politicians_df['name'][i:i + batch_size].tolist() for i in range(0, len(politicians_df), batch_size)]

# Dictionary to store lastrevid for each article
lastrevid_dict = {}

i = 1
# Loop through each batch and make API requests
for batch in article_batches:
    print(f"Fetching page info for Batch:{i}")
    i+=1
    page_info = fetch_page_info(batch)
    
    # Extract lastrevid for each article and store in the dictionary
    for page_id, page_data in page_info.items():
        title = page_data['title']
        lastrevid = page_data.get('lastrevid', None)
        
        if lastrevid is not None:
            lastrevid_dict[title] = lastrevid
        else:
            # Log the title with missing 'lastrevid'
            log_failed_titles([title], missing_lastrevid_log)

    # Respect API limits with a small delay
    time.sleep(API_THROTTLE_WAIT)  

# Add lastrevid as a new column to the DataFrame
politicians_df['lastrevid'] = politicians_df['name'].map(lastrevid_dict)

# Save the updated DataFrame to a new CSV file
output_csv_path = "politicians_with_lastrevid.csv"
politicians_df.to_csv(output_csv_path, index=False)

print(f"Updated CSV saved as {output_csv_path}")
print(f"Failed page requests logged to {failed_page_requests_log}")
print(f"Titles with missing 'lastrevid' logged to {missing_lastrevid_log}")

Fetching page info for Batch:1
Fetching page info for Batch:2
Fetching page info for Batch:3
Fetching page info for Batch:4
Fetching page info for Batch:5
Fetching page info for Batch:6
Fetching page info for Batch:7
Fetching page info for Batch:8
Fetching page info for Batch:9
Fetching page info for Batch:10
Fetching page info for Batch:11
Fetching page info for Batch:12
Fetching page info for Batch:13
Fetching page info for Batch:14
Fetching page info for Batch:15
Fetching page info for Batch:16
Fetching page info for Batch:17
Fetching page info for Batch:18
Fetching page info for Batch:19
Fetching page info for Batch:20
Fetching page info for Batch:21
Fetching page info for Batch:22
Fetching page info for Batch:23
Fetching page info for Batch:24
Fetching page info for Batch:25
Fetching page info for Batch:26
Fetching page info for Batch:27
Fetching page info for Batch:28
Fetching page info for Batch:29
Fetching page info for Batch:30
Fetching page info for Batch:31
Fetching page inf

### Requesting ORES scores through LiftWing ML Service API

The cell below generates article quality estimates for article revisions using the LiftWing version of [ORES](https://www.mediawiki.org/wiki/ORES). The [ORES API documentation](https://ores.wikimedia.org) can be accessed from the main ORES page.

You will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin). This notebook used the option to create a [Personal API token](https://api.wikimedia.org/wiki/Authentication) and stores it in a .env file. 

#### License
The code below builds upon the example developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - August 15, 2023


In [3]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = "Apoorvasheera"
ACCESS_TOKEN = ""
#

Load the personal access token form the .env file

In [4]:
load_dotenv()  # Loads the .env file

CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET = os.getenv('CLIENT_SECRET')
ACCESS_TOKEN = os.getenv('ACCESS_TOKEN')

Define function to encapsulate the API call

In [5]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Example usage of the function and output format from the API

In [8]:
#   
#
#   Which article - the key for the article dictionary defined above
article_title = "Bison"
#
print(f"Getting LiftWing ORES scores for '{article_title}' with revid: {ARTICLE_REVISIONS[article_title]:d}")
#
#    Make the call, just pass in the article revision ID, email address, and access token
score = request_ores_score_per_article(article_revid= ARTICLE_REVISIONS[article_title],
                                       email_address="apoorvasheera98@gmail.com",
                                       access_token=ACCESS_TOKEN)
#
#    Output the result
print(json.dumps(score,indent=4))
#

Getting LiftWing ORES scores for 'Bison' with revid: 1085687913
{
    "enwiki": {
        "models": {
            "articlequality": {
                "version": "0.9.2"
            }
        },
        "scores": {
            "1085687913": {
                "articlequality": {
                    "score": {
                        "prediction": "FA",
                        "probability": {
                            "B": 0.07895665991827401,
                            "C": 0.03728215742560417,
                            "FA": 0.5629436065906797,
                            "GA": 0.30547854835374505,
                            "Start": 0.011061807252218824,
                            "Stub": 0.00427722045947826
                        }
                    }
                }
            }
        }
    }
}


#### Retrieving ORES Quality Predictions for Wikipedia Articles

In this step, we load a CSV file containing Wikipedia articles about politicians along with their most recent revision IDs. We check if a partially completed predictions file exists (`politicians_with_partial_prediction.csv`) and resume from the last processed index. If not, we start from the beginning, initializing an empty predictions list.

To retrieve the quality predictions, we use the ORES API, which requires the revision ID and article title. The API is rate-limited, so we introduce a delay between requests to avoid exceeding limits. We log failed requests and errors to a separate log file (`ores_request_errors.log`) for review.

The main functionality includes:
- Loading the access token from an environment variable for API authentication.
- Defining helper functions for logging failed requests and extracting predictions from the API response.
- Iterating through each article and making requests to the ORES API to get quality predictions.
- After each request, we save the updated DataFrame to a CSV file (`politicians_with_partial_prediction.csv`) to ensure progress is saved.

The process resumes from the last successfully completed request, ensuring the program is resilient to failures and can continue without losing data.


In [33]:
# Load the CSV file
csv_file_path = "politicians_with_lastrevid.csv"
politicians_df = pd.read_csv(csv_file_path)

# Check if you already have some predictions and resume from there
output_csv_path = "politicians_with_partial_prediction.csv"
if os.path.exists(output_csv_path):
    politicians_df = pd.read_csv(output_csv_path)
    predictions = politicians_df['prediction'].tolist()  # Load existing predictions
    start_index = 6779 # Manually restarting from where previous request failed
else:
    predictions = [None] * len(politicians_df)  # Initialize empty predictions
    start_index = 0  # Start from the beginning if no previous results

# Log file to store failed article titles or errors
log_file_path = "ores_request_errors.log"

# ORES API constants
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"
API_LATENCY_ASSUMED = 0.002
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0) - API_LATENCY_ASSUMED

# Load access token and other environment variables
from dotenv import load_dotenv
load_dotenv()

ACCESS_TOKEN = os.getenv('ACCESS_TOKEN')

# Function to log failed requests
def log_failed_request(title, error_message):
    with open(log_file_path, "a") as log_file:
        log_file.write(f"{title}: {error_message}\n")

# Function to make ORES request for each article
def request_ores_score_per_article(article_title, article_revid, email_address, access_token):
    # Define request data and header
    request_data = {
        "lang": "en",
        "rev_id": article_revid,
        "features": True
    }
    
    headers = {
        'User-Agent': f"{email_address}, University of Washington, MSDS DATA 512 - AUTUMN 2024",
        'Content-Type': 'application/json',
        'Authorization': f"Bearer {access_token}"
    }
    
    request_url = API_ORES_LIFTWING_ENDPOINT.format(model_name=API_ORES_EN_QUALITY_MODEL)
    
    try:
        time.sleep(API_THROTTLE_WAIT)  # Throttle to avoid hitting rate limits
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        response.raise_for_status()  # Raise an error for bad responses
        json_response = response.json()
        return json_response
    
    except Exception as e:
        log_failed_request(article_title, str(e))
        return None

# Function to extract prediction from the ORES API response
def extract_prediction(json_response,article_revid):
    try:
        prediction = json_response['enwiki']['scores'][str(article_revid)]["articlequality"]["score"]["prediction"]
        return prediction
    except Exception as e:
        return None

# Loop through each article and make ORES requests
# Resume from the next index after the last completed request
for index, row in politicians_df.iloc[start_index:].iterrows():
    article_title = row['name']
    article_revid = row['lastrevid']
    
    if pd.notnull(article_revid):  # Ensure revid exists
        article_revid = int(article_revid)
        print(f"Getting LiftWing ORES scores for '{article_title}' with revid: {article_revid}")
        json_response = request_ores_score_per_article(article_title, article_revid, "apoorvasheera98@gmail.com", ACCESS_TOKEN)
        
        if json_response:
            print("Success")
            prediction = extract_prediction(json_response, article_revid)
            if prediction:
                predictions[index] = prediction
            else:
                log_failed_request(article_title, "Prediction missing in ORES response")
                predictions[index] = None
        else:
            print("Failure")
            predictions[index] = None
    else:
        log_failed_request(article_title, "Missing revid")
        predictions[index] = None

    # Save the updated DataFrame with predictions after each request
    politicians_df['prediction'] = predictions
    politicians_df.to_csv(output_csv_path, index=False)


print(f"Updated CSV saved as {output_csv_path}")
print(f"Failed requests logged to {log_file_path}")


Getting LiftWing ORES scores for 'Ajok Lucy' with revid: 1236299635
Success
Getting LiftWing ORES scores for 'Jack Odur Lutanywa' with revid: 1141688429
Success
Getting LiftWing ORES scores for 'James Mamawi' with revid: 1246243211
Success
Getting LiftWing ORES scores for 'Margaret Rwebyambu' with revid: 1246348833
Success
Getting LiftWing ORES scores for 'Margret Okunga Makoha' with revid: 1246243219
Success
Getting LiftWing ORES scores for 'Jack Maumbe Mukhwana' with revid: 1246244527
Success
Getting LiftWing ORES scores for 'Willy Mayambala' with revid: 1249062770
Success
Getting LiftWing ORES scores for 'Kibirige Mayanja' with revid: 1108117021
Success
Getting LiftWing ORES scores for 'Tamale Mirundi' with revid: 1242742635
Success
Getting LiftWing ORES scores for 'Milly Mugeni' with revid: 1164574518
Success
Getting LiftWing ORES scores for 'Isaac Mulindwa' with revid: 1246242602
Success
Getting LiftWing ORES scores for 'Besueri Kiwanuka Lusse Mulondo' with revid: 1061210640
Succe

#### Handling Timeout Errors and Retrying ORES Requests

In this cell, we handle the cases where previous ORES requests failed due to a 504 Gateway Timeout error. The goal is to identify these failed requests, retry them, and update the predictions in our dataset.

The process is broken into two functions:
1. **get_titles_with_timeout_errors:**  
   This function reads through the error log file (`ores_request_errors.log`) and extracts the titles of articles that encountered a 504 error during the ORES request. These titles are returned as a list, allowing us to retry the requests for just these problematic articles.
   
2. **retry_ores_requests_for_timeout_titles:**  
   This function takes the list of titles with timeout errors and retries the ORES API request for each one. It locates the relevant article in the dataset, makes the ORES request again, and updates the prediction if a valid response is received. After processing all the retries, the updated DataFrame is saved back to the CSV file (`politicians_with_partial_prediction.csv`).

By retrying only the failed requests, we ensure that the entire dataset is updated without reprocessing articles that were already completed successfully. This approach improves efficiency and ensures a complete dataset for analysis.


In [35]:
# Read the log file and extract titles with the 504 Gateway Timeout error
def get_titles_with_timeout_errors(log_file_path):
    titles_with_timeout = []
    with open(log_file_path, "r") as log_file:
        for line in log_file:
            if "504 Server Error" in line:
                title = line.split(":")[0]  # Extract the title before the colon
                titles_with_timeout.append(title.strip())
    return titles_with_timeout

# Retry the ORES requests for the titles with 504 errors
def retry_ores_requests_for_timeout_titles(titles_with_timeout, csv_file_path):
    # Load the existing CSV with predictions
    politicians_df = pd.read_csv(csv_file_path)

    # Loop through the titles and retry requests
    for title in titles_with_timeout:
        row = politicians_df[politicians_df['name'] == title]
        if not row.empty:
            article_revid = row['lastrevid'].values[0]
            if pd.notnull(article_revid):  # Ensure revid exists
                article_revid = int(article_revid)
                print(f"Retrying ORES request for '{title}' with revid: {article_revid}")
                json_response = request_ores_score_per_article(title, article_revid, "apoorvasheera98@gmail.com", ACCESS_TOKEN)
                
                if json_response:
                    prediction = extract_prediction(json_response, article_revid)
                    if prediction:
                        # Update the prediction in the DataFrame
                        politicians_df.loc[politicians_df['name'] == title, 'prediction'] = prediction
                        print(f"Updated prediction for '{title}'")
                    else:
                        print(f"Prediction missing in ORES response for '{title}'")
    
    # Save the updated DataFrame
    politicians_df.to_csv(csv_file_path, index=False)
    print(f"Updated CSV saved as {csv_file_path}")

# Path to the log file and CSV file
log_file_path = "ores_request_errors.log"
csv_file_path = "politicians_with_partial_prediction.csv"

# Get the list of titles with 504 errors
titles_with_timeout_errors = get_titles_with_timeout_errors(log_file_path)

# Retry requests and update the predictions
retry_ores_requests_for_timeout_titles(titles_with_timeout_errors, csv_file_path)

Retrying ORES request for 'André Resampa' with revid: 1191255852
Updated prediction for 'André Resampa'
Retrying ORES request for 'Malik Allahyar Khan' with revid: 1233679331
Updated prediction for 'Malik Allahyar Khan'
Retrying ORES request for 'József Klekl (politician)' with revid: 1185528975
Updated prediction for 'József Klekl (politician)'
Retrying ORES request for 'Jožef Krajnc' with revid: 1239252872
Updated prediction for 'Jožef Krajnc'
Retrying ORES request for 'Matija Majar' with revid: 1195006555
Updated prediction for 'Matija Majar'
Retrying ORES request for 'Aziz Feyzi Pirinççizâde' with revid: 1240060888
Updated prediction for 'Aziz Feyzi Pirinççizâde'
Retrying ORES request for 'Mammetmyrat Geldinyyazov' with revid: 1138728249
Updated prediction for 'Mammetmyrat Geldinyyazov'
Retrying ORES request for 'Karubanga Jacob' with revid: 1234106478
Updated prediction for 'Karubanga Jacob'
Updated CSV saved as politicians_with_partial_prediction.csv


## 2. Creating the Dataset

#### Structuring the Population Data with Regions

We process the population dataset (`population_by_country_AUG.2024.csv`) to add a column for geographic regions. The original dataset includes both countries and regions (e.g., continents or groups of countries), but it does not explicitly associate each country with its region. This restructuring will allow us to later analyze Wikipedia article coverage and quality on both a country-by-country and regional basis.

Key steps in this process:
1. **Remove irrelevant rows:**  
   The row for 'WORLD' is manually removed as it's not useful for our analysis.
   
2. **Detect countries vs. regions:**  
   We use a function to identify whether a row represents a country or a region by checking if the 'Geography' value is in uppercase. Rows with uppercase 'Geography' values represent regions (e.g., "AFRICA"), while lowercase values represent countries.

3. **Assign regions to countries:**  
   We iterate through the dataset and, for each country, assign it to the most recent region (from the uppercase rows). This helps ensure that each country is associated with its respective region.

4. **Save the updated data:**  
   The resulting DataFrame, with the new 'region' column, is saved as `population_by_country_with_regions.csv`, which will be used in subsequent analysis steps.

This restructuring is essential for our analysis of Wikipedia articles at both country and regional levels.


In [4]:
# Converting the population file to a structured format (adding columns
# with region and continent)
# Load the population_df
population_df = pd.read_csv("raw-data/population_by_country_AUG.2024.csv")

# Remove the 'WORLD' row manually since it's not relevant
population_df = population_df[population_df['Geography'] != 'WORLD']

# Create an empty column for region
population_df['region'] = None

# Function to detect if a row is a country by checking if the 'Geography' is NOT all uppercase
def is_country(row):
    return not row['Geography'].isupper()

# Temporary variable to keep track of the current region
current_region = None

# Loop through the rows and update the region for each country
for index, row in population_df.iterrows():
    geography = row['Geography']
    
    # If it's not a country (i.e., the geography name is in all caps), it's a region
    if not is_country(row):
        current_region = geography  # Set the region
    else:
        # If it's a country, assign the current region to the 'region' column
        population_df.at[index, 'region'] = current_region

# Save the updated population DataFrame
output_csv_path = "intermediate-data/population_by_country_with_regions.csv"
population_df.to_csv(output_csv_path, index=False)

print(f"Updated population data with regions saved as {output_csv_path}")


Updated population data with regions saved as intermediate-data/population_by_country_with_regions.csv


#### Merging Politicians and Population Data

We merge the politician article data with the population data to perform country and regional-level analysis. The goal is to combine the Wikipedia article information with country population statistics and geographic region data, allowing us to analyze the distribution of articles and their quality across different countries and regions.

Key steps:
1. **Load the Data:**  
   We load the previously processed `politicians_with_partial_prediction.csv` file, which contains Wikipedia article data, and the population dataset (with region and continent information).

2. **Merging the DataFrames:**  
   The two datasets are merged based on the country field. We use an outer join to ensure that countries missing from either dataset are retained for further investigation. The `_merge` indicator column helps us identify where matches failed.

3. **Identifying Unmatched Countries:**  
   Countries that exist in the Wikipedia dataset but not in the population data (and vice versa) are logged into a text file (`wp_countries-no_match.txt`). This helps us identify any inconsistencies between the datasets and handle them appropriately.

4. **Filtering Matched Data:**  
   We filter the merged DataFrame to retain only the rows where both the Wikipedia and population data matched. These matched rows are used for further analysis.

5. **Renaming and Formatting Columns:**  
   The columns are renamed and reformatted to meet the required format for the final output, which includes country, region, population, article title, revision ID, and article quality.

This step ensures that our dataset is correctly structured for analysis by country and region, while also handling any inconsistencies between the two datasets.

In [11]:
import pandas as pd

# Load the politicians_with_partial_prediction.csv file
politicians_df = pd.read_csv("intermediate-data/politicians_with_partial_prediction.csv")

# Load the population data file (already preprocessed with continent and region info)
# population_df = pd.read_csv("raw-data/population_by_country_AUG.2024.csv")

# Merge the two DataFrames: 'country' from politicians_df and 'Geography' from population_df
merged_df = pd.merge(
    politicians_df[['country', 'name', 'lastrevid', 'prediction']],
    population_df[['Geography', 'Population', 'region']],
    left_on='country',
    right_on='Geography',
    how='outer',  # Use 'outer' to ensure we capture non-matching countries on both sides
    indicator=True  # This will add a column to indicate where the match failed
)

# Separate unmatched countries from Wikipedia (left_only) and unmatched countries from Population (right_only)
unmatched_wp_countries = merged_df[merged_df['_merge'] == 'left_only']['country'].unique()
unmatched_population_countries = merged_df[merged_df['_merge'] == 'right_only']['Geography'].unique()

# Write unmatched countries to wp_countries-no_match.txt
log_file_path = "wp_countries-no_match.txt"
with open(log_file_path, "w") as log_file:
    # Log unmatched countries from Wikipedia dataset
    log_file.write("Unmatched Wikipedia Countries:\n")
    for country in unmatched_wp_countries:
        log_file.write(f"{country}\n")

    # Log unmatched countries from Population dataset
    log_file.write("\nUnmatched Population Countries:\n")
    for country in unmatched_population_countries:
        log_file.write(f"{country}\n")

print(f"Unmatched countries from both datasets logged to {log_file_path}")

# Filter out only the matched rows (both in 'both') for final output
matched_df = merged_df[merged_df['_merge'] == 'both']

# Rename and reformat the columns to match the final CSV requirement
matched_df = matched_df.rename(columns={
    'name': 'article_title',
    'lastrevid': 'revision_id',
    'prediction': 'article_quality',
    'Population': 'population'
})

# Select and reorder the required columns for the final output
final_columns = ['country', 'region', 'population', 'article_title', 'revision_id', 'article_quality']
final_df = matched_df[final_columns]


Unmatched countries from both datasets logged to wp_countries-no_match.txt


#### Handling Unmatched Countries and Applying Corrections

We load the list of unmatched countries and apply corrections for known issues (e.g., punctuation differences). For each corrected country, we merge the corresponding population and region data into the final dataset. We then update the unmatched countries list by removing corrected entries and filtering out regions or continents. Finally, we log any remaining unmatched countries and save the updated dataset to `wp_politicians_by_country.csv`.


In [12]:
# Load the unmatched countries file (initial version)
with open("wp_countries-no_match.txt", "r") as log_file:
    unmatched_countries = log_file.read().splitlines()

# Known punctuation issues to be corrected
corrections = {
    "Guinea-Bissau": "GuineaBissau",
    "Korea, South": "Korea (South)"
}

# 1. For each correction, find the row in the population_df and append it to the matched_df
for incorrect, correct in corrections.items():
    # Get the corresponding population data from population_df
    corrected_row = politicians_df[politicians_df['country'] == incorrect].copy()
    population_data = population_df[population_df['Geography'] == correct].copy()
    
    # If a match is found, add the corrected population and region to the corrected_row
    if not population_data.empty:
        corrected_row['population'] = population_data['Population'].values[0]
        corrected_row['region'] = population_data['region'].values[0]
        
        corrected_row = corrected_row.drop(columns=['url'])
        # Rename columns to match the final format
        corrected_row = corrected_row.rename(columns={
            'name': 'article_title',
            'lastrevid': 'revision_id',
            'prediction': 'article_quality'
        })

        
        # Append the corrected row back to matched_df
        final_df = pd.concat([final_df, corrected_row])


# 2. Remove the corrected countries from the unmatched list
unmatched_countries = [country for country in unmatched_countries if country not in corrections.keys()]
# Remove corrected country names from the unmatched list
unmatched_countries = [country for country in unmatched_countries if country not in corrections.values()]

# 3. Remove all-uppercase rows (regions/continents) from the unmatched list
def is_all_uppercase(name):
    return name.isupper()

unmatched_countries = [country for country in unmatched_countries if not is_all_uppercase(country)]

# Append the updated unmatched countries to the same log file
log_file_path = "wp_countries-no_match.txt"
with open(log_file_path, "w") as log_file:  # Change "a" to "w" to overwrite the existing file
    for country in unmatched_countries:
        log_file.write(f"{country}\n")

print(f"Updated unmatched countries logged to {log_file_path}")

# Save the updated final dataset with corrections applied
final_df.to_csv("wp_politicians_by_country.csv", index=False)

print(f"Updated dataframe saved as wp_politicians_by_country.csv")

Updated unmatched countries logged to wp_countries-no_match.txt
Updated dataframe saved as wp_politicians_by_country.csv


## 3. Analysis

#### Calculating Articles per Capita and High-Quality Articles

We calculate the total articles and high-quality articles (FA, GA) per capita for each country. First, we group the data by country to count the total articles and population. Then, we calculate the number of articles per million people. We also filter for high-quality articles, calculate the high-quality articles per capita, and merge these results into the country-level data for analysis.


In [18]:
# Load the final dataset
final_df = pd.read_csv("wp_politicians_by_country.csv")

# Group by country to get the total number of articles and population per country
country_grouped = final_df.groupby('country').agg(
    total_articles=('article_title', 'count'),
    total_population=('population', 'first')
)

# Calculate total articles per capita for each country
country_grouped['articles_per_mil_capita'] = country_grouped['total_articles'] / country_grouped['total_population']

# Filter high-quality articles (FA, GA)
high_quality_articles = final_df[final_df['article_quality'].isin(['FA', 'GA'])]

# Group by country to get high-quality articles count
high_quality_grouped = high_quality_articles.groupby('country').agg(
    high_quality_articles=('article_title', 'count')
)

# Merge high-quality articles count into the country-level data
country_grouped = country_grouped.merge(high_quality_grouped, on='country', how='left').fillna(0)

# Calculate high-quality articles per capita
country_grouped['high_quality_articles_per_mil_capita'] = country_grouped['high_quality_articles'] / country_grouped['total_population']

# Show the calculated values
country_grouped.head()

Unnamed: 0_level_0,total_articles,total_population,articles_per_mil_capita,high_quality_articles,high_quality_articles_per_mil_capita
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,85,42.4,2.004717,3.0,0.070755
Albania,70,2.7,25.925926,7.0,2.592593
Algeria,71,46.8,1.517094,1.0,0.021368
Angola,58,36.7,1.580381,2.0,0.054496
Antigua and Barbuda,33,0.1,330.0,0.0,0.0


#### Calculating Total Articles and High-Quality Articles Per Capita (Regional-Level)

This step calculates the total and high-quality articles per capita at the regional level. We group the data by region to sum the total articles and population, then compute the articles per million people. High-quality articles are also grouped by region, and the results are merged to calculate high-quality articles per capita for each region.

In [19]:
# Group by region for regional totals
region_grouped = final_df.groupby('region').agg(
    total_articles=('article_title', 'count'),
    total_population=('population', 'sum')
)

# Calculate total articles per capita for each region
region_grouped['articles_per_mil_capita'] = region_grouped['total_articles'] / region_grouped['total_population']

# Group high-quality articles by region
high_quality_region_grouped = high_quality_articles.groupby('region').agg(
    high_quality_articles=('article_title', 'count')
)

# Merge with the regional data
region_grouped = region_grouped.merge(high_quality_region_grouped, on='region', how='left').fillna(0)

# Calculate high-quality articles per capita for each region
region_grouped['high_quality_articles_per_mil_capita'] = region_grouped['high_quality_articles'] / region_grouped['total_population']

# Show the calculated values
region_grouped.head()


Unnamed: 0_level_0,total_articles,total_population,articles_per_mil_capita,high_quality_articles,high_quality_articles_per_mil_capita
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CARIBBEAN,219,1414.9,0.154781,9,0.006361
CENTRAL AMERICA,188,1418.8,0.132506,10,0.007048
CENTRAL ASIA,106,1983.6,0.053438,5,0.002521
EAST ASIA,230,41422.0,0.005553,13,0.000314
EASTERN AFRICA,665,23941.2,0.027776,17,0.00071


#### Result 1: Top 10 Countries by Coverage (Articles Per Capita)

In [20]:
# Top 10 countries by total articles per capita
top_10_countries_by_coverage = country_grouped.sort_values('articles_per_mil_capita', ascending=False).head(10)

# Display the top 10 countries
top_10_countries_by_coverage[['total_articles', 'articles_per_mil_capita']].style.format({'articles_per_mil_capita': '{:.10f}'})

Unnamed: 0_level_0,total_articles,articles_per_mil_capita
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Monaco,10,inf
Tuvalu,1,inf
Antigua and Barbuda,33,330.0
Federated States of Micronesia,14,140.0
Marshall Islands,13,130.0
Tonga,10,100.0
Barbados,25,83.3333333333
Seychelles,6,60.0
Montenegro,36,60.0
Maldives,33,55.0


#### Result 2: Bottom 10 Countries by Coverage (Articles Per Capita)

In [22]:
# Bottom 10 countries by total articles per capita
bottom_10_countries_by_coverage = country_grouped.sort_values('articles_per_mil_capita', ascending=True).head(10)

# Display the bottom 10 countries
bottom_10_countries_by_coverage[['total_articles', 'articles_per_mil_capita']].style.format({'articles_per_mil_capita': '{:.10f}'})

Unnamed: 0_level_0,total_articles,articles_per_mil_capita
country,Unnamed: 1_level_1,Unnamed: 2_level_1
China,16,0.0113370651
India,151,0.105697886
Ghana,4,0.1173020528
Saudi Arabia,5,0.135501355
Zambia,3,0.1485148515
Norway,1,0.1818181818
Israel,2,0.2040816327
Egypt,32,0.3041825095
Cote d'Ivoire,10,0.3236245955
Ethiopia,44,0.347826087


#### Result 3: Top 10 Countries by High-Quality Articles Per Capita

In [24]:
# Top 10 countries by high-quality articles per capita
top_10_countries_by_high_quality = country_grouped.sort_values('high_quality_articles_per_mil_capita', ascending=False).head(10)

# Display the top 10 countries
top_10_countries_by_high_quality[['high_quality_articles', 'high_quality_articles_per_mil_capita']].style.format({'high_quality_articles_per_mil_capita': '{:.10f}'})

Unnamed: 0_level_0,high_quality_articles,high_quality_articles_per_mil_capita
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Montenegro,3.0,5.0
Luxembourg,2.0,2.8571428571
Albania,7.0,2.5925925926
Kosovo,4.0,2.3529411765
Maldives,1.0,1.6666666667
Lithuania,4.0,1.3793103448
Croatia,5.0,1.3157894737
Guyana,1.0,1.25
Palestinian Territory,6.0,1.0909090909
Slovenia,2.0,0.9523809524


#### Result 4: Bottom 10 Countries by High-Quality Articles Per Capita

In [25]:
# Bottom 10 countries by high-quality articles per capita
bottom_10_countries_by_high_quality = country_grouped.sort_values('high_quality_articles_per_mil_capita', ascending=True).head(10)

# Display the bottom 10 countries
bottom_10_countries_by_high_quality[['high_quality_articles', 'high_quality_articles_per_mil_capita']].style.format({'mil_high_quality_articles_per_capita': '{:.10f}'})

Unnamed: 0_level_0,high_quality_articles,high_quality_articles_per_mil_capita
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Zimbabwe,0.0,0.0
Qatar,0.0,0.0
Grenada,0.0,0.0
Gambia,0.0,0.0
Samoa,0.0,0.0
Senegal,0.0,0.0
Federated States of Micronesia,0.0,0.0
Estonia,0.0,0.0
Eritrea,0.0,0.0
Equatorial Guinea,0.0,0.0


#### Result 5: Geographic Regions by Total Coverage

In [27]:
# Geographic regions by total articles per capita
regions_by_coverage = region_grouped.sort_values('articles_per_mil_capita', ascending=False)

# Display the regions ranked by total coverage
regions_by_coverage[['total_articles', 'articles_per_mil_capita']].style.format({'articles_per_mil_capita': '{:.10f}'})

Unnamed: 0_level_0,total_articles,articles_per_mil_capita
region,Unnamed: 1_level_1,Unnamed: 2_level_1
OCEANIA,72,0.6480648065
NORTHERN EUROPE,191,0.1643576284
CARIBBEAN,219,0.1547812566
CENTRAL AMERICA,188,0.1325063434
CENTRAL ASIA,106,0.0534381932
WESTERN ASIA,610,0.0456262388
SOUTHERN EUROPE,797,0.0443847944
EASTERN AFRICA,665,0.0277763855
WESTERN EUROPE,498,0.0262521152
NORTHERN AFRICA,302,0.0248071694


#### Result 6: Geographic Regions by High-Quality Coverage

In [28]:
# Geographic regions by high-quality articles per capita
regions_by_high_quality_coverage = region_grouped.sort_values('high_quality_articles_per_mil_capita', ascending=False)

# Display the regions ranked by high-quality coverage
regions_by_high_quality_coverage[['high_quality_articles', 'high_quality_articles_per_mil_capita']].style.format({'high_quality_articles_per_mil_capita': '{:.10f}'})

Unnamed: 0_level_0,high_quality_articles,high_quality_articles_per_mil_capita
region,Unnamed: 1_level_1,Unnamed: 2_level_1
OCEANIA,1,0.0090009001
NORTHERN EUROPE,9,0.0077446003
CENTRAL AMERICA,10,0.0070482098
CARIBBEAN,9,0.0063608736
SOUTHERN EUROPE,53,0.002951561
CENTRAL ASIA,5,0.0025206695
WESTERN ASIA,27,0.002019522
NORTHERN AFRICA,17,0.0013964301
SOUTHERN AFRICA,8,0.0013435668
EASTERN EUROPE,38,0.0013083192
