## Imports and Dependencies

In this section, we import the necessary libraries that will be used throughout the notebook for data processing, requests handling, and timing operations.

### Libraries Overview

- **`pandas`**: A powerful data manipulation library used for data analysis and manipulation.
- **`requests`**: A Python library used for sending HTTP requests to interact with external APIs.
- **`time`**: A module that provides time-related functions, used to handle delays or capture timestamps.
- **`json`**: A module for working with JSON data, which allows easy parsing and manipulation of JSON files.

In [44]:
#Python Standard Libraries
import time
import json

# You will need to install these with pip/pip3 if you do not already have it
import pandas as pd
import requests

## Data Loading

In this section, we load the datasets that will be used for further analysis. We are working with two datasets:

1. **Politicians Data**: This dataset contains information about politicians and their associated Wikipedia articles.
2. **Population Data**: This dataset includes the population of different countries and regions as of August 2024.

In [59]:
# Import the dataset containing politicians' information
politicians_df = pd.read_csv('../data/politicians_by_country_AUG.2024.csv')

# Import the dataset containing population statistics by country
population_df = pd.read_csv('../data/population_by_country_AUG.2024.csv')

# Display the first few records of the politicians dataset to verify it loaded correctly
print(politicians_df.head())

# Display the count of non-null entries in each column of the politicians dataset
politicians_df.count()

# Display the first few records of the population dataset to verify it loaded correctly
print(population_df.head())

# Display the count of non-null entries in each column of the population dataset
population_df.count()



                   name                                                url  \
0        Majah Ha Adrif       https://en.wikipedia.org/wiki/Majah_Ha_Adrif   
1     Haroon al-Afghani    https://en.wikipedia.org/wiki/Haroon_al-Afghani   
2           Tayyab Agha          https://en.wikipedia.org/wiki/Tayyab_Agha   
3  Khadija Zahra Ahmadi  https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...   
4        Aziza Ahmadyar       https://en.wikipedia.org/wiki/Aziza_Ahmadyar   

       country  
0  Afghanistan  
1  Afghanistan  
2  Afghanistan  
3  Afghanistan  
4  Afghanistan  
         Geography  Population
0            WORLD      8009.0
1           AFRICA      1453.0
2  NORTHERN AFRICA       256.0
3          Algeria        46.8
4            Egypt       105.2


Geography     233
Population    233
dtype: int64

## Duplicate Detection

In this section, we identify any potential duplicates in the dataset of politicians. We will perform checks based on two different criteria:

1. **All Columns**: Checking for exact duplicates across all columns.
2. **Name + URL**: Checking for duplicates where only the name or URL is repeated.


In [60]:
# Identify and count duplicates considering all columns
duplicate_politicians_all = politicians_df[politicians_df.duplicated()]
print(len(duplicate_politicians_all))

# Identify and count duplicates based only on the 'name' column
duplicate_politicians_name = politicians_df[politicians_df.duplicated(subset=['name'])]

# Identify and count duplicates based only on the 'url' column
duplicate_politicians_url = politicians_df[politicians_df.duplicated(subset=['url'])]

# Display the number of duplicated entries found for name and URL
print(len(duplicate_politicians_name))
print(len(duplicate_politicians_url))

# Explanation: Duplicate names and URLs suggest that the only distinguishing feature is the country field.

0
44
44


## Data Cleaning: Removing Zero Population Rows

In this step, we clean the population dataset by removing rows where the population value is zero. This ensures that we are working with valid data and that any countries or regions with a population of zero are excluded from further analysis.


In [61]:
# Remove rows where the population value is 0
# This ensures that we only consider countries or regions with a valid population for analysis
population_df = population_df[population_df['Population'] > 0]

# After filtering, only countries/regions with a population greater than 0 remain in the DataFrame


## Checking for Duplicates in Population Data

To ensure data accuracy, it's crucial to check for duplicates within the population dataset. We will check for duplicates across all columns as well as duplicates specifically within the `Geography` column. This step helps identify and remove any redundant entries, ensuring the uniqueness and correctness of the data for analysis.


In [62]:
# Check for duplicates across all columns in the population DataFrame
# This will identify any rows that are exact duplicates (i.e., all column values are the same)
duplicate_population_all = population_df[population_df.duplicated()]
print(f"Number of duplicate rows across all columns: {len(duplicate_population_all)}")

# Check for duplicates based only on the 'Geography' column
# This will check if any countries or regions appear more than once in the dataset based on their 'Geography'
duplicate_population_geo = population_df[population_df.duplicated(subset=['Geography'])]
print(f"Number of duplicate entries based on 'Geography': {len(duplicate_population_geo)}")



Number of duplicate rows across all columns: 0
Number of duplicate entries based on 'Geography': 0


## Handling Politicians Appearing in Multiple Countries

Some politicians appear in the dataset associated with multiple countries. This could be due to their nationalities or because they have served in more than one country. Based on the information retrieved from the Wikipedia API, we have identified 44 politicians who appear in two or more countries. 

Since it makes sense for these politicians to be part of all the countries they are associated with, we will retain them in the dataset for each country. However, for record-keeping purposes, we will create and store a separate file containing these duplicates.


In [63]:
# 44 politicians appear in multiple countries (2 or more). This has been determined (based on the Wikipedia API)
# due to either their nationalities or the fact that they have served in multiple countries.
# It makes sense to include them as part of all the countries their names appear in.

# We create a combined DataFrame of duplicate politicians, 
# considering those with the same name and same URL (i.e., served in different countries).
combined_duplicates = pd.concat([duplicate_politicians_name, duplicate_politicians_url]).drop_duplicates()

# Save the combined duplicate entries for future reference or analysis.
# This will store the list of politicians who appear in multiple countries into a CSV file.
combined_duplicates.to_csv('../data/duplicate_politicians.csv', index=False)




## Checking for Missing Values

Before proceeding with the analysis, it is important to check for any missing values in the dataset. Missing values can affect the accuracy of the analysis, and identifying them allows us to take appropriate action, such as filling or removing them.


In [64]:
# Check for missing values in each column of the politicians dataset
# This will display the number of missing (null) values for each column in the DataFrame
missing_values_politicians = politicians_df.isnull().sum()
print("Missing values in politicians dataset:")
print(missing_values_politicians)

# Check for missing values in each column of the population dataset
# This will display the count of missing values per column, helping to identify any data quality issues
missing_values_population = population_df.isnull().sum()
print("\nMissing values in population dataset:")
print(missing_values_population)



Missing values in politicians dataset:
name       0
url        0
country    0
dtype: int64

Missing values in population dataset:
Geography     0
Population    0
dtype: int64


## Constants and API Request Setup

In order to fetch revision IDs and other information for Wikipedia articles, we set up the required constants, headers, and functions to interact with the Wikipedia API. These parameters will allow us to efficiently query the API while respecting the API's rate limits.

### API Constants

The following constants have been defined for the API requests:
- **API_ENWIKIPEDIA_ENDPOINT**: The base URL for Wikipedia's API.
- **API_HEADER_AGENT**: The HTTP header for the user-agent string, identifying the requester.
- **API_LATENCY_ASSUMED**: Assumed latency (in seconds) for API requests (roughly 2ms).
- **API_THROTTLE_WAIT**: The calculated wait time to ensure we don’t exceed the API rate limit (100 requests per second).
- **REQUEST_HEADERS**: Headers used in the API request, including the user-agent information that specifies the course and university details.

### Request Parameters Template

We have created a template for querying the Wikipedia API for page information:
- **PAGEINFO_PARAMS_TEMPLATE**: A dictionary that defines the basic structure of the request, including:
  - **action**: Specifies the action to be performed, in this case, querying for information.
  - **format**: The format of the response, set to `json`.
  - **titles**: Placeholder for the article title.
  - **prop**: Property to retrieve, in this case, basic info about the page.
  - **inprop**: Additional properties to include, such as the URL and talk page ID.


In [65]:
# Constants

# Wikipedia API endpoint to fetch page information
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# Header for the API request, which includes the User-Agent for proper identification
API_HEADER_AGENT = 'User-Agent'

# Assumed network latency for the API requests (in seconds)
API_LATENCY_ASSUMED = 0.002  # Assuming roughly 2ms latency on the API and network

# API throttle limit to prevent overloading the server (100 requests per second max)
API_THROTTLE_WAIT = (1.0 / 100.0) - API_LATENCY_ASSUMED  # Adjusting for latency

# Request headers used in each API request (User-Agent identifies the request source)
REQUEST_HEADERS = {
    'User-Agent': '<trips@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# Template for requesting page info from Wikipedia API (this template will be copied and updated for each request)
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",        # The API action (query)
    "format": "json",         # The format for the response (JSON)
    "titles": "",             # Placeholder for the article title (filled in dynamically)
    "prop": "info",           # Retrieve basic info about the page
    "inprop": "url|talkid"    # Include the URL and the talk page ID in the response
}

# Function to request page info and retrieve revision ID for a specific article
def request_pageinfo_per_article(article_title):
    # Make a copy of the template and update the title field with the given article title
    request_template = PAGEINFO_PARAMS_TEMPLATE.copy()
    request_template['titles'] = article_title

    # Add a small wait to respect the API rate limit and avoid overloading the server
    if API_THROTTLE_WAIT > 0.0:
        time.sleep(API_THROTTLE_WAIT)  # Sleep for the calculated throttle time

    try:
        # Send a GET request to the Wikipedia API with the request template and headers
        response = requests.get(API_ENWIKIPEDIA_ENDPOINT, headers=REQUEST_HEADERS, params=request_template)
        # Parse the JSON response
        json_response = response.json()
        
        # Extract the 'pages' section of the JSON response to get page info
        pages = json_response["query"]["pages"]
        for page_id, page_info in pages.items():
            # Extract the last revision ID for the article
            revision_id = page_info.get("lastrevid", None)
            if revision_id:
                return revision_id  # Return the revision ID if available
            else:
                return None  # Return None if no revision ID found
    except Exception as e:
        # Handle any errors that occur during the API request
        print(f"Error fetching page info for {article_title}: {e}")
        return None  # Return None in case of error


## ORES API Setup

In this section, we configure and define functions to interact with the **Objective Revision Evaluation Service (ORES)** provided by Wikipedia. ORES helps us to retrieve quality predictions for Wikipedia articles based on their revision IDs.

### ORES API Constants

We define the constants required for making requests to the ORES API:
- **ORES_ENDPOINT**: The base URL for querying ORES.
- **ORES_MODEL**: The machine learning model used to predict the quality of Wikipedia articles. For this project, we are using the `wp10` model, which categorizes articles into various quality classes (e.g., Stub, Start, B, GA, FA).

### Function to Retrieve ORES Quality Prediction

The function `get_ores_quality_prediction(article_title, revision_id)` takes an article title and its revision ID as input and queries the ORES API to get the quality prediction for the article. It processes the API response and returns the predicted quality score.


In [66]:
# ORES API Constants

# Endpoint for querying article quality scores using the ORES API
ORES_ENDPOINT = "https://ores.wikimedia.org/v3/scores/enwiki/"

# ORES model used for quality prediction; 'wp10' is the model used to classify article quality
ORES_MODEL = "wp10"

# Function to request article quality prediction from the ORES API
def get_ores_quality_prediction(article_title, revision_id):
    """
    This function takes an article title and its Wikipedia revision ID, sends a request to the ORES API to 
    get the article's quality prediction, and returns the predicted quality score (e.g., 'FA', 'GA', etc.).
    """
    try:
        # Construct the full URL for querying ORES API with the model and revision ID
        ores_url = f"{ORES_ENDPOINT}?models={ORES_MODEL}&revids={revision_id}"
        
        # Send a GET request to the ORES API to fetch the quality prediction
        response = requests.get(ores_url)

        # Check if the request was successful (HTTP status 200)
        if response.status_code == 200:
            # Parse the JSON response to extract the predicted quality score
            data = response.json()
            # Navigate to the specific part of the response that holds the prediction for the revision ID
            scores = data['enwiki']['scores'][str(revision_id)]['wp10']['score']['prediction']
            return scores  # Return the quality prediction (e.g., 'FA', 'GA')
        else:
            # Handle cases where the API request fails (e.g., bad response)
            print(f"Failed to get ORES score for {article_title} (Revision ID: {revision_id})")
            return None
    except Exception as e:
        # Handle any exceptions (e.g., network or parsing issues)
        print(f"Error getting ORES score for {article_title}: {e}")
        return None




## Fetching Revision IDs and Article Quality Scores


In this section, we retrieve the **revision IDs** and **quality scores** for each politician's Wikipedia article. Using the ORES API, we classify the quality of each article based on the `wp10` model predictions.

### Steps:
1. **Initialize Error Logging**:
   - We create a list `error_log` to track articles that fail to retrieve revision IDs or quality scores.
   
2. **Add New Columns**:
   - We add two new columns, `revision_id` and `quality_score`, to store the retrieved data for each politician's article.

3. **Loop Through Politicians**:
   - For each row in the `politicians_df` DataFrame, we:
     - Extract the article title from the article URL.
     - Request the page's revision ID using the function `request_pageinfo_per_article`.
     - If the revision ID is retrieved successfully, we fetch the article's quality score using `get_ores_quality_prediction`.
     - If the data is fetched successfully, it is stored in the DataFrame.
     - Articles that fail to retrieve data are logged in the `error_log`.

4. **Save the Error Log**:
   - The `error_log` is saved as a `.txt` file for further inspection, which can be used to debug or retry the failed articles later.

5. **Save Updated Politicians Data**:
   - The updated DataFrame, now including revision IDs and quality scores, is saved as `articles_scores.csv`.

6. **Calculate and Print Error Rate**:
   - The script calculates the percentage of articles that failed to retrieve a quality score, providing an error rate to assess how often the ORES API request failed.


In [67]:
# Initialize an empty list to log article titles that could not fetch quality scores
error_log = []

# Add columns to the politicians DataFrame to store the revision ID and quality score from the Wikipedia and ORES APIs
politicians_df['revision_id'] = None
politicians_df['quality_score'] = None

# Loop through each row (each politician) to retrieve the Wikipedia revision ID and corresponding quality score
for index, row in politicians_df.iterrows():
    # Extract the article title from the URL by splitting the string and taking the last segment after '/'
    article_title = row['url'].split('/')[-1]
    
    # Request the revision ID using the Wikipedia API
    revision_id = request_pageinfo_per_article(article_title)
    
    # If the revision ID is successfully retrieved, fetch the ORES quality score
    if revision_id:
        quality_score = get_ores_quality_prediction(article_title, revision_id)
        # Store the revision ID and quality score in the DataFrame at the respective index
        politicians_df.at[index, 'revision_id'] = revision_id
        politicians_df.at[index, 'quality_score'] = quality_score
        print(f"Data Fetched for: {revision_id}")  # Output fetched data to console for monitoring
    else:
        # If no revision ID was found, log the article title in the error log
        error_log.append(article_title)

# Save the error log to a text file to track articles where no quality score was fetched
with open('../data/ores_error_log.txt', 'w') as log_file:
    log_file.write("\n".join(error_log))

# Save the updated DataFrame with revision ID and quality score to a CSV file
politicians_df.to_csv('../data/articles_scores.csv', index=False)

# Calculate and print the error rate (i.e., the proportion of articles for which no quality score was fetched)
error_rate = len(error_log) / len(politicians_df)
print(f"Error rate: {error_rate:.2%}")


Data Fetched for: 1233202991
Data Fetched for: 1230459615
Data Fetched for: 1225661708
Data Fetched for: 1234741562
Data Fetched for: 1195651393
Data Fetched for: 1235521766
Data Fetched for: 1176429234
Data Fetched for: 1247931713
Data Fetched for: 1225385278
Data Fetched for: 1226326055
Data Fetched for: 1244521219
Data Fetched for: 1231655023
Data Fetched for: 1237694188
Data Fetched for: 1227635806
Data Fetched for: 1248505877
Data Fetched for: 1197443408
Data Fetched for: 1134129082
Data Fetched for: 1193992206
Data Fetched for: 988838315
Data Fetched for: 949986748
Data Fetched for: 1158302291
Data Fetched for: 1185105938
Data Fetched for: 1212323536
Data Fetched for: 1245967190
Data Fetched for: 1207743719
Data Fetched for: 1227026187
Data Fetched for: 1158659195
Data Fetched for: 1240993642
Data Fetched for: 1136611354
Data Fetched for: 1234514565
Data Fetched for: 1234743111
Data Fetched for: 1179137138
Data Fetched for: 1246566795
Data Fetched for: 1243745950
Data Fetched for

## Assigning Regions to Countries

In this section, we process the **population dataset** to map each country to its respective region. The region information is present as uppercase entries in the 'Geography' column, which signifies the start of a new region. The countries that follow each region entry are assigned to that region.

### Steps:
1. **Identify Regions**:
   - We identify all rows in the `population_df` where the 'Geography' entry is in all uppercase, indicating that these rows represent regions.

2. **Add a Region Column**:
   - We create a new column `region` in `population_df` to store the region name for each country.

3. **Assign Regions to Countries**:
   - We iterate through the dataset and, for each row, check whether it represents a region or a country:
     - If the row represents a region, we set it as the current region.
     - If the row represents a country, we assign the current region to that country.

4. **Remove Region Rows**:
   - After assigning regions to all countries, we remove the rows corresponding to the region names, keeping only the country rows.

5. **Rename Columns**:
   - The columns are renamed to `country`, `population`, and `region` for clarity.

6. **Save Transformed Data**:
   - The final dataset with countries and their corresponding regions is saved as `country_population_with_region.csv` for future analysis.


In [58]:
# Identify the rows that represent regions (ALL CAPS entries in 'Geography')
regions = population_df[population_df['Geography'].str.isupper()].copy()

# Create an empty column for regions in the population_df
population_df['region'] = None

# Iterate over the regions and assign the corresponding region to countries
current_region = None
for index, row in population_df.iterrows():
    if row['Geography'].isupper():
        # If the row is a region, set it as the current region
        current_region = row['Geography']
    else:
        # If the row is a country, assign the current region
        population_df.at[index, 'region'] = current_region

# Remove the rows corresponding to regions and keep only the country rows
country_population_df = population_df[~population_df['Geography'].str.isupper()].copy()

# Rename the columns to 'country', 'population', and 'region'
country_population_df = country_population_df.rename(columns={'Geography': 'country', 'Population': 'population'})

# Display the transformed DataFrame
display(country_population_df.head())

# Save the final DataFrame to a CSV file for future use
country_population_df.to_csv('../data/country_population_with_region.csv', index=False)


Unnamed: 0,country,population,region
3,Algeria,46.8,NORTHERN AFRICA
4,Egypt,105.2,NORTHERN AFRICA
5,Libya,6.9,NORTHERN AFRICA
6,Morocco,37.0,NORTHERN AFRICA
7,Sudan,48.1,NORTHERN AFRICA
