# Data Acquisition
This notebook contains the code for executing the MediaWiki API call, and extracting and storing the article revision ID corresponding to each politician. The output of this notebook is the JSON file "rev_ids.json" located in the "data_clean" folder.

This is the first of the four notebooks that should be ran, followed next by running "data_scoring.ipynb"

## Article Page Info MediaWiki API 
This notebook illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code was adapted from an example developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024



## Importing Libraries

In [42]:
import json
import time
import pandas as pd
import requests

# Loading in the List of Article Titles

In [43]:
### Reading in the article titles ###
input_data = pd.read_csv("../data_raw/politicians_by_country_AUG.2024.csv")

# Extract the politician names
article_titles = input_data['name']

## Defining Constants for the API Call

In [44]:
# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<anetzley@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = article_titles

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


## Defining the API Call
The following two functions enable the API calling. The first, "request_pageinfo_per_article", executes the API call for that specific article. The second, "store_rev_ids" calls "request_pageinfo_per_article", and stores the revision id in an output json file, with the key as the corresponding article name.

In [84]:
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

def store_rev_ids(article_titles = ARTICLE_TITLES):

    #Create a new dictionary to hold all of the rev_ids 
    rev_ids = {}
    for art in article_titles:
        #Execute the API call
        response = request_pageinfo_per_article(art)

        # Dig within the json and extract the last revision ID, if the call was successful
        if response is not None: # Ensure we get an output from the API call
            if list(response['query']['pages'].keys())[0] != '-1': # Additionally, ensure that the output is error-free ("-1" article ID denotes an error)
                #print(art)
                rev_id = list(response['query']['pages'].values())[0]['lastrevid']
                
                #Store the views into the dictionary
                rev_ids[art] = rev_id

    return rev_ids

## Execute the API Call

In [85]:
# Call the above function
rev_ids = store_rev_ids(article_titles = ARTICLE_TITLES)
#print(json.dumps(rev_ids,indent=4))

# Store the output json in the data_clean folder
with open('../data_intermediate/rev_ids.json', 'w') as file:
    json.dump(rev_ids, file, indent=4)

{
    "Majah Ha Adrif": 1233202991,
    "Haroon al-Afghani": 1230459615,
    "Tayyab Agha": 1225661708,
    "Khadija Zahra Ahmadi": 1234741562,
    "Aziza Ahmadyar": 1195651393,
    "Muqadasa Ahmadzai": 1235521766,
    "Mohammad Sarwar Ahmedzai": 1176429234,
    "Amir Muhammad Akhundzada": 1247931713,
    "Nasrullah Baryalai Arsalai": 1225385278,
    "Abdul Rahim Ayoubi": 1226326055,
    "Ismael Balkhi": 1244521219,
    "Abdul Baqi Turkistani": 1231655023,
    "Mohammad Ghous Bashiri": 1237694188,
    "Jan Baz": 1227635806,
    "Bashir Ahmad Bezan": 1248505877,
    "Rafiullah Bidar": 1197443408,
    "Mohammad Siddiq Chakari": 1134129082,
    "Cheragh Ali Cheragh": 1193992206,
    "Nasir Ahmad Durrani": 988838315,
    "Muhammad Hashim Esmatullahi": 949986748,
    "Ezatullah (Nangarhar)": 1158302291,
    "Aimal Faizi": 1185105938,
    "Gajinder Singh Safri": 1212323536,
    "Sharif Ghalib": 1245967190,
    "Hashmat Ghani Ahmadzai": 1207743719,
    "Abdul Ghani Ghani": 1227026187,
    "Gh