# Article Page Info MediaWiki API Example
This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024



In [42]:
# 
# These are standard python modules
import json
import time
import urllib.parse
import pandas as pd
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

# Loading in the List of Article Titles

In [43]:
### Reading in the article titles ###
input_data = pd.read_csv("../data_raw/politicians_by_country_AUG.2024.csv")

# Extract the disease names
article_titles = input_data['name']

The example relies on some constants that help make the code a bit more readable.

In [44]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<anetzley@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = article_titles

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


## Defining the API Call

In [45]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

def store_rev_ids(article_titles = ARTICLE_TITLES):
    # #Split article titles list into groups of 4 to speed up api calls
    # for title in range(0, len(article_titles), 4):
    #     article_titles_fours = article_titles[title:title+4]


    #Create a new dictionary to hold all of the rev_ids 
    rev_ids = {}
    for art in article_titles:
        #Execute the API call
        response = request_pageinfo_per_article(art)

        # Dig within the json and extract the last revision ID
        rev_id = list(response['query']['pages'].values())[0]['lastrevid']
        
        #Store the views into the dictionary
        rev_ids[art] = rev_id


    return rev_ids


## Execute the API Call

In [46]:
rev_ids = store_rev_ids(article_titles = ARTICLE_TITLES[:10])
print(json.dumps(rev_ids,indent=4))

{
    "Majah Ha Adrif": 1233202991,
    "Haroon al-Afghani": 1230459615,
    "Tayyab Agha": 1225661708,
    "Khadija Zahra Ahmadi": 1234741562,
    "Aziza Ahmadyar": 1195651393,
    "Muqadasa Ahmadzai": 1235521766,
    "Mohammad Sarwar Ahmedzai": 1176429234,
    "Amir Muhammad Akhundzada": 1247931713,
    "Nasrullah Baryalai Arsalai": 1225385278,
    "Abdul Rahim Ayoubi": 1226326055
}


In [47]:
with open('../data_clean/rev_ids.json', 'w') as file:
    json.dump(rev_ids, file, indent=4)

## ALL CODE BELOW FROM EXAMPLE

In [11]:
print(f"Getting page info data for: {ARTICLE_TITLES[3]}")
info = request_pageinfo_per_article(ARTICLE_TITLES[3])
print(json.dumps(info,indent=4))

Getting page info data for: Khadija Zahra Ahmadi
{
    "batchcomplete": "",
    "query": {
        "pages": {
            "71600382": {
                "pageid": 71600382,
                "ns": 0,
                "title": "Khadija Zahra Ahmadi",
                "contentmodel": "wikitext",
                "pagelanguage": "en",
                "pagelanguagehtmlcode": "en",
                "pagelanguagedir": "ltr",
                "touched": "2024-10-07T14:08:50Z",
                "lastrevid": 1234741562,
                "length": 2569,
                "talkid": 71610138,
                "fullurl": "https://en.wikipedia.org/wiki/Khadija_Zahra_Ahmadi",
                "editurl": "https://en.wikipedia.org/w/index.php?title=Khadija_Zahra_Ahmadi&action=edit",
                "canonicalurl": "https://en.wikipedia.org/wiki/Khadija_Zahra_Ahmadi"
            }
        }
    }
}


In [12]:
print(f"Getting page info data for: {ARTICLE_TITLES[1]}")
info = request_pageinfo_per_article(ARTICLE_TITLES[1])
print(json.dumps(info['query']['pages'],indent=4))

Getting page info data for: Haroon al-Afghani
{
    "11966231": {
        "pageid": 11966231,
        "ns": 0,
        "title": "Haroon al-Afghani",
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2024-10-05T14:27:29Z",
        "lastrevid": 1230459615,
        "length": 17027,
        "talkid": 15250816,
        "fullurl": "https://en.wikipedia.org/wiki/Haroon_al-Afghani",
        "editurl": "https://en.wikipedia.org/w/index.php?title=Haroon_al-Afghani&action=edit",
        "canonicalurl": "https://en.wikipedia.org/wiki/Haroon_al-Afghani"
    }
}


There is a way to get the information for multiple pages at the same time, by separating the page titles with the vertical bar "|" character. However, this approach has limits. You should probably check the API documentation if you want to do multiple pages in a single request - and limit the number of pages in one request reasonably.

This example also illustrates creating a copy of the template, setting values in the template, and then calling the function using the template to supply the parameters for the API request.

In [13]:
page_titles = f"{ARTICLE_TITLES[0]}|{ARTICLE_TITLES[2]}|{ARTICLE_TITLES[4]}"
print(f"Getting page info data for: {page_titles}")
request_info = PAGEINFO_PARAMS_TEMPLATE.copy()
request_info['titles'] = page_titles
info = request_pageinfo_per_article(request_template=request_info)
print(json.dumps(info['query']['pages'],indent=4))

Getting page info data for: Majah Ha Adrif|Tayyab Agha|Aziza Ahmadyar
{
    "47805901": {
        "pageid": 47805901,
        "ns": 0,
        "title": "Aziza Ahmadyar",
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2024-10-08T13:30:38Z",
        "lastrevid": 1195651393,
        "length": 3790,
        "talkid": 47806200,
        "fullurl": "https://en.wikipedia.org/wiki/Aziza_Ahmadyar",
        "editurl": "https://en.wikipedia.org/w/index.php?title=Aziza_Ahmadyar&action=edit",
        "canonicalurl": "https://en.wikipedia.org/wiki/Aziza_Ahmadyar"
    },
    "10483286": {
        "pageid": 10483286,
        "ns": 0,
        "title": "Majah Ha Adrif",
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2024-09-30T14:32:18Z",
        "lastrevid": 1233202991,
        "length