# Article Page Info MediaWiki API Example
This notebook contains the code required to retreive page information from the MediaWiki API. The code is written in Python and uses the `requests` library to make the API calls. The code is designed to be run in a Jupyter notebook, but could be adapted to run in a standalone Python script.

## License
Some part of the code are used form the examples developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024



List of required packages:
1. json
2. time
3. urllib
4. dotenv
5. requests: The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it.
6. pandas

In [15]:
# 
# These are standard python modules
import json, time, urllib.parse
#
import pandas as pd
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

First we are retriving the politicians data provided in the assignment.

### Make sure to copy the csv file from the data_files folder and place it in the path of this notebook.

In [16]:
csv_file = pd.read_csv('politicians_by_country_AUG.2024.csv')
print(csv_file)

                      name                                                url  \
0           Majah Ha Adrif       https://en.wikipedia.org/wiki/Majah_Ha_Adrif   
1        Haroon al-Afghani    https://en.wikipedia.org/wiki/Haroon_al-Afghani   
2              Tayyab Agha          https://en.wikipedia.org/wiki/Tayyab_Agha   
3     Khadija Zahra Ahmadi  https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...   
4           Aziza Ahmadyar       https://en.wikipedia.org/wiki/Aziza_Ahmadyar   
...                    ...                                                ...   
7150      Josiah Tongogara     https://en.wikipedia.org/wiki/Josiah_Tongogara   
7151     Langton Towungana    https://en.wikipedia.org/wiki/Langton_Towungana   
7152     Sengezo Tshabangu    https://en.wikipedia.org/wiki/Sengezo_Tshabangu   
7153   Herbert Ushewokunze  https://en.wikipedia.org/wiki/Herbert_Ushewokunze   
7154          Denis Walker         https://en.wikipedia.org/wiki/Denis_Walker   

          country  
0     A

For each article_name(encoded as name), we want to get the lastrevid which is the revision id of the last edit to the page.

For this, we start by first defining the base url and the parameters required to get the page information.

In [17]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'dvabhinav31@gmail.com, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = list(csv_file['name'])
clenaed_article_titles = []
for article in ARTICLE_TITLES:
    clenaed_article_titles.append(article.replace(' ', '_'))

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the name.



In [18]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


There is a way to get the information for multiple pages at the same time, by separating the page titles with the vertical bar "|" character. However, this approach has limits. You should probably check the API documentation if you want to do multiple pages in a single request - and limit the number of pages in one request reasonably.

The batch size is set to 50, which is the maximum number of pages that can be requested in a single call. THis significantly reduces the number of API calls required from 7100 to around 142.

We only need the lastreid and the name of the article. We will store this information in a list of lists. The list will be converted to a dataframe and saved as a csv file in the next cell

In [19]:
wiki_information = []
batch_size = 50
for i in range(0, len(clenaed_article_titles), batch_size):
    names = '|'.join(clenaed_article_titles[i:i+batch_size])
    info = request_pageinfo_per_article(names)
    if 'query' in info and 'pages' in info['query']:
        response = info['query']['pages']
        for page_id in response:
            name = response[page_id]['title']
            if 'lastrevid' in response[page_id]:
                wiki_information.append([name,response[page_id]['lastrevid']])
            else:
                wiki_information.append([name,''])

Here, we write the dictionary into a csv for future use in analysis

In [28]:
#groupby the wiki_information based on the revid and name
wiki_information_df = pd.DataFrame(wiki_information, columns = ['name', 'revid'])
wiki_information_df = wiki_information_df.groupby(['revid','name']).size().reset_index(name='count')

# Basic Analysis
There are a total of 7155 articles in the dataset.

I wanted to check if there are any repeated name and revision id pairs. This is important because we are using the revision id as a unique identifier for the page. If there are any duplicates, we need to investigate further.


In [29]:
wiki_information_df[wiki_information_df['count'] > 1]

Unnamed: 0,revid,name,count
152,1046006361,Mohammad Toaha,2
509,1120771298,Melko Čingrija,2
559,1130454039,Yat Hwaidi,2
567,1131754356,Ibrahim Harun,2
711,1144472397,Josip Ferfolja,2
1055,1166369458,Moinuddin Ahmed Chowdhury,2
1130,1171543658,Luigi Adwok,2
2091,1209003568,Manuel Carrascalão,2
2360,1214682376,Count Václav Antonín Chotek of Chotkov and Vojnín,2
2385,1215043825,Eduard Hedvicek,2


We do not need these exact duplicates, so we will remove them. We will keep the first instance of the duplicate and remove the rest.

In [13]:
wiki_information_df = wiki_information_df.drop_duplicates(subset=['revid'], keep='first')

In [31]:
#write wiki_information whihc is a list of list with name and response into a csv file
wiki_information_df.to_csv('wiki_information.csv', index = False)

In [14]:
wiki_information_df

Unnamed: 0,name,revid
0,Abdul Baqi Turkistani,1231655023
1,Abdul Ghani Ghani,1227026187
2,Abdul Rahim Ayoubi,1226326055
3,Ahmad Wali Massoud,1221720658
4,Aimal Faizi,1185105938
...,...,...
7150,Denis Walker,1247902630
7151,Herbert Ushewokunze,959111842
7152,Josiah Tongogara,1203429435
7153,Langton Towungana,1246280093


In [23]:
len(wiki_information)

7155

In [30]:
len(wiki_information_df)

7111