#### DATA ACQUISITION

In this notebook, the data on counts of pageviews from desktop, mobile web and mobile app is collected for a subset of English Wikipedia dinosaur related articles.

The output of the notebook are three datasets saved as JSON files ordered using article titles as a key for the resulting time series.

In [None]:
# Import libraries
import json, time, urllib.parse
import requests
import pandas as pd

I have defined the values of constants such as the API pageviews URL, headers, file from which the titles of the dinosaur articles are imported, name of the files to where the JSON needs to be exported in the following block.

In [None]:
#########
#
#    CONSTANTS
#

# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include a "unique ID" that will allow them to
# contact you if something happens - such as - your code exceeding request limits - or some other error happens
REQUEST_HEADERS = {
    'User-Agent': '<anuhyabs@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is a subset of the English Wikipedia dinosaur related files.
ARTICLE_TITLES = pd.read_csv('../data/dinosaur_genera.cleaned.SEPT.2022.csv')["name"]
START_DATE = "2015070100"
END_DATE = "2022093000"

DESKTOP_DATA = "../data/dino_monthly_desktop_201501-202209.json"
MOBILE_DATA = "../data/dino_monthly_mobile_201501-202209.json"
CUMULATIVE_DATA = "../data/dino_monthly_cumulative_201501-202209.json" 

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. 
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "",      # this value will be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       START_DATE,
    "end":         END_DATE   
}

The request_pageviews_per_article is a function that takes in the article titles, access type, endpoint url, parameters (defined earlier) and headers makes the request for the information related to the article. This function returns a JSON (dictionary type) with all the information on the article.

In [None]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageviews_per_article(article_title = None, 
                                  access = "desktop",
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(article_title.replace(' ','_'))
    request_template['article'] = article_title_encoded
    request_template['access'] = access

    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


The generate_views function takes in the subset of articles and accesstype as parameters. It loops through each article in the articles subset and calls the request_pageviews_per_article function for information on the article. 

For mobile access type, there are two seperate requests so the code combines the monthly pageviews for both mobile-app and mobile-web and returns the combined mobile views.

For cummulative access type, I generate the cummulative pageviews for all the months until September 2022 (by adding the monthly views of the priors months to each month) and return the the total pageviews for all access types.

For desktop access type, the montly pageviews is returned as is from therequest API.

This function returns a dictionary containing each article title as the key and it's relevant information as the values.

In [None]:
# A function to generate datasets for all articles at different access levels
def generate_views(articles, access):
    views_json = {}

    if access == "mobile":
        # The API separates mobile access types into two separate requests: mobile-app and mobile-web.
        # The following code sums these to make one count for all mobile pageviews.
        mobile_views = {}
        for title in articles:
            mobile_app_views = request_pageviews_per_article(title, "mobile-app")
            mobile_web_views = request_pageviews_per_article(title, "mobile-web")
        
            mobile_app_views = mobile_app_views['items']
            mobile_web_views = mobile_web_views['items']
            mobile_views = mobile_app_views
            for month in range(len(mobile_web_views)):
                mon = mobile_web_views[month]
                views = mon['views']
                mobile_views[month]['views'] += views
            views_json[title] = mobile_views
    elif access == "all-access":
        for title in articles:
            views = request_pageviews_per_article(title, access)
            total_views = 0
            views = views['items']
            for month in views:
                total_views += month['views']
                month['views'] = total_views
            views_json[title] = views
    else:        
        for title in articles:
            views = request_pageviews_per_article(title, access)
            views = views['items']
            views_json[title] = views

    return views_json

The write_file function saves the the dictionary received from the previous function into JSON files. As previously stated, the file name and path to save the JSON file to has already been defined above.

In [None]:
# Saving the datasets as JSON files
def write_file(path, views):
    with open(path, "w") as f:
        json.dump(views, f)

Now I call the function for each of the three access types in the following steps:

In [None]:
# Creating monthly desktop views
desktop_views = generate_views(ARTICLE_TITLES, "desktop")
write_file(DESKTOP_DATA, desktop_views)

In [None]:
# Creating monthly cumulative views
cumulative_views = generate_views(ARTICLE_TITLES, "all-access")
write_file(CUMULATIVE_DATA, cumulative_views)

In [None]:
#Creating monthly mobile views
mobile_views = generate_views(ARTICLE_TITLES, "mobile")
write_file(MOBILE_DATA, mobile_views)