# Article Page Views API Example
This example illustrates how to access page view data using the [Wikimedia REST API](https://www.mediawiki.org/wiki/Wikimedia_REST_API). This example shows how to request monthly counts of page views for one specific article. The API documentation, [pageviews/per-article](https://wikimedia.org/api/rest_v1/#/Pageviews%20data), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.1 - May 5, 2022



In [10]:
!pip install pandas
!pip install tqdm
!pip install xarray

Collecting xarray
  Downloading xarray-2022.9.0-py3-none-any.whl (943 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m943.1/943.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m[31m3.7 MB/s[0m eta [36m0:00:01[0m
Installing collected packages: xarray
Successfully installed xarray-2022.9.0


In [12]:
# 
# These are standard python modules
import json, time, urllib.parse, os
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd
from tqdm import tqdm
import json
import xarray as xa

The example relies on some constants that help make the code a bit more readable.

In [2]:
#########
#
#    CONSTANTS
#

# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include a "unique ID" that will allow them to
# contact you if something happens - such as - your code exceeding request limits - or some other error happens
REQUEST_HEADERS = {
    'User-Agent': '<amb7896@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# read list of all dinosaur article titles
dinosaur_wiki_df = pd.read_csv("/Users/amrit/Documents/Courses/Human Centered Data Science/Assignments/data-512-homework_1/dinosaur_genera.cleaned.SEPT.2022.csv")
# This is just a list of English Wikipedia article titles that we can use for example requests                            
# ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]
                               
ARTICLE_TITLES = list(dinosaur_wiki_df.name)                               
                               
# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",
    "end":         "2022093000"    # this is likely the wrong end date
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageviews_per_article(article_title = None, 
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(article_title.replace(' ','_'))
    request_template['article'] = article_title_encoded
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    # print(request_url)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


### Collecting data from Pageviews API and saving to avoid wasting time in redownloads if kernel crashes

In [None]:
device_access_modes = ['all-access', 'desktop', 'mobile-app', 'mobile-web']
access_modes_dict = {}
for i in range(len(device_access_modes)):
    mode = device_access_modes[i]
    ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE["access"] = mode
    mode_time_series = {}
    # Used tqdm package to check download times of views data from Pageviews API
    for j in tqdm(range(len(ARTICLE_TITLES))):
        dino_article = ARTICLE_TITLES[j]
        # print(f"Getting pageview data for {dino_article} in {mode} mode")
        try:
            views = request_pageviews_per_article(dino_article)
            # Removing the ‘access’ field as it is misleading for mobile and cumulative files
            dinosaur_views_wo_access_field = [ {key:item[key] for key in item if key!='access'} for item in views['items']]
        except Exception as e:
            print(views)
        mode_time_series[dino_article] = dinosaur_views_wo_access_field
    access_modes_dict[mode] = mode_time_series

In [7]:
# Serializing json
json_object = json.dumps(access_modes_dict, indent=4)
 
# Save PageView API downloaded json to local to avoid redownload next time kernel dies
with open("pageview_download.json", "w") as outfile:
    json.dump(access_modes_dict, outfile)

In [4]:
# Load Pagview API downloaded data
with open('pageview_download.json', 'r') as openfile:
 
    # Reading from json file
    access_modes_dict = json.load(openfile)

### Converting dictionary data to dataframe

### Summing mobile-app and mobile-web data 

In [5]:
all_dinos_mobile_dict = {}
for dino in access_modes_dict["mobile-app"]:
    dino_mobile_list = []
    # TODO: Why was final list of time series objects consisting a replication of only the last time series object
    # when not initialized?
    # mobile_monthly_dict = {}
    dino_all_months = access_modes_dict["mobile-app"][dino]
    for dino_mob_app_monthly_views in dino_all_months:
        mobile_monthly_dict = {}
        for key in dino_mob_app_monthly_views:
            if key!='views':
                mobile_monthly_dict[key] = dino_mob_app_monthly_views[key]
            else:
                filtered_dict = [ dino_mob_web_monthly_views for dino_mob_web_monthly_views in access_modes_dict["mobile-web"][dino] 
                                            if dino_mob_web_monthly_views['timestamp'] == dino_mob_app_monthly_views['timestamp']]
                mobile_monthly_dict[key] = dino_mob_app_monthly_views[key] + filtered_dict[0]["views"]
        dino_mobile_list.append(mobile_monthly_dict)
    all_dinos_mobile_dict[dino] = dino_mobile_list
access_modes_dict['mobile'] = all_dinos_mobile_dict

### Create monthly mobile, desktop and cumulative files 

In [24]:
with open("dino_monthly_mobile_201507-202209.json", "w") as outfile:
    json.dump(access_modes_dict["mobile"], outfile)
    
with open("dino_monthly_desktop_201507-202209.json", "w") as outfile:
    json.dump(access_modes_dict["desktop"], outfile)
    
with open("dino_monthly_cumulative_201507-202209.json", "w") as outfile:
    json.dump(access_modes_dict["all-access"], outfile)

### Hop into dino_analysis.ipynb for Step2: Analysis

In [23]:
mobile_access_time_series_json = []
mobile_access_modes_dict = access_modes_dict["mobile"]
for dino in mobile_access_modes_dict:
    for time_series_list in mobile_access_modes_dict[dino]:
        mobile_access_time_series_json.append(time_series_list)
        
desktop_access_time_series_json = []
desktop_access_modes_dict = access_modes_dict["desktop"]
for dino in desktop_access_modes_dict:
    for time_series_list in desktop_access_modes_dict[dino]:
        desktop_access_time_series_json.append(time_series_list)
        
cumulative_access_time_series_json = []
cumulative_access_modes_dict = access_modes_dict["all-access"]
for dino in cumulative_access_modes_dict:
    for time_series_list in cumulative_access_modes_dict[dino]:
        cumulative_access_time_series_json.append(time_series_list)

Above output should show dictionaries with views per month