# Data Acquisition
## Article Page Views API
This notebook illustrates how to access page view data using the [Wikimedia REST API](https://www.mediawiki.org/wiki/Wikimedia_REST_API). 

## License
This code was adapted from an example developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.3 - August 16, 2024



## Importing Libraries

In [1]:
import json
import time
import urllib.parse
import pandas as pd
import requests
import copy
import matplotlib.pyplot as plt

## Loading in the list of article titles

In [2]:
### Reading in the article titles ###
input_data = pd.read_csv("rare-disease_cleaned.AUG.2024.csv")

# Extract the disease names
disease_names = input_data['disease']

## Defining Constants

In [17]:
# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error 
REQUEST_HEADERS = {
    'User-Agent': '<anetzley@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is the list of diseases that we previously read in
ARTICLE_TITLES = disease_names 

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015070100",   # Starting on July 1, 2015
    "end":         "2024093000"    # Ending on September 30, 2024
}


## Defining the API Call
The following three functions enable the API calling. The first, "request_pageviews_per_article", executes the API call for that specific article. The second, "store_pageviews_per_access", generates a new dictionary storing all of the pageviews per article for a specific access type, call "request_pageviews_per_article" in a loop. Lastly, "combine_pageviews" loops through the three raw, gathered access pageviews, and combines them into the three final pageview stratifications requested: and desktop, all mobile, and combined.

In [18]:
def request_pageviews_per_article(article_title = None, 
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'))
    request_template['article'] = article_title_encoded
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

def store_pageviews_per_access(access, article_titles = ARTICLE_TITLES, template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE):
    #Create a new dictionary to hold all of the pageviews for this access type
    all_views = {}
    for art in article_titles:
        template['access'] = access
        #print("Getting pageview data for: ", art)
        
        #Execute the API call
        views = request_pageviews_per_article(art, request_template = template)

        #Store the views into the dictionary
        if 'items' in views:
            all_views[art] = views['items']
        else:
            all_views[art] = None

    return all_views

def combine_pageviews(desktop_all_views, mobile_app_all_views, mobile_web_all_views, article_titles = ARTICLE_TITLES):
    #Initialize empty dicts to hold the combined views data
    mobile_all_views = {}
    combined_all_views = {}

    #Loop through each article and combine views to generate the mobile all_views and combined_all_views
    for article_name in article_titles:

        #Fill in the dictionaries with the rest of the data from the mobile app source to serve as background
        mobile_all_views[article_name] = copy.deepcopy(mobile_app_all_views[article_name])
        combined_all_views[article_name] = copy.deepcopy(mobile_app_all_views[article_name])
        #print(mobile_app_all_views[name])
        #print(mobile_all_views[name])

        #Iterate through each month of data for the article and combine views
        if mobile_web_all_views[article_name] is not None:
            for i, key in enumerate(mobile_web_all_views[article_name]):

                #Extract the corresponding viewcounts from the other two sources
                app_views = mobile_app_all_views[article_name][i]['views']
                desktop_views = desktop_all_views[article_name][i]['views']
                web_views = key['views']
                print('App Views:', app_views)
                print('Desktop Views:', desktop_views)
                print('Web Views:', web_views)

                #print(mobile_all_views[article_name][i])

                #Assign the aggregate view counts to each respective dict
                mobile_all_views[article_name][i]['views'] = app_views + web_views
                combined_all_views[article_name][i]['views'] = app_views + web_views + desktop_views

                print('Mobile All Views:',mobile_all_views[article_name][i]['views'])
                print('Combined All Views:',combined_all_views[article_name][i]['views'])
                print('\n')

                #Lastly, remove the 'access' key from all of the dicts
                desktop_all_views[article_name][i].pop('access', None)
                mobile_all_views[article_name][i].pop('access', None)
                combined_all_views[article_name][i].pop('access', None)
            
    return desktop_all_views, mobile_all_views, combined_all_views

## Main
The main section below performs the actual analysis, executing the API calls and combining the pageviews using the functions described above. It stores the final processed data to three separate json files located in the "data_clean" folder.

In [21]:
#Execute the raw API Calls
desktop_all_views = store_pageviews_per_access('desktop')
mobile_app_all_views = store_pageviews_per_access('mobile-app')
mobile_web_all_views = store_pageviews_per_access('mobile-web')

# with open('mapp_temp.json', 'w') as file:
#     json.dump(mobile_app_all_views, file, indent=4)

# with open('desktop_temp.json', 'w') as file:
#     json.dump(desktop_all_views, file, indent=4)

# with open('mweb_temp.json', 'w') as file:
#     json.dump(mobile_web_all_views, file, indent=4)

#Combine pageviews
desktop_all_views, mobile_all_views, combined_all_views = combine_pageviews(desktop_all_views, mobile_app_all_views, mobile_web_all_views)

# #Printing a few results for a sanity check
# print(f"Collected {len(mobile_all_views[ARTICLE_TITLES[0]])} months of pageview data")
# for name in ARTICLE_TITLES[:4]:
#     for month in mobile_all_views[name]:
#         print(json.dumps(month,indent=4))

#Write the outputs to the three JSON files
with open('rare-disease_monthly_mobile_201507_202409.json', 'w') as file:
    json.dump(mobile_all_views, file, indent=4)

with open('rare-disease_monthly_desktop_201507_202409.json', 'w') as file:
    json.dump(desktop_all_views, file, indent=4)

with open('rare-disease_monthly_combined_201507_202409.json', 'w') as file:
    json.dump(combined_all_views, file, indent=4)
