# Program Overview

The purpose of this notebook is to acquire the raw data necessary to analyze monthly article traffic from English Wikipedia 07-01-2015 through 09-30-2023.

The source data uses the following license:
**EKRC INSERT LICENSE HERE

We also use the Wikimedia Foundation REST API which has the following
terms of use:
https://www.mediawiki.org/wiki/REST_API#Terms_and_conditions

The Pageviews API documentation can be found at the following link:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews

We will leverage code developed by Dr. David W. McDonald for use in Data 512  which is provided under Creative Commons CC-BY license. (https://creativecommons.org/ and https://creativecommons.org/licenses/by/4.0/)

Additionally, we will be using a list of Academy Award winning article titles provided by Dr. McDonald. A link to the list can be found here: https://drive.google.com/drive/folders/1lPJF73GX5Vyu2uAvT5VpAY-xGwP2fCCx

## API Pull Setup

The following sections of code will create a method to call the Pageviews API. The code was taken from Dr. McDonald's scripts noted in the Program Overview section above. Limited changes were made to lower-case variable names, update the range of dates for information pulls, and to limit line length where possible.

We begin by importing the necessary Python libraries. Users may not have "requests" installed, and can use pip to add it to their machine.

In [16]:
#Import python libraries
import json
import time
import urllib.parse
import requests
import pandas as pd

The following code will create constants leveraged in the API calls in addition to a list of test article titles to verify the code is working correctly.

In [14]:
#Creating list of constants leveraged in API calls

# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
api_request_pageviews_endpoint = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'


# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
api_request_per_article_params = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
api_latency_assumed = 0.002       # Assuming roughly 2ms latency on the API and network
api_throttle_wait = (1.0/100.0)-api_latency_assumed

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error 
request_headers = {
    'User-Agent': 'ekrolen@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ex_article_titles = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. 
# The dictionary has a field/key for each of the required parameters. In the example, below, we only vary the article name,
# so the majority of the fields can stay constant for each request. Of course, these values *could* be changed if necessary.
article_pageviews_params_template = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015070100",   # July 1, 2015 in YYYYMMDDHH format
    "end":         "2023093000"    # September 30, 2023 in YYYYMMDDHH format
}

As described in Dr. McDonald's code, "The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title."

In [9]:
#Building the api request 

def request_pageviews_per_article(article_title = None, 
                                  endpoint_url = api_request_pageviews_endpoint, 
                                  endpoint_params = api_request_per_article_params,
                                  request_template = article_pageviews_params_template,
                                  headers = request_headers):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'))
    request_template['article'] = article_title_encoded
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if api_throttle_wait > 0.0:
            time.sleep(api_throttle_wait)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

The following code can be uncommented to verify the API call is working correctly.

In [12]:
'''print("Getting pageview data for: ",ex_article_titles[1])
views = request_pageviews_per_article(ex_article_titles[1])

#print(json.dumps(views,indent=4))
print("Have %d months of pageview data"%(len(views['items'])))
for month in views['items']:
    print(json.dumps(month,indent=4))'''

Getting pageview data for:  Northern flicker
Have 99 months of pageview data
{
    "project": "en.wikipedia",
    "article": "Northern_flicker",
    "granularity": "monthly",
    "timestamp": "2015070100",
    "access": "desktop",
    "agent": "user",
    "views": 4018
}
{
    "project": "en.wikipedia",
    "article": "Northern_flicker",
    "granularity": "monthly",
    "timestamp": "2015080100",
    "access": "desktop",
    "agent": "user",
    "views": 3116
}
{
    "project": "en.wikipedia",
    "article": "Northern_flicker",
    "granularity": "monthly",
    "timestamp": "2015090100",
    "access": "desktop",
    "agent": "user",
    "views": 4802
}
{
    "project": "en.wikipedia",
    "article": "Northern_flicker",
    "granularity": "monthly",
    "timestamp": "2015100100",
    "access": "desktop",
    "agent": "user",
    "views": 5373
}
{
    "project": "en.wikipedia",
    "article": "Northern_flicker",
    "granularity": "monthly",
    "timestamp": "2015110100",
    "access": 

The following section of code will build the list of articles we want to retrieve data for. See the Program Overview section for more information on the source of this list. 

thank_the_academy.AUG.2023.csv should be downloaded from https://docs.google.com/spreadsheets/d/1A1h_7KAo7KXaVxdScJmIVPTvjb3IuY9oZhNV4ZHxrxw/edit#gid=1229854301 and saved to this repository's "raw_data" directory before running this code.

In [24]:
#Create a dataframe from the csv
article_title_list = pd.read_csv('raw_data/thank_the_academy.AUG.2023.csv')

#Save the names as a list
article_titles = list(article_title_list['name'])

## Acquire monthly mobile access data

The following section of code will acquire monthly mobile access data using the request_pageviews_per_article method built 