## Data Acquisition

This notebook is used to pull the Academy Awards Movie views data from wikimedia using the pageviews api. The first step is to load the python packages that I will need for data acquisition.

In [None]:
import json, time, urllib.parse

import requests
from collections import OrderedDict
import pandas as pd

### Helper Code
The code below is from Dr. David McDonald. I will be using it to access the pageviews API. 

In [4]:
#########
#
#    CONSTANTS
#

# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error 
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023040100"    # this is likely the wrong end date
}

### Main Code

We will first adjust the values of various parameters to Dr. McDonald's function. Three changed need to be made. First I will add my personal email and contact information to the request headers parameter. Then I will adjust the parameters of the pageviews api so that we are only looking at pageview data between July 1, 2015 to September 30, 2023.

In [14]:
# Adjusting the given example code
REQUEST_HEADERS['User-Agent'] = 'eyfy@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2023'

# Setting the start and end date to July 1, 2015 to September 30, 2023
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['start'] = 20150701
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['end'] = 20230930

##### Specifications
The goal is to extract the monthly **user** pageview requests from July 1, 2015 to September 30, 2023. We will create three separate datasets and store them as three different JSON files based upon three different access modes.
1. Monthly mobile access
    - academy_monthly_mobile_201507-202309.json
2. Monthly desktop access
    - academy_monthly_desktop_201507-202309.json
3. Monthly cumulative
    - academy_monthly_cumulative_201507-202309.json


Additional Notes:
1. Mobile access requires two separate requests `mobile-web` and `mobile-app` access keys
2. We are only interested in monthly activity ie `granularity = monthly`
3. JSON file format: 
    - Ordered using article titles as a key for the resulting time series data
    - remove the `access` field


##### Loading the new parameters into the function

We will now define the `request_pageviews_per_article` method below. By instantiating the variables above we are able to feed in the default parameters for the function below.

In [60]:
#########
#
#    PROCEDURES/FUNCTIONS from Dr. McDonald
#

def request_pageviews_per_article(article_title = None, 
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'), safe='')
    request_template['article'] = article_title_encoded
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


#### Extracting the Data

Now that our methods are defined we can begin by loading in the raw data file which is a csv file containing the names of the movies (page name) and their corresponding url. We can easily load this in using the pandas package.

In [81]:
# Reading in the raw data
data_raw = pd.read_csv('../data_raw/thank_the_academy_AUG_2023.csv')

Now we will parse through each movie page and pull pageview data from the wikimedia api. Because there are multiple different access types we will need to make multiple calls to the api for each movie.

In [80]:
mobile_dict = {}
desktop_dict = {}
cumulative_dict = {}

# Iterating through the names in data_raw
for _, item in data_raw.iterrows():
    try:
        # Load all access types
        ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['access'] = 'mobile-web'
        mobile_web = request_pageviews_per_article(item['name'], request_template=ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE)
        ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['access'] = 'mobile-app'
        mobile_app = request_pageviews_per_article(item['name'], request_template=ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE)
        ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['access'] = 'desktop'
        desktop = request_pageviews_per_article(item['name'], request_template=ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE)

        # Convert all to pandas dataframes
        mobile_web_df = pd.DataFrame(mobile_web['items'])
        mobile_app_df = pd.DataFrame(mobile_app['items'])
        desktop_df = pd.DataFrame(desktop['items'])

        # Drop the access column in all dataframes
        mobile_web_df.drop(columns=['access'], inplace=True)
        mobile_app_df.drop(columns=['access'], inplace=True)
        desktop_df.drop(columns=['access'], inplace=True)


        # Making copies of the dataframes into new dataframes
        mobile_combine_df = mobile_web_df.copy()
        cumulative_df = desktop_df.copy()
        # Combine the two mobile access types (web and app)
        # Combine mobile with desktop to get total views
        mobile_combine_df['views'] = mobile_web_df['views'] + mobile_app_df['views']
        cumulative_df['views'] = desktop_df['views'] + mobile_combine_df['views']

        # Convert back to dictionary
        mobile_output_dict = mobile_combine_df.to_dict(orient='records')
        desktop_output_dict = desktop_df.to_dict(orient='records')
        cumulative_output_dict = cumulative_df.to_dict(orient='records')

        # Storing movie views for each access type
        mobile_dict[item['name']] = mobile_output_dict
        desktop_dict[item['name']] = desktop_output_dict
        cumulative_dict[item['name']] = cumulative_output_dict

    except Exception as e:
        print(item['name'])
        print("Failed!", e)

The final steps are to order the dictionary based on the Movie names using the collections OrderedDict object and then store the dictionaries as JSON files. We will store the extracted data under the `data_final` folder.

In [None]:
# Sorting dictionaries by key
ordered_mobile = OrderedDict(sorted(mobile_dict.items()))
ordered_desktop = OrderedDict(sorted(desktop_dict.items()))
ordered_cumulative = OrderedDict(sorted(cumulative_dict.items()))

# Writing dictionaries to output files
with open('../data_final/academy_monthly_mobile_201507-202309.json', 'w+') as f:
    json.dump(ordered_mobile, f, indent=4)
with open('../data_final/academy_monthly_desktop_201507-202309.json', 'w+') as f:
    json.dump(ordered_desktop, f, indent=4)
with open('../data_final/academy_monthly_cumulative_201507-202309.json', 'w+') as f:
    json.dump(ordered_cumulative, f, indent=4)