## Goal

This notebook provides the code & instruction to acquire, process, and analyze pageview data from a subset of dataset from Wikipedia about rare diseases to generate insights on user interactions with articles.

### Import necessary libraries

In [1]:
import json, time, urllib.parse
import requests
import os
import pandas as pd

### Data Acquisition
To get the list off all rare disease articles, download this [CSV](https://drive.google.com/file/d/15_FiKhBgXB2Ch9c0gAGYzKjF0DBhEPlY/view) to the data folder

### Define configurations
Define all the constants and configurations to be used for the data acquisition, including:
- Wikipedia API metadata
- the time to collect the data
- rare disease csv file path
- data directory to save files to

This code cell folows this [template](https://drive.google.com/file/d/1fYTIX79t9jk-Jske8IwysV-rbRkD4_dc/view)

In [2]:
#########
#
#    CONSTANTS
#

# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error 
REQUEST_HEADERS = {
    'User-Agent': '<uwid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is the path the list of article titles that we are going to request pageviews for
disease_articles_path = 'data/rare-disease_cleaned.AUG.2024.csv'
ARTICLE_TITLES = pd.read_csv(disease_articles_path)['disease'].tolist()

ACCESS_MAPPING = {
    'all-access': 'all-access', 
    'desktop': 'desktop', 
    'mobile-app': 'mobile', 
    'mobile-web': 'mobile'
    }

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015070100",   # start and end dates need to be set
    "end":         "2024093000"    
}

#########
# SAVE DATA

data_dir = 'data/jsons/'
startYYYYMM = "201507"
endYYYYMM = "202409"
name_template = {
    'mobile': f'rare-disease_monthly_mobile_{startYYYYMM}-{endYYYYMM}.json',
    'desktop': f'rare-disease_monthly_desktop_{startYYYYMM}-{endYYYYMM}.json',
    'all-access': f'rare-disease_monthly_cumulative_{startYYYYMM}-{endYYYYMM}.json'
}


### Function Documentation: 
`request_pageviews_per_article_per_access`
This function retrieves the pageview data for a specific article from the Wikimedia Analytics API. The function allows for specifying the type of access (e.g., desktop, mobile) and returns the JSON response containing pageviews over a specified time period. 

This code cell folows this [template](https://drive.google.com/file/d/1fYTIX79t9jk-Jske8IwysV-rbRkD4_dc/view)

In [13]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageviews_per_article_per_access(article_title = None,
                                  access_type = 'all-access',
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):
    """
    This helper function makes a request to the Wikimedia Pageviews API for a specific article and access type.
    The function will return the JSON response from the API. The function will also handle any exceptions
    that occur during the request process and return None if an exception occurs.
    
    Parameters:
    article_title (str) : the title of the article to request pageviews for
    access_type (str) : the access type to request pageviews for
    endpoint_url (str) : the base URL for the API request
    endpoint_params (str) : the parameterized string that specifies the type of request
    request_template (dict) : a dictionary that contains the parameters for the request
    headers (dict) : a dictionary that contains the headers for the request

    Returns:
    json_response (dict) : the JSON response from the API request
    """

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'), safe='')
    request_template['article'] = article_title_encoded

    # access type can be set in the request_template or as a parameter to the call
    if access_type:
        request_template['access'] = access_type
    
    if not request_template['access']:
        raise Exception("Must supply an access type to make a pageviews request.")
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

`request_pageviews_per_article`

This function retrieves the pageview data for a specific article from the Wikimedia Analytics API across multiple access types (e.g., desktop, mobile-web, mobile-app). It calls the request_pageviews_per_article_per_access function for each access type and returns a dictionary containing the pageview data for each access type.

In [4]:
def request_pageviews_per_article(article_title = None,
                                  access_types = ACCESS_MAPPING,
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):
    """
    This function makes a request to the Wikimedia Pageviews API for a specific article and access type.
    The function will return the JSON response from the API.

    Parameters:
    article_title (str) : the title of the article to request pageviews for
    access_types (dict) : a dictionary that contains the access types to request pageviews for
    endpoint_url (str) : the base URL for the API request
    endpoint_params (str) : the parameterized string that specifies the type of request
    request_template (dict) : a dictionary that contains the parameters for the request
    headers (dict) : a dictionary that contains the headers for the request

    Returns:
    pageviews (dict) : a dictionary that contains the JSON responses from the API request

    """
    pageviews = {}
    for access_type,_ in access_types.items():
        pageviews[access_type] = request_pageviews_per_article_per_access(
            article_title, access_type, endpoint_url, endpoint_params, request_template, headers
        )

    return pageviews

`process_views`

The `process_views` function processes pageview data for various access types (e.g., desktop, mobile-app, mobile-web) retrieved from an API response, and sums the views for mobile access by combining the "mobile-app" and "mobile-web" data for each article. It returns a pandas DataFrame with the processed data, where each row represents pageview information for a specific access type and timestamp.

In [5]:
def process_views(views, access_mapping):
    """
    This function processes the JSON response from the Pageviews API and returns a DataFrame with the relevant fields.
    This step helps to convert the API types defined by the Wikipedia to the desire types.

    Parameters:
    views (dict) : a dictionary that contains the JSON responses from the API request
    access_mapping (dict) : a dictionary that maps the access types to the desired access types

    Returns:
    df (DataFrame) : a DataFrame that contains the relevant fields from the JSON response
    """
    timestamp = len([x['timestamp'] for x in views['mobile-app']['items']])

    df = pd.DataFrame()
    for access in access_mapping:
        df = pd.concat([df, pd.DataFrame(views[access]['items'])], ignore_index=True)

    mobile_df = df[df['access'].isin(['mobile-app', 'mobile-web'])]

    # Group by relevant fields and sum the views for mobile access
    mobile_summed = mobile_df.groupby(['project', 'article', 'granularity', 'timestamp', 'agent'], as_index=False)['views'].sum()

    # Add a new column 'access' with value 'mobile'
    mobile_summed['access'] = 'mobile'
    df = pd.concat([df, mobile_summed], ignore_index=True)
    
    # Drop the rows where access is 'mobile-app' or 'mobile-web'
    df = df[~df['access'].isin(['mobile-app', 'mobile-web'])]
    df = df.drop_duplicates()

    # Sort the DataFrame by timestamp and reset index
    df = df.sort_values(by=['article', 'access', 'timestamp']).reset_index(drop=True)
    
    return df

`save_json`

The `save_json` function saves the queried data as JSON files based for each access type (e.g., desktop, mobile, all-access). For each unique value of 'access', the corresponding rows are filtered, the 'access' column is dropped, and the data is saved to a JSON file. The function allows for inputing a naming template for each access type.

In [6]:
def save_json(data, path, name_template):
    """
    This function saves the data to the JSON file based on the access type.
    For the given path that contains existing JSON files, this will not overwrite the existing data,
    but will append the new data to the existing data.

    Parameters:
    data (DataFrame) : a DataFrame that contains the relevant fields from the JSON response
    path (str) : the path to save the JSON file
    name_template (dict) : a dictionary that contains the name of the JSON file based on the access type

    Returns:
    None

    """

    unique_access = data['access'].unique()
    for access in unique_access:
        # Filter the data based on the access type
        final_df = data.loc[data['access']==access].drop(columns=['access'])

        # Check if the file already exists
        full_path = os.path.join(path, name_template[access])
        if os.path.exists(path):
            try:
                with open(full_path, 'r') as f:
                    existing_data = json.load(f)
            except json.JSONDecodeError:
                existing_data = []
            except FileNotFoundError:
                existing_data = []
        else:
            existing_data = []        
        
        # Appendd the existing data with the new data
        curr_data = final_df.to_dict(orient='records')
        updated_data = existing_data + curr_data
        # Write the updated data back to the file
        with open(full_path, 'w') as f:
            json.dump(updated_data, f, indent=4)


### Run the pipeline for data acquisition and processing

In [23]:
df = pd.DataFrame()
for i in range(0,len(ARTICLE_TITLES)):
    progress = (i+1)/len(ARTICLE_TITLES)*100
    if int(progress) % 10 == 0:
        print(f"Processing {ARTICLE_TITLES[i]}: {progress}%")
    views = request_pageviews_per_article(ARTICLE_TITLES[i])
    df = process_views(views, ACCESS_MAPPING)
    save_json(df, data_dir, name_template)
