### Wikipedia Pageview Data Collection

### Importing Libraries

This section imports essential Python libraries required for making API requests, processing data, and visualizing results.

#### Libraries:
- **Standard Libraries**: 
  - `json`, `time`, `urllib.parse`: Used for handling JSON data, managing time delays, and encoding URLs.
  
- **Third-Party Libraries**:
  - `requests`: Required to make API calls. Ensure it is installed using `pip install requests`.
  - `pandas`: Used for data manipulation and storage in DataFrames.
  - `datetime`: Assists with date handling and formatting.
  - `matplotlib.pyplot`: Used for data visualization.

#### Reproducibility:
These imports ensure all necessary libraries are included for consistent execution of the code, enabling reproducibility across different environments.


In [1]:
#
# These are standard python modules
import json, time, urllib.parse
#
# You will need to install these with pip/pip3 if you do not already have it
import requests
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt

### Constants for API Requests

This section defines constants used to construct and manage API requests to the Wikimedia Pageviews API, ensuring consistency and reproducibility throughout the workflow.

#### Key Elements:
- **API Request URL**: The base URL (`API_REQUEST_PAGEVIEWS_ENDPOINT`) for all pageviews requests.
- **Parameterized Request String**: The string (`API_REQUEST_PER_ARTICLE_PARAMS`) contains placeholders for project, access type, article name, and date range, which will be dynamically replaced.
- **API Throttling**: A small delay (`API_THROTTLE_WAIT`) is added to ensure we don't exceed Wikimedia’s rate limit of 100 requests per second.
- **Headers**: The request includes a User-Agent in `REQUEST_HEADERS` to comply with Wikimedia’s API guidelines.

#### Reproducibility:
By centralizing these values, we ensure that all requests use consistent parameters, which simplifies debugging, enhances transparency, and allows for easy modifications without altering the core code.
#### License
Code Owner : Dr. David W. McDonald

Creative Commons CC-BY license


In [2]:
#########
#
#    CONSTANTS
#

# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error 
REQUEST_HEADERS = {
    'User-Agent': '<trips@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015070100",   #start date
    "end":         "2024093000"    #end date
}

### API Data Request Function

The `request_pageviews_per_article` function automates the process of requesting pageview data from the Wikimedia Pageviews API. It handles API calls, parameters, and responses, ensuring consistency and reproducibility.

#### Key Parameters:
- **article_title**: Wikipedia article name (URL-encoded for special characters).
- **access_type**: Specifies the platform (desktop, mobile-web, mobile-app).
- **endpoint_url**: Base URL for the API.
- **endpoint_params**: Placeholder string for request parameters.
- **request_template**: Standard parameters for API requests (e.g., project, date range).
- **headers**: User metadata for API compliance.

#### Workflow:
1. **Set Article Title**: Dynamically assigns the article title in the request.
2. **Encode Title**: Handles special characters via URL encoding.
3. **Construct API Request**: Combines endpoint and parameters for the request URL.
4. **Make Request**: Fetches data from the API, ensuring API rate limits are respected.
5. **Error Handling**: Catches exceptions, prints errors, and returns `None` if failed.

#### Return:
The function returns the API response in JSON format or `None` in case of errors.

#### Reproducibility:
Encapsulating the API logic ensures automated, repeatable data collection across multiple articles and access types, reducing manual steps and minimizing errors.


In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

# Function to request pageviews per article
def request_pageviews_per_article(article_title=None, 
                                  access_type="desktop",
                                  endpoint_url=API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params=API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template=ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers=REQUEST_HEADERS):

    # Set the article title
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Set the access type
    request_template['access'] = access_type
    
    # Encode the article title for URL
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'))
    request_template['article'] = article_title_encoded
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # Make the request and handle exceptions
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


### Loading Article Titles from CSV

This section loads Wikipedia article titles related to rare diseases from a CSV file. The `'disease'` column in the file contains the article names, which will be used in API requests.

#### Steps:
1. **Load CSV**: The CSV file is read into a DataFrame using `pandas`, ensuring a structured format.
2. **Extract Titles**: Article titles are converted into a list, which will be iterated over for data fetching.

#### Reproducibility:
Automating the loading of article titles ensures consistent input for the workflow and reduces manual intervention, improving reproducibility.

In [4]:
# Load the article titles from the provided CSV file
# The column 'disease' contains the titles of the Wikipedia articles
df_articles = pd.read_csv("../data/rare-disease_cleaned.AUG.2024.csv")
article_titles = df_articles['disease'].tolist()  # Convert the 'disease' column into a Python list of article titles

### Data Acquisition

This section automates the collection of pageview data for each article from the Wikimedia Pageviews API. We retrieve data for three access types: desktop, mobile (combined mobile-web and mobile-app), and cumulative (all-access).

#### Steps:
1. **Desktop Views**: Fetches and stores monthly desktop pageview data for each article.
2. **Mobile Views**: Combines mobile-web and mobile-app views for each article and stores the monthly sum.
3. **Cumulative Views**: Collects the total monthly pageviews across all access types (desktop, mobile-web, mobile-app).

#### Reproducibility:
Automating the data collection ensures consistent, repeatable results, minimizing the risk of manual errors while fetching large-scale data across multiple articles and access types.


In [5]:
#########
#
#    DATA AQUISITION
#

# Initialize empty lists to store pageview data for desktop, mobile, and cumulative views
desktop_dict = []      # Data for desktop views will be stored here
mobile_dict = []       # Data for combined mobile-web and mobile-app views will be stored here
cumulative_dict = []   # Data for cumulative pageviews (all-access) will be stored here

# Iterate over each article title and fetch pageview data for the different access types
for article in article_titles:
    # Fetch desktop pageviews for the article
    desktop_pageviews = request_pageviews_per_article(article_title=article, access_type="desktop")
    
    # If the API returns valid desktop data, extract the monthly view counts and store them
    if desktop_pageviews and 'items' in desktop_pageviews:
        for month_record in desktop_pageviews['items']:
            # Each month is represented by a dictionary with the article title, timestamp, and views
            desktop_month_data = {
                "article_title": article,
                "timestamp": month_record['timestamp'],  # Date of the views
                "views": month_record['views']  # Number of views for that month
            }
            desktop_dict.append(desktop_month_data)  # Add the data to the desktop data list
    else:
        print(f"No desktop data found for {article}")  # Message if no desktop data is found

    # Fetch mobile views by combining mobile-web and mobile-app pageviews
    mobile_web_pageviews = request_pageviews_per_article(article_title=article, access_type="mobile-web")
    mobile_app_pageviews = request_pageviews_per_article(article_title=article, access_type="mobile-app")
    
    # If both mobile-web and mobile-app data are available, sum their views for each month
    if mobile_web_pageviews and 'items' in mobile_web_pageviews and mobile_app_pageviews and 'items' in mobile_app_pageviews:
        for web_record, app_record in zip(mobile_web_pageviews['items'], mobile_app_pageviews['items']):
            # Ensure that both mobile-web and mobile-app data refer to the same month
            if web_record['timestamp'] == app_record['timestamp']:
                mobile_month_data = {
                    "article_title": article,
                    "timestamp": web_record['timestamp'],  # Date of the views
                    "views": web_record['views'] + app_record['views']  # Sum of mobile-web and mobile-app views
                }
                mobile_dict.append(mobile_month_data)  # Add the combined mobile data to the mobile list
    else:
        print(f"No mobile-web or mobile-app data found for {article}")  # Message if no mobile data is found

    # Fetch cumulative views (all-access combines desktop, mobile-web, and mobile-app views)
    all_access_pageviews = request_pageviews_per_article(article_title=article, access_type="all-access")
    
    # If cumulative data (all-access) is returned, add it to the cumulative list
    if all_access_pageviews and 'items' in all_access_pageviews:
        for cumulative_month in all_access_pageviews['items']:
            cumulative_month_data = {
                "article_title": article,
                "timestamp": cumulative_month['timestamp'],  # Date of the views
                "views": cumulative_month['views']  # Total views from all access types (desktop, mobile-web, mobile-app)
            }
            cumulative_dict.append(cumulative_month_data)  # Add the cumulative data to the cumulative list
    else:
        print(f"No all-access data found for {article}")  # Message if no all-access data is found


No desktop data found for Sulfadoxine/pyrimethamine
No mobile-web or mobile-app data found for Sulfadoxine/pyrimethamine
No all-access data found for Sulfadoxine/pyrimethamine
No desktop data found for Cystine/glutamate transporter
No mobile-web or mobile-app data found for Cystine/glutamate transporter
No all-access data found for Cystine/glutamate transporter
No desktop data found for Trimethoprim/sulfamethoxazole
No mobile-web or mobile-app data found for Trimethoprim/sulfamethoxazole
No all-access data found for Trimethoprim/sulfamethoxazole


### Data Processing and Output

This section processes the collected pageview data and saves it in a structured format for further analysis.

#### Steps:
1. **Convert Data to DataFrames**: 
   The collected pageview data (desktop, mobile, and cumulative views) is converted into `pandas` DataFrames to facilitate easy manipulation and analysis.
   
2. **Save Data to JSON**: 
   Each DataFrame is saved as a JSON file with clear filenames indicating the type of data (desktop, mobile, or cumulative) and the time range. The data is stored in a record-oriented format, with each observation as a separate JSON object.

#### Reproducibility:
Saving the processed data in a standard format like JSON ensures that it can be easily shared and reproduced by others, maintaining consistency and transparency in the workflow.


In [None]:
#########
#
#    DATA PROCESSING AND OUTPUT
#

# Convert the collected data into pandas DataFrames
# Each DataFrame holds data corresponding to one of the access types: desktop, mobile, or cumulative
df_desktop_views = pd.DataFrame(desktop_dict)  # DataFrame for desktop views
df_mobile_views = pd.DataFrame(mobile_dict)    # DataFrame for combined mobile-web and mobile-app views
df_all_access_views = pd.DataFrame(cumulative_dict)  # DataFrame for cumulative views (all-access)

# Sort DataFrames by article title and timestamp
df_desktop_views = df_desktop_views.sort_values(by=['article_title', 'timestamp'])
df_mobile_views = df_mobile_views.sort_values(by=['article_title', 'timestamp'])
df_all_access_views = df_all_access_views.sort_values(by=['article_title', 'timestamp'])

# Save the DataFrames as JSON files
# The data is saved as JSON, with each observation being an individual record (row)
df_desktop_views.to_json("../data/rare-disease_monthly_desktop_201507-202409.json", orient='records')
df_mobile_views.to_json("../data/rare-disease_monthly_mobile_201507-202409.json", orient='records')
df_all_access_views.to_json("../data/rare-disease_monthly_cumulative_201507-202409.json", orient='records')

# The filenames reflect the type of data and the date range they cover