# Dataset creation from API notebook
The purpose of this notebook is to use the Pageviews API  
from https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews  
to create three datasets of url traffic data from July 2015 through the previous complete month.  
The pages that will be considered for these datasets are located in /Data/ArticleNames.csv.  

The three datasets are described to be:  
```
Dataset1: Get all monthly mobile pageviews per article
Dataset2: Get all monthly desktop pageviews per article
Dataset3: Get all monthly desktop+mobile pageviews per article
```

# Import python packages

In [1]:
import os 
import json
import time
import urllib.parse
import requests
import csv

# Load in article data that we want to process into list of lists
#### Step 1: First, we want to open the comma-separated file that is of the format: Title,URL .
#### Step 2: Next, we will loop through csv file and load data into a list.
#### Step 3: After that, we will remove the header row from list to prevent it showing up later.
#### Step 4: Finally, we will fonfirm data was properly loaded by printing out the first few results and the total number of articles read in.

In [10]:
#Step1: Open csv file that is of the format: Title, URL 
ArticleInfo = []
with open('../Data/ArticleNames.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    #Step 2: Loop through csv file and load data into a list  
    for row in reader: ArticleInfo.append([row[0], row[1]]) 

#Step 3: Remove header row from list to prevent it showing up later
ArticleInfo.pop(0)

#Step 4: Confirm data was loading by printing out the first few results and the total number of articles read in
for i in range(3): print(ArticleInfo[i])
print("Number of articles, should be 1359:", len(ArticleInfo))

['Everything Everywhere All at Once', 'https://en.wikipedia.org/wiki/Everything_Everywhere_All_at_Once']
['All Quiet on the Western Front (2022 film)', 'https://en.wikipedia.org/wiki/All_Quiet_on_the_Western_Front_(2022_film)']
['The Whale (2022 film)', 'https://en.wikipedia.org/wiki/The_Whale_(2022_film)']
Number of articles, should be 1359: 1359


# Helper functions and variables for using the Pageviews API

The code in this cell was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - August 14, 2023. Some of the original code made be modified, however it still falls under the Creative Commons license.

#### The code below simply defines a handful of constants used for requesting data from pageviews API, there are no specific steps here. 

In [12]:
# Below are CONSTANTS used by the API that are relevant to all subsequent API calls
# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error 
REQUEST_HEADERS = {
    'User-Agent': '<zbowyer@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

def request_pageviews_per_article(article_title = None, endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):
    '''
    Description:
        Used to request data for a single article, uses defined constants above for params
    Input(s):
        article_title - String
        endpoint_url - String
        endpoint_params - String
        request_template - Dictionary
        headers - Dictionary
    Outputs:
        json_response - JSON string/object
    '''
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    title_replaced = request_template['article'].replace(' ','_')
    article_title_encoded = urllib.parse.quote(title_replaced)
    request_template['article'] = article_title_encoded
    article_title_encoded = (article_title_encoded.replace('/', '%2F'))
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    #print(request_url)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


## Create first dataset: Get all monthly mobile pageviews per article
#### Step 1: We want to define the json request objects used to retreive data in an abstract manner for both forms of mobile data.
#### Step 2: Then we will loop over each article name that we stored earlier and store the total mobile views for each monthin a dictionary. 
#### Step 3: Next, we will sort that dictionary alphabetically by the name of the article in a descending fashion.
#### Step 4: Finally, we will save the dictionary as a javascript object notation file.

File will be named: academy_monthly_mobile_201501-202309.json

In [13]:
#Step 1: Define the json request objects used to retreive data in an abstract manner for both forms of mobile data.
ARTICLE_PAGEVIEWS_MOBILE_app = {
    "project":     "en.wikipedia.org",
    "access":      "mobile-app",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023100100"    # this is likely the wrong end date
}
ARTICLE_PAGEVIEWS_MOBILE_web = {
    "project":     "en.wikipedia.org",
    "access":      "mobile-web",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023100100"    # this is likely the wrong end date
}

#Step 2: Loop over each article name that we stored earlier and store the total mobile views for each monthin a dictionary. 
results = {}
for article in ArticleInfo:
    #Get mobile app and mobile web views
    request1 = request_pageviews_per_article(article[0], request_template=ARTICLE_PAGEVIEWS_MOBILE_app)
    request2 = request_pageviews_per_article(article[0], request_template=ARTICLE_PAGEVIEWS_MOBILE_web)
    
    #Store combined views for each month
    inner_dictionary = {"Views": {}}
    if(request1.__contains__("items") and request2.__contains__("items")):
        for i in range(len(request1['items'])):
            month = request1['items'][i]['timestamp']
            views = request1['items'][i]['views'] + request2['items'][i]['views']
            inner_dictionary["Views"][month] = views
    elif(request2.__contains__("items")):
        for i in range(len(request1['items'])):
            month = request2['items'][i]['timestamp']
            views = request2['items'][i]['views'] 
            inner_dictionary["Views"][month] = views
    elif(request1.__contains__("items")):
         for i in range(len(request1['items'])):
            month = request1['items'][i]['timestamp']
            views = request1['items'][i]['views'] 
            inner_dictionary["Views"][month] = views
    else:
        print("Missing data for", article)
        continue
    
    #Add to final dictionary
    results[article[0]] = inner_dictionary

#Step 3: Sort dictionary alphabetically by the name of the article in a descending fashion
results = sorted(results.items(), key=lambda x:x[0])

#Step 4: Write to file
with open("../Data/academy_monthly_mobile_201501-202309.json", "w") as outfile:
    json.dump(results, outfile, indent=2)

Missing data for ['Victor/Victoria', 'https://en.wikipedia.org/wiki/Victor/Victoria']


## Create Dataset2: Get all monthly desktop pageviews per article
#### Step 1: We want to define the json request object used to retreive data in an abstract manner for desktop data.
#### Step 2: Then we will loop over each article name that we stored earlier and store the total desktop views for each month in a dictionary. 
#### Step 3: Next, we will sort that dictionary alphabetically by the name of the article in a descending fashion.
#### Step 4: Finally, we will save the dictionary as a javascript object notation file.

File will be named: academy_monthly_desktop_201501-202309.json

In [15]:
#Step 1: Define the json request objects used to retreive data in an abstract manner for dekstop data.
ARTICLE_PAGEVIEWS_desktop = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023100100"    # this is likely the wrong end date
}

#Step 2: Loop over each article name that we stored earlier and store the total desktop views for each monthin a dictionary. 
results = {}
for article in ArticleInfo:
    #Get desktop view data
    request1 = request_pageviews_per_article(article[0], request_template=ARTICLE_PAGEVIEWS_desktop)
    
    if(request1.__contains__("items")):
        #Store combined views for each month
        inner_dictionary = {"Views": {}}
        for i in range(len(request1['items'])):
            month = request1['items'][i]['timestamp']
            views = request1['items'][i]['views']
            inner_dictionary["Views"][month] = views
    else:
        print("Missing data for", article)
        continue
    
    #Add to final dictionary
    results[article[0]] = inner_dictionary

#Step 3: Sort dictionary alphabetically by the name of the article in a descending fashion
results = sorted(results.items(), key=lambda x:x[0])

#Step 4: Save the dictionary as a javascript object notation file
with open("../Data/academy_monthly_desktop_201501-202309.json", "w") as outfile:
    json.dump(results, outfile, indent=2)

Missing data for ['Victor/Victoria', 'https://en.wikipedia.org/wiki/Victor/Victoria']


## Create Dataset3: Get all monthly desktop+mobile pageviews per article
#### Step 1: We want to define the json request object used to retreive data in an abstract manner for all data.
#### Step 2: Then we will loop over each article name that we stored earlier and store the total all views for each month in a dictionary. 
#### Step 3: Next, we will sort that dictionary alphabetically by the name of the article in a descending fashion.
#### Step 4: Finally, we will save the dictionary as a javascript object notation file.

File will be named: academy_monthly_cumulative_201501-202309.json

In [17]:
#Step 1: Define the json request objects used to retreive data in an abstract manner for all data.
ARTICLE_PAGEVIEWS_all = {
    "project":     "en.wikipedia.org",
    "access":      "all-access",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023100100"    # this is likely the wrong end date
}

#Step 2: Loop over each article name that we stored earlier and store the total cumulative views for each monthin a dictionary. 
results = {}
for article in ArticleInfo:
    #Get cumulative view data
    request1 = request_pageviews_per_article(article[0], request_template=ARTICLE_PAGEVIEWS_all)
  
    if(request1.__contains__("items")):
        #Store combined views for each month
        inner_dictionary = {"Views": {}}
        for i in range(len(request1['items'])):
            month = request1['items'][i]['timestamp']
            views = request1['items'][i]['views']
            inner_dictionary["Views"][month] = views
    else:
        print("Missing data for", article)
        continue
    
    #Add to final dictionary
    results[article[0]] = inner_dictionary

#Step 3: Sort dictionary alphabetically by the name of the article in a descending fashion
results = sorted(results.items(), key=lambda x:x[0])

#Step 4: Save the dictionary as a javascript object notation file
with open("../Data/academy_monthly_cumulative_201501-202309.json", "w") as outfile:
    json.dump(results, outfile, indent=2)

Missing data for ['Victor/Victoria', 'https://en.wikipedia.org/wiki/Victor/Victoria']
