# Dataset creation from API notebook
The purpose of this notebook is to use the Pageviews API  
from https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews  
to create three datasets of url traffic data from July 2015 through the previous complete month.  
The pages that will be considered for these datasets are located in /Data/ArticleNames.csv.  

The three datasets are described to be:  
```
Dataset1: Get all monthly mobile pageviews per article
Dataset2: Get all monthly desktop pageviews per article
Dataset3: Get all monthly desktop+mobile pageviews per article
```

# Import python packages

In [3]:
import os 
import json
import time
import urllib.parse
import requests
import csv

# Load in article data that we want to process into list of lists
Element 1: Article title
Element 2: Article URL

In [41]:
#Open csv and put data into list of lists
ArticleInfo = []
with open('../Data/ArticleNames.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        ArticleInfo.append([row[0], row[1]])

#Remove header row
ArticleInfo.pop(0)

#Print out first few results as proof it is working
for i in range(3):
    print(ArticleInfo[i])

#Show number of articles
print("Number of articles, should be 1359:", len(ArticleInfo))

['Everything Everywhere All at Once', 'https://en.wikipedia.org/wiki/Everything_Everywhere_All_at_Once']
['All Quiet on the Western Front (2022 film)', 'https://en.wikipedia.org/wiki/All_Quiet_on_the_Western_Front_(2022_film)']
['The Whale (2022 film)', 'https://en.wikipedia.org/wiki/The_Whale_(2022_film)']
Number of articles, should be 1359: 1359


# Helper functions and variables for using the Pageviews API

The code in this cell was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - August 14, 2023. Some of the original code made be modified, however it still falls under the Creative Commons license.

In [42]:
# Below are CONSTANTS used by the API that are relevant to all subsequent API calls
# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error 
REQUEST_HEADERS = {
    'User-Agent': '<zbowyer@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023040100"    # this is likely the wrong end date
}

def request_pageviews_per_article(article_title = None, 
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):
    '''
    Description:
        Used to request data for a single article, uses defined constants above for params
    Input(s):
        article_title - String
        endpoint_url - String
        endpoint_params - String
        request_template - Dictionary
        headers - Dictionary
    Outputs:
        json_response - JSON string/object
    '''
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'))
    request_template['article'] = article_title_encoded
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


## Create Dataset1: Get all monthly mobile pageviews per article

File will be named: academy_monthly_mobile_201501-202309.json

In [None]:
# Define param for mobile viewership from start date to current time
ARTICLE_PAGEVIEWS_MOBILE_app = {
    "project":     "en.wikipedia.org",
    "access":      "mobile-app",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023100100"    # this is likely the wrong end date
}
ARTICLE_PAGEVIEWS_MOBILE_web = {
    "project":     "en.wikipedia.org",
    "access":      "mobile-app",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023100100"    # this is likely the wrong end date
}

#Store API results to dictionary, which will be written to file
results = {}

#Loop over each article of interest
for article in ArticleInfo:
    #Get mobile app and mobile web views
    request1 = request_pageviews_per_article(article[0], request_template=ARTICLE_PAGEVIEWS_MOBILE_app)
    request2 = request_pageviews_per_article(article[0], request_template=ARTICLE_PAGEVIEWS_MOBILE_web)
    
    #Store combined views for each month
    inner_dictionary = {"Views": {}}
    print(article[0])
    for i in range(len(request1['items'])):
        month = request1['items'][i]['timestamp']
        views = request1['items'][i]['views'] + request2['items'][i]['views']
        inner_dictionary["Views"][month] = views
    
    #Add to final dictionary
    results[article[0]] = inner_dictionary

#Sort dictionary by article name alphabetically descending
results = sorted(results.items(), key=lambda x:x[0])

#Write to file
with open("../Data/academy_monthly_mobile_201501-202309.json", "w") as outfile:
    json.dump(results, outfile, indent=2)

Everything Everywhere All at Once
All Quiet on the Western Front (2022 film)
The Whale (2022 film)
Top Gun: Maverick
Black Panther: Wakanda Forever
Avatar: The Way of Water
Women Talking (film)
Guillermo del Toro's Pinocchio
Navalny (film)
The Elephant Whisperers
An Irish Goodbye
The Boy, the Mole, the Fox and the Horse (film)
RRR (film)
CODA (2021 film)
Dune (2021 film)
The Eyes of Tammy Faye (2021 film)
No Time to Die
The Windshield Wiper
The Long Goodbye (Riz Ahmed album)
The Queen of Basketball
Summer of Soul
Drive My Car (film)
Encanto
West Side Story (2021 film)
Belfast (film)
The Power of the Dog (film)
King Richard (film)
Cruella (film)
Nomadland (film)
The Father (2020 film)
Judas and the Black Messiah
Minari (film)
Mank
Sound of Metal
Ma Rainey's Black Bottom (film)
Promising Young Woman
Tenet (film)
Soul (2020 film)
Another Round (film)
My Octopus Teacher
Colette (2020 film)
If Anything Happens I Love You
Two Distant Strangers
Parasite (2019 film)
Ford v Ferrari
Learning to 

The Ghost and the Darkness
Kolya
The Nutty Professor (1996 film)
Quest (1996 film)
When We Were Kings
Breathing Lessons: The Life and Work of Mark O'Brien
Dear Diary (1996 film)
Braveheart
Apollo 13 (film)
Pocahontas (1995 film)
The Usual Suspects
Restoration (1995 film)
Babe (film)
Sense and Sensibility (film)
Il Postino: The Postman
Dead Man Walking (film)
Leaving Las Vegas
Mighty Aphrodite
Anne Frank Remembered
A Close Shave
Lieberman in Love
One Survivor Remembers
Antonia's Line
Toy Story
Forrest Gump
The Lion King
Speed (1994 film)
Ed Wood (film)
Pulp Fiction
Bullets Over Broadway
The Madness of King George
Legends of the Fall
A Time for Justice
Franz Kafka's It's a Wonderful Life
Maya Lin: A Strong Clear Vision
Burnt by the Sun
Trevor (film)
The Adventures of Priscilla, Queen of the Desert
Bob's Birthday
Blue Sky (1994 film)
Schindler's List
The Piano
Jurassic Park (film)
Philadelphia (film)
The Fugitive (1993 film)
The Age of Innocence (1993 film)
The Wrong Trousers
Belle Epoque

## Create Dataset2: Get all monthly desktop pageviews per article

File will be named: academy_monthly_desktop_201501-202309.json

## Create Dataset3: Get all monthly desktop+mobile pageviews per article

File will be named: academy_monthly_cumulative_201501-202309.json