# English Wikipedia Page Views, 2008 - 2017

This code will scrape traffic volumes on English language Wikipedia pages over time.  It will track both desktop requests, mobile app, and mobile web requests for all months that the data is provided.  It will make requests for these data using Wikipedia RESTful APIs, consolidate the data, and - time permitting - present the data as a graph of page counts as they evolve over time.

Import Python files required for this code.  Declare and initialize common Wikipedia REST API endpoints, and common headers used by the API calls.

In [72]:
import copy
import csv
import datetime
from dateutil.relativedelta import relativedelta
import json
import requests

endpoint_pagecounts = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'
endpoint_pageviews  = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers = {'User-Agent': 'https://github.com/garygr2002', 'From': 'garygr@uw.edu'}

This function will format a date object to include year, month, day and hour.

In [73]:
def format_date_long(date):
    return date.strftime('%Y%m%d%H')



This function will format a date object to include only year and month.

In [74]:
def format_date_short(date):
    return date.strftime('%Y%m')



This function will call the Wikimedia Page Counts REST API given access type ('desktop-site' or 'mobile-site'), a start date, and an end date.  Dates are assumed to be given on month boundaries.

In [75]:
def get_page_counts(access, start_date, end_date):
    
    params = {'project' : 'en.wikipedia.org',
              'access-site' : access,
              'granularity' : 'monthly',
              'start' : format_date_long(start_date),
              'end' : format_date_long(end_date)
             }
    
    api_call = requests.get(endpoint_pagecounts.format(**params))
    return api_call.json()



This function will call the Wikimedia Page Views REST API given access type ('desktop', 'mobile-app', or 'mobile-web'), a start date, and an end date.  Dates are assumed to be given on month boundares.

In [76]:
def get_page_views(access, start_date, end_date):
    
    params = {'project' : 'en.wikipedia.org',
              'access' : access,
              'agent' : 'user',
              'granularity' : 'monthly',
              'start' : format_date_long(start_date),
              'end' : format_date_long(end_date)
             }
    
    api_call = requests.get(endpoint_pageviews.format(**params))
    return api_call.json()



Declare and initialize the column headers for the CSV file that this code will produce.

In [77]:
pagecount_all_views = 'pagecount_all_views'
pagecount_desktop_views = 'pagecount_desktop_views'
pagecount_mobile_views = 'pagecount_mobile_views'
pageview_all_views = 'pageview_all_views'
pageview_desktop_views = 'pageview_desktop_views'
pageview_mobile_views = 'pageview_mobile_views'


Declare and initialize a lookup dictionary that will allow CSV column headers to be indexed by a concatenation of REST API type and access type.

In [78]:
key_lookup_dictionary = {'pagecounts/desktop-site' : pagecount_desktop_views,
                        'pagecounts/mobile-site' : pagecount_mobile_views,
                        'pageviews/desktop' : pageview_desktop_views,
                        'pageviews/mobile-app' : pageview_mobile_views,
                        'pageviews/mobile-web' : pageview_mobile_views}


Declare and initialize a initial traffic counts dictionary that will show zero requests for each of the page accesses that this code will track.  Declare and initialize an initial traffic dictionary that is empty.  We will populate it with traffic requests indexed by month.

In [79]:
initial_traffic = {pagecount_all_views : 0,
                  pagecount_desktop_views : 0,
                  pagecount_mobile_views : 0,
                  pageview_all_views : 0,
                  pageview_desktop_views : 0,
                  pageview_mobile_views : 0}

initial_traffic_dictionary = { }
traffic_dictionary = initial_traffic_dictionary


This function will access and return page count data from the JSON retured from a Wikipedia RESTful API.

In [80]:
def access_count(api_name, api_data):
    
    if (api_name == 'pageviews'):
        key = 'views'
    else:
        key = 'count'
    return api_data.get('items', None)[0][key]



This function will initialize the by-month traffic data dictionary.

In [81]:
def clear_traffic_dictionary():
    traffic_dictionary = initial_traffic_dictionary



This function will update the by-month traffic dictionary.  It attempts to locate an existing entry in the dictionary given its access information.  If it finds none, it adds a new entry for the given date.  If it finds an existing entry, it updates the entry.

In [82]:
def update_traffic_dictionary(start_date, api_name, access, api_data):
    
    date_key = format_date_short(start_date)
    traffic = traffic_dictionary.get(date_key, copy.deepcopy(initial_traffic))
    access_key = key_lookup_dictionary['{}/{}'.format(api_name, access)]
    count = traffic.get(access_key, 0)
    count += access_count(api_name, api_data)
    traffic[access_key] = count
    traffic_dictionary[date_key] = traffic



This function updates traffic totals for all entries in the traffic dictionary (given by month), by summing desktop and mobile views, and updating the 'all views' member.

In [83]:
def calculate_traffic_totals():
    
    for key, val in traffic_dictionary.items():
        val[pagecount_all_views] = val[pagecount_desktop_views] + val[pagecount_mobile_views]
        val[pageview_all_views] = val[pageview_desktop_views] + val[pageview_mobile_views]



This function calls an API (either Page Counts or Page Views) for a given start date, and number of iterations.  Iterations define the number of months requested.

In [84]:
def perform_api_call(api_name, function, access, initial_start_date, iterations):

    start_date = initial_start_date
    traffic = { }
    for i in range(0, iterations):
        # print('Request starting \'{}\'...'.format(start_date.strftime('%Y%m%d')))
        end_date = start_date + relativedelta(months=+1)
        api_data = function(access, start_date, end_date)
        update_traffic_dictionary(start_date, api_name, access, api_data)
        traffic[start_date.strftime('%Y%m%d')] = api_data
        start_date = end_date
        
    #
    #  Note: At this stage we have traffic data for the API name for all the months
    #  that have been requested.  Here we open a JSON file for the API name, access
    #  type, and date range.  We then write the file.
    #
        
    with open('{}_{}_{}_{}.json'.format(api_name,
                                        access,
                                        format_date_short(initial_start_date),
                                        format_date_short(end_date)),
              'w') as outfile:  
        json.dump(traffic, outfile)



Declare and initialize common API names and access keys.

In [85]:
api_name_current = 'pageviews'
api_name_legacy = 'pagecounts'

key_access = 'access'
key_api_name = 'apiname'
key_function = 'function'
key_months = 'months'
key_start_date = 'startdate'


Declare and initialize parameter grouping dictionaries that include API name, function to call, access string, start date, and number of requested months.

In [86]:
legacy_mobile = {key_api_name : api_name_legacy,
                 key_function : get_page_counts,
                 key_access : 'mobile-site',
                 key_start_date : datetime.datetime(2014, 10, 1, 0, 0, 0),
                 key_months : 23 } # 1 }

legacy_desktop = {key_api_name : api_name_legacy,
                 key_function : get_page_counts,
                 key_access : 'desktop-site',
                 key_start_date : datetime.datetime(2008, 1, 1, 0, 0, 0),
                 key_months : 104 } # 1 }

current_desktop = {key_api_name : api_name_current,
                   key_function : get_page_views,
                   key_access : 'desktop',
                   key_start_date : datetime.datetime(2015, 7, 1, 0, 0, 0),
                   key_months : 26 } # 1 }

current_mobile_app = {key_api_name : api_name_current,
                      key_function : get_page_views,
                      key_access : 'mobile-app',
                      key_start_date : datetime.datetime(2015, 7, 1, 0, 0, 0),
                      key_months : 26 } # 1 }

current_mobile_web = {key_api_name : api_name_current,
                      key_function : get_page_views,
                      key_access : 'mobile-web',
                      key_start_date : datetime.datetime(2015, 7, 1, 0, 0, 0),
                      key_months : 26 } # 1 }


Declare and initialize a tuple of the parameter grouping dictionaries.  Clear the traffic data dictionary, and cycle for each parameter group.  Perform the API call for the first/next parameter group.

In [87]:
access_types = [legacy_desktop, legacy_mobile, current_desktop, current_mobile_web, current_mobile_app]

clear_traffic_dictionary()
for access_type in access_types:
    print('Starting a new one...')
    perform_api_call(access_type[key_api_name],
                     access_type[key_function],
                     access_type[key_access],
                     access_type[key_start_date],
                     access_type[key_months])

    

Starting a new one...
Starting a new one...
Starting a new one...
Starting a new one...
Starting a new one...


Now calculate the traffic totals by summing desktop and mobile access.  Open the output CSV file, create a CSV, and give it a column header.  Sort the traffic dictionary by date, and for each key (in ascending order), write a row that gives all pagecount views, pagecount desktop views, pagecount mobile views, pageview all views, pageview desktop views, and pageview mobile views.  We are done!

In [88]:
calculate_traffic_totals()
with open('en-wikipedia_traffic_200801-201709.csv', 'w', newline='') as csvfile:
    
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(['year', 'month', pagecount_all_views, pagecount_desktop_views, pagecount_mobile_views,
                        pageview_all_views, pageview_desktop_views, pageview_mobile_views])
    
    sorted_keys = sorted(traffic_dictionary)
    for key in sorted_keys:
        
        item = traffic_dictionary[key]
        csvwriter.writerow([key[:4],
                            key[4:],
                            item[pagecount_all_views],
                            item[pagecount_desktop_views],
                            item[pagecount_mobile_views],
                            item[pageview_all_views],
                            item[pageview_desktop_views],
                            item[pageview_mobile_views]])
