# English Wikipedia page views, 2008 - 2017

For this assignment, goal is to analyze traffic on English Wikipedia over time, and then document your process and the resulting dataset and visualization according to best practices for open research that were outlined for you in class.

### Example API request
You can use this example API request as a starting point for building your API queries. Note that the [Legacy Pagecounts API](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts) has slightly different schema than the [pageview API](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews) shown here.

This sample API request would get you all pageviews by web crawlers on the mobile website for English Wikipedia during the month of September, 2017.

## Step 1: Data acquisition

#### Collect dataset of monthly traffic on English Wikipedia from July 1 2008 through September 30 2017.

Using Legacy Pagcounts API - From January 2008 through July 2016. 
    * By Desktop
        -- pagecounts_desktop-site_200801-201607.json
    * By Mobile 
        -- pagecounts_mobile-site_200801-201607.json
        
Variable  :
        -- pcStart = 2008010100
        -- pcEnd = 2016080100
       

In [1]:
# Required Library 
import requests
import json

In [2]:
# API  :  https://wikimedia.org/api/rest_v1/#/Legacy_data 
# Understanding of Legacy API  - Modifying endpoint and Parameter accordingly 
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'
headers={'User-Agent' : 'https://github.com/abhishekanand', 'From' : 'anand1@uw.edu'}

In [3]:
# Getting PageCount [Legacy API (User+Spider)]  : Desktop-site 
OUTPUT_File = 'pagecounts_desktop-site_200801-201607.json'
params = {'project' : 'en.wikipedia.org',
            'access-site' : 'desktop-site',
            'granularity' : 'monthly',
            'start' : '2008010100',
            'end' : '2016080100'#use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
#print(endpoint.format(**params))
response = api_call.json()
#print(response) # Print Resposne from API call 

# https://stackoverflow.com/questions/17043860/python-dump-dict-to-json-file
with open(OUTPUT_File, 'w') as fp:
    json.dump(response, fp)
    

In [4]:
# Getting PageCount [Legacy API (User+Spider)]   : Mobile-site
OUTPUT_File = 'pagecounts_mobile-site_200801-201607.json'
params = {'project' : 'en.wikipedia.org',
            'access-site' : 'mobile-site',
            'granularity' : 'monthly',
            'start' : '2008010100',
            'end' : '2016080100'#use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
#print(endpoint.format(**params))
response = api_call.json()
#print(response) # Print Resposne from API call 

# https://stackoverflow.com/questions/17043860/python-dump-dict-to-json-file
with open(OUTPUT_File, 'w') as fp:
    json.dump(response, fp)
    

Using Pageviews API - From July 2015 through September 2017.
* By Desktop
    -- pageviews_desktop-site_201501-201711.json       
* By Mobile web
    -- pageviews_mobileweb-site_201501-201711.json
* By Mobile App
    -- pageviews_mobileapp-site_201501-201711.json
Variable : -- pvstart = 2015070100 -- pvEnd = 2017100100

In [5]:

endpoint = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'
headers={'User-Agent' : 'https://github.com/abhishekanand', 'From' : 'anand1@uw.edu'}

# Using Pageviews API - From July 2015 through September 2017
#pvStart = 2015070100
#pvEnd = 2017091000 


In [6]:
# Getting PageView [(User)] : Desktop
OUTPUT_File = 'pageviews_desktop-site_201501-201711.json'
params = {'project' : 'en.wikipedia.org',
            'access' : 'desktop',
            'agent' : 'user',
            'granularity' : 'monthly',
            'start' : '2015070100',
            'end' : '2017100100'#use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
response = api_call.json()
# print(response) # Print Resposne from API call 

# https://stackoverflow.com/questions/17043860/python-dump-dict-to-json-file
with open(OUTPUT_File, 'w') as fp:
    json.dump(response, fp)
    

In [7]:
# Getting PageView [(User)]  : mobile-app
OUTPUT_File = 'pageviews_mobileapp-site_201501-201711.json'
params = {'project' : 'en.wikipedia.org',
            'access' : 'mobile-app',
            'agent' : 'user',
            'granularity' : 'monthly',
            'start' : '2015070100',
            'end' : '2017100100'#use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
response = api_call.json()
# print(response) # Print Resposne from API call 

# https://stackoverflow.com/questions/17043860/python-dump-dict-to-json-file
with open(OUTPUT_File, 'w') as fp:
    json.dump(response, fp)

In [8]:
# Getting PageView [(User)]  : mobile-web
OUTPUT_File = 'pageviews_mobileweb-site_201501-201711.json'
params = {'project' : 'en.wikipedia.org',
            'access' : 'mobile-web',
            'agent' : 'user',
            'granularity' : 'monthly',
            'start' : '2015070100',
            'end' : '2017100100'#use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
response = api_call.json()
# print(response) # Print Resposne from API call 

# https://stackoverflow.com/questions/17043860/python-dump-dict-to-json-file
with open(OUTPUT_File, 'w') as fp:
    json.dump(response, fp)

## Step 2: Data processing

For data collected from the Pageviews API, combine the monthly values for mobile-app and mobile-web to create a total mobile traffic count for each month.

In [None]:
# Code 

For all data, separate the value of timestamp into four-digit year (YYYY) and two-digit month (MM) and discard values for day and hour (DDHH).
Combine all data into a single CSV file with the following headers:

Column            	         Value
year	                     YYYY
month	                     MM
pagecount_all_views	         num_views
pagecount_desktop_views	     num_views
pagecount_mobile_views	     num_views
pageview_all_views	         num_views
pageview_desktop_views	     num_views
pageview_mobile_views	     num_views


In [None]:
#Code