# Human Centered Data Science: Assignment A1
### University of Washington, Fall 2017 
#### October 19, 2017 
Alyssa Goodrich

## Objective:
The goal of this notebook is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through September 30 2017.


## Step 1: Data Acquisition
Our first step is to access the data from the API. In this case we must use two separate end points because one includes recent data and the other includes legacy data. 

NOTE: There is an inconsistency in the two data sets. The legacy "pagecounts" data includes scrapers/crawlers whereas the pageviews data includes only user generated data.

Please see the read me file for more information on the API, endpoints and license information.

The goal of this step is to access the data from the API and save it to JSON files that we can later use to extract the data we need and put it in a format we can analyze.

** The first step is to access data from the pageviews endpoint and save it to JSON files. There will be one file for each of desktop, mobile-app and mobile-web access points**

In [19]:
#This code will access data from wikimedia api (referenced in endpoint below), and save it to a JSON file
#This code was adapted from code written by Jonathan Morgan. It was accessed 10/13/2017 at the site: http://hcds.yuvi.in/user/alyssacolony/notebooks/data-512-a1/hcds-a1-data-curation.ipynb

import requests
import json

endpoint = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers={'User-Agent' : 'https://github.com/alyssacolony', 'From' : 'alcgood@uw.edu'}

# Choose start and end date. Data available beginning July 1, 2016. Earlier data available from page counts API 'https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end'
# Data should be formatted: YYYYMMDDHH so July 1, 2016 is formatted as: 2016070100   
end_date = '2017100100' # First day of current month
start_date = '2015070100' # Start date is when data is available

# Set a variable that determines which access type we want to collect data for
#Options are: 'desktop', 'mobile-app','mobile-web'
access = ['desktop', 'mobile-app','mobile-web'] 
# Set a variable that determines which agent type we want to collect data for 
#options are 'user', 'spider', 'all-agents'
agent = ['user'] 

# Create a parameter list for each access type
for i in range(len(access)):
    params = {'project' : 'en.wikipedia.org',
            'access' : 'desktop',
            'agent' : 'user', #We choose only user
            'granularity' : 'monthly',
            'start' : start_date,
            'end' : end_date            }

#Defines a function that gets data based on specified parameters and saves to a file
#Filename structured as "UserType_AccessPoint_StartDate_EndDate.txt"
def getData():
    #This function 
    for j in range(len(agent)):
        thisAgent = agent[j]
        for i in range(len(access)):
            params['agent'] = thisAgent
            params['access'] = access[i]
            api_call = requests.get(endpoint.format(**params))
            response = api_call.json()
            filename = 'pageViews'+ '_'+ params['access']+'_'+start_date[0:6]+'_'+end_date[0:6]+'.json'
            with open(filename, 'w') as outfile:
                json.dump(response, outfile)
            
# Run funciton to collect data
getData()

** The next step is to access data from the pagecounts endpoint and save it to JSON files. There will be one file for each of desktop-site and mobile-site access points**

In [20]:
#This code will access data from wikimedia api (referenced in endpoint below), and save it to a JSON file
#This code was adapted from code written by Jonathan Morgan. It was accessed 10/13/2017 at the site: http://hcds.yuvi.in/user/alyssacolony/notebooks/data-512-a1/hcds-a1-data-curation.ipynb

endpoint = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'
headers={'User-Agent' : 'https://github.com/alyssacolony', 'From' : 'alcgood@uw.edu'}

# Choose start and end date. Data available beginning July 1, 2016. Earlier data available from page counts API 'https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end'
# Data should be formatted: YYYYMMDDHH so July 1, 2016 is formatted as: 2016070100   
end_date = '2016073100' # Last date this data is available
start_date = '2008010100' # Start date is earliest date data is available

# Set a variable that determines which access type we want to collect data for
#Options are: 'desktop-site', 'mobile-site'
access = ['desktop-site', 'mobile-site'] 

# Create a parameter list for each access type
for i in range(len(access)):
    params = {'project' : 'en.wikipedia.org',
            'access-site' : access[i],
            'granularity' : 'monthly',
            'start' : start_date,
            'end' : end_date            }

#Defines a function that gets data based on specified parameters and saves to a file
#Filename structured as "UserType_AccessPoint_StartDate_EndDate.txt"
def getData():
    #This function 
    for i in range(len(access)):
        params['access-site'] = access[i]
        api_call = requests.get(endpoint.format(**params))
        response = api_call.json()
        filename = 'PageCounts'+ '_'+ params['access-site']+'_'+start_date[0:6]+'_'+end_date[0:6]+'.json'
        with open(filename, 'w') as outfile:
            json.dump(response, outfile)
            
# Run funciton to collect data
getData()

## Step 2: Data processing
In this step we will process the raw data to put it in a format that we can analyze.  

Key steps to processing include:  
1) Extract the data we need from the JSON files and put it into a combind dictonary  
2) Format and name all necessary data fields to comply with required naming conventions as well combine necessary fields, particularly mobile-app and mobile-web to get a total mobile views number for pageviews data.  
3) Write a CSV file with all required fields including: date, year, month, date, pagecount_all_views, pagecount_desktop_views, pagecount_mobile_views, pageview_all_views, pageview_desktop_views, pageview_mobile_views	


In [21]:
import glob
import json
from collections import defaultdict
import operator
from datetime import datetime

# Get list of files to extract from
files =  glob.glob('*.json')
files


# Step 1: The below code block combimes our JSON Files into a single dictionary

# NOTE TO OLIVER: I had originally chosen names that would specify "all agent" or "user only" in an attempt to consider users 
# who are not familiar with the implications of page view vs page count API. (I later discovered the required naming conventions
# and adjusted the names in a later code block. That is why there are extra steps

All_PageViews = defaultdict(dict)

for file in files:
    input_file=open(file, 'r')
    json_decode=json.load(input_file)
    for item in json_decode['items']:
        if item.get('agent') is not None and item.get('access') is not None:
            name = str(item.get('agent')) + "_"+ str(item.get('access'))
        if str(item.get('access-site')) == 'desktop-site':
            name = "all-agents_desktop"
        if str(item.get('access-site')) == 'mobile-site':
            name = "all-agents_mobile"
        if item.get('views') is not None:
            All_PageViews[item.get('timestamp')][name] = item.get('views')
        if item.get('count') is not None:
            All_PageViews[item.get('timestamp')][name] = item.get('count')


# Step 2
# Create and rename fields to ensure we have all necessary fields for CSV 
for item in All_PageViews:
    #Create a field for all mobile views from pageview API
    if (All_PageViews[item].get('user_mobile-app') is not None) and (All_PageViews[item].get('user_mobile-web') is not None):
            All_PageViews[item]['pageview_mobile_views'] = All_PageViews[item]['user_mobile-app']+All_PageViews[item]['user_mobile-web']
    # rename "user_desktop field to comply with naming conventions
    if (All_PageViews[item].get('user_desktop') is not None):
        All_PageViews[item]['pageview_desktop_views'] = All_PageViews[item].pop('user_desktop')
    # Create field for all page views from page view API source     
    if All_PageViews[item].get('pageview_desktop_views') is not None and All_PageViews[item].get('pageview_mobile_views') is not None:
        All_PageViews[item]['pageview_all_views'] = All_PageViews[item]['pageview_desktop_views'] + All_PageViews[item]['pageview_mobile_views'] 
    # rename "all-agents_desktop field to comply with naming conventions
    if (All_PageViews[item].get('all-agents_desktop') is not None):
        All_PageViews[item]['pagecount_desktop_views'] = All_PageViews[item].pop('all-agents_desktop')
    # rename "all-agents_mobile and desktop fields to comply with naming conventions
    if (All_PageViews[item].get('all-agents_mobile') is not None):
        All_PageViews[item]['pagecount_mobile_views'] = All_PageViews[item].pop('all-agents_mobile')
    # Create field for all page views from pagecount API Source
    if All_PageViews[item].get('pagecount_desktop_views') is not None and All_PageViews[item].get('pagecount_mobile_views') is None:
        All_PageViews[item]['pagecount_all_views'] = All_PageViews[item]['pagecount_desktop_views']  
    if All_PageViews[item].get('pagecount_desktop_views') is not None and All_PageViews[item].get('pagecount_mobile_views') is not None:
        All_PageViews[item]['pagecount_all_views'] = All_PageViews[item]['pagecount_desktop_views'] + All_PageViews[item]['pagecount_mobile_views'] 
    #Create Month and Year fields
    All_PageViews[item]['Month'] = str(item[4:6])
    All_PageViews[item]['Year'] = str(item[0:4])
    All_PageViews[item]['Date'] = datetime.strptime(str(item), '%Y%m%d%H').strftime('%m/%Y')

Sorted_PageViews = sorted(All_PageViews.items(), key=operator.itemgetter(0))



In [22]:
# Step 3: Write CSV with all necessary data and headers
import csv

# define headers
headers = ['Date',
           'Month',
           'Year',
           'pagecount_all_views',
           'pagecount_desktop_views',
           'pagecount_mobile_views',
           'pageview_all_views',
           'pageview_desktop_views',
           'pageview_mobile_views',
               ]

# Open file and initalize writer
file = open('en-wikipedia_traffic_200801-201709.csv', 'w',  newline='') 
writer = csv.writer(file)
writer.writerow(headers)

# Create a dates variable to create a table ordered by dates
dates = []
for item in Sorted_PageViews:
    dates.append(item[0])

# Extract necessary data and write to CSV
for date in dates:
    line = []
    for header in headers:
        if All_PageViews[date].get(header) is None or All_PageViews[date][header] == 0:
            line.append(None)
        else:
            line.append(All_PageViews[date][header])
    writer.writerow(line)

file.close()

## Step 3: Analysis


In this step we analyze the data that we have extracted with the goal of creating a visualization that shows page view trends on the English language wikipedia site including legacy page count data that includes all views, and current page view data that allows us to consider only user views and exclude spiders from the count.

Because of the ridiculously long time I spent on the above steps, I am electing to do the analysis in google graphs. (Obviously I am very new at this. But I am definitely getting my money's worth from this program). 

**Key steps to google charts analysis:**  
1) Copy data from CSV into new sheet at: https://docs.google.com/spreadsheets/u/0/  
2) Select "Insert Chart" and choose line chart  
3) Specifiy each series to be viewed, and specify data for the x axis   
4) Adjust formatting and titles  
5) Publish to web  

![title](pageviewsOnEnglishWikipedia.png)

The chart can be viewed here:
https://docs.google.com/spreadsheets/d/e/2PACX-1vQaZBhXDCLDwUUGy-9XC7n1uqQ7y0QQ9-yK9m6Oy-CPRRkUd9Ano_VAM0TJrCtLDcorqwh-AESNV5R2/pubchart?oid=463731903&format=image