# A1: Data Curation Assignment
### DATA512  
#### Emily Yamauchi

The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through August 30 2021. All analysis should be performed in a single Jupyter notebook. Your Jupyter notebook and your data files will be uploaded to your own GitHub repository. A link to that repository will be submitted to enable the grading of this assignment.  

The purpose of the assignment is to demonstrate that you can follow best practices for open scientific research in designing and implementing your project, and make your project fully reproducible by others: from data collection to data analysis.
For this assignment, you combine data about Wikipedia page traffic from two different [Wikimedia REST API endpoints](https://www.mediawiki.org/wiki/REST_API) into a single dataset, perform some simple data processing steps on the data, and then analyze that data.

### Step 0: Read about reproducibility

Review the chapters  "Assessing Reproducibility" and "The Basic Reproducible Workflow Template" from The Practice of Reproducible Research.

>Rokem, Marwick, and Staneva. [Assessing Reproducibility](http://www.practicereproducibleresearch.org/core-chapters/2-assessment.html) in Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018). The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press.

>Kitzes. [The Basic Reproducible Workflow Template](http://www.practicereproducibleresearch.org/core-chapters/3-basic.html) in Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018). The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press.

### Step 1: Data acquisition

In order to measure Wikipedia traffic from 2008-2021, you will need to collect data from two different API endpoints, the Legacy Pagecounts API and the Pageviews API.  

- The Legacy Pagecounts API ([documentation](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts), [endpoint](https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end)) provides access to desktop and mobile traffic data from December 2007 through July 2016.
- The Pageviews API ([documentation](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews), [endpoint](https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end)) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through last month.  

For each API, you will need to collect data for all months where data is available and then save the raw results into 5 separate JSON source data files (one file per API query type) before continuing to Step 2.  

To get you started, you can refer to [this example Notebook](http://paws-public.wmflabs.org/paws-public/User:Jtmorgan/data512_a1_example.ipynb) that contains sample code for API calls ([download the notebook](http://paws-public.wmflabs.org/paws-public/User:Jtmorgan/data512_a1_example.ipynb?format=raw)). This sample code is [licensed CC0](https://creativecommons.org/share-your-work/public-domain/cc0/) so feel free to reuse any of the code in that notebook without attribution.  

Your JSON-formatted source data file must contain the complete and unedited output of your API queries. The naming convention for the source data files is:  

>`apiname_accesstype_firstmonth-lastmonth.json`

For example, your filename for monthly page views for devices requesting the desktop versions of pages should be:
>`pagecounts_desktop-site_200712-202108.json`

Important notes:  
- As much as possible, we're interested in organic (user) traffic, as opposed to traffic by web crawlers or spiders. The Pageview API (but not the Pagecount API) allows you to filter by agent=user. You should do that.
- There was about 1 year of overlapping traffic data between the two APIs. You need to gather, and later graph, data from both APIs for this period of time.

In [1]:
import json
import requests

In [93]:
# get endpoints for legacy/pageview

base = 'https://wikimedia.org/api/rest_v1/metrics/'

endpoints = {
    "pagecounts" : base + 'legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}',
    "pageviews" : base + 'pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'
}

Parameters for each API call:  

- Legacy, desktop `"access-site" : "desktop-site"`
- Legacy, mobile `"access-site" : "mobile-site"`
- Pageviews, desktop `"access" : "desktop"`
- Pageviews, mobile web `"access" : "mobile-web"`
- Pageviews, mobile app `"access" : "mobile-app"`

In [84]:
# define all parameters

all_params = {
    "legacy_desktop": {
        "access" : "desktop-site",
        "data_type" : "pagecounts"},
    "legacy_mobile" : {
        "access" : "mobile-site",
        "data_type" : "pagecounts"},
    "new_desktop" : {
        "access" : "desktop",
        "data_type" : "pageviews"},
    "new_web" : {
        "access" : "mobile-web",
        "data_type" : "pageviews"},
    "new_app" : {
        "access" : "mobile-app",
        "data_type" : "pageviews"}
}

dates = {
    "pagecounts" : {
        "start" : "2007120100",
        "end" : "2016080100"},
    "pageviews" : {
        "start" : "2015070100",
        "end" : "2021100100"
    }
}

In [89]:
def get_params(access_type, agent="user"):
    
    data_type = all_params[access_type]["data_type"]
    
    if data_type == "pagecounts":
        
        params = {"project" : "en.wikipedia.org",
                  "access-site" : all_params[access_type]["access"],
                  "granularity" : "monthly",
                  "start" : dates[data_type]["start"],
                  "end" : dates[data_type]["end"]}
    
    if data_type == "pageviews":
        
        params = {"project" : "en.wikipedia.org",
                  "access" : all_params[access_type]["access"],
                  "agent" : agent,
                  "granularity" : "monthly",
                  "start" : dates[data_type]["start"],
                  "end" : dates[data_type]["end"]}
    
    return params

In [90]:
# test for api call type

get_params("new_app")

{'project': 'en.wikipedia.org',
 'access': 'mobile-app',
 'agent': 'user',
 'granularity': 'monthly',
 'start': '2015070100',
 'end': '2021100100'}

In [94]:
# function to get api response- default as user

def api_call(access_type, agent="user"):
    
    data_type = all_params[access_type]["data_type"]
    
    endpoint = endpoints[data_type]
    params = get_params(access_type, agent)
    
    call = requests.get(endpoint.format(**params), headers=headers)
    response = call.json()
    
    return response

In [95]:
# test for pageview mobile app

api_call("new_app")

{'items': [{'project': 'en.wikipedia',
   'access': 'mobile-app',
   'agent': 'user',
   'granularity': 'monthly',
   'timestamp': '2015070100',
   'views': 109624146},
  {'project': 'en.wikipedia',
   'access': 'mobile-app',
   'agent': 'user',
   'granularity': 'monthly',
   'timestamp': '2015080100',
   'views': 109669149},
  {'project': 'en.wikipedia',
   'access': 'mobile-app',
   'agent': 'user',
   'granularity': 'monthly',
   'timestamp': '2015090100',
   'views': 96221684},
  {'project': 'en.wikipedia',
   'access': 'mobile-app',
   'agent': 'user',
   'granularity': 'monthly',
   'timestamp': '2015100100',
   'views': 94523777},
  {'project': 'en.wikipedia',
   'access': 'mobile-app',
   'agent': 'user',
   'granularity': 'monthly',
   'timestamp': '2015110100',
   'views': 94353925},
  {'project': 'en.wikipedia',
   'access': 'mobile-app',
   'agent': 'user',
   'granularity': 'monthly',
   'timestamp': '2015120100',
   'views': 99438956},
  {'project': 'en.wikipedia',
   'a

In [102]:
def save_json(access_type, agent="user"):
    
    resp = api_call(access_type, agent)
    path_name = "/data_raw/{data_type}_{access_type}_{start}-{end}.json"
    
    data_type = all_params[access_type]["data_type"]
    
    names = {"data_type" : data_type,
            "access_type" : all_params[access_type]["access"],
            "start" : dates[data_type]["start"],
            "end" : dates[data_type]["end"]}
    
    file_path = path_name.format(**names)
    
    with open(file_path, "w") as f:
        json.dump(resp, f)

In [103]:
save_json("new_app")

FileNotFoundError: [Errno 2] No such file or directory: '/data_raw/pageviews_mobile-app_2015070100-2021100100.json'

In [9]:
# SAMPLE parameters for getting aggregated legacy view data 
# see: https://wikimedia.org/api/rest_v1/#!/Legacy_data/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end
example_params_legacy = {"project" : "en.wikipedia.org",
                 "access-site" : "desktop-site",
                 "granularity" : "monthly",
                 "start" : "2001010100",
                # for end use 1st day of month following final month of data
                 "end" : "2018100100"
                    }

# SAMPLE parameters for getting aggregated current standard pageview data
# see: https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end
example_params_pageviews = {"project" : "en.wikipedia.org",
                    "access" : "desktop",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "2001010100",
                    # for end use 1st day of month following final month of data
                    "end" : '2018101000'
                        }

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/emi90',
    'From': 'eyamauch@uw.edu'
}

In [10]:
def api_call(endpoint,parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

In [11]:
example_monthly_pageviews = api_call(endpoint_pageviews, example_params_pageviews)