# A1: Data Curation Assignment
### Data 512: Human Centered Data Science
#### Aaliyah Hänni
#### 10/7/2021

## Project Description
The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through August 30 2021. 

## Data Documentation


### Wikimedia REST API

In order to measure Wikipedia traffic from 2008-2021, you will need to collect data from two different API endpoints, the Legacy Pagecounts API and the Pageviews API.The Legacy Pagecounts API provides access to desktop and mobile traffic data from December 2007 through July 2016.The Pageviews API provides access to desktop, mobile web, and mobile app traffic data from July 2015 through last month.

Legacy Pagecounts API Documentation: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts

Pageviews API Documentation:https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews

Wikimedia REST APIEndpoint: https://wikimedia.org/api/rest_v1/#/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end

For each API, you will need to collect data for all months where data is available and then save the raw results into 5 separate JSON source data files (one file per API query type) before continuing to Step 2.

## Data Acquisition

In [1]:
import json
import requests

In [2]:
endpoint_legacy = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'

endpoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

In [3]:
# Parameters for getting aggregated legacy view data 
# see: https://wikimedia.org/api/rest_v1/#!/Legacy_data/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end
params_legacy_mobile = {"project" : "en.wikipedia.org",
                 "access-site" : "mobile-site",
                 "granularity" : "monthly",
                 "start" : "2008010100",
                # for end use 1st day of month following final month of data
                 "end" : "2016080100"
                    }

params_legacy_desktop = {"project" : "en.wikipedia.org",
                 "access-site" : "desktop-site",
                 "granularity" : "monthly",
                 "start" : "2008010100",
                # for end use 1st day of month following final month of data
                 "end" : "2016080100"
                    }

# Parameters for getting aggregated current standard pageview data
# see: https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end
params_pageviews_mobile_app = {"project" : "en.wikipedia.org",
                    "access" : "mobile-app",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "2015070100",
                    # for end use 1st day of month following final month of data
                    "end" : '2021090100'
                        }
params_pageviews_mobile_web = {"project" : "en.wikipedia.org",
                    "access" : "mobile-web",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "2015070100",
                    # for end use 1st day of month following final month of data
                    "end" : '2021090100'
                        }
params_pageviews_desktop = {"project" : "en.wikipedia.org",
                    "access" : "desktop",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "2015070100",
                    # for end use 1st day of month following final month of data
                    "end" : '2021090100'
                        }

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/aaliyahfiala42',
    'From': 'fialaa@uw.edu'
}

In [4]:
def api_call(endpoint,parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

In [5]:
#Get monthly pages views for mobile from Legacy API
monthly_legacy_mobile = api_call(endpoint_legacy, params_legacy_mobile)

#Get monthly pages views for desktop from Legacy API
monthly_legacy_desktop = api_call(endpoint_legacy, params_legacy_desktop)

In [6]:
#print(monthly_legacy_mobile)
#print(monthly_legacy_desktop)

In [7]:
#Get monthly pages views for mobile apps from Pageviews API
monthly_pageviews_mobile_app = api_call(endpoint_pageviews, params_pageviews_mobile_app)

#Get monthly pages views for mobile web from Pageviews API
monthly_pageviews_mobile_web = api_call(endpoint_pageviews, params_pageviews_mobile_web)

#Get monthly pages views for desktop from Pageviews API
monthly_pageviews_desktop = api_call(endpoint_pageviews, params_pageviews_desktop)


In [8]:
#print(monthly_pageviews_mobile_app)
#print(monthly_pageviews_mobile_web)
#print(monthly_pageviews_mobile_desktop)

In [9]:
#Saving data from APIs to JSON files in the format apiname_accesstype_firstmonth-lastmonth.json

with open('legacy_mobile-site_200801-201607.json', 'w') as outfile:
    json.dump(monthly_legacy_mobile, outfile)
    
with open('legacy_desktop-site_200801-201607.json', 'w') as outfile:
    json.dump(monthly_legacy_desktop, outfile)

with open('pageviews_mobile-app_201507-202108.json', 'w') as outfile:
    json.dump(monthly_pageviews_mobile_app, outfile)

with open('pageviews_mobile-web_201507-202108.json', 'w') as outfile:
    json.dump(monthly_pageviews_mobile_web, outfile)
    
with open('pageviews_desktop_201507-202108.json', 'w') as outfile:
    json.dump(monthly_legacy_desktop, outfile)

## Data Processing


To begin, below we convert the data into Python Dataframes using the Pandas library to allow for ease of processing.

In [10]:
import pandas as pd

In [11]:
#reading in the Legacy Mobile JSON file
with open('legacy_mobile-site_200801-201607.json', 'r') as f:
    legacy_mobile = json.loads(f.read())
    
#flattening JSON to format into columns correctly
legacy_mobile = pd.json_normalize(legacy_mobile, record_path = ['items'])

#preview data to check that it all read in correctly
#legacy_mobile.head()



#reading in the Legacy Desktop JSON file
with open('legacy_desktop-site_200801-201607.json', 'r') as f:
    legacy_desktop = json.loads(f.read())
    
#flattening JSON to format into columns correctly
legacy_desktop = pd.json_normalize(legacy_desktop, record_path = ['items'])

#preview data to check that it all read in correctly
#legacy_desktop.head()



#reading in the Pageviews Mobile App JSON file
with open('pageviews_mobile-app_201507-202108.json', 'r') as f:
    pageviews_mobile_app = json.loads(f.read())
    
#flattening JSON to format into columns correctly
pageviews_mobile_app = pd.json_normalize(pageviews_mobile_app, record_path = ['items'])

#preview data to check that it all read in correctly
#pageviews_mobile_app.head()



#reading in the Pageviews Mobile Web JSON file
with open('pageviews_mobile-web_201507-202108.json', 'r') as f:
    pageviews_mobile_web = json.loads(f.read())
    
#flattening JSON to format into columns correctly
pageviews_mobile_web = pd.json_normalize(pageviews_mobile_web, record_path = ['items'])

#preview data to check that it all read in correctly
#pageviews_mobile_web.head()



#reading in the Pageviews Desktop JSON file
with open('pageviews_mobile-web_201507-202108.json', 'r') as f:
    pageviews_desktop = json.loads(f.read())
    
#flattening JSON to format into columns correctly
pageviews_desktop = pd.json_normalize(pageviews_desktop, record_path = ['items'])

#preview data to check that it all read in correctly
pageviews_desktop.head()

Unnamed: 0,project,access,agent,granularity,timestamp,views
0,en.wikipedia,mobile-web,user,monthly,2015070100,3179131148
1,en.wikipedia,mobile-web,user,monthly,2015080100,3192663889
2,en.wikipedia,mobile-web,user,monthly,2015090100,3073981649
3,en.wikipedia,mobile-web,user,monthly,2015100100,3173975355
4,en.wikipedia,mobile-web,user,monthly,2015110100,3142247145


### Creating Mobile Total Variable 
In the steps below, the monthly values for mobile-app and mobile-web from the PageViews API will be combined to obtain the total monthly mobile traffic count for each month.

In [12]:
#replace all null count months with zeroes
pageviews_mobile_web['views'] = pageviews_mobile_web['views'].fillna(0)
pageviews_mobile_app['views'] = pageviews_mobile_app['views'].fillna(0)

In [13]:
#removing the following fields as they are not needed for further analysis: agent, granularity, project, and access
pageviews_mobile_web = pageviews_mobile_web.drop(['agent', 'granularity', 'project', 'access'], axis = 1)
pageviews_mobile_app = pageviews_mobile_app.drop(['agent', 'granularity', 'project', 'access'], axis = 1)

In [14]:
#merging Pageview Mobile Web and App dataframes 
pageviews_mobile = pd.merge(pageviews_mobile_web, pageviews_mobile_app, on='timestamp')

#creating a new field that is the sum of all mobile views
pageviews_mobile_sum = pageviews_mobile['views_x'] + pageviews_mobile['views_y']

#adding the new field back into the dataframe
pageviews_mobile = pd.concat([pageviews_mobile, pageviews_mobile_sum], axis = 1, names = ['views'])

#removing the individual web and app views
pageviews_mobile = pageviews_mobile.drop(['views_x', 'views_y'], axis = 1)

#relabing the summed column to views for clarity
#pageviews_mobile = pageviews_mobile.rename({'0' : 'views'}, axis = 1)

In [15]:

pageviews_mobile

Unnamed: 0,timestamp,0
0,2015070100,3288755294
1,2015080100,3302333038
2,2015090100,3170203333
3,2015100100,3268499132
4,2015110100,3236601070
...,...,...
69,2021040100,4759095083
70,2021050100,4976579558
71,2021060100,4584510417
72,2021070100,4778909421


### Time Formating
All date timestamp variables modified to segregate the month (MM) and year (YYYY), discarding days and hours

In [16]:
## Time stamp formating for Legacy Mobile
#getting the year (YYYY) from the timestamp (format = YYYYMMHHDD)
year = legacy_mobile['timestamp'].str.slice(0, 4)

#getting the month (MM) from the timestamp (format = YYYYMMHHDD)
month = legacy_mobile['timestamp'].str.slice(4, 6)

#removing the timestamp and appending the year and month as columns
legacy_mobile = legacy_mobile.drop(['timestamp', 'project', 'granularity', 'access-site'], axis = 1)
legacy_mobile = pd.concat([legacy_mobile, year, month], axis = 1)

#renaming columns to be more intuitive
legacy_mobile.columns = ['views', 'year', 'month']



## Time stamp formating for Legacy Desktop
#getting the year (YYYY) from the timestamp (format = YYYYMMHHDD)
year = legacy_desktop['timestamp'].str.slice(0, 4)

#getting the month (MM) from the timestamp (format = YYYYMMHHDD)
month = legacy_desktop['timestamp'].str.slice(4, 6)

#removing the timestamp and appending the year and month as columns
legacy_desktop = legacy_desktop.drop(['timestamp', 'project', 'granularity', 'access-site'], axis = 1)
legacy_desktop = pd.concat([legacy_desktop, year, month], axis = 1)

#renaming columns to be more intuitive
legacy_desktop.columns = ['views', 'year', 'month']



## Time stamp formating for Pageviews Mobile
#getting the year (YYYY) from the timestamp (format = YYYYMMHHDD)
year = pageviews_mobile['timestamp'].str.slice(0, 4)

#getting the month (MM) from the timestamp (format = YYYYMMHHDD)
month = pageviews_mobile['timestamp'].str.slice(4, 6)

#removing the timestamp and appending the year and month as columns
pageviews_mobile = pageviews_mobile.drop(['timestamp'], axis = 1)
pageviews_mobile = pd.concat([pageviews_mobile, year, month], axis = 1)

#renaming columns to be more intuitive
pageviews_mobile.columns = ['views', 'year', 'month']



## Time stamp formating for Pageviews Desktop
#getting the year (YYYY) from the timestamp (format = YYYYMMHHDD)
year = pageviews_desktop['timestamp'].str.slice(0, 4)

#getting the month (MM) from the timestamp (format = YYYYMMHHDD)
month = pageviews_desktop['timestamp'].str.slice(4, 6)

#removing the timestamp and appending the year and month as columns
pageviews_desktop = pageviews_desktop.drop(['timestamp', 'project', 'granularity', 'access', 'agent'], axis = 1)
pageviews_desktop = pd.concat([pageviews_desktop, year, month], axis = 1)

#renaming columns to be more intuitive
pageviews_desktop.columns = ['views', 'year', 'month']

In [17]:
#legacy_mobile.head()

In [18]:
#legacy_desktop.head()

In [19]:
#pageviews_mobile.head()

In [20]:
#pageviews_desktop.head()

### Final Data Set
For ease of data analysis, all four about files will be combined and stored into a singe CSV file with the following headers (format): year (YYYY), month (MM), pagecount_all_views (int), pagecount_desktop_views (int), pagecount_mobile_views (int), pageview_all_views (int), pageview_desktop_views (int), pageview_mobile_views (int)


## Data Analysis


## Sources

https://towardsdatascience.com/how-to-parse-json-data-with-python-pandas-f84fbd0b1025 

https://www.codegrepper.com/code-examples/python/add+two+dataframes+pandas

https://www.codegrepper.com/code-examples/python/parse+string+in+pandas+dataframe