# Program Overview

The purpose of this notebook is to process the raw data necessary to analyze monthly article traffic from English Wikipedia articles published 07-01-2015 through 09-30-2023.

The source data uses the following license:
**EKRC INSERT LICENSE HERE

Things to do:
    1. remove the "Access" fields from all of the dictionaries
    2. Combine the mobile files to make the mobile file
    3. Combine the mobile and desktop files to make the cumulative file

We will begin by reading in some basic Python libraries.

In [2]:
#Import python libraries
import json
import time
import urllib.parse
import requests
import pandas as pd

## Create Monthly Desktop Access file

This clean data file requires minimal processing - simply the removal of the "access" field from the raw data and a new file name. We will begin by reading in the data from our raw_data directory.

In [41]:
#Reading in the desktop access file
desktop_file = open('../raw_data/desktop_pageviews.json')
 
#Makes desktop_file a dictionary
desktop_dict = json.load(desktop_file)

Next, we will remove the "access" field from each of the value dictionaries in the desktop file.

In [42]:
#Iterate through each article title, then through each month to remove
#"access"
desktop_keys = list(desktop_dict.keys())
for key in desktop_keys:
    for month in desktop_dict[key]:
        del month['access']

Finally, we will save the dictionary as a JSON in the clean data folder.

In [46]:
#Saving the desktop_dict as a JSON
desktop_json_object = json.dumps(desktop_dict, indent = 4) 

#Writing to the file
with open('../clean_data/academy_monthly_desktop_201507-202309.json', 'w') as outfile:
    outfile.write(desktop_json_object)

## Create a fuction which combines the views in two JSON files

To create the monthly mobile access file and monthly cumulative file we will need to sum the views across two files. We have created a function which does this to increase reusability of the code.

The function first takes in two files which the user wants to combine views for. It then tests if they have the same keys. If the keys are the same, for each key, we look at each month of data. If the timestamps of the month match between the two input files, then we sum their views together. We add this information, along with project, article, granularity, timestamp, and agent information, to the combined dictionary (comb_dict) that we established in the function. The function ultimately returns the combined dictionary. Should the two input dictionaries have different keys, different timestamps/month, 

In [None]:
#Creating a method which sums the views between two files
def view_combiner(file_1, file_2):
    #Testing if the two files have the same keys
    file_1_keys = list(file_1.keys())
    file_2_keys = list(file_2.keys())
    if file_1_keys == file_2_keys:
        comb_dict = {}
        for key in file_1_keys: #could have been either - they match
            comb_dict[key] = []
            for i in range(len(file_1[key])): #for each month
                if file_1[key][i]['timestamp'] == file_2[key][i]['timestamp']:
                    tot_views = file_1[key][i]['views'] + file_2[key][i]['views']
                    comb_dict[key].append(
                        {'project': file_1[key][i]['project'],
                         'article': file_1[key][i]['article'],
                         'granularity': file_1[key][i]['granularity'],
                         'timestamp': file_1[key][i]['timestamp'],
                         'agent': file_1[key][i]['agent'],
                         'views': tot_views})
                else:
                    print("Please ensure both files have matching timestamps"+
                    "for each of they array values for each of the keys")
    else:
        print("Keys don't match - please ensure both files have matching"+
              " keys")
    return comb_dict
        

mobile_dict = view_combiner(mobile_web_dict, mobile_app_dict)

## Create Monthly Mobile Access file

This file requires more processing than the desktop file. We will begin similarly - by reading in the two mobile file from the raw_data directory and removing the "access" field from each of the value dictionaries.

In [48]:
#Reading in the mobile_app and mobile_web files
mobile_app_file = open('../raw_data/mobile_app_pageviews.json')
mobile_web_file = open('../raw_data/mobile_web_pageviews.json')
 
#Make the files into dictionaries
mobile_app_dict = json.load(mobile_app_file)
mobile_web_dict = json.load(mobile_web_file)

#Iterate through each article title, then through each month to remove
#"access"
mobile_dict_list = [mobile_app_dict, mobile_web_dict]
for mobile_dict in mobile_dict_list:
    mobile_dict_keys = list(mobile_dict.keys())
    for key in mobile_dict_keys:
        for month in mobile_dict[key]:
            del month['access']

Next we need to combine views from mobile app pages and mobile web pages into a single mobile views dataset. We will do this by using the view_combiner function created earlier.

In [95]:
mobile_dict[test_name]

[{'project': 'en.wikipedia',
  'article': 'Everything_Everywhere_All_at_Once',
  'granularity': 'monthly',
  'timestamp': '2020010100',
  'agent': 'user',
  'views': 2306},
 {'project': 'en.wikipedia',
  'article': 'Everything_Everywhere_All_at_Once',
  'granularity': 'monthly',
  'timestamp': '2020020100',
  'agent': 'user',
  'views': 5107},
 {'project': 'en.wikipedia',
  'article': 'Everything_Everywhere_All_at_Once',
  'granularity': 'monthly',
  'timestamp': '2020030100',
  'agent': 'user',
  'views': 4547},
 {'project': 'en.wikipedia',
  'article': 'Everything_Everywhere_All_at_Once',
  'granularity': 'monthly',
  'timestamp': '2020040100',
  'agent': 'user',
  'views': 9824},
 {'project': 'en.wikipedia',
  'article': 'Everything_Everywhere_All_at_Once',
  'granularity': 'monthly',
  'timestamp': '2020050100',
  'agent': 'user',
  'views': 8109},
 {'project': 'en.wikipedia',
  'article': 'Everything_Everywhere_All_at_Once',
  'granularity': 'monthly',
  'timestamp': '2020060100',

In [52]:
test_name = 'Everything Everywhere All at Once'

In [None]:
#Testing to see if the mobile apps have the same amount of data in them
desktop_keys = list(desktop_dict.keys())
for key in desktop_keys:
    if len(mobile_web_dict[key]) != len(mobile_app_dict[key]):
        print("Not same amount of information")
        break
else:
    print("The files entered have the same amount of data")