# Program Overview

The purpose of this notebook is to process the raw data necessary to analyze monthly article traffic from English Wikipedia articles published 07-01-2015 through 09-30-2023.

The source data comes from English Wikipedia, the text of which is licensed under "Creative Commons Attribution Share-Alike license" (https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License)

We will begin by reading in some basic Python libraries.

In [100]:
#Import python libraries
import json
import time
import urllib.parse
import requests
import pandas as pd

## Create Monthly Desktop Access file

This data file requires minimal processing - simply the removal of the "access" field from the raw data and a new file name. We will begin by reading in the data from our raw_data directory.

In [101]:
#Reading in the desktop access file
desktop_file = open('../raw_data/desktop_pageviews.json')
 
#Makes desktop_file a dictionary
desktop_dict = json.load(desktop_file)

Next, we will remove the "access" field from each of the value dictionaries in the desktop file.

In [102]:
#Iterate through each article title, then through each month to remove
#"access"
desktop_keys = list(desktop_dict.keys())
for key in desktop_keys:
    for month in desktop_dict[key]:
        del month['access']

Finally, we will save the dictionary as a JSON in the clean data folder.

In [103]:
#Saving the desktop_dict as a JSON
desktop_json_object = json.dumps(desktop_dict, indent = 4) 

#Writing to the file
with open('../clean_data/academy_monthly_desktop_201507-202309.json', 'w') as outfile:
    outfile.write(desktop_json_object)

## Create a fuction which combines the views in two JSON files

To create the monthly mobile access file and monthly cumulative file we will need to sum the views across two files. We have created a function which does this to increase reusability of the code.

The function first takes in two files which the user wants to combine views for. It then tests if they have the same keys. If the keys are the same, for each key, we look at each month of data. If the timestamps of the month match between the two input files, then we sum their views together. We add this information, along with project, article, granularity, timestamp, and agent information, to the combined dictionary (comb_dict) that we established in the function. The function ultimately returns the combined dictionary. Should the two input dictionaries have different keys or different timestamps/key, the function will print a message to the user asking them to input files with matching keys and/or matching timestamps/key.

In [104]:
#Creating a method which sums the views between two files
def view_combiner(file_1, file_2):
    #Testing if the two files have the same keys
    file_1_keys = list(file_1.keys())
    file_2_keys = list(file_2.keys())
    if file_1_keys == file_2_keys:
        comb_dict = {}
        for key in file_1_keys: #could have been either - they match
            comb_dict[key] = []
            for i in range(len(file_1[key])): #for each month
                if file_1[key][i]['timestamp'] == file_2[key][i]['timestamp']:
                    tot_views = file_1[key][i]['views'] + file_2[key][i]['views']
                    comb_dict[key].append(
                        {'project': file_1[key][i]['project'],
                         'article': file_1[key][i]['article'],
                         'granularity': file_1[key][i]['granularity'],
                         'timestamp': file_1[key][i]['timestamp'],
                         'agent': file_1[key][i]['agent'],
                         'views': tot_views})
                else:
                    print("Please ensure both files have matching timestamps"+
                    "for each of they array values for each of the keys")
    else:
        print("Keys don't match - please ensure both files have matching"+
              " keys")
    return comb_dict

## Create Monthly Mobile Access file

This file requires more processing than the desktop file. We will begin similarly - by reading in the two mobile files from the raw_data directory and removing the "access" field from each of the value dictionaries.

In [105]:
#Reading in the mobile_app and mobile_web files
mobile_app_file = open('../raw_data/mobile_app_pageviews.json')
mobile_web_file = open('../raw_data/mobile_web_pageviews.json')
 
#Make the files into dictionaries
mobile_app_dict = json.load(mobile_app_file)
mobile_web_dict = json.load(mobile_web_file)

#Iterate through each article title, then through each month to remove
#"access"
mobile_dict_list = [mobile_app_dict, mobile_web_dict]
for mobile_dict in mobile_dict_list:
    mobile_dict_keys = list(mobile_dict.keys())
    for key in mobile_dict_keys:
        for month in mobile_dict[key]:
            del month['access']

Next we need to combine views from mobile app pages and mobile web pages into a single mobile views dataset. We will do this by using the view_combiner function created earlier.

In [106]:
#Combining the mobile views data
mobile_dict = view_combiner(mobile_web_dict, mobile_app_dict)

Finally, we will save the mobile_dict output to a JSON file.

In [107]:
#Saving the mobile_dict as a JSON
mobile_json_object = json.dumps(mobile_dict, indent = 4) 

#Writing to the file
with open('../clean_data/academy_monthly_mobile_201507-202309.json', 'w') as outfile:
    outfile.write(mobile_json_object)

## Create Monthly Cumulative Access file

This file requires similar processing to the mobile access file. Because we have already removed the "access" fields from the mobile and desktop data, that step is not required here. We will proceed to using the view_combiner method written earlier in the code, and saving the output to the clean data folder.

In [108]:
#Combining all views data
combo_dict = view_combiner(mobile_dict, desktop_dict)

#Saving the combined dict as a JSON
combo_json_object = json.dumps(combo_dict, indent = 4) 

#Writing to the file
with open('../clean_data/academy_monthly_cumulative_201507-202309.json', 'w') as outfile:
    outfile.write(combo_json_object)