# Program Overview

The purpose of this notebook is to analyze monthly article traffic from English Wikipedia articles published 07-01-2015 through 09-30-2023. Specifically, we will be producing graphs for articles with the:
    Highest and lowest average monthly view rates for both desktop and mobile
    Top 10 peak page views for both desktop and mobile access
    Fewest months of data for recent award winners on both mobile and desktop

The source data uses the following license:
**EKRC INSERT LICENSE HERE

We begin by loading common python libraries. Note, the plotly visualization saving requires kaleido, which the user may need to install.

In [2]:
#Import python libraries
import json
from datetime import datetime
import urllib.parse
import requests
import pandas as pd
import plotly.express as px

### Maximum Average and Minimum Average

The following sections of code will find the max and min average pageviews per month for mobile and desktop articles. We will begin by loading in the mobile and desktop files from the clean_data folder as dictionaries.

In [3]:
#Loading in desktop and mobile files
mobile_file = open("../clean_data/academy_monthly_mobile_201507-202309.json")
desktop_file = open("../clean_data/academy_monthly_desktop_201507-202309.json")

#Turn files into dicts
mobile_dict = json.load(mobile_file)
desktop_dict = json.load(desktop_file)

Given we must repeat the process on both desktop and mobile articles we create a function which can be reused to generate a list of article titles and a separate list of average monthly pageviews. By identifying the max and min pageviews' indices, one can also find the articles with the max and min pageviews respectively.

In [4]:
#Creating function to return article name w/ max average monthly pageviews
#and article name with min monthly pageviews

def min_max_articles(dict_file):
    #Finding the average pageviews per month & adding to list
    article_list = list(dict_file.keys())
    avg_monthly_views = []
    for key in article_list:
        num_months = 0
        tot_views = 0
        for month in dict_file[key]:
            num_months = num_months + 1
            tot_views = tot_views + month['views']
        avg_monthly_views.append(tot_views/num_months)
    #Getting the index of the max and min pageviews
    min_views = min(avg_monthly_views)
    max_views = max(avg_monthly_views)
    min_index = avg_monthly_views.index(min_views)
    max_index = avg_monthly_views.index(max_views)
    #Getting the name of the max and min articles
    min_article = article_list[min_index]
    max_article = article_list[max_index]
    return min_article, max_article

#Use the following line to test if the method is working
#min_name, max_name= min_max_articles(mobile_dict)
#print(min_name, max_name)

Next, we will create a function to return a dataframe with the timeseries data for the articles with the min and max monthly views from a single datasource.

In [5]:
#Creating function to return timeseries information with monthly pageviews
#for min and max avg. pageviews articles

def min_max_timeseries(dict_file, min_name, max_name, access_format):
    timeseries_list = []
    for month in dict_file[min_name]:        
        raw_time = month['timestamp']
        format_time = datetime.strptime(raw_time, '%Y%m%d%H').strftime('%Y-%m')
        timeseries_list.append([min_name,"min "+access_format,
                             format_time, month['views']])
    for month in dict_file[max_name]:
        raw_time = month['timestamp']
        format_time = datetime.strptime(raw_time, '%Y%m%d%H').strftime('%Y-%m')
        timeseries_list.append([max_name,"max "+access_format,
                             format_time, month['views']])
    timeseries_df = pd.DataFrame(timeseries_list)
    timeseries_df.columns = ['article_title', 'kind', 'date', 'pageviews']
    return timeseries_df

#Use the following line to test if the method is working
'''mobile_df = min_max_timeseries(mobile_dict, min_name, max_name, "mobile")
mobile_df'''

'mobile_df = min_max_timeseries(mobile_dict, min_name, max_name, "mobile")\nmobile_df'

Now that we've created the two fuctions above, we will use them to pull the min and max names from the desktop and mobile datasources respectively. We will then use those names to generate two dataframes with timeseries data, and combine the data into a single timeseries for later graphing.

In [6]:
#Pulling desktop and mobile info
min_desktop_name, max_desktop_name = min_max_articles(desktop_dict)
#print(min_desktop_name, max_desktop_name)
min_mobile_name, max_mobile_name = min_max_articles(mobile_dict)
#print(min_mobile_name, max_mobile_name)

#Generating dataframes with min and max desktop and mobile data respectively
desktop_df = min_max_timeseries(desktop_dict, min_desktop_name, max_desktop_name, "desktop")
mobile_df = min_max_timeseries(mobile_dict, min_mobile_name, max_mobile_name, "mobile")

#Combining the dataframes together
df_list = [desktop_df, mobile_df]
min_max_avg_df = pd.concat(df_list)
min_max_avg_df

Unnamed: 0,article_title,kind,date,pageviews
0,Project Hope (film),min desktop,2015-07,56
1,Project Hope (film),min desktop,2015-08,31
2,Project Hope (film),min desktop,2015-09,60
3,Project Hope (film),min desktop,2015-10,56
4,Project Hope (film),min desktop,2015-11,40
...,...,...,...,...
109,The Whale (2022 film),max mobile,2023-05,194589
110,The Whale (2022 film),max mobile,2023-06,210556
111,The Whale (2022 film),max mobile,2023-07,167920
112,The Whale (2022 film),max mobile,2023-08,192969


To make it clearer in our legend which article title represents which max/min desktop/mobile article, we are creating a "legend" column which combines the article name with its kind (e.g., max mobile).

In [None]:
#Combining cols into new col
min_max_avg_df['legend'] = min_max_avg_df['article_title']+" ("+min_max_avg_df['kind']+")"
#min_max_avg_df

Now we will graph the timeseries data for the min and max desktop and mobile article views. We will use the Seaborn library with each article title as a line and its pageviews over time.

In [None]:
#Creating Seaborn plot of multi-line timeseries
fig = px.line(min_max_avg_df, x = 'date', y = 'pageviews', color = 'legend',
             labels = {
                 'date': 'Date',
                 'pageviews': "Total Monthly Pageviews",
                 'legend': "Legend"
             },
             title = "Articles with Min and Max Avg. Monthly Pageviews on Desktop and Mobile")
fig.update_layout(font_size = 8,
                    title=dict(font=dict(size=12)))
fig.show()
fig.write_image('../results/min_avg_max_avg.png')