# INFO 3401 â€“ Module Assignment 3

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT).

## Learning Objectives
This is one of two required sub-assignments for Module Assignment 3. In this assignment we want you to analyze time series data about a Wikipedia article. 


In [2]:
# Our usual libraries for working with data
import pandas as pd
import numpy as np

# Our usual libraries for visualizing data
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb

# Some new libraries for retrieving data from the web, working with time, etc.
import requests, json
from datetime import datetime
from urllib.parse import unquote, quote
from copy import deepcopy

## Two functions to start

Below are two functions that I use in my research to retrieve data about Wikipedia articles. While Module 6 on "Structured Data" at the end of the term will go into more detail, you already have all the knowledge from INFO 1201 and 2201 to be able to interpret what these functions are doing. That said: you do not need to edit anything about these functions and can treat them like black boxes and you can skip down to "Step 0" if you're genuinely uncurious.

### Get the revision history for an article

Wikipedia keeps track of every single change made to every single article going back to approximately 2002. This is called a revision history: [this is the revision history for the "University of Colorado Boulder" article](https://en.wikipedia.org/w/index.php?title=University_of_Colorado_Boulder&action=history). You could get the content from previous versions of articles if you wanted from the API (again, check back in for Module 6) but this function simply retrieves the metadata about every revision: the ID of the revision, the ID of the user, if the user left a comment, the timestamp of the change, the username, the size of the page, and a SHA1 hash which is helpful for comparing if two versions of an article are identical.

I would **very** strongly caution you against trying to retrieve the complete revision histories for popular topics like "Donald Trump", "Hillary Clinton", etc. This function will work, but because these can have tens of thousands of revisions, it could take several minutes to retrieve them all.

In [3]:
def response_to_revisions(json_response):
    if type(json_response['query']['pages']) == dict:
        page_id = list(json_response['query']['pages'].keys())[0]
        return json_response['query']['pages'][page_id]['revisions']
    elif type(json_response['query']['pages']) == list:
        if 'revisions' in json_response['query']['pages'][0]:
            return json_response['query']['pages'][0]['revisions']
        else:
            return list()
    else:
        raise ValueError("There are no revisions in the JSON")

def get_all_page_revisions(page_title, endpoint='en.wikipedia.org/w/api.php', redirects=1):
    """Takes Wikipedia page title and returns a DataFrame of revisions
    
    page_title - a string with the title of the page on Wikipedia
    endpoint - a string that points to the web address of the API.
        This defaults to the English Wikipedia endpoint: 'en.wikipedia.org/w/api.php'
        Changing the two letter language code will return a different language edition
        The Wikia endpoints are slightly different, e.g. 'starwars.wikia.com/api.php'
    redirects - a Boolean value for whether to follow redirects to another page
        
    Returns:
    df - a pandas DataFrame where each row is a revision and columns correspond
         to meta-data such as parentid, revid, sha1, size, timestamp, and user name
    """
    
    # A container to store all the revisions
    revision_list = list()
    
    # Set up the query
    query_url = "https://{0}".format(endpoint)
    query_params = {}
    query_params['action'] = 'query'
    query_params['titles'] = page_title
    query_params['prop'] = 'revisions'
    query_params['rvprop'] = 'ids|userid|comment|timestamp|user|size|sha1'
    query_params['rvlimit'] = 500
    query_params['rvdir'] = 'newer'
    query_params['format'] = 'json'
    query_params['redirects'] = redirects
    query_params['formatversion'] = 2
    
    # Make the query
    json_response = requests.get(url = query_url, params = query_params).json()

    # Add the temporary list to the parent list
    revision_list += response_to_revisions(json_response)

    # Loop for the rest of the revisions
    while True:

        # Newer versions of the API return paginated results this way
        if 'continue' in json_response:
            query_continue_params = deepcopy(query_params)
            query_continue_params['rvcontinue'] = json_response['continue']['rvcontinue']
            json_response = requests.get(url = query_url, params = query_continue_params).json()
            revision_list += response_to_revisions(json_response)
        
        # Older versions of the API return paginated results this way
        elif 'query-continue' in json_response:
            query_continue_params = deepcopy(query_params)
            query_continue_params['rvstartid'] = json_response['query-continue']['revisions']['rvstartid']
            json_response = requests.get(url = query_url, params = query_continue_params).json()
            revision_list += response_to_revisions(json_response)
        
        # If there are no more revisions, stop
        else:
            break

    # Convert to a DataFrame
    df = pd.DataFrame(revision_list)

    # Add in some helpful fields to the DataFrame
    df['page'] = json_response['query']['pages'][0]['title']
    
    return df

### Get the pageviews for an article

Wikipedia also keeps track of all the times an article was accessed, ***but only back to July 2015***. [Here are the recent page views for the "University of Colorado Boulder" article](https://pageviews.toolforge.org/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&range=latest-20&pages=University_of_Colorado_Boulder). Getting all five years of data should take longer than a few months, so please keep that in mind, but there's no penalty for getting the pageview data for a popular topics compared to a non-popular topic.

In [7]:
def get_pageviews(page_title,endpoint='en.wikipedia.org',date_from='20150701',date_to='today'):
    """Takes Wikipedia page title and returns a all the various pageview records
    
    page_title - a string with the title of the page on Wikipedia
    lang - a string (typically two letter ISO 639-1 code) for the language edition,
        defaults to "en"
        datefrom - a date string in a YYYYMMDD format, defaults to 20150701 (earliest date)
        dateto - a date string in a YYYYMMDD format, defaults to today
        
    Returns:
    s - a Series indexed by date with page views as values
    """
    if date_to == 'today':
        date_to = str(datetime.today().date()).replace('-','')
        
    quoted_page_title = quote(page_title, safe='')
    date_from = datetime.strftime(pd.to_datetime(date_from),'%Y%m%d')
    date_to = datetime.strftime(pd.to_datetime(date_to),'%Y%m%d')
    
    s = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/{1}/{2}/{3}/{0}/daily/{4}/{5}".format(quoted_page_title,endpoint,'all-access','user',date_from,date_to)
    json_response = requests.get(s).json()
    
    if 'items' in json_response:
        df = pd.DataFrame(json_response['items'])
    else:
        raise KeyError('There is no "items" key in the JSON response.')
        
    df = df[['timestamp','views']]
    df['timestamp'] = pd.to_datetime(df['timestamp'],format='%Y%m%d%H')
    s = df.set_index('timestamp')['views']
        
    return s

## Step 0: Write down a research question and motivation

Why are you choosing this article to analyze? What kinds of information production (revisions) and consumption (pageviews) behavior are you expecting to observe? For whom does this matter and why? Because of the limitations of the data, try to come up with questions/motivations/hypotheses for behavior in or after July 2015.

## Step 1: Retrieve the revision history for a single article

### Step 1a: Use the `get_all_page_revisions` function on your article

In [4]:
# cub_rev_df = get_all_page_revisions('University of Colorado Boulder')

### Step 1b: Inspect the top and print out the number of revisions

### Step 1c: When was the first revision made? The most recent?

### Step 1d: What "user" made the most revisions to this article?

## Step 2: Engineer new temporal features

We want to make several new columns that capture temporal behavior.

### Step 2a: Convert the "timestamp" column from `str` to `pd.Timestamp`

### Step 2b: Create a column "date" that has only the dates of the revision

### Step 2c: Create a column "weekday" that has the day of the week of the revision

### Step 2d: Create a column "hour" that has the hour of the revision

### Step 2e: Create a column "diff" that is the change in the "size" since the last revision

This is not a temporal variable but it will require you to use a method we discussed in the pre-class lectures.

### Step 2f: Create a column "lag" that has the time elapsed since the previous edit

Make sure this is stored as a float or an integer of a reasonable unit (minutes, hours, days, weeks, etc.) and **not as a Timedelta**.

### Step 2g: Create a column "age" that has the time elapsed since the first edit

Make sure this is stored as a float or an integer of a reasonable unit (minutes, hours, days, weeks, etc.) and **not as a Timedelta**.

## Step 3: Visualize the revision data

### Step 3a: Use the "hour" and "weekday" columns to make a revision heatmap

Use your data reshaping skills like `pivot_table` and read up on how to make a [heatmap in Seaborn](https://seaborn.pydata.org/generated/seaborn.heatmap.html) ([also](https://towardsdatascience.com/heatmap-basics-with-pythons-seaborn-fb92ea280a6c)) showing when revisions on your article tend to happen. Your revision heatmap should have "weekday" as columns and "hour" as an index and a count of revisions as values.

### Step 3b: Interpret and summarize some interesting features about the revision heatmap

During what part of the day and week are most revisions to this article made? Is this surprising or expected? What kinds of mechanisms social conventions for time could explain why there is or is not a clear pattern? If you had to tell someone to give you a report on the state of the article every week, when should they look at the article?

### Step 3c: Make a line plot of the number of revisions over time

Use pandas's groupby, reindex, and/or resample functionality to count the number of revisions by day. Make sure that the date range is continuous without gaps and dates without revision activity have an appropriate value. Plot the data.

### Step 3d: Interpret and summarize some interesting features about the revision activity line plot

Are there any trends of increasing or decreasing revision activity? Are there any instances of "bursts" of revision activity? If so, do these bursts correspond to meaningful events in the "real" world?

### Step 3e: Make a scatter plot of "lag" and "diff"

Make a scatterplot with the "lag" column on the x-axis and the "diff" column on the y-axis.

### Step 3f: Interpret and summarize some interesting features about the lag-diff scatter plot

## Step 4: Retrieve the pageviews for a single article

### Step 4a: Use the `get_pageviews` function on your article

In [8]:
# cub_pv_df = get_pageviews('University of Colorado Boulder')

### Step 4b: Inspect the top and print out the number of pageview observations

### Step 4c: Make sure the time variables are `Timestamp`s

### Step 4d: What date had the most pageviews? The fewest?

## Step 5: Visualize pageview data

### Step 5a: Make a line plot of the number of revisions over time

### Step 5b: Make a barplot of the pageview activity by day of week

### Step 5c: Make an autocorrelation plot of the pageview data

### Step 5d: Interpret and summarize some interesting features about the pageview behavior

## Step 6: Combing revision and pageview data

### Step 6a: Make a DataFrame containing only daily revision and pageviews

### Step 6b: Plot both these time series on the same figure

Use good visualization practices to ensure that a reader can identify meaningful variability in both. See using [Scales](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#scales), [Plotting on a secondary y-axis](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#plotting-on-a-secondary-y-axis), or [Subplots](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#subplots) for ideas.

### Step 6c: Interpret and summarize some interesting features about revision and pageview behavior

### Step 6d: Revisit your research question/hypotheses/motivation and discuss in light of your findings