## Tutorial 30: MediaWiki Page History

Here, we return to our study of the MediaWiki API. This will be a useful
review of how to grab and parse data, both are key steps in doing work in
data science. Specifically, we will see how to grab and parse the history
of a wikipedia page.

Start by loading some useful modules:

In [None]:
import json
import os
from os.path import join
import re
import requests

import wiki
import iplot
import matplotlib.pyplot as plt

I am going to demonstrate at first using the Wikipedia page for
coffee. You'll be able to look at other pages later in the tutorial.
To start, let's grab the page JSON file (either from disk or downloaded
through the functions provided in `wiki.py`). I'll also print out the
keys in the returned dictionary object.

In [None]:
page_json = wiki.get_wiki_json("Coffee")
page_json.keys()

We have not done anything with the 'revid', but this will be key for our work
today. The 'revid' is a unique key that describes a particular version of a page.
It is unique *on it's own*; you do not need to combine it with the 'pageid' (I 
found this confusing in the MediaWiki documentation). 

### Query API

In order to get previous revision ids for a given page, we need to use the 
'query' API. Last time, and throughout the semester so far, we have been only
using the 'parse' functionality of the MediaWiki API. Let's see how the query
API works. Recall that we need a base API URL that the query is sent to. This
remains unchanged from tutorial 8:

In [None]:
lang = 'en'
base_api_url = 'https://' + lang + '.wikipedia.org/w/api.php?'
base_api_url

Now, we produce a query by providing variable keys separated by the
amperstand (&) symbol. For the query API we need to specify the following:

- `action=query` to describe the action that we want to take
- `format=json` to let the API know we want data returned as JSON objects
- `prop=revisions` to tell the API that we want to see the page revisions
- `rvprop=ids|flags|timestamp|comment|user` indicate what data to return
about the revision
- `rvstartid=######` to indicate the revision we want to start at, with the
proper revid filled in for the `########`


Start by extracting the revid for the Coffee page: 

In [None]:
pageid = page_json['pageid']
pageid

In [None]:
revid = page_json['revid']
revid

And then construct the API query:

In [None]:
api_query = base_api_url + "action=query&" + "format=json&" + \
            "prop=revisions&" + "rvprop=ids|flags|timestamp|comment|user&" \
            "pageids={0:d}&".format(pageid) + \
            "rvstartid={0:d}&".format(revid)
api_query

Finally, we can call the API and have the page data returned to us. Take a few
minutes to look at the output before moving on the next section.

In [None]:
r = requests.get(api_query)
page_data = r.json()
page_data

What information is shown about revision? What information
does the parameter 'parentid' give you about the revision?
How many revisions are returned here?

**Answer**:

### Parsing page dates

I want you to try writing some code again. Cycle through the output of `page_data` 
to produce a python list named `rev_years` that gives the year of each revision.
Note: you might start by storing the entire time stamp and then modify the code to
store just the year. Second note: You don't need to be too clever to get the year
from the time stamp, just take the first four characters in the string.

Once you have created the list `rev_years`, run the following to show how many
revisions there are in each year.

In [None]:
import collections

collections.Counter(rev_years)

The output likely won't be too interesting yet because you only have the first 
10 revisions of the page. Let's see how to rectify that now.

### Increasing the page limit

By default, MediaWiki returns only the last 10 revisions. We can fetch up to
500 revisions by adding the parameter `rvlimit=max` to the query. Modify the
variable `api_query` and fetch the modifyied data. I suggest NOT printing out
the results of `page_data` because there will be a lot of them!

Now, copy the code you had above to find the distribution of years for the first
500 revisions.

You should see about 112 revisions for 2018, 386 for 2017, and
just 2 for 2016. This is a bit more interesting, but still not the
whole story because even the maximum query returns just 500 pages.

### Continuing the query

In order to get more results from the API we must place another query request.
This is similar to seeing the "next page" of results on a site such as Google
or Amazon when searching for a webpage or product. The idea here is common to
many API's such as the ones used by Google, Twitter, and Facebook.

Notice that the variable `page_data` contains an element named `continue`:

In [None]:
page_data['continue']

Create a variable `api_query_continue` that adds the parameter
`rvcontinue=######` with the `#####` filled in from the above value
to the variable `api_query`. 

Next, call this query and load the data into the variable `page_data`. 

And now see the distribution of page years in this new chunk of data:

You should see about 322 revisions in 2016 and another 178 in 2015.

### Getting all of the pages

We still do not have all of the Coffee pages, just the next 500 of them. In
order to get all of the pages we have to cycle through these continue statements
until we reach the end of the pages. Let's write the code to handle this now.
Rather than just grabbing the years, we will construct a list of all the information
about each revision.

I've written the function here for you to generate the list of revisions, but
you should be able to understand what is going on in all of the code. It spits
out the progress of the API by showing the number of revisions grabbed as well
as indicating what the last timestamp grabbed was.

In [None]:
def wiki_page_revisions(page_title):
    page_json = wiki.get_wiki_json(page_title)
    pageid = page_json['pageid']
    revid = page_json['revid']
    
    api_query = base_api_url + "action=query&" + "format=json&" + \
                "prop=revisions&" + "rvprop=ids|flags|timestamp|comment|user&" \
                "rvlimit=max&" + \
                "pageids={0:d}&".format(pageid) + \
                "rvstartid={0:d}&".format(revid)
    r = requests.get(api_query)
    page_data = r.json()

    rev_data = page_data['query']['pages'][str(pageid)]['revisions']

    while 'continue' in page_data:
        api_query_continue = api_query + \
                             "rvcontinue={0:s}&".format(page_data['continue']['rvcontinue'])
        r = requests.get(api_query_continue)
        page_data = r.json()
        rev_data += page_data['query']['pages'][str(pageid)]['revisions']
        msg = "Loaded {0:d} revisions, through {1:s}"
        print(msg.format(len(rev_data), rev_data[-1]['timestamp']))
        
    return rev_data

In [None]:
rev_data = wiki_page_revisions("Coffee")

Just looking at the message output, how many revisions have been made to the
Coffee page and when was the page first created?

**Answer**:

Modify your code you used above to grab the list `rev_years` from
the list `rev_data` (the code will actually be a bit cleaner then
getting it from the raw query JSON).

In what year were the most revisions completed? Has 2018 had an unusually high
or lower number of revisions at this point in the year?

**Answer**:

Finally, you can even use the following code to produce a line plot of the
number of revisions in each year.

In [None]:
plt.rcParams["figure.figsize"] = (12, 10)

In [None]:
cnt = collections.Counter(rev_years).items()
cnt = sorted(cnt, key=lambda x: x[0])
plt.xticks(rotation=90)
plt.plot([x[0] for x in cnt], [x[1] for x in cnt], 'k-', lw=2)

### Revision data

We now have some information about each of the page revisions, but we still have not
seen how to grab the actual page data from a given revision. To do this, we need to
return to the "parse" API with our revision ids in hand. Essentially, all we need to
do is call the parse action, specify the format as JSON, and provid the revid to the
parameter `oldid`. So, to get the very first version of the coffee page we would first
get the revision id:

In [None]:
revid = rev_data[-1]['revid']
revid

And then place the following API query:

In [None]:
api_query = base_api_url + "action=parse&" + "format=json&" + \
            "oldid={0:d}&".format(revid)
r = requests.get(api_query)
page_data = r.json()['parse']

This page is now in the format we have been working with the rest of the semester but gives
and old version of the page, way back from 2014.

In [None]:
page_data.keys()

It would be fun to see what this page actually looks like rendered as html.
Let's write it to disk with some header information to make it look reasonable:

In [None]:
with open('temp.html', 'w') as fin:
    fin.writelines("<html><body>")
    fin.writelines(page_data['text']['*'])
    fin.writelines("</body></html>")

You should see that the page was very basic back in 2004!

### A different page

Repeat the code above (you can drop all of the steps into one or two code blocks)
to look at another page that interests you.

How does the pattern of number of changes differ from the coffee page?

**Answer**:

### What next

We certainly could grab the revision history for every change that has been made to the Coffee
page. Most changes, though, are not particularly interesting on their own (at least for our level
of study here). Instead, I want to focus on large-scale change over time by grabbing one page for
each year in the collection. Starting with the values in `rev_data`, I want you to use the space
below to grab the last version of the pages on Coffee for each year. That is, start with the 
current page, then get the last page from 2017, then 2016, and so forth until you get back to the
page at the end of 2004. Store these pages as a list named `page_history`.

In [None]:
# Note: run this cell just once to refresh the value in `rev_data`
# from running the code to grab a different page above
rev_data = wiki_page_revisions("Coffee")

Then, when you are done with that, cycle through the pages to extract the length of the
page in characters, the number of internal links, the number of external links, the 
number of images, and the number of sections.

Finally, plot the data across time for each of these variables. For example,
if you stored the data for the number of internal links as a list named
`num_ilinks`, you should be able to run something like this:

In [None]:
plt.xticks(rotation=90)
plt.plot(list(range(2018,2003,-1)), num_ilinks, 'k-', lw=2)

Take note of any interesting patterns that arise over time with the pages.

### Even more practice

My guess is that the above tasks will take up most of the class time. If you want
extra practice or finish early, wrap up the code above in a function that takes 
just a page name and returns all of the metrics as a panda's DataFrame object. Include
the timestamp as the first column of the data frame and make sure that you handle the
case where the number of years may be different and there may even be no revisions
in a given year.