## Tutorial 10: Lists and Loops

Today we will learn about using lists and for loops in Python.
This will allow us to use the MediaWiki API function to grab
data from several websites.

Let's start by importing the modules we will need at the functions
for loading JSON objects in Python. Notice that I've made a few
small changes to the code to make it function a bit better for us.

In [None]:
import json
import os
import re
import requests
import time
import urllib

from os.path import join

logging.basicConfig(level=logging.INFO)


def wiki_json_path(page_title, lang='en'):
    """Returns local path to JSON file for Wikipeida page data
    
    This function is used to determine where the dump of a 
    call to the MediaWiki API, using the parse method, should
    be stored. As an extra action, the function also checks that
    the relevant directory exists and creates it if it does not.
    
    Args:
        page_title: A string containing the page title.
        lang: Two letter language code describing the Wikipedia
            language used to grab the data.
            
    Returns:
        A string describing a relative path to file.
    """
    page_title = re.sub("[ /]", "_", page_title)
    stat289_base_dir = os.path.dirname(os.getcwd())
    
    dir_name = join(stat289_base_dir, "data", lang)
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)
        
    file_name = page_title + ".json"
    file_path = join(dir_name, file_name)
    
    return file_path


def get_mediawiki_request(page_title, lang):
    """Returns URL to make parse request to the MediaWiki API
        
    Args:
        page_title: A string containing the page title.
        lang: Two letter language code describing the Wikipedia
            language used to grab the data.
            
    Returns:
        A string giving the complete request URL.
    """
    page_title = re.sub(" ", "_", page_title)
    page_title = urllib.parse.urlencode({'page': page_title})
    page_title = re.sub('%2F', '/', page_title) # don't encode '/'
    
    base_api_url = 'https://' + lang + '.wikipedia.org/w/api.php'
    default_query = 'action=parse&format=json&'

    url = base_api_url + "?" + default_query + page_title
    return url


def get_wiki_json(page_title, lang='en'):
    """Returns JSON data as a dictionary for the Wikipedia page
    
    This function either loads a cached version of the page or,
    if a local version of the page is not available, calls the
    MediaWiki API directly.
    
    Args:
        page_title: A string containing the page title.
        lang: Two letter language code describing the Wikipedia
            language used to grab the data.
            
    Returns:
        A dictionary object with the complete parsed JSON data.
    """
    file_path = wiki_json_path(page_title, lang)
    
    # if page does not exist, grab it from Wikipedia
    if not os.path.exists(file_path):
        print("Pulling data from MediaWiki API: '" + page_title + "'")
        url = get_mediawiki_request(page_title, lang)
        r = requests.get(url)
        if r.status_code != requests.codes.ok:
             raise IOError('Website cannot be reached')
        page_data = r.json()
        if 'parse' not in page_data:
            raise IOError('Wikipedia page not found')
        with open(file_path, 'w') as outfile:
            json.dump(page_data['parse'], outfile)
        time.sleep(0.5) # sleep for half second to avoid API limits
            
    # read the JSON data from local filesystem
    with open(file_path, 'r') as infile:
        new_data = json.load(infile)
    
    return new_data


def links_as_list(data):
    """Extracts MediaWiki JSON links as a list object.
    
    This helper function extracts valid MediaWiki links from
    Wikipedia and returns the data as a list.
    
    Args:
        data: Data returned from the function get_wiki_json.
            
    Returns:
        A list containing the internal links on the page.
    """
    output = []
    for link in data['links']:
        if link['ns'] == 0 and 'exists' in link:
            output.append(link['*'])
            
    return output

### Links data

We saw last time that internal links, links to other pages on
Wikipedia, are returned as a particular element of the JSON data
returned by the MediaWiki API. I write a small helper funtion
`links_as_list` (defined above) to extract this as a list in Python.
Let's use this to get all of the links of the University of Richmond
page.

In [None]:
data = get_wiki_json("University of Richmond")
links = links_as_list(data)
links

Now, a reasonable next step would be to grab the data associated with
each of these pages. To download the data for the first link we would
just do this:

In [None]:
data = get_wiki_json(links[0])
data

How do this automatically for all of the links? We want to make use
of a `for` loop. A for loop cycles through all of the elements of a
list and applies a set of instructions to each element. 

Here's an example where we take each element in the list of links and
print out just the first three letters:

In [None]:
for link in links:
    print(link[:3])

If we want to grab the webpage data for each link from the UR page,
we can now just do this (this will take a while the first time you
run it, but will be quick the second time):

In [None]:
for link in links:
    get_wiki_json(link)

### Using the MediaWiki data

Now, finally, we have the code and functionality to look at a
collection of Wikipedia pages. Let's start with a simple task
of counting how many links all of the pages linked from the Richmond
site have. Pay attention to how I do this!

In [None]:
num_links = []
ur_links = links_as_list(get_wiki_json("University of Richmond"))

for link in ur_links:
    data = get_wiki_json(link)
    new_links = links_as_list(data)
    num_links.append(len(new_links))

Now, let's look at the results:

In [None]:
print(num_links)

What can we do with this? For starters, what's the average
number of links on each page?

In [None]:
sum(num_links) / len(num_links)

How does this compare to the number of links from the Richmond site?

In [None]:
len(ur_links)

**Answer**:

## Practice

Take a look at the Wikipedia page on Rock and Roll Hall of
Fame inductees:

> https://en.wikipedia.org/wiki/List_of_Rock_and_Roll_Hall_of_Fame_inductees

While some of the links on the page point to other references,
most refer to the winners or inductors of the Hall of Fame.

Below, write code that:

1. Downloads all of the links from the Rock and Roll Hall of Fame
Wikipedia page.
2. Then, extract from each page all of the links from **that** page
and puts them together in one appended list called `all_links`.
3. Use the `Collections.counter` object to find the 20 links that
are used most across all of the pages.
4. Think about the most frequent 20 pages and try to reason why
these are the most common.