## Tutorial 10: Lists and Loops

Today we will learn about using lists and for loops in Python.
This will allow us to use the MediaWiki API function to grab
data from several websites.

Let's start by importing the modules we will need at the functions
for loading JSON objects in Python. Notice that I've made a few
small changes to the code to make it function a bit better for us.

In [98]:
import json
import os
import re
import requests
import time
import urllib

from os.path import join

logging.basicConfig(level=logging.INFO)


def wiki_json_path(page_title, lang='en'):
    """Returns local path to JSON file for Wikipeida page data
    
    This function is used to determine where the dump of a 
    call to the MediaWiki API, using the parse method, should
    be stored. As an extra action, the function also checks that
    the relevant directory exists and creates it if it does not.
    
    Args:
        page_title: A string containing the page title.
        lang: Two letter language code describing the Wikipedia
            language used to grab the data.
            
    Returns:
        A string describing a relative path to file.
    """
    page_title = re.sub("[ /]", "_", page_title)
    stat289_base_dir = os.path.dirname(os.getcwd())
    
    dir_name = join(stat289_base_dir, "data", lang)
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)
        
    file_name = page_title + ".json"
    file_path = join(dir_name, file_name)
    
    return file_path


def get_mediawiki_request(page_title, lang):
    """Returns URL to make parse request to the MediaWiki API
        
    Args:
        page_title: A string containing the page title.
        lang: Two letter language code describing the Wikipedia
            language used to grab the data.
            
    Returns:
        A string giving the complete request URL.
    """
    page_title = re.sub(" ", "_", page_title)
    page_title = urllib.parse.urlencode({'page': page_title})
    page_title = re.sub('%2F', '/', page_title) # don't encode '/'
    
    base_api_url = 'https://' + lang + '.wikipedia.org/w/api.php'
    default_query = 'action=parse&format=json&'

    url = base_api_url + "?" + default_query + page_title
    return url


def get_wiki_json(page_title, lang='en'):
    """Returns JSON data as a dictionary for the Wikipedia page
    
    This function either loads a cached version of the page or,
    if a local version of the page is not available, calls the
    MediaWiki API directly.
    
    Args:
        page_title: A string containing the page title.
        lang: Two letter language code describing the Wikipedia
            language used to grab the data.
            
    Returns:
        A dictionary object with the complete parsed JSON data.
    """
    file_path = wiki_json_path(page_title, lang)
    
    # if page does not exist, grab it from Wikipedia
    if not os.path.exists(file_path):
        print("Pulling data from MediaWiki API: '" + page_title + "'")
        url = get_mediawiki_request(page_title, lang)
        r = requests.get(url)
        if r.status_code != requests.codes.ok:
             raise IOError('Website cannot be reached')
        page_data = r.json()
        if 'parse' not in page_data:
            raise IOError('Wikipedia page not found')
        with open(file_path, 'w') as outfile:
            json.dump(page_data['parse'], outfile)
        time.sleep(0.5) # sleep for half second to avoid API limits
            
    # read the JSON data from local filesystem
    with open(file_path, 'r') as infile:
        new_data = json.load(infile)
    
    return new_data


def links_as_list(data):
    """Extracts MediaWiki JSON links as a list object.
    
    This helper function extracts valid MediaWiki links from
    Wikipedia and returns the data as a list.
    
    Args:
        data: Data returned from the function get_wiki_json.
            
    Returns:
        A list containing the internal links on the page.
    """
    output = []
    for link in data['links']:
        if link['ns'] == 0 and 'exists' in link:
            output.append(link['*'])
            
    return output

### Links data

We saw last time that internal links, links to other pages on
Wikipedia, are returned as a particular element of the JSON data
returned by the MediaWiki API. I write a small helper funtion
`links_as_list` (defined above) to extract this as a list in Python.
Let's use this to get all of the links of the University of Richmond
page.

In [63]:
data = get_wiki_json("University of Richmond")
links = links_as_list(data)
links

['2008 Montana Grizzlies football team',
 '2008 Richmond Spiders football team',
 "2010-11 Richmond Spiders men's basketball team",
 "2010–11 Kansas Jayhawks men's basketball team",
 "2011 Atlantic 10 Men's Basketball Tournament",
 "2011 NCAA Men's Division I Basketball Tournament",
 'A cappella',
 'Afroman',
 'Alcoa',
 'Alpha Kappa Alpha',
 'Alpha Phi Alpha',
 'Alpha Phi Omega',
 'Altria Group',
 'Aluminum',
 'Alumnus',
 'American Civil War',
 'American Jobs Act',
 'Appalachian College of Pharmacy',
 'Appalachian School of Law',
 'Associated Colleges of the South',
 'Athletic nickname',
 'Atlantic 10 Conference',
 'Atlantic University',
 'Auburn University',
 'Averett University',
 'Baptist Theological Seminary at Richmond',
 'Baptists',
 'Barracks',
 'Baylor University',
 'Bill Clinton',
 'Birmingham–Southern College',
 'Blackstone College for Girls',
 'Bluefield College',
 'Bon Secours Memorial College of Nursing',
 'Bonner Scholars',
 'Bridgewater College',
 'BusinessWeek',
 'Capit

Now, a reasonable next step would be to grab the data associated with
each of these pages. To download the data for the first link we would
just do this:

In [34]:
data = get_wiki_json(links[0])
data

{'title': '2008 Montana Grizzlies football team',
 'pageid': 23897286,
 'revid': 841752145,
 'text': {'*': '<div class="mw-parser-output"><div class="mw-parser-output"><table class="infobox vevent" style="width:22em;width: 25em"><tbody><tr><th colspan="2" class="summary" style="text-align:center;font-size:125%;font-weight:bold;font-size: 125%"><span class="dtstart">2008</span> <span class="vcard attendee fn org"><a href="/wiki/Montana_Grizzlies_football" title="Montana Grizzlies football">Montana Grizzlies football</a></span></th></tr><tr><td colspan="2" style="text-align:center">\n<a href="/wiki/File:Montana_Griz_logo.svg" class="image"><img alt="Montana Griz logo.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Montana_Griz_logo.svg/150px-Montana_Griz_logo.svg.png" width="150" height="109" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Montana_Griz_logo.svg/225px-Montana_Griz_logo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Montana_Griz

How do this automatically for all of the links? We want to make use
of a `for` loop. A for loop cycles through all of the elements of a
list and applies a set of instructions to each element. 

Here's an example where we take each element in the list of links and
print out just the first three letters:

In [35]:
for link in links:
    print(link[:3])

200
200
201
201
201
201
A c
Afr
Alc
Alp
Alp
Alp
Alt
Alu
Alu
Ame
Ame
App
App
Ass
Ath
Atl
Atl
Aub
Ave
Bap
Bap
Bar
Bay
Bil
Bir
Bla
Blu
Bon
Bon
Bri
Bus
Cap
Cap
Cat
Cen
Cen
Cha
Cha
Chr
Chr
Cle
Col
Col
Col
Col
Col
Com
Con
Cor
Cor
Cou
Cry
Dav
Dav
Daw
Day
DeV
Del
Del
Del
Duk
Duq
E. 
E. 
E. 
E. 
ECP
Eas
Eas
Eas
Edw
Eig
Eli
Emo
Fer
Fin
Flo
Flo
Flo
Flo
Flo
For
Fre
Fre
Fre
Fur
Gee
Geo
Geo
Geo
Geo
Geo
Geo
Geo
Geo
Geo
Geo
Gib
Goo
Gor
Got
Gov
H. 
Hab
Ham
Ham
Har
Hen
Hen
Hen
Hol
Hon
Ida
Int
Jam
Jef
Jef
Jep
Joh
Joi
Kap
Kap
Kap
Kap
La 
Lam
Lat
Lea
Lib
Lib
Lis
Lis
Lon
Lou
Lou
Mac
Mal
Mar
Mar
Mar
Mar
Mas
Mid
Mil
Mis
Mod
Mor
NCA
NCA
Nat
Nat
Nat
Nee
New
Non
Nor
Nor
Nor
Nor
Ohi
Old
Pat
Phi
Phi
Phi
Pi 
Pos
Pre
Pre
Pri
Pri
Pri
Rad
Ral
Ran
Ran
Reg
Rey
Rho
Rho
Ric
Ric
Ric
Ric
Ric
Ric
Ric
Ric
Ric
Roa
Rob
Rol
Ron
Ros
Ryl
Sai
Sai
Sai
Sch
Sew
Sha
She
Sig
Sig
Sma
Sou
Sou
Sou
Spe
Spi
St.
Str
Stu
Stu
Sub
Sul
Swe
Tex
The
The
The
The
The
The
Tri
Tru
Tul
U.S
UMa
Und
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni


If we want to grab the webpage data for each link from the UR page,
we can now just do this (this will take a while the first time you
run it, but will be quick the second time):

In [64]:
for link in links:
    get_wiki_json(link)

### Using the MediaWiki data

Now, finally, we have the code and functionality to look at a
collection of Wikipedia pages. Let's start with a simple task
of counting how many links all of the pages linked from the Richmond
site have. Pay attention to how I do this!

In [87]:
num_links = []
ur_links = links_as_list(get_wiki_json("University of Richmond"))

for link in ur_links:
    data = get_wiki_json(link)
    new_links = links_as_list(data)
    num_links.append(len(new_links))

Now, let's look at the results:

In [79]:
print(num_links)

[218, 183, 1, 208, 107, 1, 273, 52, 118, 624, 690, 198, 1, 2, 22, 1207, 267, 111, 176, 33, 54, 250, 98, 506, 179, 92, 777, 123, 625, 1618, 447, 99, 286, 94, 1, 181, 1, 81, 274, 486, 349, 475, 144, 44, 212, 322, 439, 369, 787, 611, 1, 169, 1, 930, 98, 916, 18, 32, 427, 218, 526, 308, 297, 381, 364, 401, 802, 604, 30, 48, 14, 240, 15, 347, 212, 154, 318, 1, 96, 420, 251, 134, 262, 379, 237, 612, 791, 462, 25, 8, 283, 338, 373, 286, 2026, 277, 500, 200, 827, 181, 793, 702, 293, 62, 418, 137, 1, 32, 16, 89, 547, 643, 202, 432, 429, 171, 453, 1, 228, 528, 465, 91, 176, 61, 126, 208, 282, 442, 318, 271, 169, 291, 683, 1, 1, 410, 206, 535, 358, 679, 429, 1, 121, 213, 91, 280, 276, 1122, 1, 400, 349, 160, 540, 494, 364, 2, 249, 232, 175, 1952, 75, 436, 460, 470, 171, 397, 446, 155, 152, 258, 109, 267, 248, 3, 36, 826, 268, 392, 330, 219, 506, 592, 345, 101, 102, 544, 699, 206, 1129, 250, 88, 184, 197, 16, 48, 507, 105, 419, 22, 479, 171, 227, 185, 369, 41, 359, 280, 227, 372, 263, 99, 71, 267,

What can we do with this? For starters, what's the average
number of links on each page?

In [85]:
sum(num_links) / len(num_links)

330.9732441471572

How does this compare to the number of links from the Richmond site?

In [88]:
len(ur_links)

299

**Answer**:

## Practice

Take a look at the Wikipedia page on Rock and Roll Hall of
Fame inductees:

> https://en.wikipedia.org/wiki/List_of_Rock_and_Roll_Hall_of_Fame_inductees

While some of the links on the page point to other references,
most refer to the winners or inductors of the Hall of Fame.

Below, write code that:

1. Downloads all of the links from the Rock and Roll Hall of Fame
Wikipedia page.
2. Then, extract from each page all of the links from **that** page
and puts them together in one appended list called `all_links`.
3. Use the `Collections.counter` object to find the 20 links that
are used most across all of the pages.
4. Think about the most frequent 20 pages and try to reason why
these are the most common.

In [102]:
# Make sure all of the links are downloaded
rr_links = links_as_list(get_wiki_json("List_of_Rock_and_Roll_Hall_of_Fame_inductees"))

for link in rr_links:
    data = get_wiki_json(link)

In [108]:
# Now, collect all of the links as a single list
all_links = []

for link in rr_links:
    data = get_wiki_json(link)
    new_links = links_as_list(data)
    all_links = all_links + new_links

In [109]:
# Now, count most frequent links
from collections import Counter

Counter(all_links).most_common(30)

[('MusicBrainz', 923),
 ('Virtual International Authority File', 866),
 ('International Standard Name Identifier', 855),
 ('Library of Congress Control Number', 743),
 ('Rock and Roll Hall of Fame', 726),
 ('Bibliothèque nationale de France', 723),
 ('Integrated Authority File', 713),
 ('International Standard Book Number', 676),
 ('Système universitaire de documentation', 491),
 ('Rolling Stone', 404),
 ('IMDb', 383),
 ('Biblioteca Nacional de España', 380),
 ('Rock music', 370),
 ('Wayback Machine', 357),
 ('List of Rock and Roll Hall of Fame inductees', 355),
 ('AllMusic', 354),
 ('SNAC', 308),
 ('Billboard Hot 100', 285),
 ('Billboard (magazine)', 263),
 ('The Beatles', 258),
 ('Rhythm and blues', 252),
 ('BIBSYS', 245),
 ('Grammy Award', 233),
 ('Billboard 200', 226),
 ('Rock and roll', 224),
 ('Bob Dylan', 219),
 ('The New York Times', 218),
 ('Eric Clapton', 212),
 ('Pop music', 206),
 ('Columbia Records', 203)]