## Tutorial 10: Looping through Wikipedia

In this tutorial, we combine our lists and loops with the MediaWiki API
functions to grab data from several websites in an automated way.

### Modules

We will need functions that I gave last class for loading data from
Wikipedia again today, as well as for the foreseeable future. Rather
than having to copy and paste them each time, there is an easy way to
load these functions from a common file. 

I've created the file `wiki.py` that you should download from the course
website and put into the same directory that you store your tutorials.
You can open and edit the file in Jupyter, which I suggest you do right
now to get a sense of what the file looks like. It is basically one long
code cell. To load the functions in this file, we write `import` along 
with the name of the file (without the extension).

In [1]:
import wiki

Now, to get one of the functions in the module, we use the normal 
"module name" + "." + "function name" calling convention. So, to
get the function `wiki_json_path` we would do this:

In [2]:
wiki.wiki_json_path("University of Richmond")

'/Users/taylor/gh/stat289-f18/assets/data/en/University_of_Richmond.json'

Remember that you can see the help page for a function like this:

In [3]:
help(wiki.wiki_json_path)

Help on function wiki_json_path in module wiki:

wiki_json_path(page_title, lang='en')
    Returns local path to JSON file for Wikipedia page data.
    
    This function is used to determine where the dump of a 
    call to the MediaWiki API, using the parse method, should
    be stored. As an extra action, the function also checks that
    the relevant directory exists and creates it if it does not.
    
    Args:
        page_title: A string containing the page title.
        lang: Two letter language code describing the Wikipedia
            language used to grab the data.
            
    Returns:
        A string describing a relative path to file.



I've made a few small changes to the code in `wiki.py` to make it function a bit
better for us and to deal with some annoying edge cases. I may need to fix some
other edge cases as we work through the data (pages like "AC/DC" and "Guns & Roses"
failed on the original code).

### Dictionaries

We saw last time that internal links, links to other pages on
Wikipedia, are returned as a particular element of the JSON data
returned by the MediaWiki API. Here, we will show how to extract
data from the JSON object. 

Let's start by loading the data from a single Wikipedia page. As
I mentioned briefly last time, the Python object that stores JSON
data is called a "dict" (short for dictionary).

In [4]:
data = wiki.get_wiki_json("University of Richmond")
type(data)

Pulling data from MediaWiki API: 'University of Richmond'


dict

A dictionary is similar to a list in that it stores an collection of items. While
a list keeps all of the items in a particular order, a dictionary associated each
element to a named "key". We saw these keys in the JSON file from last time. To
see all of the keys in a particular dictionary, use the `keys` method:

In [5]:
data.keys()



To grab an element from the dictionary, we use square brackets with the name
(in quotes) of the desired key. Again, similar to a list but with a twist.
Here I'll print out the title of the page.

In [6]:
data['title']

'University of Richmond'

The title returns a single string, but its possible that dictionarie elements consists of a
list or even another dictionary.

In [7]:
type(data['langlinks'])

list

In [8]:
data['langlinks']

[{'lang': 'az',
  'url': 'https://az.wikipedia.org/wiki/Ri%C3%A7mond_Universiteti',
  'langname': 'Azerbaijani',
  'autonym': 'azərbaycanca',
  '*': 'Riçmond Universiteti'},
 {'lang': 'de',
  'url': 'https://de.wikipedia.org/wiki/University_of_Richmond',
  'langname': 'German',
  'autonym': 'Deutsch',
  '*': 'University of Richmond'},
 {'lang': 'es',
  'url': 'https://es.wikipedia.org/wiki/Universidad_de_Richmond',
  'langname': 'Spanish',
  'autonym': 'español',
  '*': 'Universidad de Richmond'},
 {'lang': 'it',
  'url': 'https://it.wikipedia.org/wiki/Universit%C3%A0_di_Richmond',
  'langname': 'Italian',
  'autonym': 'italiano',
  '*': 'Università di Richmond'},
 {'lang': 'fi',
  'url': 'https://fi.wikipedia.org/wiki/Richmondin_yliopisto',
  'langname': 'Finnish',
  'autonym': 'suomi',
  '*': 'Richmondin yliopisto'},
 {'lang': 'sv',
  'url': 'https://sv.wikipedia.org/wiki/University_of_Richmond',
  'langname': 'Swedish',
  'autonym': 'svenska',
  '*': 'University of Richmond'},
 {'la

What if we want information about the Azerbaijani page for the University of Richmond?
Well, this is just a list so grab the first element with `[0]` as usual:

In [9]:
data['langlinks'][0]

{'lang': 'az',
 'url': 'https://az.wikipedia.org/wiki/Ri%C3%A7mond_Universiteti',
 'langname': 'Azerbaijani',
 'autonym': 'azərbaycanca',
 '*': 'Riçmond Universiteti'}

And from what data type is this element? Its another dictionary:

In [10]:
type(data['langlinks'][0])

dict

And so we could grab an element, such as the language name, like this:

In [11]:
data['langlinks'][0]['langname']

'Azerbaijani'

And if we want all of the language links? We need to combine our looping
knowledge with the dictionary methods:

In [12]:
lang_names = []

for lang in data['langlinks']:
    lang_names = lang_names + [lang['langname']]
    
print(lang_names)

['Azerbaijani', 'German', 'Spanish', 'Italian', 'Finnish', 'Swedish', 'Urdu', 'Chinese']


### Links data

Now, let's do something similar to get the internal links from our Wikipedia page. These
are stored in the element named 'links' from the object `data`. Print out this object 
below:

Now, what kind of object are the links stored in? Use the `type` function below to 
figure this out:

You should see that the links are stored as a list. Each element of the list
is a particular link. Below, grab just the first (remember, this is element '0')
link in the list:

Use the `type` function again to detect the object type of a particular
link.

You should see that this is a dictionary. Now (yes, there's more!) print out the names
of the keys for this dictionary:

You should see that there are three elements in the dictionary. Here are what
the three elements mean:

- **ns**: an integer giving the "namespace" of the link. Each type of page has
its own namespace. The links to "real" pages all have a code of '14'.
- **exists**: this is an empty string. Its used because the element exists only
'exists' if the link is not dead (in other words, it links to a real page).
- **`*`**: this is the actual internal link.

Print out the namespace of the first link:

You should see that the namespace is 14 because the first link is to a Category
page (Categories are always 14).

Now, do something similar to what I did in the prior section to create a list named
`internal_links` that grabs all of the links (the elements under `*`). Print out
the list at the bottom of the cell.

### Using `links_as_list`

I wrote a small helper funtion `links_as_list` (defined in `wiki.py`) to
extract the list of links from a webpage. It should work very similar to
the code you wrote above (open the code file and check it!), but additionally
only includes links is (1) the namespace is equal to 10 and (2) the page
actually exists.

Let's use this to get all of the links of the University of Richmond page.

In [13]:
data = wiki.get_wiki_json("University of Richmond")
links = wiki.links_as_list(data)
links

['2008 Montana Grizzlies football team',
 '2008 Richmond Spiders football team',
 "2010-11 Richmond Spiders men's basketball team",
 "2010–11 Kansas Jayhawks men's basketball team",
 "2011 Atlantic 10 Men's Basketball Tournament",
 "2011 NCAA Men's Division I Basketball Tournament",
 'A cappella',
 'Afroman',
 'Alcoa',
 'Alpha Kappa Alpha',
 'Alpha Phi Alpha',
 'Alpha Phi Omega',
 'Altria Group',
 'Aluminum',
 'Alumnus',
 'American Civil War',
 'American Jobs Act',
 'Appalachian College of Pharmacy',
 'Appalachian School of Law',
 'Associated Colleges of the South',
 'Athletic nickname',
 'Atlantic 10 Conference',
 'Atlantic University',
 'Auburn University',
 'Averett University',
 'Baptist Theological Seminary at Richmond',
 'Baptists',
 'Barracks',
 'Baylor University',
 'Bill Clinton',
 'Birmingham–Southern College',
 'Blackstone College for Girls',
 'Bluefield College',
 'Bon Secours Memorial College of Nursing',
 'Bonner Scholars',
 'Bridgewater College',
 'BusinessWeek',
 'Capit

Now, a reasonable next step would be to grab the data associated with
each of these pages. To download the data for the first link we would
just do this:

In [14]:
data = wiki.get_wiki_json(links[0])
data

Pulling data from MediaWiki API: '2008 Montana Grizzlies football team'


{'title': '2008 Montana Grizzlies football team',
 'pageid': 23897286,
 'revid': 841752145,
 'text': {'*': '<div class="mw-parser-output"><table class="infobox vevent" style="width:22em;width: 25em"><tbody><tr><th colspan="2" class="summary" style="text-align:center;font-size:125%;font-weight:bold;font-size: 125%"><span class="dtstart">2008</span> <span class="vcard attendee fn org"><a href="/wiki/Montana_Grizzlies_football" title="Montana Grizzlies football">Montana Grizzlies football</a></span></th></tr><tr><td colspan="2" style="text-align:center">\n<a href="/wiki/File:Montana_Griz_logo.svg" class="image"><img alt="Montana Griz logo.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Montana_Griz_logo.svg/150px-Montana_Griz_logo.svg.png" width="150" height="109" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Montana_Griz_logo.svg/225px-Montana_Griz_logo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Montana_Griz_logo.svg/300px-Montana_Griz_l

How do this automatically for all of the links? We want to make use
of a `for` loop. A for loop cycles through all of the elements of a
list and applies a set of instructions to each element. 

Here's an example where we take each element in the list of links and
print out just the first three letters:

In [15]:
for link in links:
    print(link[:3])

200
200
201
201
201
201
A c
Afr
Alc
Alp
Alp
Alp
Alt
Alu
Alu
Ame
Ame
App
App
Ass
Ath
Atl
Atl
Aub
Ave
Bap
Bap
Bar
Bay
Bil
Bir
Bla
Blu
Bon
Bon
Bri
Bus
Cap
Cap
Cat
Cen
Cen
Cha
Cha
Chr
Chr
Cle
Col
Col
Col
Col
Col
Com
Con
Cor
Cor
Cou
Cry
Dav
Dav
Daw
Day
DeV
Del
Del
Del
Duk
Duq
E. 
E. 
E. 
E. 
ECP
Eas
Eas
Eas
Edw
Eig
Eli
Emo
Fer
Fin
Flo
Flo
Flo
Flo
Flo
For
Fre
Fre
Fre
Fur
Gee
Geo
Geo
Geo
Geo
Geo
Geo
Geo
Geo
Geo
Geo
Gib
Goo
Gor
Got
Gov
H. 
Hab
Ham
Ham
Har
Hen
Hen
Hen
Hol
Hon
Ida
Int
Jam
Jef
Jef
Jep
Joh
Joi
Kap
Kap
Kap
Kap
La 
Lam
Lat
Lea
Lib
Lib
Lis
Lis
Lon
Lou
Lou
Mac
Mal
Mar
Mar
Mar
Mar
Mas
Mid
Mil
Mis
Mod
Mor
NCA
NCA
Nat
Nat
Nat
Nee
New
Non
Nor
Nor
Nor
Nor
Ohi
Old
Pat
Phi
Phi
Phi
Pi 
Pos
Pre
Pre
Pri
Pri
Pri
Rad
Ral
Ran
Ran
Reg
Rey
Rho
Rho
Ric
Ric
Ric
Ric
Ric
Ric
Ric
Ric
Ric
Roa
Rob
Rol
Ron
Ros
Ryl
Sai
Sai
Sai
Sch
Sew
Sha
She
Sig
Sig
Sma
Sou
Sou
Sou
Spe
Spi
St.
Str
Stu
Stu
Sub
Sul
Swe
Tex
The
The
The
The
The
The
Tri
Tru
Tul
U.S
UMa
Und
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni
Uni


If we want to grab the webpage data for each link from the UR page,
we can now just do this (this will take a while the first time you
run it, but will be quick the second time):

In [16]:
for link in links:
    wiki.get_wiki_json(link)

Pulling data from MediaWiki API: '2008 Richmond Spiders football team'
Pulling data from MediaWiki API: '2010-11 Richmond Spiders men's basketball team'
Pulling data from MediaWiki API: '2010–11 Kansas Jayhawks men's basketball team'
Pulling data from MediaWiki API: '2011 Atlantic 10 Men's Basketball Tournament'
Pulling data from MediaWiki API: '2011 NCAA Men's Division I Basketball Tournament'
Pulling data from MediaWiki API: 'A cappella'
Pulling data from MediaWiki API: 'Afroman'
Pulling data from MediaWiki API: 'Alcoa'
Pulling data from MediaWiki API: 'Alpha Kappa Alpha'
Pulling data from MediaWiki API: 'Alpha Phi Alpha'
Pulling data from MediaWiki API: 'Alpha Phi Omega'
Pulling data from MediaWiki API: 'Altria Group'
Pulling data from MediaWiki API: 'Aluminum'
Pulling data from MediaWiki API: 'Alumnus'
Pulling data from MediaWiki API: 'American Civil War'
Pulling data from MediaWiki API: 'American Jobs Act'
Pulling data from MediaWiki API: 'Appalachian College of Pharmacy'
Pulling 

Pulling data from MediaWiki API: 'Marine Corps University'
Pulling data from MediaWiki API: 'Marion College, Virginia'
Pulling data from MediaWiki API: 'Mary Baldwin University'
Pulling data from MediaWiki API: 'Marymount University'
Pulling data from MediaWiki API: 'Massachusetts Institute of Technology'
Pulling data from MediaWiki API: 'Mid-Atlantic States'
Pulling data from MediaWiki API: 'Millsaps College'
Pulling data from MediaWiki API: 'Mississippi State University'
Pulling data from MediaWiki API: 'Model United Nations'
Pulling data from MediaWiki API: 'Morehouse College'
Pulling data from MediaWiki API: 'NCAA Division I'
Pulling data from MediaWiki API: 'NCAA Division I Football Championship'
Pulling data from MediaWiki API: 'National Public Radio'
Pulling data from MediaWiki API: 'National Register of Historic Places'
Pulling data from MediaWiki API: 'National Science Foundation'
Pulling data from MediaWiki API: 'Need-blind admission'
Pulling data from MediaWiki API: 'New Eng

Pulling data from MediaWiki API: 'Virginia Intermont College'
Pulling data from MediaWiki API: 'Virginia International University'
Pulling data from MediaWiki API: 'Virginia Military Institute'
Pulling data from MediaWiki API: 'Virginia State University'
Pulling data from MediaWiki API: 'Virginia Tech'
Pulling data from MediaWiki API: 'Virginia Theological Seminary'
Pulling data from MediaWiki API: 'Virginia Union University'
Pulling data from MediaWiki API: 'Virginia University of Lynchburg'
Pulling data from MediaWiki API: 'Virginia Wesleyan University'
Pulling data from MediaWiki API: 'Virginia–Maryland College of Veterinary Medicine'
Pulling data from MediaWiki API: 'WDCE'
Pulling data from MediaWiki API: 'Warren H. Manning'
Pulling data from MediaWiki API: 'Warren Manning'
Pulling data from MediaWiki API: 'Washington and Lee University'
Pulling data from MediaWiki API: 'West Virginia University'
Pulling data from MediaWiki API: 'Westwood College'
Pulling data from MediaWiki API: '

### Using the MediaWiki data

Now, finally, we have the code and functionality to look at a
collection of Wikipedia pages. Let's start with a simple task
of counting how many links all of the pages linked from the Richmond
site have. Pay attention to how I do this!

In [17]:
num_links = []
data_json = wiki.get_wiki_json("University of Richmond")
ur_links = wiki.links_as_list(data_json)

for link in ur_links:
    data = wiki.get_wiki_json(link)
    new_links = wiki.links_as_list(data)
    num_links.append(len(new_links))

Now, let's look at the results:

In [18]:
print(num_links)

[218, 183, 1, 208, 108, 1, 273, 52, 118, 624, 690, 198, 1, 2, 22, 1223, 267, 111, 176, 33, 54, 250, 98, 506, 179, 92, 770, 123, 625, 1618, 447, 99, 286, 94, 1, 181, 1, 81, 274, 486, 349, 475, 144, 44, 212, 322, 439, 369, 787, 611, 1, 169, 1, 930, 98, 916, 18, 32, 427, 218, 526, 308, 297, 381, 364, 401, 802, 604, 30, 48, 14, 240, 15, 347, 212, 155, 318, 1, 96, 420, 251, 134, 262, 379, 237, 612, 791, 462, 25, 8, 283, 338, 373, 286, 2026, 277, 500, 200, 829, 181, 793, 702, 293, 62, 418, 137, 1, 32, 16, 89, 547, 643, 202, 432, 429, 171, 453, 1, 228, 528, 465, 91, 176, 61, 126, 208, 282, 442, 318, 271, 169, 291, 683, 1, 1, 410, 206, 535, 358, 679, 428, 1, 121, 213, 91, 284, 276, 1122, 1, 400, 349, 159, 540, 494, 364, 2, 249, 232, 175, 1953, 75, 436, 460, 470, 171, 397, 446, 155, 152, 258, 109, 267, 248, 3, 36, 827, 274, 392, 330, 219, 506, 592, 345, 101, 102, 544, 699, 206, 1129, 250, 88, 184, 198, 16, 48, 507, 105, 419, 22, 479, 171, 227, 185, 369, 41, 359, 280, 227, 372, 263, 99, 71, 267,

What can we do with this? For starters, what's the average
number of links on each page?

In [19]:
sum(num_links) / len(num_links)

330.9397993311037

How does this compare to the number of links from the Richmond site?

In [20]:
len(ur_links)

299

**Answer**:

## Practice

Take a look at the Wikipedia page on Richmond, Virginia:

> https://en.wikipedia.org/wiki/Richmond,_Virginia

Below, write code that:

1. Downloads all of the links from the Rock and Roll Hall of Fame
Wikipedia page.
2. Then, extract from each page all of the links from **that** page
and puts them together in one appended list called `all_links`.
3. Use the `collections.Counter` object to find the 40 links that
are used most across all of the pages.
4. Think about the most frequent 40 pages and try to reason why
these are the most common.

In [25]:
# Make sure all of the links are downloaded
data_json = wiki.get_wiki_json("Richmond,_Virginia")
rr_links = wiki.links_as_list(data_json)

for link in rr_links:
    data = wiki.get_wiki_json(link)

Pulling data from MediaWiki API: 'Kansas'
Pulling data from MediaWiki API: 'Kentucky'
Pulling data from MediaWiki API: 'King George County, Virginia'
Pulling data from MediaWiki API: 'King William County, Virginia'
Pulling data from MediaWiki API: 'King and Queen County, Virginia'
Pulling data from MediaWiki API: 'Kingsport–Bristol–Bristol, Tennessee-Virginia Metropolitan Statistical Area'
Pulling data from MediaWiki API: 'Laburnum Park Historic District'
Pulling data from MediaWiki API: 'Lancaster County, Virginia'
Pulling data from MediaWiki API: 'Lansing, Michigan'
Pulling data from MediaWiki API: 'Law of Virginia'
Pulling data from MediaWiki API: 'Lee County, Virginia'
Pulling data from MediaWiki API: 'Lexington, Virginia'
Pulling data from MediaWiki API: 'Libby Hill, Richmond'
Pulling data from MediaWiki API: 'Lincoln, Nebraska'
Pulling data from MediaWiki API: 'List of National Historic Landmarks in Virginia'
Pulling data from MediaWiki API: 'List of United States Representatives

Pulling data from MediaWiki API: 'Southside (Virginia)'
Pulling data from MediaWiki API: 'Southwest Virginia'
Pulling data from MediaWiki API: 'Spotsylvania County, Virginia'
Pulling data from MediaWiki API: 'Springfield, Illinois'
Pulling data from MediaWiki API: 'Stadium neighborhood'
Pulling data from MediaWiki API: 'Stafford County, Virginia'
Pulling data from MediaWiki API: 'State Fair of Virginia'
Pulling data from MediaWiki API: 'Staunton, Virginia'
Pulling data from MediaWiki API: 'Staunton–Waynesboro metropolitan area'
Pulling data from MediaWiki API: 'Suffolk, Virginia'
Pulling data from MediaWiki API: 'Surry County, Virginia'
Pulling data from MediaWiki API: 'Sussex County, Virginia'
Pulling data from MediaWiki API: 'Tallahassee, Florida'
Pulling data from MediaWiki API: 'Tazewell County, Virginia'
Pulling data from MediaWiki API: 'Tennessee'
Pulling data from MediaWiki API: 'Tennessee Valley'
Pulling data from MediaWiki API: 'Texas'
Pulling data from MediaWiki API: 'Texas B

In [26]:
# Now, collect all of the links as a single list
all_links = []

for link in rr_links:
    data = wiki.get_wiki_json(link)
    new_links = wiki.links_as_list(data)
    all_links = all_links + new_links

In [27]:
# Now, count most frequent links
from collections import Counter

Counter(all_links).most_common(40)

[('Richmond, Virginia', 685),
 ('Virginia', 576),
 ('Geographic coordinate system', 479),
 ('United States', 469),
 ('International Standard Book Number', 446),
 ('Wayback Machine', 334),
 ('Norfolk, Virginia', 311),
 ('United States Census Bureau', 306),
 ('List of regions of the United States', 306),
 ('Democratic Party (United States)', 297),
 ('2010 United States Census', 294),
 ('American Civil War', 285),
 ('Republican Party (United States)', 285),
 ('Washington, D.C.', 285),
 ('Henrico County, Virginia', 282),
 ('Petersburg, Virginia', 282),
 ('2000 United States Census', 280),
 ('1990 United States Census', 276),
 ('1980 United States Census', 275),
 ('1970 United States Census', 274),
 ('List of National Historic Landmarks in Virginia', 274),
 ('1960 United States Census', 272),
 ('1950 United States Census', 269),
 ('1930 United States Census', 268),
 ('1940 United States Census', 267),
 ('1920 United States Census', 266),
 ('1910 United States Census', 265),
 ('Chesterfield 

## For next time

On Tuesday we are going to start doing some network analysis. This means that we will
need to use the **networkx** module, which is not included in the standard Anaconda
Python installation. Please make sure that you have this downloaded correctly by running
the following:

In [None]:
import networkx as nx

If there is a problem, please let me know before the end of class today.