## Tutorial 10: Looping through Wikipedia

In this tutorial, we combine our lists and loops with the MediaWiki API
functions to grab data from several websites in an automated way.

### Modules

We will need functions that I gave last class for loading data from
Wikipedia again today, as well as for the foreseeable future. Rather
than having to copy and paste them each time, there is an easy way to
load these functions from a common file. 

I've created the file `wiki.py` that you should download from the course
website and put into the same directory that you store your tutorials.
You can open and edit the file in Jupyter, which I suggest you do right
now to get a sense of what the file looks like. It is basically one long
code cell. To load the functions in this file, we write `import` along 
with the name of the file (without the extension).

In [None]:
import wiki

Now, to get one of the functions in the module, we use the normal 
"module name" + "." + "function name" calling convention. So, to
get the function `wiki_json_path` we would do this:

In [None]:
wiki.wiki_json_path("University of Richmond")

Remember that you can see the help page for a function like this:

In [None]:
help(wiki.wiki_json_path)

I've made a few small changes to the code in `wiki.py` to make it function a bit
better for us and to deal with some annoying edge cases. I may need to fix some
other edge cases as we work through the data (pages like "AC/DC" and "Guns & Roses"
failed on the original code).

### Dictionaries

We saw last time that internal links, links to other pages on
Wikipedia, are returned as a particular element of the JSON data
returned by the MediaWiki API. Here, we will show how to extract
data from the JSON object. 

Let's start by loading the data from a single Wikipedia page. As
I mentioned briefly last time, the Python object that stores JSON
data is called a "dict" (short for dictionary).

In [None]:
data = wiki.get_wiki_json("University of Richmond")
type(data)

A dictionary is similar to a list in that it stores an collection of items. While
a list keeps all of the items in a particular order, a dictionary associated each
element to a named "key". We saw these keys in the JSON file from last time. To
see all of the keys in a particular dictionary, use the `keys` method:

In [None]:
data.keys()

To grab an element from the dictionary, we use square brackets with the name
(in quotes) of the desired key. Again, similar to a list but with a twist.
Here I'll print out the title of the page.

In [None]:
data['title']

The title returns a single string, but its possible that dictionarie elements consists of a
list or even another dictionary.

In [None]:
type(data['langlinks'])

In [None]:
data['langlinks']

What if we want information about the Azerbaijani page for the University of Richmond?
Well, this is just a list so grab the first element with `[0]` as usual:

In [None]:
data['langlinks'][0]

And from what data type is this element? Its another dictionary:

In [None]:
type(data['langlinks'][0])

And so we could grab an element, such as the language name, like this:

In [None]:
data['langlinks'][0]['langname']

And if we want all of the language links? We need to combine our looping
knowledge with the dictionary methods:

In [None]:
lang_names = []

for lang in data['langlinks']:
    lang_names = lang_names + [lang['langname']]
    
print(lang_names)

### Links data

Now, let's do something similar to get the internal links from our Wikipedia page. These
are stored in the element named 'links' from the object `data`. Print out this object 
below:

Now, what kind of object are the links stored in? Use the `type` function below to 
figure this out:

You should see that the links are stored as a list. Each element of the list
is a particular link. Below, grab just the first (remember, this is element '0')
link in the list:

Use the `type` function again to detect the object type of a particular
link.

You should see that this is a dictionary. Now (yes, there's more!) print out the names
of the keys for this dictionary:

You should see that there are three elements in the dictionary. Here are what
the three elements mean:

- **ns**: an integer giving the "namespace" of the link. Each type of page has
its own namespace. The links to "real" pages all have a code of '14'.
- **exists**: this is an empty string. Its used because the element exists only
'exists' if the link is not dead (in other words, it links to a real page).
- **`*`**: this is the actual internal link.

Print out the namespace of the first link:

You should see that the namespace is 14 because the first link is to a Category
page (Categories are always 14).

Now, do something similar to what I did in the prior section to create a list named
`internal_links` that grabs all of the links (the elements under `*`). Print out
the list at the bottom of the cell.

### Using `links_as_list`

I wrote a small helper funtion `links_as_list` (defined in `wiki.py`) to
extract the list of links from a webpage. It should work very similar to
the code you wrote above (open the code file and check it!), but additionally
only includes links is (1) the namespace is equal to 10 and (2) the page
actually exists.

Let's use this to get all of the links of the University of Richmond page.

In [None]:
data = wiki.get_wiki_json("University of Richmond")
links = wiki.links_as_list(data)
links

Now, a reasonable next step would be to grab the data associated with
each of these pages. To download the data for the first link we would
just do this:

In [None]:
data = wiki.get_wiki_json(links[0])
data

How do this automatically for all of the links? We want to make use
of a `for` loop. A for loop cycles through all of the elements of a
list and applies a set of instructions to each element. 

Here's an example where we take each element in the list of links and
print out just the first three letters:

In [None]:
for link in links:
    print(link[:3])

If we want to grab the webpage data for each link from the UR page,
we can now just do this (this will take a while the first time you
run it, but will be quick the second time):

In [None]:
for link in links:
    wiki.get_wiki_json(link)

### Using the MediaWiki data

Now, finally, we have the code and functionality to look at a
collection of Wikipedia pages. Let's start with a simple task
of counting how many links all of the pages linked from the Richmond
site have. Pay attention to how I do this!

In [None]:
num_links = []
data_json = wiki.get_wiki_json("University of Richmond")
ur_links = wiki.links_as_list(data_json)

for link in ur_links:
    data = wiki.get_wiki_json(link)
    new_links = wiki.links_as_list(data)
    num_links.append(len(new_links))

Now, let's look at the results:

In [None]:
print(num_links)

What can we do with this? For starters, what's the average
number of links on each page?

In [None]:
sum(num_links) / len(num_links)

How does this compare to the number of links from the Richmond site?

In [None]:
len(ur_links)

**Answer**:

## Practice

Take a look at the Wikipedia page on Richmond, Virginia:

> https://en.wikipedia.org/wiki/Richmond,_Virginia

Below, write code that:

1. Downloads all of the links from the Rock and Roll Hall of Fame
Wikipedia page.
2. Then, extract from each page all of the links from **that** page
and puts them together in one appended list called `all_links`.
3. Use the `collections.Counter` object to find the 40 links that
are used most across all of the pages.
4. Think about the most frequent 40 pages and try to reason why
these are the most common.

## For next time

On Tuesday we are going to start doing some network analysis. This means that we will
need to use the **networkx** module, which is not included in the standard Anaconda
Python installation. Please make sure that you have this downloaded correctly by running
the following:

In [None]:
import networkx as nx

If there is a problem, please let me know before the end of class today.