# Week 3: Getting data—scraping and APIs.

This week is about getting data from the big ol' Internet, with the Wikipedia as our guinea pig. The main task today is to retrieve the Wikipedia pages of **all Marvel characters** using the MediaWiki **API**. There are three parts to this exercise set.

* Learn the basics of how to retrieve data from Wiki sites using the MediaWiki API
* Download all Marvel character Wikipedia articles
* Begin to explore the data

With the data you acquire today, you will be working for the remainder of the semester. Try to get as far as possible, structure the data nicely and write your code so that it makes sense to you in the coming weeks.

Also, there's an **important practice** you should start getting used to—which matters when we grade assignments. 
1. Openly reflect on how you solve a problem. It can be code comments, or markup below/above the code cell, just as long as you share your thoughts. 
2. Comment on your results, discussing:
    * Whether they make sense
    * If they look somewhat as you expected, and if not, what the reasons for this difference might be
    * What—interesting or not—insight they reveal about the given system you analyze
    
    *Note: of course you can't always say something profound about every little thing, so rest assured, I will only expect explanations in your assignments when *it makes sense* that there should be one.*

## Exercises

**Why use an API?** You could just go ahead and scrape the HTML from a Wikipedia page as simple as:

    import requests as rq
    rq.get("https://en.wikipedia.org/wiki/Batman").text
    
Well... to navigate data in HTML format is not always easy. Therefore, MediaWiki offers its users direct use of its API. To load the MediaWiki markup using the API, one would do something like:

    rq.get("https://en.wikipedia.org/w/api.php?format=json&action=query&titles=Batman&prop=revisions&rvprop=content").json()
    
This assumes the data is JSON formatted and returns a `dict` object inside which you can find all sorts of information about the page, including the latest revision of the Batman page markup.

**Helpful code to display `dict` object as a tree.** Have a look at it, make sure you understand it.

In [1]:
def print_dict_tree(d, indent=0):
    """Print tree of keys in `dict` object.
    
    Prints the different levels of nested keys in a `dict` object. When there
    are no more dictionaries to key into, prints objects type and byte-size.

    Input
    -----
    d : dict
    """
    for key, value in d.items():
        print('    ' * indent + str(key), end=' ')
        if isinstance(value, dict):
            print(); print_dict_tree(value, indent+1)
        else:
            print(":", str(type(d[key])).split("'")[1], "-", str(len(str(d[key]))))
            
# Example
import requests as rq
data = rq.get("https://en.wikipedia.org/w/api.php?format=json&action=query&titles=Batman&prop=revisions&rvprop=content").json()
print_dict_tree(data)

batchcomplete : str - 0
    main 
        * : str - 287
    revisions 
        * : str - 163
query 
    pages 
        4335 
            pageid : int - 4
            ns : int - 1
            title : str - 6
            revisions : list - 206253


### Part 0: Learn to access Wikipedia data with Python

Understand how Wikipedia markup works. You'll need to know a bit about formatting of MediaWiki pages so that you can parse the markup that you retrieve from wikipedia. See http://www.mediawiki.org/wiki/Help:Formatting. In particular, look into how links work and how tables work and make sure you can answer the following questions.

>**Ex. 3.0.1**: How do you link to another Wikipedia page from within a Wikipedia-page, using the wikimedia markup? Write down a simple example that links to a specific section in another page.

In MediaWiki markup, the syntax is [[wikipedia:some article]] to link to some wikipedia article. For example, [[wikipedia:California]] will link to https://en.wikipedia.org/wiki/california.

> **Ex. 3.0.2**: What is the MediaWiki markup to create a simple table like the one below?
>
>| True Positive  | False Positive |
| -------------- |:--------------:|
| False Negative | True Negative  |

{| class="wikitable" style="margin:auto" <br>
! True Positive !! False Positive !! <br>
|- <br>
| False Negative || True Negative <br>
|}

> **Ex. 3.0.3**: Figure out how to download pages from Wikipedia. Familiarize yourself with [the API](http://www.mediawiki.org/wiki/API:Main_page) (there's a nice little [tutorial](https://www.mediawiki.org/wiki/API:Tutorial), and further info about the [Query action](https://www.mediawiki.org/wiki/API:Query)) and learn how to extract the markup. The API query that returns the markup of the Batman page is:
>
>`https://www.wikipedia.org/w/api.php?format=json&action=query&titles=Batman&prop=revisions&rvprop=content`
>
>1. Explain the structure of this query. What are the parameters and arguments and what do they mean? What happens if you remove `rvprop=content`?
2. Download the Batman page using the API and save it in a new variable. Extract the markup from the `dict` object and save it to a file called "batman.txt". We usually get hung up on this in class, so the first student to successfully extract the markup can share their solution with me so I can validate it and then share it with the class; s/he  gets **one extra credit point**!
>
> *Hint: 2. Use `print_dict_tree` to understand the hierarchy of keys and values in the data you get from the API. To extract the markup, you need to first key into 'query' then 'pages', and so on.*

3.0.3.1: The end point for the GET request can be broken down as follows: <br>
  >  `https://www.wikipedia.org/w/api.php` is the English Wikipedia API <br>
  >  `format=json` means that we want the response to be JSON <br>
  >  `action=query` signifies that we want to fetch data from a wiki <br>
  >  `prop=revision` signifies that we want to see all of the revisions made on a page <br>
  >  `titles=Batman` we want data on Batman's wikipedia page
  >  `rvprop=content` we want to see the content of each revision (ie, what was changed)

In [2]:
# 3.0.3.2
import requests as rq
import json
bman = rq.get("https://en.wikipedia.org/w/api.php?format=json&action=query&titles=Batman&prop=revisions&rvprop=content").json()
bman_str = json.dumps(bman, indent=4)
f = open('exercises_week3_output.txt', 'w')
f.write(bman_str)
f.close()

### Part 1: Get data (main part)

For a good part of this course we will be working with data from Wikipedia. Today, your objective is to crawl a large dataset with good and bad characters from the Marvel universe.

>**Ex. 3.1.1**: From the Wikipedia API, get a list of all Marvel superheroes and another list of all Marvel supervillains. Use the `get_categorymembers` function below to get the characters in each category: 'Category:Marvel_Comics_supervillains' and 'Category:Marvel_Comics_superheroes'. Make sure you spend some time understanding the code.  How is the query formed?  Why does it take that form?  It will help to look at the [Categorymembers API](https://www.mediawiki.org/wiki/API:Categorymembers).  Moreoever, understand the need for the while loop and role played by the `cmcontinue` variable and query argument.

>After you've obtained the lists for superheroes and supervillains, write some code to answer:
1. How many characters are *ambiguous*, i.e. are both heroes and villains? What is the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) between the two groups?
2. How many superheroes are there? How many supervillains? Do not include ambiguous characters in these counts!
>
>*Hint: Google something like "get list all pages in category wikimedia api" if you're struggling with the query. Also, you may notice that not only Marvel character pages are returned, but also names of subcategories. For now just ignore this and treat them as if they are also characters.*

In [3]:
# Use this function that gets the categorymembers of a category
def get_categorymembers(category):
    members = []
    cmcontinue = ""
    while True:

        # Download data
        data = rq.get('https://en.wikipedia.org/w/api.php?format=json&action=query&list=categorymembers&cmtitle=%s&cmlimit=max&cmcontinue=%s' % (category, cmcontinue)).json()    
        #print(data)
        
        # Add member titles
        members.extend(
            [m['title'] for m in data['query']['categorymembers']]
        )

        # If there is a 'continue' key in `data` then fetch the next 'cmcontinue' value
        if 'continue' in data:
            cmcontinue = data['continue']['cmcontinue']

        # Otherwise break
        else:
            break
            
    return members

In [6]:
villains = set(get_categorymembers('Category:Marvel_Comics_supervillains'))
heroes = set(get_categorymembers('Category:Marvel_Comics_superheroes'))
# print the ambiguous characters:
ambiguous_characters = villains & heroes
jaccard = len(ambiguous_characters) / len(heroes | villains)
print('There are %i ambiguous characters, and the heroes and villains have a Jaccard Index of %f' % (len(ambiguous_characters), jaccard))
true_heroes = heroes - villains
true_villains = villains - heroes
print('there are %i \'true\' heroes and %i \'true\' villains' % (len(true_heroes), len(true_villains)))

There are 113 ambiguous characters, and the heroes and villains have a Jaccard Index of 0.094482
there are 443 'true' heroes and 640 'true' villains


>**Ex. 3.1.2**: Using these three lists you now want to download all data you can about each character and store it on your harddrive.
* Create three folders in your working directory, one for *heroes*, one for *villains*, and one for *ambiguous*.
* For each character, download the markup on their pages (just like you did for Batman in 3.0.3) and save in a new file in the corresponding hero/villain/ambiguous folder.  Use the character's name as the filename.
* **Importantly** do not put ambiguous characters into the hero or villains folder!
>
>*Hint: Some of the characters have funky names. The first problem you may encounter is with encoding. To solve that you can call `.encode('utf-8')` on your markup string. Another problem you may encounter is that some characters have a slash in their names. You should just replace the slash with some other meaningful character.*
>Once your code will start running, it will take some time to download the data and create the files (30-40 minutes on my computer).  You might wish you had a measure of progress while the code is running, something like a progress bar.  Look no further than `tqdm`.  Here's an [example](https://www.geeksforgeeks.org/python-how-to-make-a-terminal-progress-bar-using-tqdm/) how to download and use it.

In [55]:
from tqdm import tqdm
for cat_name, character_names in (('villains', true_villains), ('heroes', true_heroes), ('ambiguous', ambiguous_characters)):
    for character_name in tqdm(character_names, desc=cat_name, total=len(character_names)):
        cleaned_character_name = character_name.replace(r'[/\\]', '-').replace('\'', '`')
        # encode into utf to protect against weird characters. Remove renm
        cleaned_character_name = str(cleaned_character_name.encode('utf-8'))[2:-1]
        page_json = rq.get(f'https://en.wikipedia.org/w/api.php?format=json&action=query&titles={cleaned_character_name}&prop=revisions&rvprop=content').json()
        page_str = json.dumps(page_json, indent=4)
        f = open(f'marvel_characters/{cat_name}/{cleaned_character_name}.txt', 'w')
        f.write(page_str)
        f.close()
        

villains:  58%|█████▊    | 370/640 [01:41<01:14,  3.63it/s]


FileNotFoundError: [Errno 2] No such file or directory: 'marvel_characters/villains/Se\\xc3\\xb1or Muerte / Se\\xc3\\xb1or Suerte.txt'

In [49]:
#print('ambiguous: ' + str(ambiguous_characters) + '\n\n\n')
print('villains' + str(true_villains) + '\n\n\n')
#print('heroes' + str(true_heroes))
#import re
cleaned = str('hello \ how are you Señor Suerte'.replace(r'[/\\]', '-').encode('utf-8'))[2:-1]
print(cleaned)

villains{'U-Man', 'Baron Mordo', 'Demon Bear', 'Maestro (character)', 'Marduk Kurios', 'Benedict Kine', "Rl'nnd", 'Marco Delgado (comics)', 'Occulus', 'Ego the Living Planet', 'Cletus Kasady', 'Impala (Marvel Comics)', 'Mojo (comics)', 'Stygyro', 'Zarek (comics)', 'Doomsday Man', 'Ripjak', 'Man-Elephant', 'Riptide (Marvel Comics)', 'Rev (comics)', 'Super Sabre (comics)', 'Shriek (character)', 'Shepard (comics)', 'Auntie Freeze', 'Ghaur', 'Masque (comics)', 'Mike Asher', 'Awesome Android', 'Titannus', 'Jackal (Marvel Comics character)', 'Equinox (comics)', 'Big Wheel (character)', 'Alex (comics)', 'Spot (comics)', 'Randall Darby', 'Xorr the God-Jewel', 'Cyclone (Marvel Comics)', 'Eliminator (comics)', 'Ringer (comics)', 'Spymaster (character)', 'Baron Macabre', 'Master of the World (comics)', 'Roderick Kingsley', 'Black Knight (Nathan Garrett)', 'Visimajoris', 'Shockwave (comics)', 'Whiplash (Marvel Comics)', 'Crusader (Marvel Comics)', 'Tombstone (character)', 'Mercurio the 4-D Man', '

### Part 2: Explore data

#### Page lengths

>**Ex. 3.2.1**: Extract the length of the page of each character (to do so you will have to open the corresponding file) and plot the distribution of this variable for each class (heroes/villains/ambiguous). Can you say anything about the popularity of characters in the Marvel universe based on your visualization?
>
>*Hint: The simplest thing is to make a probability mass function, i.e. a normalized histogram. [My figure](https://github.com/lucian979/CarletonBD/blob/main/plots/ex3.2.1.pdf) looks like this. Use `plt.hist` on a list of page lengths, with the argument `density=True`. Other distribution plots are fine too, though.*

>**Ex. 3.2.2**: Find the 10 characters from each class with the longest Wikipedia pages. Visualize their page lengths with bar charts. Comment on the result.

#### Alliances

>**Ex. 3.2.3**: In this exercise you want to find out the biggest alliances in the Marvel universe and their members. The data that will help you in doing this is in the *alliances*-field of the markup of each character -- open up a couple of character files and look for that field; get a sense for how the information is stored so that you can then write code to retreave it. Below I suggest steps you can take to solve the problem if you get stuck.
* Use the regular expression `alliances[\w\W]+?\n` to extract the *alliances*-field of a character's markup.
* Use the regular expression `\[\[.+?[\]\|]` to extract links (i.e. each team) from the *alliance*-field.
* You want to store alliance names and the corresponding members (hint: use a `defaultdict`).
* Inspect your team names. Are there any that result from inconsistencies in the information on the pages? How do you deal with this?
* **Print the 10 largest alliances and their number of members.**

In [None]:
#example of using regex
import re

for title in titles:
        
    # Need to replace / with - before loading file
    title = title.replace("/", "-")
        
    # Load character markup
    with open(f"../data/{folder}/{title}.txt") as fp:
        markup = fp.read()
    
    # Get alliance field
    alliances_field = re.findall(r"alliances[\w\W]+?\n", markup)
    #...

#### Timeline

>**Ex. 3.2.4 EXTRA**: We are interested in knowing if there is a time-trend in the debut of characters.
* Extract into three lists, debut years of heroes, villains, and ambiguous characters.
* Do all pages have a debut year? Do some have multiple? How do you handle these inconsistencies?
1. For heroes, villains and ambiguous character seperately, visualize the amount of characters introduced over time. You choose how you want to visualize this data, but please comment on your choice.
2. Make a plot that shows what fraction of introduced characters each year are heros. Taken together, **comment on your visualizations** and what they say about the system you're analyzing.
>
>*Hint: The debut year is given on the debut row in the info table of a character's Wiki-page. There are many ways that you can extract this variable. You should try to have a go at it yourself, but if you are short on time, you can use this horribly ugly regular expression code:*<br><br>
*`re.findall(r"\d{4}\)", re.findall(r"debut.+?\n", markup_text)[0])[0][:-1]`*
>
> ***Will not be included in assignment. Worth up to 5 extra credit.***