Scraping Wikipedia for Names of Climbers and Mountaineers
===

If you've ever worked with Docker you might have been pleasantly surprised by container names such as `furious_einstein`, `agitated_curie`, `wizardly_lovelace`, and `romantic_darwin`. 

The code responsible for generating names such as these [(moby/names-generator.go)](https://github.com/moby/moby/blob/master/pkg/namesgenerator/names-generator.go) works by pairing a long list of adjectives together with a list of last names. [<sup>1</sup>](#code-style)

I wanted to write a tool that generates similar strings, but with a climbing flair, such as `beautiful_honnold`, `amazing_sharma`, and `cool_hill` and get the route setters at my local climbing gym to use it to name our indoor routes so me and my friends could have an easier time referring to routes when we aren't on site.

However, I am much too lazy to write down all of these last names by hand. And instead, I wanted to automagically scrape as many of them as I could off of the internet instead. Conveniently, Wikipedia has a page listing several well-known [climbers and mountaineers](https://en.wikipedia.org/wiki/List_of_climbers_and_mountaineers) that we can, of course, scrape.

# Getting the Raw Data

The aforementioned Wikipedia page 'https://en.wikipedia.org/wiki/List_of_climbers_and_mountaineers' lists through climbers and mountaineers by their last name, from A to Z. 

To begin, we want to fetch the page and soupify it,

In [1]:
from bs4 import BeautifulSoup
import requests

list_of_climbers_url = 'https://en.wikipedia.org/wiki/List_of_climbers_and_mountaineers'

def get_html(url):
    return requests.get(url).text

soup = BeautifulSoup(get_html(list_of_climbers_url), 'lxml')

If you visit the [Wikipedia-page](https://en.wikipedia.org/wiki/List_of_climbers_and_mountaineers), and inspect its source, you'd notice that it is composed of a bunch of unordered lists, and that the unordered lists that are interesting to us are all preceeded by an `h2` with a `span` having an `id`such that its length is equal to `1` **and** it is an uppercase letter.

Thus, if we iterate across all `h2`s that satisfies these conditions we'll be able to grab all list entries within these unordered lists. Each such list entry is, conveniently, a climber.

We could write some code that grabs all of these list entires in the following manner,

In [2]:
import string # So we can check if the id is an uppercase letter

climber_lis = []

for h2 in soup.find_all('h2'):
    if h2.span: # Will be None (False) if there isn't a span to grab onto
        
        # Is a single letter in the range A..Z
        if len(h2.span['id']) == 1 and h2.span['id'] in string.ascii_uppercase:
                climber_lis.append(h2.find_next('ul').find_all('li'))

The above gives us a nested list of lists, such that the first sublist all contain climbers with have a last name starting with the letter `A`, the second sublist contains climbers that have a last name that starts with the letter `B`, and so on. 

In [3]:
climber_lis[0][0] # The 'li' for Vitaly Abalakov

<li><a href="/wiki/Vitaly_Abalakov" title="Vitaly Abalakov">Vitaly Abalakov</a> (1906–1992) Russia, climbed <a href="/wiki/Lenin_Peak" title="Lenin Peak">Lenin Peak</a> (1934) and <a href="/wiki/Khan_Tengri" title="Khan Tengri">Khan Tengri</a> (1936)</li>

In [4]:
climber_lis[1][0] # The 'li' for Samina Baig

<li><a href="/wiki/Samina_Baig" title="Samina Baig">Samina Baig</a> - <a href="/wiki/Gilgit-Baltistan" title="Gilgit-Baltistan">Gilgit-Baltistan</a>, 3rd Pakistani and only Pakistani woman to climb <a href="/wiki/Mount_Everest" title="Mount Everest">Mount Everest</a></li>

However, while the code is short, and complete with some helpful comments, it could be made more legible with some convenience functions.

Adding such functions in rarely takes a long amount of time, and while it might interrupt our flow and cadence at times, it has been my experience that whenever one optimizes for development speed in the short-term it has a long-term cost whenever one has to re-read the code or change it. I try to remind myself of the [Parable of the Road Line Painter](https://davembush.github.io/the-parable-of-the-road-line-painter/) whenever I find myself blazing ahead to quickly.

Even with a small project such as this one, it is usually worth the effort. The following convenience functions are adequate enough to improve the legibility,

In [5]:
def has_span_attribute(h2):
    return h2.span is not None

def is_single_letter(s):
    return len(s) == 1

def is_uppercase_letter(c):
    import string
    return c in string.ascii_uppercase

def find_all_climbers(soup):
    climber_lis = []
    
    for h2 in soup.find_all('h2'):
        if has_span_attribute(h2):
            span_id = h2.span['id'] 
            if is_single_letter(span_id) and is_uppercase_letter(span_id):
                climber_lis.append(h2.find_next('ul').find_all('li'))
                
    return climber_lis
                
climber_lis = find_all_climbers(soup)
climber_lis[0][0] # Still works the same as before

<li><a href="/wiki/Vitaly_Abalakov" title="Vitaly Abalakov">Vitaly Abalakov</a> (1906–1992) Russia, climbed <a href="/wiki/Lenin_Peak" title="Lenin Peak">Lenin Peak</a> (1934) and <a href="/wiki/Khan_Tengri" title="Khan Tengri">Khan Tengri</a> (1936)</li>

## Reshaping the Data

The nested structure, wherein the list is grouped by letter, is not something we need as we want to operate on the entire set of climbers. 

In [6]:
def flatten_list(l):
    from functools import reduce
    import operator
    
    return reduce(operator.add, l)

climber_lis = flatten_list(climber_lis)
climber_lis[0] # Still Vitaly

<li><a href="/wiki/Vitaly_Abalakov" title="Vitaly Abalakov">Vitaly Abalakov</a> (1906–1992) Russia, climbed <a href="/wiki/Lenin_Peak" title="Lenin Peak">Lenin Peak</a> (1934) and <a href="/wiki/Khan_Tengri" title="Khan Tengri">Khan Tengri</a> (1936)</li>

## Pruning Unwanted (Unfortunate) Entries

Now we want to filter away climbers with more complex last names such as those that,

1. Contain non-ascii characters, as not all label makers support those characters.
2. Consists of several "words" as they are presumably,
    2.1. Long, 
    2.2. and do not fit the Docker container name schema.
    
and so we want to create a predicate function for that. 

But first, how do we extract the complete name of the climber from these list entires? First recall that `climber_lis` is a (now flat) list of all climbers. Most of these entires start off with a link to the respective Wikipedia page for that climber and the link names are conveninently the name of that particular climber.

Therefore, by retaining all `li`s that have a link in them we should have an easy time grabbing the names of the climbers to which we later apply our predicate rule. And so, we begin by filtering all list entries without a link,

In [7]:
def contains_a_href(climber_li):
    return climber_li.a is not None

climber_lis = list(filter(contains_a_href, climber_lis))

For the remaining entries, we will assume then that the full name of the climber is in the text portion of the link, like so:

In [8]:
def full_name(climber_li):
    assert(contains_a_href(climber_li))
    return climber_li.a.text

Now we can define the predicate for whether or not the last name of a climber is simple enough for us to use or not,

In [9]:
def can_be_printed_on_any_label_maker(s):
    return s.isascii()

def consists_of_two_words(s):
    return len(s.split(' ')) == 2

def last_name(full_name):
    return full_name.split(' ')[1]

def has_simple_lastname(full_name):
    return consists_of_two_words(full_name) and can_be_printed_on_any_label_maker(last_name(full_name))

Now, if all we wanted to do was grab all the last names, toss it into a list, and start pairing it with a bunch of adjectives we'd be almost done at this point. Really, this is all we'd have left to do:

In [10]:
full_names = map(full_name, climber_lis)
printable_names = filter(has_simple_lastname, full_names)
all_last_names = list(map(last_name, printable_names))
assert(len(last_name) == 1 for last_name in all_last_names)

# You can print out all_last_names if you want, but I'm just going to show you a few values for the benefit of
# static rendering on Github, 
all_last_names[0:5]

['Abalakov', 'Abalakov', 'Agarwal', 'Allain', 'Almer']

If all you want are the last names, you are done here!

# Grabbing More than Just Names

But, if you have a look at [(moby/names-generator.go)](https://github.com/moby/moby/blob/master/pkg/namesgenerator/names-generator.go) you'd see that each person there has an associated description, like so,

```
// names-generator.go
...
// Sophie Wilson designed the first Acorn Micro-Computer and the instruction set for ARM processors. https://en.wikipedia.org/wiki/Sophie_Wilson
"wilson",

// Jeannette Wing - co-developed the Liskov substitution principle. - https://en.wikipedia.org/wiki/Jeannette_Wing
"wing",

// Steve Wozniak invented the Apple I and Apple II. https://en.wikipedia.org/wiki/Steve_Wozniak
"wozniak",
...
```

and right now, in our `li`s we have these types of descriptions.

Wouldn't it be neat if we could keep the descriptions around so that when we render out our route names later we can render out the description for that climber as well? I think so, and I'd very much like it if we did just that. And to accomplish this, we have to put in a bit more effort into it. We want to perform the same filtering as we did, but keep around the original `li` so we can grab the description later.

This is something we can accomplish easily with a dictionary comprehension,

In [11]:
last_names_with_li = {last_name(full_name(li)): li for li in climber_lis if has_simple_lastname(full_name(li))}

# Again, this is a pretty "big" dataset, in the sense that it looks kind of bad if we render it all out, so
# let's just look at a single item to see what we have,
last_names_with_li['Caldwell']

<li><a href="/wiki/Tommy_Caldwell" title="Tommy Caldwell">Tommy Caldwell</a> (born 1978) US, rock climber, free climbed Nose of El Capitan</li>

If you did not run the above code, it outputs the following,

```
<li><a href="/wiki/Tommy_Caldwell" title="Tommy Caldwell">Tommy Caldwell</a> (born 1978) US, rock climber, free climbed Nose of El Capitan</li>
```

and what we would like it to output is a tuple of two elements. Namely, we want to create a tuple consisting of

1. The link target, i.e. `'/wiki/Tommy_Caldwell'`, and
2. the complete contents of the `li`, i.e. `'<a href="/wiki/Tommy_Caldwell" title="Tommy Caldwell">Tommy Caldwell</a> (born 1978) US, rock climber, free climbed Nose of El Capitan')`.

How do we do this?

## Grabbing the Link Target

Grabbing the link target is super easy, almost not worth a subsection to be honest,

In [12]:
last_names_with_li['Caldwell'].a['href']

'/wiki/Tommy_Caldwell'

## Grabbing the `li` Contents

Getting the contents is _almost_ just as easy, we notice that the below code returns a list,

In [13]:
last_names_with_li['Caldwell'].contents

[<a href="/wiki/Tommy_Caldwell" title="Tommy Caldwell">Tommy Caldwell</a>,
 ' (born 1978) US, rock climber, free climbed Nose of El Capitan']

Which seems like it is almost what we want, just not quite. The keen observer would note that the first element is rendered without surrounding quotes, and so it clearly this is not just a list of strings. But what is it a list of then? By mapping the `type` function across the list we can figure this out,

In [14]:
list(map(type, last_names_with_li['Caldwell']))

[bs4.element.Tag, bs4.element.NavigableString]

Aha, it's a list of BeautifulSoup objects. Conveniently for us, the BeautifulSoup documentation goes into how to get the raw HTML of any `BeautifulSoup` object [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#non-pretty-printing), we simply need to map the `str` function over the elements and join them together, i.e. we want to do the following,

In [15]:
''.join(list(map(str, last_names_with_li['Caldwell'].contents)))

'<a href="/wiki/Tommy_Caldwell" title="Tommy Caldwell">Tommy Caldwell</a> (born 1978) US, rock climber, free climbed Nose of El Capitan'

# Footnotes

<span id="code-style">Note 1:</span> To me, this bit of code is _just_ right. No unnecessary complexity, it's simple and to the point and anyone with a modicum of programming experience could understand it, even if they do not know Go: the programming language that the code is written in.