# Collect Nobel Prize Laureates

We will begin by scraping a [Wikipedia](https://en.wikipedia.org/wiki/Main_Page) list of [Nobel Laureates in Physics from 1901 - 2017](https://en.wikipedia.org/w/index.php?title=List_of_Nobel_laureates_in_Physics&oldid=862097595). Nobel Laureates in Chemistry are included in the nomination process for the Nobel Prize in Physics. Some of them had or have a professional or personal relationship with one or more Physics Nobel Laureates. Therefore, to examine how these relationships possibly affect the awarding of the Nobel Prize in Physics, we will also scrape a Wikipedia list of [Nobel Laureates in Chemistry from 1901 - 2017](https://en.wikipedia.org/w/index.php?title=List_of_Nobel_laureates_in_Chemistry&oldid=860639110). You should recognize a few of the more famous names on both lists, even if you do not recognize them all. OK time to get scraping.

## Setting up the Environment

An initialization step is needed to setup the environment:
- An environment variable needs to be set to disable loading of `user-config.py` for [pywikibot](https://github.com/wikimedia/pywikibot).

In [None]:
%env PYWIKIBOT_NO_USER_CONFIG=1

In [None]:
import os

import numpy as np
import pandas as pd
import pywikibot as pwb
import wikitextparser as wtp

from src.data.url_utils import urls_progress_bar
from src.data.wiki_utils import FORCED_REDIRECTS
from src.data.wiki_utils import get_redirected_titles

## Scraping the Nobel Physics Laureates

We will use a combination of [pywikibot](https://github.com/wikimedia/pywikibot) and [wikitextparser](https://github.com/5j9/wikitextparser) to scrape the laureates data from the tables in the Wikipedia pages above and store it in a [pandas](https://pandas.pydata.org/) dataframe. In the future we will need to fetch data about the physicists from [DBpedia](https://wiki.dbpedia.org/about) using the links in the table, so [requests](http://docs.python-requests.org/en/master/) and [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) will be used to obtain these. An important point to note is that we will actually need to send HTTP requests to fetch the linked pages as some of them redirect to different URLs. The tricky part is that the redirects are done via javascript, so they are not detected by requests. As a result, we will have to parse the javascript to find the redirect link. Even after all of this, some of the redirected Wikipedia links do not match the DBpedia links, resulting in the wrong resource being retrieved. To avoid this, we manually force the correct redirects now.

Let's start with the physics laureates.

In [None]:
def get_nobel_laureates(code='en', fam='wikipedia',
                        title='List of Nobel laureates in Physics',
                        oldid=None):
    """Get a table of Nobel Laureates from Wikipedia.

     Args:
         code (str): Language code as defined by `pywikibot`.
         fam (str): Family name or object as defined by `pywikibot`.
         title (str): The title of the Wikipedia page as defined by `pywikibot`.
             This is essentially the path name to the url with any underscores
             replaced by spaces.
             e.g. `https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Physics`
             has title `List of Nobel laureates in Physics`.
        oldid (int): The revid of the revision of the page desired. 

    Returns:
        pandas.Dataframe: Dataframe containing the table information.

    """
    site = pwb.Site(code=code, fam=fam)
    return _get_nobel_laureates(site, title, oldid=oldid)


def _get_nobel_laureates(site, title, oldid=None):
    code = _get_page_wikicode(site, title, oldid=oldid)
    table = code.tables[0].data()
    laureates = pd.DataFrame(table[1:], columns=[
        'Year', 'Image', 'Laureate', 'Country', 'Rationale', 'Ref'])
    return laureates


def _get_page_wikicode(site, page_title, oldid=None):
    page = pwb.Page(site, page_title)
    if oldid:
        text = page.getOldVersion(oldid=oldid)
    else:
        text = page.get()
    return wtp.parse(text)

In [None]:
physics_laureates = get_nobel_laureates(title='List of Nobel laureates in Physics', oldid=862097595)
physics_laureates.head(30)

This table is extremely messy containing lots of wiki markup so let's clean it up.

In [None]:
def clean_laureates_dataframe(table, progress_bar=None):
    """Cleanup a table of Nobel Laureates from Wikipedia.

     Args:
        table (pandas.Dataframe): Pandas dataframe containing table
            information.
        progress_bar (progressbar.ProgressBar): Progress bar.

    Returns:
        pandas.Dataframe: Dataframe containing the cleaned-up table
            information.

    """

    # drop uninteresting columns
    cleaned_table = table.copy().drop('Image', axis=1)
    cleaned_table = cleaned_table.drop('Ref', axis=1)

    # cleanup the columns
    cleaned_table['Year'] = cleaned_table.Year.apply(
        _strip_wikicode).astype('int64')
    cleaned_table['Laureate'] = cleaned_table.Laureate.apply(_clean_laureate)
    cleaned_table['Rationale'] = cleaned_table.Rationale.apply(_strip_wikicode)
    cleaned_table['Country'] = cleaned_table.Country.apply(_strip_wikicode)
    cleaned_table['Country'] = cleaned_table.Country.apply(_clean_country)

    # NA years the prize was not awarded
    cleaned_table.loc[cleaned_table.Rationale.str.contains('Not awarded'),
                      ['Laureate', 'Country', 'Rationale']] = np.nan
    
    # get the redirect title (if any) from a HTTP request
    laureates = cleaned_table.Laureate.values.tolist()
    redirected_titles = get_redirected_titles(
        laureates,
        forced_redirects=FORCED_REDIRECTS,
        max_workers=10,
        progress_bar=progress_bar)
    cleaned_table['Laureate'] = cleaned_table.Laureate.apply(
        lambda title: redirected_titles[title] if isinstance(title, str) else title)
    
    return cleaned_table


def _clean_laureate(markup):    
    # Get the link otherwise return cell as is
    links = wtp.parse(markup).wikilinks
    if links:
        return links[0].target
    return markup


def _strip_wikicode(markup):
    # store the text and target of all wikilinks
    wikilinks = {}
    parsed = wtp.parse(markup)
    for link in parsed.wikilinks:
        if link.text:
            wikilinks[link.string] = link.text
        else:
            wikilinks[link.string] = link.target

    # replace all wikilinks with the associated text
    stripped_string = markup
    for link_markup in wikilinks:
        stripped_string = stripped_string.replace(
            link_markup, wikilinks[link_markup])
    return stripped_string


def _strip_wikilinks(markup):
    return text.replace('[[', '').replace(']]', '')


def _clean_country(markup):
    countries = []
    templates = wtp.parse(markup).templates
    for template in templates:
        if template.name == 'flag' or template.name == 'flagcountry':
            for argument in template.arguments:
                try:
                    int(argument.value)  # skip if string is a year
                except ValueError:
                    countries.append(argument.value)

    # e.g. "Not awarded" case
    if not countries:
        return markup
    return '|'.join(countries)

In [None]:
NUM_URLS = 213
physics_laureates = clean_laureates_dataframe(physics_laureates, urls_progress_bar(NUM_URLS))

In [None]:
with pd.option_context('display.max_rows', 300, 'display.max_colwidth', 500):
    display(physics_laureates)

This looks much better, but let's do a sanity check. The Nobel Prize in Physics:

- Has been awarded 111 times between 1901 and 2017
- to 207 Nobel Laureates (206 distinct individuals)
- *John Bardeen* is the only Nobel Laureate who has been awarded the Nobel Prize twice in 1956 and 1972

In [None]:
assert(len(physics_laureates.dropna().Year.unique() == 111))
assert(np.array_equal(physics_laureates.Year.unique(), np.array(range(1901, 2018))))
assert(len(physics_laureates.dropna().Laureate == 207))
assert(len(physics_laureates.loc[physics_laureates.Laureate == 'John Bardeen'] == 2))
assert(np.array_equal(physics_laureates.loc[physics_laureates.Laureate == 'John Bardeen'].Year,
                      [1956, 1972]))
assert(len(physics_laureates.dropna().Laureate.unique() == 206))

It looks good so let's write the data to a csv file for later use.

In [None]:
physics_laureates.to_csv('../data/raw/nobel-physics-prize-laureates.csv', index=False)

## Scraping the Nobel Chemistry Laureates

Let's use the functions we created above, but this time for the chemistry laureates.

In [None]:
chemistry_laureates = get_nobel_laureates(title='List of Nobel laureates in Chemistry', oldid=860639110)
chemistry_laureates.head(30)

In [None]:
NUM_URLS = 186
chemistry_laureates = clean_laureates_dataframe(chemistry_laureates, urls_progress_bar(NUM_URLS))

In [None]:
with pd.option_context('display.max_rows', 200, 'display.max_colwidth', 500):
    display(chemistry_laureates)

This looks good, but let's do a sanity check. The Nobel Prize in Chemistry:

- Has been awarded 109 times between 1901 and 2017
- to 178 Nobel Laureates (177 distinct individuals)
- Frederick Sanger is the only Nobel Laureate who has been awarded the Nobel Prize in Chemistry twice, in 1958 and 1980

In [None]:
assert(len(chemistry_laureates.dropna().Year.unique() == 109))
assert(np.array_equal(chemistry_laureates.Year.unique(), np.array(range(1901, 2018))))
assert(len(chemistry_laureates.dropna().Laureate == 178))
assert(len(chemistry_laureates.loc[chemistry_laureates.Laureate == 'Frederick Sanger'] == 2))
assert(np.array_equal(chemistry_laureates.loc[chemistry_laureates.Laureate == 'Frederick Sanger'].Year,
                      [1958, 1980]))
assert(len(chemistry_laureates.dropna().Laureate.unique() == 177))

It looks good so let's write the data to a csv file for later use.

In [None]:
chemistry_laureates.to_csv('../data/raw/nobel-chemistry-prize-laureates.csv', index=False)

## Cleaning Up

A clean up step is needed:

- Unset the environment variable that was set above.

In [None]:
del os.environ['PYWIKIBOT_NO_USER_CONFIG']