# Collect Physicists

For this project, I need a list of physicists who are notable for their achievements. Wikipedia contains two such lists, one general [list of physicists](https://en.wikipedia.org/wiki/List_of_physicists) and another list of [theoretical physicists](https://en.wikipedia.org/wiki/List_of_theoretical_physicists). I will scrape these lists and unify them into a single list. It is important to recognize that some of these physicists have won the *Nobel Prize* and some have not and also that some are *dead* and some are *alive*. You should at least recognize a few of the more famous names in the list even if you do not recognize them all. The entire analysis of this project will be based on the data that is acquired on these physicists. OK time to get scraping.

## Setting up the Environment

An initialization step is needed to setup the environment:
- The locale needs to be set for all categories to the user’s default setting (typically specified in the LANG environment variable) to enable correct sorting of physicists names with accents.  

In [None]:
import locale
import string
import time
from urllib import parse

from bs4 import BeautifulSoup
from bs4.element import NavigableString
import numpy as np
import progressbar as pb
import requests

In [None]:
locale.setlocale(locale.LC_ALL, '')

## Scraping the Physicists

I use a combination of *requests* and *beautifulsoup* to scrape the links from the Wikipedia pages. I filter the list of links down to only those containing physicist names. The important point to note is that I actually need to send HTTP requests to fetch the pages associated with these links as some of them are redirected to different URLs. The really tricky part is that the redirects are done via javascript so they are not detected by *requests*. As a result I have to parse the javascript to find the redirect link.

Even after all of this, some of the redirected Wikipedia links are not in sync with the DBpedia links. This means that when I later try to fetch the data from DBpedia, the links resolve the the wrong resource. So I force these redirects manually here. 

In [None]:
WIKI_URL = 'https://en.wikipedia.org/wiki/'

def get_notable_physicists(progress_bar=None):
    """Get a list of notable physicists.
    Args:
        progress_bar (progressbar.ProgressBar): Progress bar.

    Returns:
        list (str): List of names of notable physicists.

    """

    # get the theoretical physicists
    theoretical_physicists = _get_linked_article_titles(
        WIKI_URL + 'List_of_theoretical_physicists',
        section_titles=[
            'Ancient times',
            'Middle_Ages',
            '15th–16th century',
            '16th–17th century',
            '17th–18th century',
            '18th–19th century',
            '19th century',
            '19th–20th century',
            '20th century',
            '20th–21st century'
        ]
    )
    assert(len(theoretical_physicists) == 267)

    # get the physicists
    blacklist_links = [
            'Newcastle University',  # university
            # leads to the physicist at this foreign language link:
            # https://tr.wikipedia.org/wiki/Victor_Twersky_(fizik%C3%A7i)
            'Ernst equation',  # equation
            'Matthew Sanders',  # not a physicist
            'Ricardo Carezani',  # not found in DBpedia (misspelt there?)
            'Twersky#Twersky'  # a group of people of this name
        ]
    physicists = _get_linked_article_titles(
        WIKI_URL + 'List_of_physicists',
        section_titles=list(string.ascii_uppercase),
        blacklist_links=blacklist_links
    )
    assert(len(physicists) == 974)
    assert(not set(blacklist_links).intersection(set(physicists)))

    # merge the lists
    notable_physicists = list(set(theoretical_physicists + physicists))

    # get the redirect title (if any) from a HTTP request
    forced_redirects = {
        # DBpedia not in sync with Wikipedia for these names
        # so force these redirects
        'Ea Ea': 'Craige Schensted',
        'Gian Carlo Wick': 'Gian-Carlo Wick',
        'Hans Adolf Buchdahl': 'Hans Adolph Buchdahl',
        'James Jeans': 'James Hopwood Jeans',
        'Lawrence Bragg' : 'William Lawrence Bragg',
        "Shin'ichirō Tomonaga": "Sin'ichirō_Tomonaga",
        'Thales of Miletus': 'Thales'
    }
    notable_physicists = _get_redirected_titles(notable_physicists,
                                                forced_redirects,
                                                progress_bar)
    assert(len(notable_physicists) == 1059)
    assert(not set(forced_redirects.keys()).intersection(
        set(notable_physicists)))
    assert(set(forced_redirects.values()).intersection(
        set(notable_physicists)))

    # sort the list
    notable_physicists.sort(key=locale.strxfrm)
    return notable_physicists


def _get_linked_article_titles(url, section_titles,
                               blacklist_links=None):
    # fetch the page
    response = requests.get(url)
    soup = BeautifulSoup(response.text)

    # loop to find links
    article_titles = []
    for section_title in section_titles:
        span_id = section_title.replace(' ', '_')
        section = soup.find('span', id=span_id)
        ul = section.find_next('ul')
        for link in ul.find_all('a', href=True):
            if not link['href'].startswith('/wiki/'):
                # skip external and dead (redlink=1) links
                continue
            article_title = (
                link['href'].replace('/wiki/', '').replace('_', ' ')
            )
            if (not blacklist_links or 
                article_title not in blacklist_links):
                article_titles.append(article_title)
    return article_titles


def _get_redirected_titles(titles, forced_redirects=None,
                           progress_bar=None):
    redirected_titles = set()

    for i in range(len(titles)):
        # fetch the page
        url = WIKI_URL + str(titles[i]).replace(' ', '_')
        time.sleep(1)  # delay to crawl responsibly
        response = requests.get(url)
        response.raise_for_status()

        # parse javascript for redirects
        redirected = False
        if response.status_code == requests.codes.ok:
            REDIRECT = '"wgInternalRedirectTargetUrl":'
            soup = BeautifulSoup(response.text)
            for script_tag in soup.find_all(name='script'):
                script_code = script_tag.string
                if (isinstance(script_code, NavigableString) and
                        REDIRECT in script_code):
                    start = script_code.find(REDIRECT)
                    end = script_code.find('"', start + len(REDIRECT) + 1)
                    redirected_title = (
                        script_code[start + len(REDIRECT) + 1:end]
                        .replace('/wiki/', '').replace('_', ' ')
                    )
                    # some physicist names contain unicode characters
                    # which have been quoted when in a url
                    # unquote for sorting
                    redirected_titles.add(parse.unquote(redirected_title))
                    redirected = True
        if not redirected:
            redirected_titles.add(parse.unquote(titles[i]))
        
        if progress_bar:
            progress_bar.update(i)
            
    # force redirects
    redirected_titles = list(redirected_titles)
    for key, value in forced_redirects.items():
        redirected_titles[redirected_titles.index(key)] = value
            
    return redirected_titles

In [None]:
widgets = [
    'Checking: ', pb.Counter(),
    ' / ' + str(1084) + ' urls',
    ' ', pb.Bar(marker='█'),
    ' ', pb.Percentage(),
    ' ', pb.Timer(),
    ' ', pb.ETA()
]
bar = pb.ProgressBar(max_value=1084, widgets=widgets,
                     redirect_stdout=True).start()

notable_physicists = get_notable_physicists(bar)
bar.finish()

Let's check that there are no duplicate names and how many names we got.

In [None]:
assert(len(np.unique(notable_physicists)) == len(notable_physicists))
len(notable_physicists)

Let's write the list to a file for future use and check the list of names. 

In [None]:
with open('../data/raw/physicists.txt', mode='w', encoding='utf-8') as file:
    file.writelines('\n'.join(notable_physicists))

In [None]:
%pycat ../data/raw/physicists.txt

## Cleaning Up

A few clean up steps are needed:

- Convert the notebook to a HTML file with all the output.
- Convert the notebook to another notebook with the output removed.

In [None]:
!jupyter nbconvert --ClearOutputPreprocessor.enabled=False --output-dir html_output --to html 1.0-collect-physicists.ipynb

In [None]:
!jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to notebook 1.0-collect-physicists.ipynb