# Collect Physicists

For this project, I need a list of physicists who are notable for their achievements. Wikipedia contains two such lists, one general [list of physicists](https://en.wikipedia.org/wiki/List_of_physicists) and another list of [theoretical physicists](https://en.wikipedia.org/wiki/List_of_theoretical_physicists). I will scrape these lists and unify them into a single list. It is important to recognize that some of these physicists have won the *Nobel Prize* and some have not and also that some are *dead* and some are *alive*. You should at least recognize a few of the more famous names in the list even if you do not recognize them all. The entire analysis of this project will be based on the data that is acquired on these physicists. OK time to get scraping.

## Setting the Environment

A few initialization steps are needed to setup the environment:

- The top-level module directory of the repository needs to be added to the system path to enable the loading of python modules.
- The locale needs to be set for all categories to the user’s default setting (typically specified in the LANG environment variable) to enable correct sorting of physicists names with accents.

In [None]:
import locale
import sys

repo_dir = '../'
if repo_dir not in sys.path:
    sys.path.append(repo_dir)

locale.setlocale(locale.LC_ALL, '')

In [None]:
import string
import time

import numpy as np
from bs4 import BeautifulSoup

from src.data.url_utils import urls_progress_bar
from src.data.wiki_utils import BLACKLIST_LINKS
from src.data.wiki_utils import FORCED_REDIRECTS
from src.data.wiki_utils import SECTION_TITLES
from src.data.wiki_utils import WIKI_OLD_URL
from src.data.wiki_utils import get_linked_article_titles
from src.data.wiki_utils import get_redirected_titles

## Scraping the Physicists

I use a combination of *requests* and *beautifulsoup* to scrape the links from the Wikipedia pages. I filter the list of links down to only those containing physicist names. The important point to note is that I actually need to send HTTP requests to fetch the pages associated with these links as some of them are redirected to different URLs. The really tricky part is that the redirects are done via javascript so they are not detected by *requests*. As a result I have to parse the javascript to find the redirect link.

Even after all of this, some of the redirected Wikipedia links are not in sync with the DBpedia links. This means that when I later try to fetch the data from DBpedia, the links resolve the the wrong resource. So I force these redirects manually here. 

In [None]:
def get_notable_physicists(progress_bar=None):
    """Get a list of notable physicists.
    Args:
        progress_bar (progressbar.ProgressBar): Progress bar.

    Returns:
        list (str): List of names of notable physicists.

    """

    # get the theoretical physicists
    theoretical_physicists = get_linked_article_titles(
        WIKI_OLD_URL + 'List_of_theoretical_physicists&oldid=855745137',
        section_titles=SECTION_TITLES
    )
    assert(len(theoretical_physicists) == 266)

    # get the physicists
    physicists = get_linked_article_titles(
        WIKI_OLD_URL + 'List_of_physicists&oldid=861832841',
        section_titles=list(string.ascii_uppercase),
        blacklist_links=BLACKLIST_LINKS
    )
    assert(len(physicists) == 976)
    assert(not set(BLACKLIST_LINKS).intersection(set(physicists)))

    # merge the lists
    notable_physicists = list(set(theoretical_physicists + physicists))

    # get the redirect title (if any) from a HTTP request
    notable_physicists = get_redirected_titles(
        notable_physicists,
        forced_redirects=FORCED_REDIRECTS,
        max_workers=20,
        progress_bar=progress_bar)
    assert(set(FORCED_REDIRECTS.values()).intersection(
        set(notable_physicists.values())))

    # remove duplicates, sort and return list
    notable_physicists = list(set(notable_physicists.values()))
    notable_physicists.sort(key=locale.strxfrm)
    return notable_physicists

In [None]:
NUM_URLS = 1085
notable_physicists = get_notable_physicists(urls_progress_bar(NUM_URLS))

Let's check that there are no duplicate names and how many names we got.

In [None]:
assert(len(notable_physicists) == 1060)
len(notable_physicists)

In [None]:
assert(len(np.unique(notable_physicists)) == len(notable_physicists))
len(notable_physicists)

Let's write the list to a file for future use and check the list of names. 

In [None]:
with open('../data/raw/physicists.txt', mode='w', encoding='utf-8') as file:
    file.writelines('\n'.join(notable_physicists))

In [None]:
%pycat ../data/raw/physicists.txt

## Cleaning Up

A clean up step is needed:

- Remove the top-level module directory of the repository from the system path.

In [None]:
sys.path.remove(repo_dir)