# Collect Physicists

For this project, I need a list of physicists who are notable for their achievements. Wikipedia contains two such lists, one general [list of physicists](https://en.wikipedia.org/wiki/List_of_physicists) and another list of [theoretical physicists](https://en.wikipedia.org/wiki/List_of_theoretical_physicists). I will scrape these lists and unify them into a single list. It is important to recognize that some of these physicists have won the *Nobel Prize* and some have not and also that some are *dead* and some are *alive*. You should at least recognize a few of the more famous names in the list even if you do not recognize them all. The entire analysis of this project will be based on the data that is acquired on these physicists. OK time to get scraping.

## Setting up the Environment

A few initialization steps are needed to setup the environment:
- An environment variable needs to be set to disable loading of `user-config.py` for *pywikibot*.
- The locale needs to be set for all categories to the user’s default setting (typically specified in the LANG environment variable) to enable correct sorting of physicists names with accents.  

In [None]:
%env PYWIKIBOT_NO_USER_CONFIG=1 # 

In [None]:
%load_ext pycodestyle_magic

In [None]:
import locale
import os

import mwparserfromhell
import numpy as np
import pywikibot

In [None]:
locale.setlocale(locale.LC_ALL, '')

## Scraping the Physicists

I use a combination of *pywikibot* and *mwparserfromhell* to scrape the links from the Wikipedia pages. I filter the list of links down to only those containing physicist names. 

In [None]:
def get_notable_physicists():
    """Get a list of notable physicists.

    Returns:
        list (str): List of names of notable physicists.

    """
    physicists = _get_physicists()
    theoretical_physicists = _get_theoretical_physicists()
    notable_physicists = list(set(physicists + theoretical_physicists))
    assert(len(notable_physicists) == 1090)
    return sorted(notable_physicists, key=locale.strxfrm)


def _get_physicists():
    wiki_site = pywikibot.Site(code='en', fam='wikipedia')
    wiki_page = pywikibot.Page(wiki_site, 'List of physicists')
    # Newcastle University is not a physicist
    ignore_links = ['Newcastle University']
    physicists = _get_linked_pages(wiki_site, wiki_page,
                                   ignore_links=ignore_links)
    assert(len(physicists) == 978)
    return sorted(physicists, key=locale.strxfrm)


def _get_linked_pages(site, page, ignore_links=None):
    linked_pages = []
    for linked_page in page.linkedPages(namespaces=0):
        # category pages are unwanted along with other links not about
        # a physicist
        if linked_page.is_categorypage() or linked_page.title() in ignore_links:
            continue
        linked_pages.append(linked_page.title())
    return linked_pages


def _get_theoretical_physicists():
    site = pywikibot.Site('en', 'wikipedia')
    code = _get_page_wikicode(site, 'List of theoretical physicists')
    era = ['ancient times', 'middle ages', 'century']
    physicists = _get_linked_pages_in_sections(code, era)
    assert(len(physicists) == 267)
    return sorted(physicists, key=locale.strxfrm)


def _get_page_wikicode(site, page_title):
    page = pywikibot.Page(site, page_title)
    text = page.get()
    return mwparserfromhell.parse(text)


def _get_linked_pages_in_sections(wikicode, sections):
    linked_pages = []
    matches = r'|'.join(sections)
    for section in wikicode.get_sections(matches=matches):
        for linked_page in section.filter_wikilinks():
            # section headings are unwanted
            if linked_page.title.lower() in sections:
                continue
            linked_pages.append(str(linked_page.title))
    return linked_pages

In [None]:
notable_physicists = get_notable_physicists()

Let's check that there are no duplicate names and how many names we got.

In [None]:
assert(len(np.unique(notable_physicists)) == len(notable_physicists))
len(notable_physicists)

Let's write the list to a file for future use and check the list of names. 

In [None]:
def write_list_to_file(file, list_to_write, mode='w'):
    """Write a list line-by-line to a file on disk.

    Args:
        file (str): A text or byte string giving the name (and the path
            if the file isn't in the current working directory) of the
            file to be opened or an integer file descriptor of the file
            to be wrapped. See `open()` method in the standard library for
            more details.
        list_to_write (list): The list of items.
        mode (str): Specifies the mode in which the file is opened. See
            `open()` method in the standard library for more details.
            
    """

    with open(file, mode='w', encoding='utf-8') as list_file:
        list_file.writelines('%s\n' % item for item in list_to_write)

In [None]:
write_list_to_file('../data/raw/physicists.txt', notable_physicists)

In [None]:
assert(os.path.isfile('../data/raw/physicists.txt'))
%pycat ../data/raw/physicists.txt

## Cleaning Up

A few clean up steps are needed:

- Unset the environment variable that was set above.
- Convert the notebook to a HTML file with all the output.
- Convert the notebook to another notebook with the output removed.

In [None]:
del os.environ['PYWIKIBOT_NO_USER_CONFIG']

In [None]:
!jupyter nbconvert --ClearOutputPreprocessor.enabled=False --output-dir html_output --to html 1.0-collect-physicists.ipynb

In [None]:
!jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to notebook 1.0-collect-physicists.ipynb