# Collect Physicists

As well as the list of [Nobel Physics Laureates](../data/raw/nobel-physics-prize-laureates.csv), we will need a list of physicists who are notable for their achievements, but have not won the Nobel Prize in Physics. Wikipedia contains two lists of physicists who are notable for their achievements - one general [list of physicists](https://en.wikipedia.org/w/index.php?title=List_of_physicists&oldid=864677795) and another [list of theoretical physicists](https://en.wikipedia.org/w/index.php?title=List_of_theoretical_physicists&oldid=855745137). We will scrape these lists, combine them with the list of laureates and unify all of the physicists into a single list. The entire analysis of this project will be based on the data that is acquired on these physicists. OK time to get scraping.

## Setting the Environment

An initialization step is needed to setup the environment:

- The locale needs to be set for all categories to the user’s default setting (typically specified in the LANG environment variable) to enable correct sorting of physicists names with accents.

In [None]:
import locale

locale.setlocale(locale.LC_ALL, '')

In [None]:
import string
import time

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

from src.data.url_utils import urls_progress_bar
from src.data.wiki_utils import BLACKLIST_LINKS
from src.data.wiki_utils import SECTION_TITLES
from src.data.wiki_utils import WIKI_OLD_URL
from src.data.wiki_utils import get_linked_article_titles
from src.data.wiki_utils import get_redirected_titles

## Reading in the Data

First let's read in the Nobel Physics Laureates into a [pandas](https://pandas.pydata.org/) dataframe, remove the missing values and convert this to a list of laureate names.

In [None]:
physics_laureates = pd.read_csv('../data/raw/nobel-physics-prize-laureates.csv')
physics_laureates.head()

In [None]:
physics_laureates = physics_laureates.Laureate.dropna().values.tolist()
assert(len(physics_laureates) == 207)
physics_laureates

## Scraping the Physicists

We will use a combination of [requests](http://docs.python-requests.org/en/master/) and [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to scrape the links of physicist names from the Wikipedia pages above. In the future we will need these links in order to fetch data about the physicists from [DBpedia](https://wiki.dbpedia.org/about). An important point to note is that we will actually need to send HTTP requests to fetch the linked pages as some of them redirect to different URLs. The tricky part is that the redirects are done via javascript, so they are not detected by requests. As a result, we will have to parse the javascript to find the redirect link. Even after all of this, some of the redirected Wikipedia links do not match the DBpedia links, resulting in the wrong resource being retrieved. To avoid this, we manually force the correct redirects using a [wikipedia redirects cache](../data/raw/wikipedia-redirects.csv).

In [None]:
def get_notable_physicists(laureates, title_cache_path=None, progress_bar=None):
    """Get a list of notable physicists.
    Args:
        laureates (list of `str`): Nobel Physics Laureates.
        title_cache_path (str, optional): Defaults to None. Path of the csv file
            where the title cache of known mappings is located.
        progress_bar (progressbar.ProgressBar): Progress bar.

    Returns:
        list (str): List of names of notable physicists.

    """
    
    # read in the cache
    if title_cache_path:
        cache = pd.read_csv(title_cache_path)
        redirected_titles = dict(zip(cache.name, cache.redirect_name))
    
    # get the theoretical physicists
    theoretical_physicists = get_linked_article_titles(
        WIKI_OLD_URL + 'List_of_theoretical_physicists&oldid=855745137', section_titles=SECTION_TITLES)
    assert(len(theoretical_physicists) == 267)

    # get the physicists
    physicists = get_linked_article_titles(
        WIKI_OLD_URL + 'List_of_physicists&oldid=861832841', section_titles=list(string.ascii_uppercase),
        blacklist_links=BLACKLIST_LINKS)
    assert(len(physicists) == 976)
    assert(not set(BLACKLIST_LINKS).intersection(set(physicists)))

    # merge the lists with the laureates list
    notable_physicists = list(set(theoretical_physicists + physicists + laureates))

    # get the redirect title (if any) from a HTTP request
    notable_physicists = get_redirected_titles(
        notable_physicists, title_cache_path=title_cache_path, max_workers=20, progress_bar=progress_bar)

    # remove duplicates, sort and return list
    notable_physicists = list(set(notable_physicists.values()))
    for name, redirect_name in redirected_titles.items():
        if (locale.strcoll(name, redirect_name) != 0 and name in notable_physicists and 
            redirect_name in notable_physicists):
            notable_physicists.remove(name)
    
    notable_physicists.sort(key=locale.strxfrm)
    return notable_physicists

In [None]:
NUM_URLS = 1127
title_cache_path = '../data/raw/wikipedia-redirects.csv'
notable_physicists = get_notable_physicists(physics_laureates, title_cache_path=title_cache_path,
                                            progress_bar=urls_progress_bar(NUM_URLS))

Let's check that there are no duplicate names and how many names we got.

In [None]:
assert(len(notable_physicists) == 1069)
assert(len(np.unique(notable_physicists)) == len(notable_physicists))
len(notable_physicists)

Let's write the list to a file for future use and check the list of names. 

In [None]:
with open('../data/raw/physicists.txt', mode='w', encoding='utf-8') as file:
    file.writelines('\n'.join(notable_physicists))

In [None]:
%pycat ../data/raw/physicists.txt