# Collect Physicists Raw Data

Here I collect biographical data on the list of [physicists notable for their achievements](../data/raw/physicists.txt). Wikipedia contains this data in an *Infobox* on the top right side of the page for each physicist. However, similar data is available in a more structured, machine readable, *JSON* format from DBpedia. For an example, compare [Albert Einstein's Wikipedia infobox](https://en.wikipedia.org/wiki/Albert_Einstein) to [Albert Einstein's DBPedia JSON](http://dbpedia.org/data/Albert_Einstein.json). It is important to note that the data is similar but not identical. The DBpedia contains many extra fields which are not present in the Wikipedia infobox. I choose DBpedia as the data source since it has the folowing advantages over Wikipedia:

- The data is structured in a machine readable JSON format
- The data is richer 

I will need to send HTTP requests to get the biographical data on each of the physicists.

## Constructing the URLs

To make the HTTP requests, I will need a list of URLs representing the resources (physicists). It's fairly easy to construct these URLs from the list of notable physicists. However, it's important to `quote` any physicist name in unicode since unicode characters are not allowed in URLs. OK let's create the list now.

In [None]:
import gzip
import json
import time
from urllib import parse

import pandas as pd
import progressbar as pb
import requests

In [None]:
def construct_urls(file='../data/raw/physicists.txt'):
    """Construct DBpedia data URLs from list in file.

    Args:
        file (str): File containing a list of url filepaths
            with spaces replacing underscores.
    Returns:
        list(str): List of URLs.

    """

    with open(file, encoding='utf-8') as file:
        names = [line.rstrip('\n') for line in file]

    DBPEDIA_URL = 'http://dbpedia.org/data/'
    JSON_EXT = '.json'
    urls = []
    for name in names:
        url = name.replace(' ', '_')
        # some physicist names contain unicode characters
        # which have to be quoted when in a url
        if not all(ord(char) < 128 for char in name):
            url = parse.quote(url)
        url = DBPEDIA_URL + url + JSON_EXT
        urls.append(url)
    return urls

In [None]:
urls = construct_urls()
assert(len(urls) == 1090)

## Fetching the Data

Now I have the list of URLs, it's time to make the HTTP requests to acquire the data. This will take a bit of time as there are 1090 urls and I want to crawl responsibly so that I don't bombard the site. Time to get a coffee.

In [None]:
def fetch_json_data(urls, progress_bar=None):
    """Fetch json data from DBpedia.

    Args:
        urls (list(str)): List of URLs to fetch data from.
        progress_bar (progressbar.ProgressBar): Progress bar.
    Returns:
        list(json): List of JSON objects.

    """

    json_data = []
    for i in range(len(urls)):
        time.sleep(1)  # delay to crawl responsibly
        response = requests.get(urls[i])
        response.raise_for_status()
        if response.status_code == requests.codes.ok:
            json_data.append(response.json())
        if progress_bar:
            progress_bar.update(i)
    return json_data

In [None]:
widgets = [
    'Fetching: ', pb.Counter(),
    ' / ' + str(len(urls)) + ' urls',
    ' ', pb.Bar(marker='█'),
    ' ', pb.Percentage(),
    ' ', pb.Timer(),
    ' ', pb.ETA()
]
bar = pb.ProgressBar(max_value=len(urls), widgets=widgets,
                     redirect_stdout=True).start()

json_data = fetch_json_data(urls, bar)
bar.finish()

Let's confirm that all the data was fetched and take a look at the first JSON response.

In [None]:
assert(len(json_data) == 1090)
json_data[0]

It is clear that every request successfully received a response. However, I see that some responses came back empty from the server. Basically, although there are Wikipedia pages for these physicists they do not have a corresponding page in DBpedia. Not to worry, there are only 14 and they are not so famous, so I'll just exlude these physicists from the analysis.

In [None]:
dropped_urls = [urls[i] for (i, url) in enumerate(json_data)
                if not url]
assert(len(dropped_urls) == 14)
print(len(dropped_urls))
dropped_urls

## Persisting the Data

Now I have the list of JSON responses, I'd like to persist them for later analysis. [Json Lines](http://jsonlines.org/) seems like a convenient format for storing structured data that may be processed one record at a time. So I'll use that format and also compress the file to save some space.

In [None]:
def write_json_lines(lines, file, mode='wt', encoding='utf-8',
                     compress=False):
    """Write a json lines file to disk.

    Args:
        lines(list(json)): List of JSON objects.
        file (str): A text or byte string giving the name (and the path
            if the file isn't in the current working directory) of the
            file to be opened or an integer file descriptor of the file
            to be wrapped. See `open()` and `gzip.open() methods in the
            standard library for more details.
        list_to_write (list): The list of items.
        mode (str): Specifies the mode in which the file is opened. See
            `open()` and `gzip.open() methods in the standard library for
            more details.
        encoding (str): The name of the encoding used to decode or
            encode the file. This should only be used in text mode. See
            the codecs module for the list of supported encodings.
        compress (bool): True to compress the file using gzip, otherwise
            write the file as is.

    """

    if compress:
        file = gzip.open(filename=file, mode=mode, encoding=encoding)
    else:
        file = open(file=file, mode=mode, encoding=encoding)
    for datum in filter(None, lines):
        json.dump(datum, file, ensure_ascii=False)
        file.write('\n')
    file.close()

In [None]:
write_json_lines(json_data, '../data/raw/notable_physicists.jsonl.gz',
                 compress=True)

Let's do a quick sanity check to make sure the file contains the expected number of records.

In [None]:
jsonlines = []
with gzip.open(filename='../data/raw/notable_physicists.jsonl.gz',
               mode='rt', encoding='utf-8') as file:
    for line in file:
        jsonlines.append(json.loads(line))
assert(len(jsonlines) == 1076)

## Cleaning Up

A few clean up steps are needed:

- Convert the notebook to a HTML file with all the output.
- Convert the notebook to another notebook with the output removed.

In [None]:
!jupyter nbconvert --ClearOutputPreprocessor.enabled=False --output-dir html_output --to html 1.1-collect-physicists-raw-data.ipynb

In [None]:
!jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to notebook 1.1-collect-physicists-raw-data.ipynb