# Collect Physicists Raw Data

Here I collect biographical data on the list of [physicists notable for their achievements](../data/raw/physicists.txt). Wikipedia contains this structured data in an *Infobox* on the top right side of the article for each physicist. However, similar data is available in a more machine readable, *JSON* format from [DBpedia](https://wiki.dbpedia.org/about). For an example, compare [Albert Einstein's Wikipedia infobox](https://en.wikipedia.org/wiki/Albert_Einstein) to [Albert Einstein's DBPedia JSON](http://dbpedia.org/data/Albert_Einstein.json). It is important to note that the data is similar but not identical.

The shortcomings of Wikipedia infoboxes and the advantages of DBpedia datasets are explained in [DBpedia datasets (section 4.3)](https://wiki.dbpedia.org/services-resources/datasets/dbpedia-datasets). But basically the DBpedia data is much cleaner and better structured Wikipedia Infoboxes since it is based on hand-generated mappings of Wikipedia infoboxes/templates to a [DBpedia ontology](https://wiki.dbpedia.org/services-resources/ontology). As a result I choose DBpedia as the data source for this project.

However, dBpedia does have the disadvantage that its content is roughly 6-18 months behind updates applied to Wikipedia content. This is because its data is generated from a [static dump of Wikipedia content](https://wiki.dbpedia.org/online-access/DBpediaLive) in a process that takes approximately 6 months. The fact that the data is not in sync with the latest Wikipedia content is not so significant for this project as the data is edited infrequently and when it is there are only minor changes.

I will need to send HTTP requests to DBpedia to get the biographical JSON data on each of the physicists.

## Setting the Environment

A few initialization steps are needed to setup the environment:

- The top-level module directory of the repository needs to be added to the system path to enable the loading of python modules.
- The locale needs to be set for all categories to the user’s default setting (typically specified in the LANG environment variable) to enable correct sorting of physicists names with accents.

In [None]:
import locale
import sys

repo_dir = '../'
if repo_dir not in sys.path:
    sys.path.append(repo_dir)
    
locale.setlocale(locale.LC_ALL, '')

## Constructing the URLs

To make the HTTP requests, I will need a list of URLs representing the resources (physicists). It's fairly easy to construct these URLs from the list of notable physicists. However, it's important to `quote` any physicist name in unicode since unicode characters are not allowed in URLs. OK let's create the list now.

In [None]:
import gzip
import os
import shutil
from collections import OrderedDict

import jsonlines
import pandas as pd

from src.data.url_utils import DBPEDIA_DATA_URL
from src.data.url_utils import fetch_json_data
from src.data.url_utils import urls_progress_bar

In [None]:
def construct_urls(file='../data/raw/physicists.txt'):
    """Construct DBpedia data URLs from list in file.

    Args:
        file (str): File containing a list of url filepaths
            with spaces replacing underscores.
    Returns:
        list(str): List of URLs.

    """

    with open(file, encoding='utf-8') as file:
        names = [line.rstrip('\n') for line in file]

    urls = [DBPEDIA_DATA_URL + name.replace(' ', '_') + '.json'
            for name in names]
    return urls

In [None]:
urls_to_fetch = construct_urls()
assert(len(urls_to_fetch) == 1060)

## Fetching the Data

Now I have the list of URLs, it's time to make the HTTP requests to acquire the data. This will take a bit of time as there are a lot of urls and I want to crawl responsibly so that I don't bombard the site. Time to get a coffee.

In [None]:
json_data = fetch_json_data(urls_to_fetch, max_workers=20,
                            progress_bar=urls_progress_bar(len(urls_to_fetch)))

Let's sort the data alphabetically by URL, confirm that all the data was fetched and take a look at the first JSON response.

In [None]:
json_data = OrderedDict(sorted(json_data.items(),
                               key=lambda x: locale.strxfrm(x[0])))
assert(len(json_data) == 1060)
print(list(json_data.keys())[0])
list(json_data.values())[0]

It is clear that every request successfully received a response. However, I see that some responses came back empty from the server. Basically, although there are Wikipedia pages for these physicists they do not have a corresponding page in DBpedia (or the page in DBpedia has a different name). Not to worry, there are only 10 and they are not so famous, so I'll just exlude these physicists from the analysis.

In [None]:
urls_to_drop = [url for (url, data) in json_data.items()
                if not data]
assert(len(urls_to_drop) == 11)
urls_to_drop

In [None]:
json_data = [data for data in json_data.values() if data]
assert(len(json_data) == 1049)

## Persisting the Data

Now I have the list of JSON responses, I'd like to persist them for later analysis. I'll use [Json Lines](http://jsonlines.org/) as it seems like a convenient format for storing structured data that may be processed one record at a time.

In [None]:
jsonl_file = '../data/raw/notable_physicists.jsonl'
with jsonlines.open(jsonl_file, 'w') as writer:
    writer.write_all(json_data)

Let's do a quick sanity check to make sure the file contains the expected number of records.

In [None]:
json_lines = []
with jsonlines.open(jsonl_file, 'r') as reader:
    for json_line in reader:
        json_lines.append(json_line)
assert(len(json_lines) == 1049)

Finally, let's compress the file to reduce its footprint.

In [None]:
with open(jsonl_file, 'rb') as src, gzip.open(jsonl_file + '.gz', 'wb') as dest:
    shutil.copyfileobj(src, dest)
os.remove(jsonl_file)