# Collect Places Raw Data

The [notable physicists dataframe](../data/interim/notable_physicists.csv) contains the following fields which I shall collectively define as *places*: `almaMater`, `birthPlace`, `citizenship`, `deathPlace`, `nationality`, `residence` and `workplaces`. Overwhelmingly, these are defined by semantic URLs in the fields. However, there are some fields in which free form text is used and this adds some noise to the data. Matters are further complicated by the fact that two or more semantic URLs can actually refer to the same geographical location. An example of this are the resources (webpages) [Kingdom of Prussia](http://dbpedia.org/page/Kingdom_of_Prussia) and [Germany](http://dbpedia.org/page/Germany) which are located in the same country according to their geographical coordinates *latitude* ([geo:lat](http://www.w3.org/2003/01/geo/wgs84_pos#lat)) and *longitude* ([geo:long](http://www.w3.org/2003/01/geo/wgs84_pos#long)). 

I define a *location* as a DBpedia resource (webpage) that contains a latitude and longitude. So [Kingdom of Prussia](http://dbpedia.org/page/Kingdom_of_Prussia) and [Germany](http://dbpedia.org/page/Germany) are locations, but so too are [Massachusetts Institute of Technology](http://dbpedia.org/page/Massachusetts_Institute_of_Technology) and [Sancellemoz](http://dbpedia.org/resource/Sancellemoz) (the death place of *Marie Curie*). During feature construction it is important to be able to map from locations to countries (using the latitude and longitude) as the countries that a physicist is associated with can have an impact on whether s/he is awarded a Nobel Prize in Physics. In some cases I will use both the location and the country in separate features (e.g. Massachusetts Institute of Technology as a `workplace` or `almaMater` and United States as the associated `workCountry` or `almaMaterCountry`). In this case both the instution and country can have an impact on the award of a Nobel Prize in Physics. However, in some cases, the location has no impact and I will only use the country in a single feature (e.g. I will not use `birthPlace` or `deathPlace` as features but will use `birthCountry` or `deathCountry`).

To be able to perform mappings from locations to countries I will first need the resources. So the goal here is to fetch the *JSON* resources for all the places. The resources will later be processed to extract the relevant information from them such as latitude and longitude. The latitude and longitude can then be used to infer the country.

## Setting up the Environment

An initialization step is needed to setup the environment:
- The locale needs to be set for all categories to the user’s default setting (typically specified in the LANG environment variable) to enable correct sorting of words with accents.

In [None]:
import sys
import locale
    
locale.setlocale(locale.LC_ALL, '')

In [None]:
import gzip
import os
import shutil
from collections import OrderedDict

import jsonlines
import pandas as pd
from src.data.url_utils import DBPEDIA_DATA_URL
from src.data.url_utils import fetch_json_data
from src.data.url_utils import quote_url
from src.data.url_utils import unquote_url
from src.data.url_utils import urls_progress_bar

## Constructing the URLs

To make the HTTP requests, I will need a list of URLs representing the resources (places). It's fairly easy to construct these URLs from the places fields in the notable physicists dataframe. However, it's important to [quote](https://docs.python.org/3.1/library/urllib.parse.html) special characters in the URL using the %xx escape. OK let's create the list now.

In [None]:
physicists = pd.read_csv('../data/interim/physicists.csv')
physicists.head()

In [None]:
PLACES_FIELDS = ['almaMater', 'birthPlace', 'citizenship', 'deathPlace',
                 'nationality', 'residence', 'workplaces']

In [None]:
def construct_urls(physicists, columns=None):
    """Construct DBpedia data URLs from dataframe.

    Args:
        physicists (pandas.Dataframe): Dataframe containing physicists
            data.
        columns (list of `str`, optional): Defaults to None. List of
            columns to extract the URLs from. If None then all columns
            in the dataframe are used. The latter is probably not what
            is desired.

    Returns:
        list of `str`: List of URLs.

    """
    
    if not columns:
        use_columns = physicists.columns
    else:
        use_columns = columns
    
    values = (physicists[use_columns].applymap(
        lambda x: x.split('|') if isinstance(x, str)
        else [])).values.flatten()
    flat_values = list(
        set([item for item_list in values for item in item_list]))
    
    urls = [DBPEDIA_DATA_URL + flat_value.replace(' ', '_') + '.json'
            for flat_value in flat_values]
    
    urls.sort(key=locale.strxfrm)
    return urls

In [None]:
urls_to_fetch = construct_urls(physicists, columns=PLACES_FIELDS)
quoted_urls_to_fetch = [quote_url(url) for url in urls_to_fetch]
len(urls_to_fetch)

## Fetching the Data
Now I have the list of URLs, it's time to make the HTTP requests to acquire the data. The code is asynchronous which dramatically helps with performance. It is important to set the `max_workers` parameter sensibly to crawl responsibly so that I do not bombard the site server. Although the site seems to be rate limited, it's still good etiquette.

In [None]:
json_data = fetch_json_data(quoted_urls_to_fetch, max_workers=20, timeout=30,
                            progress_bar=urls_progress_bar(len(urls_to_fetch)))

Let's confirm that all the data was fetched and take a look at the first JSON response.

In [None]:
json_data = OrderedDict(sorted(
    json_data.items(),
    key=lambda x: locale.strxfrm(unquote_url(x[0]))))
assert(len(json_data) == 1913)
print(list(json_data.keys())[0])
list(json_data.values())[0]

It is clear that every request successfully received a response. However, I see that some responses came back empty from the server. This is to be expected as there is free form text in the fields that do not map to a semantic URL in DBpedia. I'll just exlude these from the list.

In [None]:
urls_to_drop = [unquote_url(url) for (url, data) in json_data.items()
                if not data]
assert(len(urls_to_drop) == 48)
urls_to_drop

In [None]:
json_data = [data for data in json_data.values() if data]
assert(len(json_data) == 1865)

## Persisting the Data

Now I have the list of JSON responses, I'd like to persist them for later analysis. I'll use [Json Lines](http://jsonlines.org/) as it seems like a convenient format for storing structured data that may be processed one record at a time.

In [None]:
jsonl_file = '../data/raw/places.jsonl'
with jsonlines.open(jsonl_file, 'w') as writer:
    writer.write_all(json_data)

Let's do a quick sanity check to make sure the file contains the expected number of records.

In [None]:
json_lines = []
with jsonlines.open(jsonl_file, 'r') as reader:
    for json_line in reader:
        json_lines.append(json_line)
assert(len(json_lines) == 1865)

Finally, let's compress the file to reduce its footprint.

In [None]:
with open(jsonl_file, 'rb') as src, gzip.open(jsonl_file + '.gz', 'wb') as dest:
    shutil.copyfileobj(src, dest)
os.remove(jsonl_file)