# Process Places Raw Data

Now I aim to convert the JSON lines format [places raw data](../data/raw/places.jsonl.gz) into an intermediate format *pandas dataframe* that is more convenient to work with. This intermediate format will be close to the final format of the data that I'll be working with for analysis.

My goal is to parse the JSON data in order to extract the interesting fields of information such as *latitude*, *longitude*, *city*, *country*, etc. As mentioned previously, these fields will be useful for constructing features based on the country that the place is located in. In particular, for places with a latitude and longitude (*locations*), I will be able to map from locations to identifiable countries.

## Setting up the Environment

An initialization step is needed to setup the environment:
- The locale needs to be set for all categories to the user’s default setting (typically specified in the LANG environment variable) to enable correct sorting of words with accents.

In [None]:
import locale
    
locale.setlocale(locale.LC_ALL, '')

In [None]:
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize

from src.data.dbpedia_utils import construct_resource_urls
from src.data.dbpedia_utils import find_resource_url
from src.data.dbpedia_utils import get_source_url
from src.data.dbpedia_utils import impute_redirect_filenames
from src.data.dbpedia_utils import json_categories_to_dict
from src.data.dbpedia_utils import json_keys_to_dict
from src.data.dbpedia_utils import PLACES_IMPUTE_KEYS
from src.data.jsonl_utils import read_jsonl
from src.data.progress_bar import progress_bar
from src.data.url_utils import get_filename_from_url
from src.data.url_utils import get_redirect_urls

## Reading in the JSON Lines Data

First let's read the JSON lines data into a list so that we can parse it later and take a look at the first entry.

In [None]:
json_lines = read_jsonl('../data/raw/places.jsonl.gz')
json_lines[0]

## Defining the Fields of Interest

Here I define the keys and values that I wish to extract from the JSON lines data and also some that I explicitly wish to exclude. These keys and values are in the form of *semantic URLs* which allows anyone to visit the resource in a web browser and see their meaning. The URLs are in 5 namespaces:

- [DBpedia Ontology](https://wiki.dbpedia.org/services-resources/ontology)
- DBpedia Property
- PURL
- W3
- FOAF

In [None]:
DBPEDIA_JSON_KEYS = [
    # DBpedia ontology
    'http://dbpedia.org/ontology/abstract',
    'http://dbpedia.org/ontology/city',
    'http://dbpedia.org/ontology/country',
    'http://dbpedia.org/ontology/type',
    'http://dbpedia.org/ontology/thumbnail',
    'http://dbpedia.org/ontology/wikiPageID',
    'http://dbpedia.org/ontology/wikiPageRevisionID',
    
    # DBPedia property
    'http://dbpedia.org/property/country',

    # PURL
    'http://purl.org/dc/terms/description',

    # W3
    'http://www.w3.org/2000/01/rdf-schema#comment',
    'http://www.w3.org/2003/01/geo/wgs84_pos#lat',
    'http://www.w3.org/2003/01/geo/wgs84_pos#long',
    'http://www.w3.org/ns/prov#wasDerivedFrom',
    
    # FOAF
    'http://xmlns.com/foaf/0.1/depiction',
    'http://xmlns.com/foaf/0.1/homepage',
    'http://xmlns.com/foaf/0.1/isPrimaryTopicOf',
    'http://xmlns.com/foaf/0.1/name'
]

DBPEDIA_IGNORE_URLS = [
    'http://dbpedia.org/resource/None'
]

## Creating the Places Dictionaries

Now I parse the JSON lines data to create dictionaries. The following rules apply when creating the dictionaries:

1. Values in the DBpedia Ontology namespace takes precedence over those in the DBpedia Property namespace since, as described in section 4.3 of [DBpedia datasets](https://wiki.dbpedia.org/services-resources/datasets/dbpedia-datasets), it contains the cleanest data. Property namespace values are only used when there are no corresponding ontology namespace values.

2. Some of the fields (e.g. the *abstract*) is multilingual. In such cases, only the English is extracted.

3. The *country* field is slightly messy and some cleanup is done on it. However, some noise remains in the data. The sources of the noise are: 
    - Fields containing semantic URLs that are redirected.
    - Semi-structured text containing valuable information that is not easy for a machine to understand.
    
I will handle some of these issues now and some of them later prior to and when generating features for machine learning.

OK let's now generate the dictionaries and take a look at the first entry.

In [None]:
def create_place_data(json_line):
    """Create place data from json_line data.

    Args:
        json_line (dict): JSON dict.
    Returns:
        dict: Dictionary of place data.

    """

    # flatten the json
    flat_json = json_normalize(json_line)

    # find the resource, source and fullName
    resource_url = find_resource_url(flat_json)
    source_url = get_source_url(resource_url)
    full_name = get_filename_from_url(resource_url).replace('_', ' ')

    # construct the dictionary
    dict_ = {'resource': resource_url,
             'source': source_url, 'fullName': full_name,
             **json_keys_to_dict(resource_url, flat_json,
                                 DBPEDIA_JSON_KEYS,
                                 ignore_urls=DBPEDIA_IGNORE_URLS),
             **json_categories_to_dict(flat_json)}

    dict_ = _clean_country(dict_)

    return dict_


def _clean_country(dict_):
    cleaned_dict = dict_.copy()

    key = 'country'
    key_present = cleaned_dict.get(key)
    if key_present:
        cleaned_dict[key] = (
            key_present
            .replace(', ', '|')
            .replace(' and ', '|')
            .replace('the ', '')
        )
        cleaned_dict[key] = '|'.join(sorted(cleaned_dict[key].split('|'),
                                            key=locale.strxfrm))
    return cleaned_dict

In [None]:
bar = progress_bar(len(json_lines), banner_text_begin='Creating: ',
                   banner_text_end=' dicts')
bar.start()

data = []
for i in range(len(json_lines)):
    datum = create_place_data(json_lines[i])
    data.append(datum)
    bar.update(i)

bar.finish()

In [None]:
data[0]

You can see that many of the values in the dictionary contain lists of semantic URLs which are each meant to refer to a unique "thing" such as a person or a place. If the URL is redirected, it is important to know where it is redirected to so that identical "things" in fact resolve to the same URL. It is very important to do this so that the machine learning models have clean data in order to differentiate signal from noise.

## Imputing Redirects in the Places Dictionaries

In order to impute redirects in the dictionaries I need to perform the following steps:

1. *Parse the dictionaries to obtain a list of all URLs*. These are either resource URLs or URLs constructed in an *ad hoc* manner from the free form text in the hope that they resolve to a genuine resource URL. The aim is to resolve as many items as possible in the fields to semantic URLs. I restrict URL selection and construction to the following impute keys: `categories`, `city`, `country` and `type` as these are fields of interest involving URLs that features may be extracted from.
2. *Submit HTTP requests to fetch the URLs and determine their redirect URLs.* A cache is kept mapping the URLs to the redirect URLs and as a consequence a HTTP request is only made to fetch a URL if it is not found in the cache. This greatly helps with performance. Note that I reutilize the [dbpedia redirects cache](../data/raw/dbpedia-redirects.csv) created in notebook [2.0-process-physicists-raw-data.ipynb](2.0-process-physicists-raw-data.ipynb)
3. *Replace the URLs in the dictionaries with the redirect URLs.* In fact I use just the filename since the paths are identical for every URL.

In [None]:
urls_to_check = construct_resource_urls(data, PLACES_IMPUTE_KEYS)
len(urls_to_check)

In [None]:
url_cache_path = '../data/raw/dbpedia-redirects.csv'
redirects = get_redirect_urls(
    urls_to_check, url_cache_path=url_cache_path, max_workers=20,
    timeout=30, progress_bar=progress_bar(len(urls_to_check)))

In [None]:
len(redirects)

You can see that many a few of the requested URLs were not found. This is to be expected as there is free form text in the fields that do not map to a semantic URL in DBpedia. However, my *ad hoc* approach of constructing URLs from free form texts is very successful in finding legitimate URLs in many instances.

Now I sort and persist the URL cache to disk in case any new URLs are found.

In [None]:
dbpedia_redirects = pd.DataFrame(
    sorted(list(zip(redirects.keys(), redirects.values())),
           key=lambda x: locale.strxfrm(x[0])), columns=['url', 'redirect_url'])
dbpedia_redirects.to_csv(url_cache_path, index=False)

Now I replace the URLs in the dictionaries with the redirect URLs making sure to just use the filename since the paths are identical for every URL.

In [None]:
imputed_data = impute_redirect_filenames(data, PLACES_IMPUTE_KEYS, redirects)
imputed_data[0]

## Creating the Places Dataframe

Now I use the dictionaries of imputed data to create a dataframe. Let's confirm that it contains the expected number of places and take a look at the first few records.

In [None]:
places = pd.DataFrame(imputed_data)
assert(len(places) == 1883)
places.head()

## Impute Missing Latitudes and Longitudes from Cities

Let's see what percentage of places we can define as locations, namely, the number that have a value for both latitude and a longitude. 

In [None]:
print(100 * round((
    ~places.lat.isna() & ~places.long.isna()).sum() / len(places), 2), '%')

Despite this being a healthy percentage, it is possible to do better as there is extra information in the `city` field that is not being utilized. The places below do not have a latitude and longitude defined but do have a city. 

In [None]:
places[places.lat.isna() & places.long.isna() & ~places.city.isna()][
    ['fullName', 'city', 'lat', 'long']]

Now for each place, I will attempt to automatically impute the missing latitude and longitude values based on their values for the associated city. For the few cities that are not present in the dataframe, I manually impute the values from the corresponding JSON file instead of making further HTTP requests to obtain the data.

In [None]:
def impute_lat_long(places):
    """Impute missing latitude and longitudes from cities in the places dataframe.

    Args:
        places (pandas.DataFrame): Dataframe of places data.

    Returns:
        pandas.DataFrame: Dataframe containing imputed data.

        Identical to `places` except that it contains imputed values for missing
        latitude and longitudes when the city is present.
    """
    
    imputed_places = places.copy()
    imputed_places = places.apply(_imputes, axis=1, args=(places,))
    
    # manually impute values which are not in places dataframe
    # instead of fetching the json via HTTP requests
    imputed_places.loc[imputed_places.fullName == 'Banaras Hindu University',
        # http://dbpedia.org/data/Varanasi.json
        ['lat', 'long']] = [25.28000068664551, 82.95999908447266]
    imputed_places.loc[imputed_places.fullName == 'City University of New York',
        # http://dbpedia.org/data/New_York_City.json
        ['lat', 'long']] = [40.71269989013672, -74.00589752197266]
    imputed_places.loc[
        imputed_places.fullName == 'Ghulam Ishaq Khan Institute of Engineering Sciences and Technology',
        # http://dbpedia.org/data/Swabi_District.json
        ['lat', 'long']] = [34.11666488647461, 72.46666717529297]
    imputed_places.loc[
        # http://dbpedia.org/data/Trieste.json
        imputed_places.fullName == 'International Centre for Theoretical Physics',
        ['lat', 'long']] = [45.63333511352539, 13.80000019073486]
    imputed_places.loc[imputed_places.fullName == 'Kanagawa University',
        # http://dbpedia.org/data/Kanagawa-ku,_Yokohama.json
        ['lat', 'long']] = [35.47694396972656, 139.6294403076172]
    imputed_places.loc[imputed_places.fullName == 'National and Kapodistrian University of Athens',
        # https://www.latlong.net/place/athens-greece-22451.html
        ['lat', 'long']] = [37.983810, 23.727539]
    imputed_places.loc[imputed_places.fullName == 'University of Azad Jammu and Kashmir',
        # http://dbpedia.org/data/Muzaffarabad.json
        ['lat', 'long']] = [34.36100006103516, 73.46199798583984]
    imputed_places.loc[imputed_places.fullName == 'University of Hawaii',
        # http://dbpedia.org/data/Honolulu.json
        ['lat', 'long']] = [21.29999923706055, -157.8166656494141]
    imputed_places.loc[imputed_places.fullName == 'University of Lisbon',
        # http://dbpedia.org/data/Lisbon.json
        ['lat', 'long']] = [38.71381759643555, -9.139386177062988]
    return imputed_places


def _imputes(row, places):
    update_row = row.copy()
    
    if np.isnan(row.lat) and np.isnan(row.long) and isinstance(row.city, str):
        cities = row.city.split('|')
        for city in cities:
            place = places[places.fullName == city]
            assert(len(place) <= 1)
            if len(place) == 1:
                update_row['lat'] = place['lat'].item()
                update_row['long'] = place['long'].item()
                return update_row  # take the first city that is found
    return update_row

In [None]:
imputed_places  = impute_lat_long(places)
imputed_places.head()

In [None]:
print(100 * round((
    ~imputed_places.lat.isna() & ~imputed_places.long.isna()).sum()
    / len(imputed_places), 2), '%')

OK I've managed to convert a further 2% of the places to locations. Not too shabby! And there are still further places with mostly well defined countries that can be utilized for feature construction.

In [None]:
imputed_places[imputed_places.lat.isna() & imputed_places.long.isna()
               & ~imputed_places.country.isna()][[
    'fullName', 'country', 'lat', 'long']]

## Persisting the Data

Now I have the dataframe, I'd like to persist it for later analysis. So I'll write out the contents to a csv file.

In [None]:
imputed_places.to_csv('../data/interim/places.csv', index=False)