# Reverse Geocode Places Interim Data

The [places interim dataframe](../data/interim/places.csv) consists of many places with a *latitude* and a *longitude* and some with only a *country* defined. Futhermore, there are some places which are actually nationalities and have none of these defined. My goal here is obtain identifiable [ISO 3166-1 alpha 2 country codes](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), [ISO 3166-1 alpha 3 country codes](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) and *continent codes* for these places which can be used further down the line for feature construction. The process of mapping from a latitude and longitude to a location is known as [reverse geocoding](https://en.wikipedia.org/wiki/Reverse_geocoding) and here I used the *python* library [reverse-geocoder](https://github.com/thampiman/reverse-geocoder) to help me with that. 

As mentioned, some places do not have a latitude or longitude, but do have a country defined. For places of this type I will use the python library [pycountry-convert](https://github.com/TuneLab/pycountry-convert) to convert between the *country name* and *country codes*. This certainly will not work in all instances due to some free form text in the country variable and in such cases I will resort to [named entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) to extract [geopolicatal entities](https://en.wiktionary.org/wiki/geopolitical_entity). For this task, I will use the excellent natural language processing library [spacy](https://spacy.io/usage/linguistic-features#section-named-entities).

For places which are nationalities I will [normalize nationalities via a ISO 3166-1 alpha 2 country codes list](https://t2a.io/blog/normalising-nationalities-via-a-good-iso-3166-country-list/) to convert from nationalities to country codes.  It is important to note that some of the places do not have a latitude, longitude, country or nationality defined. In such cases, I'll have to get creative. OK enough said, time to go on a mapping frenzy!

## Setting up the Environment

A few initialization steps are needed to setup the environment:
- The locale needs to be set for all categories to the user’s default setting (typically specified in the LANG environment variable) to enable correct sorting of words with accents.
- Load `en_core_web_sm` which is the default English language model in `spacy`.

In [None]:
import locale
import spacy
    
locale.setlocale(locale.LC_ALL, '')

nlp = spacy.load('en_core_web_sm')

In [None]:
import numpy as np
import pandas as pd
from pycountry_convert import convert_continent_code_to_continent_name
from pycountry_convert import country_alpha2_to_continent_code
from pycountry_convert import country_alpha2_to_country_name
from pycountry_convert import country_name_to_country_alpha2
from pycountry_convert import country_name_to_country_alpha3
import reverse_geocoder as rg

from src.data.country_utils import nationality_to_alpha2_code

## Reading in the Places Data

First let's read the places data into a dataframe and take a look at the columns of interest for the first few entries.

In [None]:
places = pd.read_csv('../data/interim/places.csv')
place_cols = ['fullName', 'lat', 'long', 'country']
places.head(20)[place_cols]

Already, it's obvious to see that there are places with latitudes, longitudes and countries and some with none of these defined. Exactly how many though?

In [None]:
print('Number of places: ', len(places))
print('Number with lat / long: ',
      (places.lat.notna() & places.long.notna()).sum())
assert(places.lat.isna().sum() == places.long.isna().sum())
print('Number with country: ',
      (places.country.notna()).sum())
print('Number with neither: ', 
      (places.lat.isna() & places.long.isna() & places.country.isna()).sum())

There are two reasons why it is clearly better to start with the latitude and longitude first before using the country:

- There are more values in the dataframe for latitude and longitude than country.
- The latitude and longitude values are more precise than the country values since there is free form text in the latter field.

## Reverse Geocoding

OK let's perform the reverse geocoding to obtain the alpha 2 country code and take a look at the first few places.

In [None]:
def reverse_geocode(places):
    """Reverse geocode the places dataframe.
    
    Use latitude and longitudes to find ISO 3166-1 alpha-2 country codes. 

    Args:
        places (pandas.DataFrame): Dataframe of places data.

    Returns:
        pandas.DataFrame: Dataframe containing ISO 3166-1 alpha-2 country codes.

        Identical to `places` except that it contains an extra column for ISO 
        3166-1 alpha-2 country codes when latitude and longitude are present.
    """

    rg_places = places.copy()
    
    coords = list(zip(places.lat, places.long))
    coords = [coord for coord in coords if not np.isnan(coord[0])
              and not np.isnan(coord[1])]
    ccs = [result['cc'] for result in rg.search(coords)]
    coords_indices = [i for (i, val) in enumerate(
        places.lat.notna().values & places.long.notna().values) if val]
    
    country_codes = [np.nan] * len(places)
    for i in coords_indices:
        country_codes[i] = ccs.pop(0)
    
    rg_places['countryAlpha2Code'] = country_codes
    return rg_places

In [None]:
places = reverse_geocode(places)
assert(places.lat.isna().sum() == places.countryAlpha2Code.isna().sum())
place_cols.append('countryAlpha2Code')
places.head(20)[place_cols]

`reverse_geocoder` seems to be quite accurate, but I do notice one error. Adelaide is not in Japan (JP)! Let's investigate this further.

In [None]:
rg.search([(34.929001, 138.600998)])

The above confirms the value in the dataframe above and matches with the [lat](http://www.w3.org/2003/01/geo/wgs84_pos#lat) and [long](http://www.w3.org/2003/01/geo/wgs84_pos#long) values in the source: http://dbpedia.org/data/Adelaide.json. So what's wrong? A little trial and error reveals that there is an input error in the source. The latitude value is missing a minus sign. 

In [None]:
rg.search([(-34.929001, 138.600998)])

OK nice to know it's not a reverse geocoding error. However, it does further raise some questions as to the accuracy of DBpedia data. A quick scan through the data though does reveal that this type of issue is rare though.

Time to move on now and check how many places have values for the country but not a country alpha 2 code.

In [None]:
(places.country.notna() & places.countryAlpha2Code.isna()).sum()

Not too many, but time to take care of them nonetheless.

## Converting Countries to Alpha-2 Country Codes

I'm now going to convert the remaining places with only countries to their associated alpha-2 country codes.

In [None]:
def country_to_alpha2_code(text):
    """Create ISO 3166-1 alpha-2 country codes from countries.
    
    Use the country to find ISO 3166-1 alpha-2 country codes.
    This function should only be called for a subset of the
    places dataframe where country is defined and latitude or
    longitude is not (or equivalently ISO 3166-1 alpha-2
    country code is not defined).

    Args:
        text (str): Text containing countries.

    Returns:
        `str` or `numpy.nan`: Pipe separated list of ISO 3166-1
            alpha-2 country codes if found, otherwise numpy.nan.
    """
    
    countries = text.split('|')
    alpha2_codes = set()
    for country in countries:
        try:
            alpha2 = country_name_to_country_alpha2(country)
            alpha2_codes.add(alpha2)
        except KeyError:
            doc = nlp(country)
            for ent in (ent for ent in doc.ents if ent.label_ == 'GPE'):
                try:
                    alpha2 = country_name_to_country_alpha2(ent.text)
                    alpha2_codes.add(alpha2)
                except KeyError:
                    pass
                    
    if alpha2_codes:
        alpha2_codes = '|'.join(sorted(alpha2_codes, key=locale.strxfrm))
    else:
        alpha2_codes = np.nan
    return alpha2_codes

In [None]:
places_countries = places[places.countryAlpha2Code.isna() &
           places.country.notna()][['country', 'countryAlpha2Code']]
places.loc[places_countries.index, 'countryAlpha2Code'] = (
    places_countries.country.apply(country_to_alpha2_code))
places.loc[places_countries.index][place_cols]

## Converting Nationalities to Alpha-2 Country Codes

Looking at the dataframe, it is clear that some of the remaining places are nationalities.

In [None]:
places[places.countryAlpha2Code.isna()][place_cols]

I will now read in the nationality list that will help me in converting these nationalities to their associated alpha-2 country codes. It's important at this point to turn off the default behavior of *pandas* which is to treat the string literal 'NA' as a missing value. In the dataset, 'NA' is the ISO 3166 alpha-2 country code of Namibia. I then have to impute the missing values since *pandas* replaces them with the empty string.

In [None]:
try:
    # NB: I have manually fixed the csv to have 'NA' as the country code
    # for Namibia. The author of the file clearly did not realize that by
    # default 'NA' in a field is treated as NAN by pandas.
    nationalities = pd.read_csv('../data/external/Countries-List.csv',
                                keep_default_na=False)
    nationalities = nationalities.replace('', np.nan)
except FileNotFoundError:
    nationalities = pd.read_csv(
        'https://t2a.io/blog/wp-content/uploads/2014/03/Countries-List.csv',
        encoding = 'ISO-8859-1')
    nationalities.to_csv('../data/external/Countries-List.csv', index=False)

assert(nationalities[
    nationalities.Name == 'Namibia']['ISO 3166 Code'].values == 'NA')
nationalities

I'll manually add some commonly used names and demonyms to the dataframe. Despite these being neither countries or nationalities, they either are or were in common use. 

In [None]:
other_nationalities = pd.DataFrame(
    [
        ['GB', 'England', 'English', np.nan, np.nan],
        ['CI', 'Ivory Coast', 'Ivorian', np.nan, np.nan],
        ['GB', 'Northern Ireland', 'Northern Irish', np.nan, np.nan],
        ['IR', 'Persia', 'Persian', np.nan, np.nan],
        ['DE', 'Prussia', 'Prussian', np.nan, np.nan],
        ['IE', 'Republic of Ireland', 'Irish', np.nan, np.nan],
        ['GB', 'Scotland', 'Scottish', 'Scot', np.nan],
        ['RU', 'Soviet Union', 'Soviet', np.nan],
        ['US', 'United States', 'American', np.nan, np.nan],
        ['GB', 'Wales', 'Welsh', np.nan, np.nan]
    ],
    columns=nationalities.columns
)
nationalities = nationalities.append(
    other_nationalities, ignore_index=True).sort_values(by='ISO 3166 Code')
assert(len(nationalities) - len(other_nationalities) == 249)
nationalities

I'm now going to convert the remaining places which are nationalities to their associated alpha-2 country codes.

In [None]:
places_nationalities = places[
    places.countryAlpha2Code.isna()][['fullName', 'countryAlpha2Code']]
places.loc[places_nationalities.index, 'countryAlpha2Code'] = (
    places_nationalities.fullName.apply(nationality_to_alpha2_code,
                                        args=(nationalities,)))
places[places.lat.isna() & places.country.isna() &
       places.countryAlpha2Code.notna()][place_cols]

Please take note that although this is process is very accurate it is not perfect as it can result in a few false positives. For instance, *Scottish Church Collegiate School* is actually in India and not Scotland and *Petit Luxembourg* is a hotel in Paris and not in Luxembourg. But since the quantity of true positives far outweigh the false positives I'll go with it. Now I'm left with just the following places without a country code. They are a mix of companies, educational institutions, cities and some plain random stuff.

In [None]:
places[places.countryAlpha2Code.isna()][place_cols]

Interestingly, examining the *categories* column of the false positive above gives me the idea of applying the `nationality_to_alpha2_code` function to it also since the correct information is available there.

In [None]:
print(places[
    places.fullName == 'Scottish Church Collegiate School']['categories'].values)
print()
print(places[places.fullName == 'Petit Luxembourg']['categories'].values)

However, rather than blindly applying the function to all the "nationalities" in the places dataframe which would give many false positives, such as the following:

In [None]:
print(places[places.fullName == 'Albanians']['categories'].values)
print()
print(places[places.fullName == 'Carpathian Germans']['categories'].values)

I'll be conservative and now apply it to only the categories column of the remaining places without a country code. That is to the mix of companies, educational institutions, cities and some plain random stuff shown in the dataframe above.

In [None]:
places_others = places[
    places.countryAlpha2Code.isna()][['fullName', 'categories',
                                      'countryAlpha2Code']]
places_others
places.loc[places_others.index, 'countryAlpha2Code'] = (
    places_others.categories.apply(nationality_to_alpha2_code,
                                   args=(nationalities,)))
places.loc[places_others.index][place_cols]

Again the success rate is so high that it is definitely sufficient to proceed with this. However, as usual there are few false positives. A clear example of this is *Cape Canaveral* which is of course located in the United States and not India. This is due to the fact that it is situated near the *Indian* River Lagoon.

In [None]:
display(places[places.fullName == 'Cape Canaveral'][place_cols])
print(places[places.fullName == 'Cape Canaveral'].categories.values)

Let's check to see how many places remain without a country code.

In [None]:
places[places.countryAlpha2Code.isna()][place_cols]

Very few indeed. In fact many of these are not even "places". I have managed to map nearly all of the places to country codes so it's time to move on.

In [None]:
print('Percentage of places mapped to country codes:',
      100 * round(places.countryAlpha2Code.notna().sum() / 
                  len(places), 2), '%')

## Mapping Alpha-2 Country Codes to Other Codes and Names

Finally, I can now use `pycountry-convert` to map from all the alpha-2 country codes to alpha-3 country codes, continent codes, country names and continent names.

In [None]:
def alpha2_to_codes_names(places):
    """Create other codes and names from ISO 3166-1 alpha-2 country codes.
    
    Use ISO 3166-1 alpha-2 country codes to find country name, ISO 3166-1
    alpha-3 country codes, continent code and continent name. 

    Args:
        places (pandas.DataFrame): Dataframe of places data.

    Returns:
        pandas.DataFrame: Dataframe containing the extra fields mentioned above.

        Identical to `places` except that it contains extra columns mentioned
        above.
    """

    codes_names_places = places.copy()
    
    codes_names_places['countryName'] = (
        codes_names_places.countryAlpha2Code.apply(
            _text_to_loc_or_codes, args=(country_alpha2_to_country_name,)))    
    codes_names_places['countryAlpha3Code'] = (
        codes_names_places.countryName.apply(
            _text_to_loc_or_codes, args=(country_name_to_country_alpha3,)))
    codes_names_places['continentCode'] = (
        codes_names_places.countryAlpha2Code.apply(
            _text_to_loc_or_codes, args=(country_alpha2_to_continent_code,))) 
    codes_names_places['continentName'] = (
        codes_names_places.continentCode.apply(
            _text_to_loc_or_codes, args=(convert_continent_code_to_continent_name,)))
    
    return codes_names_places


def _text_to_loc_or_codes(text, rg_function):
    if isinstance(text, float):
        return text

    texts = text.split('|')
    items = set()
    for text in texts:
        # Exclude French Southern Territories and Vatican City when
        # converting to continents since they are not recognized
        exclude_cc = ['TF', 'VA']
        if text in exclude_cc:
            continue
        item = rg_function(text)
        items.add(item)

    if items:
        items = '|'.join(sorted(items, key=locale.strxfrm))
    else:
        items = np.nan
    return items

In [None]:
places = alpha2_to_codes_names(places)
assert((places.countryAlpha2Code.isna() & 
        places.country.notna()).sum() == 0)
place_cols = place_cols + ['countryAlpha3Code', 'countryName',
                           'continentCode', 'continentName']
places[place_cols]

## Persisting the Data

Now I have the places and nationalities dataframes, I'll persist them for future use in feature construction.

In [None]:
places = places.reindex(sorted(places.columns), axis='columns')
places.head(20)

In [None]:
places.to_csv('../data/processed/places.csv', index=False)
nationalities.to_csv('../data/processed/Countries-List.csv', index=False)