# LT healthcare services

Prepared by [**K.Clemons**](mailto:kimberly.clemons@ext.ec.europa.eu) and [**J.Grazzini**](mailto:jacopo.grazzini@ec.europa.eu) ([_Eurostat_](https://ec.europa.eu/eurostat)).

## Introdution

This notebook illustrates the way to produce *ad-hoc* harmonised data collected from LT national authority. It shows how data can be automatically harmonised, using single-use script-based approachr or a reusable metadata-based approach.

**Table of contents**

* [Setting and checking the environment](#environment).
* [Ad-hoc data ingestion, exploration and processing](#ad-hoc).
* [Metadata-based semi-automated processing](#metadata-processing).
* [Full automation](#automation).
* [Customised processing](#customisation).

## Setting the environment <a id="environment"></a>

Some basic imports to start with:

In [1]:
PROJECT = 'basic-services'

import os, sys
import functools
import inspect
import hashlib
import json
from pprint import pprint as pprint

import numpy as np
import pandas as pd

The global variables below will be used throughout the notebook:

In [2]:
FACILITY      = 'HCS'
COUNTRY, LANG = 'Lithuania', 'lt'
CC            = 'LT'
IFILE         = 'Hospitals_2018.xlsx'

We will also need some local setups:

In [3]:
THISDIR = os.getcwd()
SRCPATH = os.path.abspath('../../../src/python/pyeufacility/')
DATAPATH = os.path.abspath('../../../data/healthcare/raw/') 
# or       '/Users/<username>/basic-services/data/healthcare/%s' % CC

try:
    print("- src rel. path: \033[1m%s%s\033[0m" % (PROJECT, SRCPATH.split(PROJECT)[1])) 
    print("- data rel. path: \033[1m%s%s\033[0m" % (PROJECT, DATAPATH.split(PROJECT)[1])) 
except:
    print("current path: \033[1m%s\033[0m" % os.getcwd())

- src rel. path: [1mbasic-services/src/python/pyeufacility[0m
- data rel. path: [1mbasic-services/data/healthcare/raw[0m


For the geocoding operations considered in this notebook, we will rely upon online geolocation services (*e.g.*, Google, Bing, *etc...*). In order to use these services, a key may be required. You can for instance defined personal keys (in the form `"service-name" : "key"`) in a `JSON` file `'gc_keys.json'` and load this file on-the-fly... otherwise, you can also hard encode it here:

In [4]:
KEYS_FILE = 'gc_keys.json'

try:
    with open(os.path.join(SRCPATH, KEYS_FILE), 'r') as f:
        GC = json.load(f)
    BING_KEY = GC['BING']
except:    
    assert False
    BING_KEY = '...' # insert your own
else:
    print("Private (hidden) key to access BING services: \033[1m%s\033[0m" % 
          hashlib.md5(BING_KEY.encode()).hexdigest())

Private (hidden) key to access BING services: [1m64da12dda02c79efca91e254bfad0961[0m


## Ad-hoc data ingestion, exploration and processing <a id="ad-hoc"></a>

In [5]:
try:
    import xlrd
except ImportError:
    !{sys.executable} -m pip install xlrd

def adhoc_load(file): # dumb function we aim at reusing later on
    return pd.read_excel(file, header = 2) 

file = os.path.join(DATAPATH, IFILE)
print("\033[4mInput dataset:\033[0m \033[1m%s\033[0m" % file.split(PROJECT)[1])
df_src = adhoc_load(file)

print("\033[4mInput dataset columns:\033[0m \033[1m%s\033[0m" % list(df_src.columns))
# pprint(df_src.dtypes)
df_src.head(5)

[4mInput dataset:[0m [1m/data/healthcare/raw/Hospitals_2018.xlsx[0m
[4mInput dataset columns:[0m [1m['Code of municipality', 'Municipality', 'ID', 'parent_ID', 'Code of legal entity', 'Subordination: 1-national (MoH), 3-municipality, 8-private, 9-other ministries (not MoH)', 'type_code', 'type_name', 'Level: 1-national, 2-regional, 3-municipality, 4-nursing, 5-other public and specialized, 6-private', 'Name', 'Address', 'Number of beds at the end of the 2018'][0m


Unnamed: 0,Code of municipality,Municipality,ID,parent_ID,Code of legal entity,"Subordination: 1-national (MoH), 3-municipality, 8-private, 9-other ministries (not MoH)",type_code,type_name,"Level: 1-national, 2-regional, 3-municipality, 4-nursing, 5-other public and specialized, 6-private",Name,Address,Number of beds at the end of the 2018
0,101,Vilnius,1,0,124364561,1,1,general,1,Vilniaus Universiteto ligoninė Santaros klinikos,"Santariškių 2, Vilnius LT-08661",1356
1,101,Vilnius,5,1,302620298,1,1,general,1,Vilniaus Universiteto ligoninės Santaros klini...,"Santariškių 7, Vilnius LT-08406",515
2,101,Vilnius,32,0,124243848,1,1,general,1,Respublikinė Vilniaus universitetinė ligoninė,"Šiltnamių 29, Vilnius",673
3,101,Vilnius,3,0,191744287,1,1,general,1,Vilniaus Universiteto ligoninės Žalgirio klinika,"Žalgirio 117, Vilnius",58
4,101,Vilnius,10,0,124247526,1,19,psychiatry,5,Respublikinė Vilniaus psichiatrijos ligoninė,"Parko g. 21, Vilnius",542


The `adhoc_preparation` method below enables us to clean the dataset, add/rename columns, derive information, *etc...*:

In [6]:
def adhoc_preparation(df):
    df['country'] = COUNTRY
    df['public_private'] = (df['Level: 1-national, 2-regional, 3-municipality, 4-nursing, '
                                '5-other public and specialized, 6-private']
                             .apply(lambda x: 'private' if x==6 else 'public')
                            )

    df.rename(columns = {'ID':                                    'id',
                         'Name':                                  'name',
                         #'Address':                               'address',
                         'Number of beds at the end of the 2018': 'cap_beds',
                         'type_name':                             'facility_type'}, 
               inplace = True, errors = 'ignore') 

    df['place'] = df[['Address', 'country']].apply(', '.join, 1)

    df.drop(columns = ['Subordination: 1-national (MoH), 3-municipality, 8-private, 9-other ministries (not MoH)',
                       'Level: 1-national, 2-regional, 3-municipality, 4-nursing, '
                       '5-other public and specialized, 6-private',
                       'Code of legal entity', 'Code of municipality'],
             inplace = True, errors = 'ignore')

df1 = df_src.copy()
adhoc_preparation(df1)
df1.head(5)

Unnamed: 0,Municipality,id,parent_ID,type_code,facility_type,name,Address,cap_beds,country,public_private,place
0,Vilnius,1,0,1,general,Vilniaus Universiteto ligoninė Santaros klinikos,"Santariškių 2, Vilnius LT-08661",1356,Lithuania,public,"Santariškių 2, Vilnius LT-08661, Lithuania"
1,Vilnius,5,1,1,general,Vilniaus Universiteto ligoninės Santaros klini...,"Santariškių 7, Vilnius LT-08406",515,Lithuania,public,"Santariškių 7, Vilnius LT-08406, Lithuania"
2,Vilnius,32,0,1,general,Respublikinė Vilniaus universitetinė ligoninė,"Šiltnamių 29, Vilnius",673,Lithuania,public,"Šiltnamių 29, Vilnius, Lithuania"
3,Vilnius,3,0,1,general,Vilniaus Universiteto ligoninės Žalgirio klinika,"Žalgirio 117, Vilnius",58,Lithuania,public,"Žalgirio 117, Vilnius, Lithuania"
4,Vilnius,10,0,19,psychiatry,Respublikinė Vilniaus psichiatrijos ligoninė,"Parko g. 21, Vilnius",542,Lithuania,public,"Parko g. 21, Vilnius, Lithuania"


We can build a geocoder/geolocator using the [`geopy`](https://geopy.readthedocs.io/en/stable/#) package, possibly inserting an API key to the geocoding service if needed (below, `OSM` and `Bing` are considered for the geocoders `gcNominatim` and `gcBing` respectively):

In [7]:
try:
    import geopy
except ImportError:
    !{sys.executable} -m pip install geopy
finally:
    from geopy import geocoders
    from geopy.extra.rate_limiter import RateLimiter

agent = PROJECT
key = BING_KEY # None

try:
    assert key not in ('',None)
except:
    gc = geocoders.Nominatim(user_agent = agent).geocode
else:
    gc = geocoders.Bing(user_agent = agent, timeout = 100, api_key = key).geocode

Note some way to circumvent the geocoder limitations (this may not be needed for all geocoders):

In [8]:
print("location: \033[1m%s\033[0m" % df1.iloc[0]['place'])
try:
    location = gc(df1.iloc[0]['place'])
    assert location is not None
except:
    try:
        location = gc(df1.iloc[0]['Address'])
        assert location is not None 
    except:
        pass
    else:
        print("lat/lon coordinates: \033[1m(%s,%s)\033[0m" 
              % (location.latitude, location.longitude))
else:
    print("lat/lon coordinates: \033[1m(%s,%s)\033[0m" 
          % (location.latitude, location.longitude))

location: [1mSantariškių 2, Vilnius LT-08661, Lithuania[0m
lat/lon coordinates: [1m(54.75171,25.27917)[0m


The `adhoc_geolocation` method then applies the geocoder to the newly built `'place'` variable and extract the latitude/longitude coordinates from the geocoder answers. It also handles some negative responses:

In [9]:
try:
    from tqdm import tqdm
except:
    pass
else:
    tqdm.pandas()

def adhoc_location(df, geolocator = gc):
    try:
        df['__coord__'] = df['place'].progress_apply(geolocator)
    except:
        df['__coord__'] = df['place'].apply(geolocator)
    if df['__coord__'].isnull().any():
        index = df[df['__coord__'].isnull()].index
        long_address = (df[['name', 'Address', 'country']]
                         .apply(', '.join, 1)
                        ) # of use when geocoding fails
        df.loc[index, '__coord__'] = long_address.loc[index].apply(geolocator)
    df['lat'], df['lon'] = zip(*df['__coord__']
                               .apply(lambda x: (x.latitude, x.longitude) if x != None else (np.nan, np.nan))
                              )
    df.drop(columns = '__coord__', inplace = True, errors = 'ignore') # or keep it for display

adhoc_location(df1)   
df1[['lat', 'lon']].head(5)

  from pandas import Panel
100%|██████████| 142/142 [00:54<00:00,  2.61it/s]


Unnamed: 0,lat,lon
0,54.75171,25.27917
1,54.75481,25.28272
2,54.66862,25.20767
3,54.70407,25.27884
4,54.68442,25.4182


Finally save the output data:

In [10]:
OFILE = '%sgeo-test.csv' % CC
df1.to_csv(OFILE)

## Metadata-based semi-automated processing<a id="metadata-processing"></a>

We aim at automating the above operations into a transparent reusable process, while ensuring the potential evolvability of this process (*e.g.*, when data change). For that purpose, we adopt the metadata-based approach implemented throughout the [`pyeudatanat`](https://github.com/eurostat/pyEUDatNat) and [`pyeufacility`](https://github.com/eurostat/pyeufacility) packages.

For a proper import of the packages and setup of the environment, see [the dedicated cells](https://github.com/eurostat/basic-services/blob/master/src/python/notebooks/01_HCS_generic_example_CZ.ipynb#environment) of the [`01_HCS_generic_example_CZ.ipynb` notebook](https://github.com/eurostat/basic-services/blob/master/src/python/notebooks/01_HCS_generic_example_CZ.ipynb).

In [11]:
try:
    import pyeudatnat
except ImportError:
    !{sys.executable} -m pip install git+https://github.com/eurostat/pyEUDatNat.git
finally:
    from pyeudatnat import misc, io, text, base

try:
    import pyeufacility
except ImportError:
    # !{sys.executable} -m pip install git+https://github.com/eurostat/basic-services.git
    # if you launch it from the notebooks/ directory, try for instance:
    pardir = os.path.abspath(os.path.join(THISDIR, '../'))
    sys.path.insert(0,pardir)
finally:
    from pyeufacility import config, hcs
    from pyeufacility.hcs import LThcs

Through the LT-specific module `hcs.LThcs` (file `LThcs.py`), we define a series of operations (like the ad-hoc ones above) necessary to 'prepare' the dataset. Those operations are implemented as methods of the `hcs.LThcs.Prepare_data` class (although the accompanying `hcs.LThcs.prepare_data` will be considered, see below):

In [12]:
Prep_data = LThcs.Prepare_data
preparator = Prep_data()

pred = lambda o: (inspect.ismethod(o) or inspect.isfunction(o)) and not o.__name__.startswith('__')
print("\033[4mMethods in '%s' class: \033[1m%s\033[0m " 
      % (Prep_data.__name__, [l[0] for l in inspect.getmembers(Prep_data, predicate = pred)]))

[4mMethods in 'Prepare_data' class: [1m['set_address', 'set_pp', 'split_Adr'][0m 


* the `split_Adr` function splits any address into its `['street', 'number', 'postcode', 'city']` components:

In [13]:
split_Adr = preparator.split_Adr
assert callable(split_Adr) is True

print("\033[4mMethod: \033[1m'%s'\033[0m" % split_Adr.__name__)
print(inspect.getsource(split_Adr))

df2 = df_src.copy()

print("\033[4mExample of use of '%s':\033[0m" % split_Adr.__name__)
print(" * input complete address: \033[1m'%s'\033[0m" % df2['Address'].iloc[0])
print(" * output decomposed address: \033[1m'%s'\033[0m" 
      % list(split_Adr(df2['Address'].iloc[0])))

[4mMethod: [1m'split_Adr'[0m
    @classmethod
    def split_Adr(cls, s):
        street, number, postcode, city = "", "", "", ""
        mem = re.compile(r'\s*,\s*').split(s)
        left, right = mem[0], " ".join(mem[1:])
        while left == '' and len(right)>1:
            left = right[-1].strip()
            right = right[:-1]
        if len(right) == 1 and left == '':
            return "", right[0], "", ""
        elif len(left) == 1 and right == '':
            return "", "", left[0], ""
        rights = re.compile(r'\s+').split(right)
        for r in rights:
            r = r.strip()
            if r == '': continue
            if r.isnumeric() or r[-1].isdigit():
                postcode = " ".join([postcode,r])
            else:
                city = " ".join([city,r])
        lefts = re.compile(r'\s+').split(left)
        for l in lefts:
            l = l.strip()
            if l == '': continue
            if l.isnumeric() or l[0].isdigit():
                number = "

* the `set_adress` method applies the `split_Adr` function on the `'Address'` column, throughout all raws of the input table: 

In [14]:
set_address = preparator.set_address
assert callable(set_address) is True
print("\033[4mMethod: \033[1m'%s'\033[0m" % set_address.__name__)
print(inspect.getsource(set_address))

set_address(df2)

print("\033[4mExample of use of '%s':\033[0m" % set_address.__name__)
df2[['postcode', 'city', 'street', 'number']].head()

[4mMethod: [1m'set_address'[0m
    def set_address(self, data):
        cols = data.columns.tolist()
        new_cols = ['street', 'number', 'postcode', 'city']
        data.reindex(columns = [*cols, *new_cols], fill_value = "")
        data[new_cols] =  (
            data
            .apply(lambda row: pd.Series(self.split_Adr(row['Address'])), axis=1)
            )
        return new_cols

[4mExample of use of 'set_address':[0m


Unnamed: 0,postcode,city,street,number
0,LT-08661,Vilnius,Santariškių,2
1,LT-08406,Vilnius,Santariškių,7
2,,Vilnius,Šiltnamių,29
3,,Vilnius,Žalgirio,117
4,,Vilnius,Parko g.,21


* the `set_pp` method classifies `'Level'` column into `['public', 'private']` types, throughout all raws of the input table:

In [15]:
set_pp = preparator.set_pp
assert callable(set_pp) is True
print("\033[4mMethod: \033[1m'%s'\033[0m" % set_pp.__name__)
print(inspect.getsource(set_pp))

set_pp(df2)

print("\033[4mExample of use of '%s':\033[0m" % set_pp.__name__)
df2[['public_private']].head()

[4mMethod: [1m'set_pp'[0m
    def set_pp(self, data):
        col_pp = 'Level: 1-national, 2-regional, 3-municipality, 4-nursing, 5-other public and specialized, 6-private'
        new_col = 'public_private'
        data[new_col] = (
            data[col_pp]
            .apply(lambda x: 'private' if x==6 else 'public')
            )
        return new_col

[4mExample of use of 'set_pp':[0m


Unnamed: 0,public_private
0,public
1,public
2,public
3,public
4,public


Besides the basic 'prepare' operations implemented in the `'LThcs'` module, some additional information required for data integration are parsed through the `'LThcs.json'` metadata file:

In [16]:
METAPATH = os.path.dirname(inspect.getfile(hcs))
print("- Source directory with HCS country-specific metadata: \033[1m'%s'\033[0m" 
      % METAPATH.split(PROJECT)[1])

try:
    meta = os.path.join(METAPATH,'%shcs.json' % CC)
    assert os.path.exists(meta)
except (AssertionError,FileNotFoundError):
    print("! No HCS metadata JSON-file found for '%s'!" % CC)
else:
    print("- Source HCS metadata JSON-file for '%s': \033[1m'%s'\033[0m" 
          % (CC, meta.split(PROJECT)[1]))

- Source directory with HCS country-specific metadata: [1m'/src/python/pyeufacility/hcs'[0m
- Source HCS metadata JSON-file for 'LT': [1m'/src/python/pyeufacility/hcs/LThcs.json'[0m


In [17]:
with open(meta, 'r') as fp:
    metadata = io.Json.load(fp)

print("\033[4mCountry-specific metadata for %s:\033[0m" % CC)
pprint(metadata) # print(io.Json.dumps(metadata, indent=2))

[4mCountry-specific metadata for LT:[0m
{'columns': [],
 'country': {'code': 'LT', 'name': 'Lithuania'},
 'date': {'ref': None},
 'file': 'Hospitals_2018.xlsx',
 'index': {'ER': None,
           'PP': None,
           'beds': 'Number of beds at the end of the 2018',
           'cc': None,
           'city': 'Municipality',
           'country': None,
           'email': None,
           'geo_qual': None,
           'id': 'ID',
           'lat': None,
           'lon': None,
           'name': 'Name',
           'number': None,
           'postcode': None,
           'prac': None,
           'pubdate': None,
           'refdate': None,
           'rooms': None,
           'site': None,
           'specs': None,
           'street': None,
           'tel': None,
           'type': 'type_name',
           'url': None},
 'lang': {'code': 'lt', 'name': 'Lithuanian'},
 'options': {'clean': None,
             'load': {'dfmt': '%d/%m/%Y', 'header': 2},
             'locate': {'place': ['numb

The exact one-to-one columns matching from the input raw table to the output harmonised dataset is provided through the `'index'` attribute:

In [18]:
print("\033[4mInput/output colum matching:\033[0m")
pprint({k:v for (k,v) in metadata['index'].items() if v is not None})

[4mInput/output colum matching:[0m
{'beds': 'Number of beds at the end of the 2018',
 'city': 'Municipality',
 'id': 'ID',
 'name': 'Name',
 'type': 'type_name'}


With the two resources above, we can design the LT country-specific ingestion design:
* a `LTFacility` class is derived from the `pyeudatnat.base.BaseDatNat` class using the `pyeufacility.config.facilityFactory` constructor method,
* the `metadata` information (as made available through the `'LThcs.json'` file) is parsed to the `LTFacility` constructor, 
* the 'prepare' operations (as implemented in the `hcs.LThcs.Prepare_data` class) are 'attached' to the `LTFacility` class by overriding the original abstract `BaseDatNat.prepare_data` with the `'hcs.LThcs.prepare_data'` method.

In [19]:
# build the LT HCS specific class while parsing the metadata
LTFacility = config.facilityFactory(cat = FACILITY, meta = metadata)
# override the prepare_data abstract method
LTFacility.prepare_data = LThcs.prepare_data

pprint(LTFacility.__dict__)

mappingproxy({'CATEGORY': {'code': 'hcs', 'name': 'Healthcare services'},
              'CC': 'LT',
              'COUNTRY': 'Lithuania',
              'PUBDATE': None,
              '__doc__': None,
              '__init__': <function datnatFactory.<locals>.__init__ at 0x123fef170>,
              '__module__': 'pyeudatnat.base',
              'prepare_data': <function prepare_data at 0x123fa3290>})


Following, we create an instance `lt` of `LTFacility` to handle these data:

In [20]:
lt = LTFacility()

print("\033[4mInput data source:\033[0m %s" % lt.file.split(PROJECT)[1])

assert ({k:v for (k,v) in lt.config.to_dict().items() if v is not None}
        ==  # comparing non empty settings
        {k:v for (k,v) in config.FACMETADATA[FACILITY].items() if v is not None})
# assert lt.config == config.FACMETADATA[FACILITY]
print("\n\033[4mOutput %s configuration metadata:\033[0m" % FACILITY)
pprint(lt.config.to_dict()) # or simply: print(lt.config)

assert ({k:v for (k,v) in lt.meta.to_dict().items() if v is not None}
        ==  # comparing non empty options
        {k:v for (k,v) in metadata.items() if v is not None})
print("\n\033[4mInput %s country specific metadata:\033[0m" % CC)
pprint(lt.meta.to_dict())

[4mInput data source:[0m /data/healthcare/raw/Hospitals_2018.xlsx

[4mOutput HCS configuration metadata:[0m
{'category': {'code': 'hcs', 'name': 'Healthcare services'},
 'index': {'ER': {'desc': "Flag 'yes/no' for whether the healthcare site "
                          'provides emergency medical services',
                  'name': 'emergency',
                  'type': 'str',
                  'values': ['yes', 'no']},
           'PP': {'desc': "Status 'private/public' of the healthcare service",
                  'name': 'public_private',
                  'type': 'str',
                  'values': ['public', 'private']},
           'beds': {'desc': 'Measure of capacity by number of beds (most '
                            'common)',
                    'name': 'cap_beds',
                    'type': 'int',
                    'values': None},
           'cc': {'desc': 'Country code (ISO 3166-1 alpha-2 format)',
                  'name': 'cc',
                  'type': 'str',
  

In [21]:
lt.meta.to_dict()

{'country': {'code': 'LT', 'name': 'Lithuania'},
 'lang': {'code': 'lt', 'name': 'Lithuanian'},
 'proj': None,
 'file': 'Hospitals_2018.xlsx',
 'path': '../../../data/healthcare/raw/',
 'columns': [],
 'index': {'id': 'ID',
  'name': 'Name',
  'site': None,
  'lat': None,
  'lon': None,
  'geo_qual': None,
  'street': None,
  'number': None,
  'postcode': None,
  'city': 'Municipality',
  'cc': None,
  'country': None,
  'beds': 'Number of beds at the end of the 2018',
  'prac': None,
  'rooms': None,
  'ER': None,
  'type': 'type_name',
  'PP': None,
  'specs': None,
  'tel': None,
  'email': None,
  'url': None,
  'refdate': None,
  'pubdate': None},
 'options': {'load': {'header': 2, 'dfmt': '%d/%m/%Y'},
  'clean': None,
  'prepare': None,
  'locate': {'place': ['number', 'street', 'postcode', 'city']}},
 'date': {'ref': None}}

In [22]:
metadata

{'country': {'code': 'LT', 'name': 'Lithuania'},
 'lang': {'code': 'lt', 'name': 'Lithuanian'},
 'proj': None,
 'file': 'Hospitals_2018.xlsx',
 'path': '../../../data/healthcare/raw/',
 'columns': [],
 'index': {'id': 'ID',
  'name': 'Name',
  'site': None,
  'lat': None,
  'lon': None,
  'geo_qual': None,
  'street': None,
  'number': None,
  'postcode': None,
  'city': 'Municipality',
  'cc': None,
  'country': None,
  'beds': 'Number of beds at the end of the 2018',
  'prac': None,
  'rooms': None,
  'ER': None,
  'type': 'type_name',
  'PP': None,
  'specs': None,
  'tel': None,
  'email': None,
  'url': None,
  'refdate': None,
  'pubdate': None},
 'options': {'load': {'header': 2, 'dfmt': '%d/%m/%Y'},
  'clean': None,
  'prepare': None,
  'locate': {'place': ['number', 'street', 'postcode', 'city']}},
 'date': {'ref': None}}

Method `load_data` is used to load the data into the instance dataframe `lt.data`. At this stage, all necessary information to run this operation is available in the `metadata` of the `lt` instance. Besides the file source (`'file'` and `'path'` fields), the option `options['load']` also provides relevant settings for retrieving the (Excel) input data:

In [23]:
assert ({k:v for (k,v) in lt.options.items() if v not in ({},None)}
        ==  # comparing non empty options
        {k:v for (k,v) in lt.meta['options'].items() if v not in ({},None)})

print("\n\033[4mLoading options:\033[0m %s" % lt.options.get('load')) 
lt.load_data() # try also: lt.load_data(header = 2)
# you can add some keyword arguments necessary to load the data: they will override the default settings

cols = lt.data.columns
print("\n\033[4mInput loaded data columns:\033[0m %s" % list(cols))


[4mLoading options:[0m {'header': 2, 'dfmt': '%d/%m/%Y'}

[4mInput loaded data columns:[0m ['Code of municipality', 'Municipality', 'ID', 'parent_ID', 'Code of legal entity', 'Subordination: 1-national (MoH), 3-municipality, 8-private, 9-other ministries (not MoH)', 'type_code', 'type_name', 'Level: 1-national, 2-regional, 3-municipality, 4-nursing, 5-other public and specialized, 6-private', 'Name', 'Address', 'Number of beds at the end of the 2018']


Next, we run the `prepare_data` operations that were introduced through the overriding `Prepare_data` class introduced above. This particularly entails the prior creation of the `['number', 'street', 'postcode', 'city']` fields that will be later used for geolocation:

In [24]:
# print("\033[4mOverriding method: \033[1m'%s'\033[0m" % LThcs.prepare_data.__name__)
# print(inspect.getsource(LThcs.prepare_data))

print("\n\033[4mPre-processing options:\033[0m %s" % lt.options.get('prepare')) 
lt.prepare_data()

print("\n\033[4mUpdated data columns:\033[0m %s" % list(set(lt.data.columns) ^ set(cols)))


[4mPre-processing options:[0m None

[4mUpdated data columns:[0m ['number', 'street', 'public_private', 'postcode', 'city']


Using the previously created fields, the geolocation operation is run through the call to `locate_data`. This time, in the absence of an explicit geocoder parsed to the method, default Nominatim geolocation service is used:

In [25]:
print("\n\033[4mSupported geoservices\033[0m (to be complemented - see geopy):")
pprint(pyeudatnat.geo.CODERS)

print("\n\033[4mGeocoding options:\033[0m %s" % lt.options.get('locate')) 
lt.locate_data() 
# use also: 
# lt.locate_data(place = ['number', 'street', 'postcode', 'city'], gc = {'Bing': 'your-key'})

# note that the following alternatives to parsing a geocoder
# (a) parse the argument gc = 'Nominatim' to NewCZhcs, or 
# (b) run first newcz.geocoder = 'Nominatim'

lt.data.head(5)


[4mSupported geoservices[0m (to be complemented - see geopy):
{'Bing': 'api_key',
 'GeoNames': 'username',
 'GoogleV3': 'api_key',
 'MapQuest': 'key',
 'Nominatim': None,
 'OpenMapQuest': 'api_key',
 'Yandex': 'api_key'}

[4mGeocoding options:[0m {'place': ['number', 'street', 'postcode', 'city']}


Unnamed: 0,Code of municipality,Municipality,ID,parent_ID,Code of legal entity,"Subordination: 1-national (MoH), 3-municipality, 8-private, 9-other ministries (not MoH)",type_code,type_name,"Level: 1-national, 2-regional, 3-municipality, 4-nursing, 5-other public and specialized, 6-private",Name,Address,Number of beds at the end of the 2018,street,number,postcode,city,public_private,place,lat,lon
0,101,Vilnius,1,0,124364561,1,1,general,1,Vilniaus Universiteto ligoninė Santaros klinikos,"Santariškių 2, Vilnius LT-08661",1356,Santariškių,2,LT-08661,Vilnius,public,"Santariškių, 2, LT-08661, Vilnius",54.752349,25.27725
1,101,Vilnius,5,1,302620298,1,1,general,1,Vilniaus Universiteto ligoninės Santaros klini...,"Santariškių 7, Vilnius LT-08406",515,Santariškių,7,LT-08406,Vilnius,public,"Santariškių, 7, LT-08406, Vilnius",54.755024,25.282701
2,101,Vilnius,32,0,124243848,1,1,general,1,Respublikinė Vilniaus universitetinė ligoninė,"Šiltnamių 29, Vilnius",673,Šiltnamių,29,,Vilnius,public,"Šiltnamių, 29, Vilnius",54.668216,25.207733
3,101,Vilnius,3,0,191744287,1,1,general,1,Vilniaus Universiteto ligoninės Žalgirio klinika,"Žalgirio 117, Vilnius",58,Žalgirio,117,,Vilnius,public,"Žalgirio, 117, Vilnius",54.704809,25.278833
4,101,Vilnius,10,0,124247526,1,19,psychiatry,5,Respublikinė Vilniaus psichiatrijos ligoninė,"Parko g. 21, Vilnius",542,Parko g.,21,,Vilnius,public,"Parko g., 21, Vilnius",54.683846,25.417448


Finally, the formatting/harmonisation operated in the `format_data` method consists mainly in basic renaming/casting of existing and new fields/columns:

In [26]:
lt.format_data()

lt.data.head(5)

Unnamed: 0,id,hospital_name,lat,lon,street,house_number,postcode,city,cc,country,cap_beds,facility_type,public_private
0,1,Vilniaus Universiteto ligoninė Santaros klinikos,54.752349,25.27725,Santariškių,2,LT-08661,Vilnius,LT,Lithuania,1356,general,public
1,5,Vilniaus Universiteto ligoninės Santaros klini...,54.755024,25.282701,Santariškių,7,LT-08406,Vilnius,LT,Lithuania,515,general,public
2,32,Respublikinė Vilniaus universitetinė ligoninė,54.668216,25.207733,Šiltnamių,29,,Vilnius,LT,Lithuania,673,general,public
3,3,Vilniaus Universiteto ligoninės Žalgirio klinika,54.704809,25.278833,Žalgirio,117,,Vilnius,LT,Lithuania,58,general,public
4,10,Respublikinė Vilniaus psichiatrijos ligoninė,54.683846,25.417448,Parko g.,21,,Vilnius,LT,Lithuania,542,psychiatry,public


Note that if you want to save specific column(s) from the input dataset, it is possible to parse the name of this(ese) column(s) to the method `format_data` using the `'keep'` keyword argument. This can also be added as an option into the metadata. By default, `keep = True` will ensure all non-empty columns listed in `config.FACMETADATA['HCS']` are preserved, while `force = True` will even force the empty ones to be preserved. When `keep = False` is parsed to `format_data`, all columns referred to in the index attribute `idx` are preserved. Note that `harmonise` (see below) is run by default with `keep = True, force = True` so that the output file contains all fields of the configuration template `config.FACMETADATA['HCS']`.

## Full automation <a id="automation"></a>

To summarise, the entire data preparation, processing and harmonisation of the input dataset consists in running sequentially the following operations:
```python
# country-specific class definition
LTFacility = config.facilityFactory(cat = 'HCS', meta = metadata)
# facility instance creation
lt = LTFacility()
# data handling, geocoding and formatting
lt.load_data()
lt.prepare_data()
lt.locate_data()
lt.format_data()
```
where processing options/parameters of the various operations above are parsed through the configuration and metadata attributes `meta`, `config` and `options` of the input facility instance, or passed directly to the different methods. 

Overall, the above stepwise process is fully automated once all necessary options/parameters have been informed in the metadata. Namely, it is possible to automatically (re)run the entire processing using the self-contained `harmonise` method: 

In [27]:
from pyeufacility import harmonise

OFILE = '%sgeo-test2.csv' % CC

lt = harmonise.run(FACILITY, country = CC, gc = {'Bing': BING_KEY},
                   on_disk = True, dest = OFILE)

lt.data.head(5)

! Country py-module 'pyeufacility.hcs.LThcs' found !
! Generic formatting/harmonisation methods used !
! country-specific 'prepare_data' method loaded !
! Harmonised data for country 'LT' generated !


Unnamed: 0,id,hospital_name,site_name,lat,lon,geo_qual,street,house_number,postcode,city,...,emergency,facility_type,public_private,list_specs,tel,email,url,ref_date,pub_date,comments
0,1,Vilniaus Universiteto ligoninė Santaros klinikos,,54.75171,25.27917,,Santariškių,2,LT-08661,Vilnius,...,,general,public,,,,,,,
1,5,Vilniaus Universiteto ligoninės Santaros klini...,,54.75481,25.28272,,Santariškių,7,LT-08406,Vilnius,...,,general,public,,,,,,,
2,32,Respublikinė Vilniaus universitetinė ligoninė,,54.66862,25.20767,,Šiltnamių,29,,Vilnius,...,,general,public,,,,,,,
3,3,Vilniaus Universiteto ligoninės Žalgirio klinika,,54.70407,25.27884,,Žalgirio,117,,Vilnius,...,,general,public,,,,,,,
4,10,Respublikinė Vilniaus psichiatrijos ligoninė,,,,,Parko g.,21,,Vilnius,...,,psychiatry,public,,,,,,,


Note that, whenever the geographical coordinates are set in the dataset, it is possible to easily retrieve a `GeoDataFrame` instance from the facility instance using the `to_geodf` method:

In [28]:
geolt = lt.to_geodf(latlon = ['lat', 'lon'], crs = "EPSG:4326")

geolt[['lat', 'lon', 'geometry']].head(5)

Unnamed: 0,lat,lon,geometry
0,54.75171,25.27917,POINT (25.279 54.752)
1,54.75481,25.28272,POINT (25.283 54.755)
2,54.66862,25.20767,POINT (25.208 54.669)
3,54.70407,25.27884,POINT (25.279 54.704)
4,,,POINT (nan nan)


In [29]:
geolt.columns

Index(['comments', 'url', 'facility_type', 'tel', 'cap_prac', 'city',
       'pub_date', 'lon', 'lat', 'street', 'hospital_name', 'public_private',
       'cc', 'ref_date', 'country', 'id', 'house_number', 'list_specs',
       'email', 'emergency', 'cap_beds', 'cap_rooms', 'geo_qual', 'postcode',
       'site_name', 'geometry'],
      dtype='object')

For instance, one can use the geoinformation for basic exploratory visualisation:

In [30]:
from folium import Map, GeoJson, features
  
vilnius = [54.6872, 25.2797]
zoom = 11
nsample = 10

# gjsonlt = json.dumps(io.Frame.to_geojson(lt.data.iloc[0:(nsample-1),:], latlon = ['lat', 'lon']))
gjsonlt = geolt[~(geolt.lon.isna() | geolt.lat.isna())]
gjsonlt = gjsonlt.iloc[0:(nsample-1),:].to_json()
          
m = Map(location = vilnius, zoom_start = zoom)
GeoJson(gjsonlt, name = 'LT hospital',
    tooltip=features.GeoJsonTooltip(fields=['hospital_name','facility_type','cap_beds'], localize=True)).add_to(m)

m

## Customised processing <a id="customisation"></a>

Using class methods overriding, it is actually possible to customise any of the operations run throughout the entire harmonisation process. Actually, the default implementations made available in the metadata file `LThcs.py` (currently, `Prepare_data` and `prepare_data`) can be overriden themselves. 

This way, any of the loading (`load_data`), pre-processing (`prepare_data`), geolocalisation (`locate_data`) and formatting (`format_data`) operations can be customised/updated. New operations can be added, as listed below: 

In [31]:
print("\n\033[4mCustomisable operations:\033[0m \033[1m%s\033[0m" % pyeudatnat.base.PROCESSES)


[4mCustomisable operations:[0m [1m['fetch', 'load', 'prepare', 'clean', 'translate', 'locate', 'format', 'save'][0m


Operations that are updated and reported in the `LThcs.py` module will be uploaded by the automated `harmonise` process to override the abstract/original methods. Otherwise, overriding can be explicitly performed. 

In the following, we reuse the basic functions used for the [ad-hoc processing](#ad-hoc) throughout the entire workflow. For this purpose a new class `Another_LTFacility` needs to be introduced:

In [32]:
Another_LTFacility = config.facilityFactory(cat = FACILITY, meta = metadata)

def another_loader(instance): # instance is like 'self'...
    instance.data = adhoc_load(instance.file)
Another_LTFacility.load_data = another_loader

def another_preparator(instance):
    instance.data = instance.data.copy().head(10) # reduce the dataset dimension
    adhoc_preparation(instance.data)
Another_LTFacility.prepare_data = another_preparator

def another_locator(instance):
    adhoc_location(instance.data)
Another_LTFacility.locate_data = another_locator

# pprint(Another_LTFacility.__dict__)

Following, the previous stepwise approach can be adopted, with the same processes' names (but overriden methods), to produce the data. Note that these functions do not use any 'metadata information' besides the one that is hard-encoded in the core implementation of these functions. This for instance explains the list of  columns present in the output dataset: 

In [33]:
another_lt = Another_LTFacility()

another_lt.load_data()
another_lt.prepare_data()
another_lt.locate_data()
another_lt.format_data()

another_lt.data.head(5)

100%|██████████| 10/10 [01:49<00:00, 10.91s/it]


Unnamed: 0,Municipality,id,parent_ID,type_code,facility_type,name,Address,cap_beds,country,public_private,place,lat,lon,cc
0,Vilnius,1,0,1,general,Vilniaus Universiteto ligoninė Santaros klinikos,"Santariškių 2, Vilnius LT-08661",1356,Lithuania,public,"Santariškių 2, Vilnius LT-08661, Lithuania",54.75171,25.27917,LT
1,Vilnius,5,1,1,general,Vilniaus Universiteto ligoninės Santaros klini...,"Santariškių 7, Vilnius LT-08406",515,Lithuania,public,"Santariškių 7, Vilnius LT-08406, Lithuania",54.75481,25.28272,LT
2,Vilnius,32,0,1,general,Respublikinė Vilniaus universitetinė ligoninė,"Šiltnamių 29, Vilnius",673,Lithuania,public,"Šiltnamių 29, Vilnius, Lithuania",54.66862,25.20767,LT
3,Vilnius,3,0,1,general,Vilniaus Universiteto ligoninės Žalgirio klinika,"Žalgirio 117, Vilnius",58,Lithuania,public,"Žalgirio 117, Vilnius, Lithuania",54.70407,25.27884,LT
4,Vilnius,10,0,19,psychiatry,Respublikinė Vilniaus psichiatrijos ligoninė,"Parko g. 21, Vilnius",542,Lithuania,public,"Parko g. 21, Vilnius, Lithuania",54.68442,25.4182,LT


More 'advanced' functions can be considered, like for the newly defined class `Yet_Another_LTFacility` below (although we also could have overriden the methods of the previously defined class `Another_LTFacility`):

In [34]:
Yet_Another_LTFacility = config.facilityFactory(cat = FACILITY, meta = metadata)

yet_another_preparator = another_preparator # same as above
Yet_Another_LTFacility.prepare_data = yet_another_preparator

def yet_another_locator(instance, delay = True): 
    # remember: we defined earlier the global variable geolocator as:
    #    gcNominatim = functools.partial(geocoders.Nominatim(user_agent = PROJECT).geocode, 
    #                                   language = 'en') # or instance.lang
    if delay is False:
        new_geolocator = gc
    else: 
        if delay is True:     delay = 1 # delay for requests to be sent to the geoservice
        try:
            assert isinstance(delay, int)
        except: raise IOError("Wrong delay number")
        new_geolocator = RateLimiter(gc, min_delay_seconds = delay)
    instance.data['lat'], instance.data['lon'] = (
        zip(*instance.data['place']
            .apply(new_geolocator)
            .apply(lambda x: (x.latitude, x.longitude) if x != None else (np.nan, np.nan))
           )
    )
Yet_Another_LTFacility.locate_data = yet_another_locator

And similarly process the data. Note this time that `['number', 'street', 'postcode', 'city']` have actually not been created (since `another_prepator` was also used), while not all empty/NaN columns have been discarded from the output dataset (`force = False` in the default `format_data` method):

In [35]:
yet_another_lt = Yet_Another_LTFacility()

yet_another_lt.load_data()
yet_another_lt.prepare_data()
yet_another_lt.locate_data()
yet_another_lt.format_data() # set force = True to get all output columns

yet_another_lt.data.head(5)

Unnamed: 0,id,lat,lon,city,cc,country,cap_beds,facility_type,public_private
0,1,54.75171,25.27917,Vilnius,LT,Lithuania,1356,general,public
1,5,54.75481,25.28272,Vilnius,LT,Lithuania,515,general,public
2,32,54.66862,25.20767,Vilnius,LT,Lithuania,673,general,public
3,3,54.70407,25.27884,Vilnius,LT,Lithuania,58,general,public
4,10,54.68442,25.4182,Vilnius,LT,Lithuania,542,psychiatry,public
