# Input/Output data integration and harmonisation - CZ healthcare services

Prepared by [**J.Grazzini**](mailto:jacopo.grazzini@ec.europa.eu) ([_Eurostat_](https://ec.europa.eu/eurostat)).

## Introduction

This notebook illustrates some basic usage of the [`pyeudatanat`](https://github.com/eurostat/pyEUDatNat) and [`pyeufacility`](https://github.com/eurostat/pyeufacility) packages in order to ingest and harmonise data collected from national providers. 

The example presented herein aims at formatting CZ data on healthcare services according to an agreed [common template](https://github.com/eurostat/basic-services/blob/master/data/healthcare/metadata.pdf).

**Table of contents**

* [Setting and checking the environment](#environment).
* [Definition of the output configuration format](#configuration).
* [Loading the input data](#loading).
* [Use of input metadata and output configuration template](#metadata).
* [Geocoding and formatting the input data](#geocode-format).
* [Adopting a metadata-based approach](#metadata-approach).
* [Automated metadata-based harmonisation](#harmonisation).
* [Automated metadata-based validation](#validation).

## Setting and checking the environment <a id="environment"></a>

Let's run some setup necessary to import the dependency(ies) required to run the project... 

In [1]:
thisdir = !pwd # os.getcwd()
PROJECT = 'basic-services'

import os, sys
import numpy as np
import pandas as pd
import json
from pprint import pprint as pprint

First, you will need to import the `pyeudatnat` package that contains various useful functions/methods: 

In [2]:
try:
    import pyeudatnat
except ImportError:
    try:
        !{sys.executable} -m pip install git+https://github.com/eurostat/pyEUDatNat.git
        # !conda install --yes --prefix {sys.prefix} git+https://github.com/eurostat/pyEUDatNat.git
    except:
        raise IOError("sorry, you're doomed, you won't be able to run this notebook...")
    else:
        import pyeudatnat
finally:
    print('package \033[1mpyeudatnat\033[0m available to run project \033[1m%s\033[0m' % PROJECT)
    from pyeudatnat import misc, io, meta, text

package [1mpyeudatnat[0m available to run project [1mbasic-services[0m


Then, you will need to load the [`pyeufacility`](https://github.com/eurostat/basic-services/tree/master/src/python/pyeufacility) module available with the [`'basic-services'`](https://github.com/eurostat/basic-services/) project.
The following is required only if `pyeufacility` is not remotely installed/available from `pypi`, but it will be working only if you have a local version on your machine. Another simple option is to update your environment variable `PYTHONPATH`. 

In [3]:
try:
    import pyeufacility
except ImportError:
    try:
        assert 'pyeufacility' in [mod.__name__ for mod in sys.modules.values()]
    except AssertionError:
        # first case: this notebook will need to be run from the project directory, otherwise...
        try:
            pardir = os.path.abspath(os.path.join(thisdir[0], '../'))
            assert 'pyeufacility' in os.listdir(pardir)
        except:
            # second case: install from github
            try:
                !{sys.executable} -m pip install git+https://github.com/eurostat/basic-services.git#egg=pyeufacility\&subdirectory=src/python
            except:
                raise IOError("sorry, you're doomed, you won't be able to run this notebook...")
            else:
                import pyeufacility
        else:
            sys.path.insert(0,pardir)     
            import pyeufacility
finally:
    print('package \033[1mpyeufacility\033[0m available to run project \033[1m%s\033[0m' % PROJECT)
    from pyeufacility import config

package [1mpyeufacility[0m available to run project [1mbasic-services[0m


We will also use the following variable for handling the paths to the local data repositories:

In [4]:
try:
    PACKPATH = getattr(sys.modules['pyeufacility'], 'PACKPATH')
except:
    PACKPATH = getattr(sys.modules['pyeufacility'], '__path__')[0]

Let's check some basic global parameters/metadata already made available there...

In [5]:
print("\033[4mCountries considered for harmonisation:\033[0m \033[1m%s\033[0m" 
      % list(pyeudatnat.COUNTRIES))
print("\033[4mFacilities available for harmonisation:\033[0m \033[1m%s\033[0m" 
      % pyeufacility.FACILITIES)

[4mCountries considered for harmonisation:[0m [1m['BE', 'EL', 'LT', 'PT', 'BG', 'ES', 'LU', 'RO', 'CZ', 'FR', 'HU', 'SI', 'DK', 'HR', 'MT', 'SK', 'DE', 'IT', 'NL', 'FI', 'EE', 'CY', 'AT', 'SE', 'IE', 'LV', 'PL', 'UK', 'IS', 'NO', 'CH', 'LI'][0m
[4mFacilities available for harmonisation:[0m [1m{'HCS': {'code': 'hcs', 'name': 'Healthcare services'}, 'EDU': {'code': 'edu', 'name': 'Educational facilities'}, 'Oth': {'code': 'other', 'name': 'Other basic services TBD'}}[0m


## Definition of the output configuration format <a id="configuration"></a> 

Next, we import the `config` module, and check also some of the metadata made available through the `pyeufacility` module. 

In particular, the format of the output harmonised dataset, such as the output columns (name and type), the output encoding format, the separator for output CSV, _etc_..., are defined for any given type of facility, _e.g._ HealthCare Services (*HCS*), separately in the global dictionary `config.FACMETADATA`:

In [6]:
print("\033[4mFields of the configuration template:\033[0m \033[1m%s\033[0m" 
      % list(config.FACMETADATA['HCS'].keys()))
print("\033[4mFormatting of output data is available for 'HCS':\033[0m \033[1m%s\033[0m" 
      % config.FACMETADATA['HCS']['category'])

[4mFields of the configuration template:[0m [1m['path', 'info', 'options', 'index', 'category'][0m
[4mFormatting of output data is available for 'HCS':[0m [1m{'code': 'hcs', 'name': 'Healthcare services'}[0m


Using this variable, it is possible to retrieve the specific formatting of *HCS* described through its contents. `config.FACMETADATA['HCS']` actually informs the output harmonised data, *e.g.* regarding the list of desired columns in the dataset, format/cast of columns, path of the output file, projection considered for the dataset, *etc...*:

In [7]:
print("\033[4mConfiguration template:\033[0m (%s)" % type(config.FACMETADATA['HCS']))
# print(json.dumps(config.CONFIGINFO['HCS'], indent=2))
pprint(config.FACMETADATA['HCS'])

[4mConfiguration template:[0m (<class 'dict'>)
{'category': {'code': 'hcs', 'name': 'Healthcare services'},
 'index': {'ER': {'desc': "Flag 'yes/no' for whether the healthcare site "
                          'provides emergency medical services',
                  'name': 'emergency',
                  'type': 'str',
                  'values': ['yes', 'no']},
           'PP': {'desc': "Status 'private/public' of the healthcare service",
                  'name': 'public_private',
                  'type': 'str',
                  'values': ['public', 'private']},
           'beds': {'desc': 'Measure of capacity by number of beds (most '
                            'common)',
                    'name': 'cap_beds',
                    'type': 'int',
                    'values': None},
           'cc': {'desc': 'Country code (ISO 3166-1 alpha-2 format)',
                  'name': 'cc',
                  'type': 'str',
                  'values': ['BE',
                             '

This provides for instance with the definitions of the desired output fields, for instance:

In [8]:
print("- output attribute \033[1m'%s'\033[0m defined as:\n\033[1m%s\033[0m" % 
      (config.FACMETADATA['HCS']['index']['site']['name'], config.FACMETADATA['HCS']['index']['site']))
print("- output attribute \033[1m'%s'\033[0m defined as:\n\033[1m%s\033[0m" % 
      (config.FACMETADATA['HCS']['index']['lat']['name'], config.FACMETADATA['HCS']['index']['lat']))

- output attribute [1m'site_name'[0m defined as:
[1m{'name': 'site_name', 'desc': 'The name of the specific site or branch of a healthcare institution', 'type': 'str', 'values': None}[0m
- output attribute [1m'lat'[0m defined as:
[1m{'name': 'lat', 'desc': 'Latitude (WGS 84)', 'type': 'float', 'values': None}[0m


Note that the parameters of any given facility, _e.g._ *HCS*, as defined in `config.FACMETADATA` can actually be (re)set through an external configuration JSON file, _e.g._ *hcs.json*:

In [9]:
cfg = os.path.join(PACKPATH, 'hcs.json')
with open(cfg, 'r') as fp:
    cfgdata = json.load(fp)
    
print("Configuration template is parsed through \033[1m'%s'\033[0m file:" 
      % cfg.split(PROJECT)[1])
# cfgdatametadata
print(json.dumps(cfgdata, indent=2))

Configuration template is parsed through [1m'/src/python/pyeufacility/hcs.json'[0m file:
{
  "path": "../../../data/healthcare",
  "info": "metadata.pdf",
  "options": {
    "lang": "en",
    "fmt": {
      "geojson": "geojson",
      "csv": "csv",
      "gpkg": "gpkg"
    },
    "proj": null,
    "sep": ",",
    "encoding": "utf-8",
    "dtfmt": "%d/%m/%Y"
  },
  "index": {
    "id": {
      "name": "id",
      "desc": "The healthcare service identifier - This identifier is based on national identification codes, if it exists.",
      "type": "int",
      "values": null
    },
    "name": {
      "name": "hospital_name",
      "desc": "The name of the healthcare institution",
      "type": "str",
      "values": null
    },
    "site": {
      "name": "site_name",
      "desc": "The name of the specific site or branch of a healthcare institution",
      "type": "str",
      "values": null
    },
    "lat": {
      "name": "lat",
      "desc": "Latitude (WGS 84)",
      "type": "floa

## Loading the input data<a id="loading"></a> 

Let's consider a simple example: as raw information in the original table, *CZ* data is already available with the lat/lon geographical coordinates encoded in a single column of the input data (named `'GPS`'). "Integrating" these data is nothing else than extracting the coordinates, reshufling some of the columns and dumping it in a new table.

To do so, it will be necessary to retrieve the structure (in terms of columns and data format) of the original table and "match" it to the output desired structure, as defined in the previously mentioned `config.FACMETADATA` parameter. 

Let's first defined the input structure using the class constructor `config.facilityFactory`:

In [10]:
import importlib
from pyeudatnat import io, geo, misc, meta, base 
importlib.reload(geo)
importlib.reload(meta)
importlib.reload(misc)

importlib.reload(base)

importlib.reload(config)

<module 'pyeufacility.config' from '/Users/gjacopo/DevOps/basic-services/src/python/pyeufacility/config.py'>

In [11]:
CZhcs = config.facilityFactory(cat = 'HCS', country = 'CZ')

print("\033[4mCountry:\033[0m \033[1m%s - %s\033[0m" % (CZhcs.CC, CZhcs.COUNTRY))
print("\033[4mFacility category:\033[0m \033[1m%s\033[0m" %CZhcs.CATEGORY)

[4mCountry:[0m [1mCZ - Czechia[0m
[4mFacility category:[0m [1m{'code': 'hcs', 'name': 'Healthcare services'}[0m


and define the *CZ* data as an instance of the previously dynamically defined class `CZhcs`, with specific details regarding the data, such as the location of the original table:

In [12]:
cz = CZhcs(file = 'export-2020-02.csv', 
           path = os.path.abspath('../../../data/healthcare/raw/'), 
           lang = 'cs')

print("\033[4mCZhcs instance:\033[0m (%s)" % type(cz))
print("- source file: \033[1m'%s'\033[0m" % cz.file.split(PROJECT)[1])
print("- country code: \033[1m'%s'\033[0m" % cz.cc)
print("- language used in the original table: \033[1m'%s'\033[0m" % cz.lang)

[4mCZhcs instance:[0m (<class 'pyeudatnat.base.NewDatNat'>)
- source file: [1m'/data/healthcare/raw/export-2020-02.csv'[0m
- country code: [1m'CZ'[0m
- language used in the original table: [1m'cs'[0m


This will be enough to load the data using the `load_data` method which stores the input data as a [`pandas`](https://pandas.pydata.org) dataframe structure in the `data` attribute of the `cz` instance. 

At this stage, it is necessary to check the input data, _e.g._ let's have a look at the column names and the content of the table in the raw dataset: obviously, some formatting is needed. Related information is parsed to the loading  `load_data` method through the implicit attributes of the `options['load']` of the `cz` instance:

In [13]:
# first incomplete try
try:
    cz.load_data()
except:
    print('Further information regarding data loading is requested...')
    cz.options['load'].update({'encoding': 'latin1', 'sep': ';'})
    # rerun...
    cz.load_data() 
    # note: we could have run: cz.load_data(encoding='latin1', sep=;') without the prior setting

print("\033[4mDataset:\033[0m (%s)" % type(cz.data))
cz.data.head(5)



Further information regarding data loading is requested...
[4mDataset:[0m (<class 'pandas.core.frame.DataFrame'>)


Unnamed: 0,ZdravotnickeZarizeniId,PCZ,PCDP,NazevCely,DruhZarizeni,Obec,Psc,Ulice,CisloDomovniOrientacni,Kraj,...,PscSidlo,ObecSidlo,UliceSidlo,CisloDomovniOrientacniSidlo,OborPece,FormaPece,DruhPece,OdbornyZastupce,GPS,LastModified
0,138130,0,1,Zubní studio V+V s.r.o.,Samostatná ordinace PL - stomatologa,Pelhøimov,39301.0,Komenského,1465,Kraj Vysoèina,...,39301.0,Pelhøimov,Pod Náspem,641,"zubní lékaøství, Dentální hygienistka","primární ambulantní péèe, specializovaná ambul...",,"Veronika kodová, Vít koda",49.427787867131 15.218289906425,01-02-2020 23:03
1,138129,0,4,TECTUM spol. s r.o.,Odbìrová místnost,Most,43401.0,Topolová,1234,Ústecký kraj,...,36001.0,Karlovy Vary,Bezruèova,1098/10,klinická biochemie,specializovaná ambulantní péèe,,"David Hepnar, JITKA PODROUKOVÁ",50.497217046878 13.650265371038,01-02-2020 23:03
2,138128,1,0,"MUDr. Milan Kuèera, s.r.o.",Samostatná ordinace lékaøe specialisty,Praha,15000.0,Kartouzská,3274/10,Hlavní mìsto Praha,...,27201.0,Kladno,T. G. Masaryka,2104,dermatovenerologie,ambulantní péèe,,Jan Kuèera,50.073416715218 14.401090995315,01-02-2020 23:03
3,138127,1,0,Tereza Horáèková,Výdejna zdravotnických prostøedkù,Praha,18200.0,Klapkova,154/46,Hlavní mìsto Praha,...,,,,,praktické lékárenství,,lékárenská péèe,Petra Holasová,50.126760922745 14.456956313382,01-02-2020 23:03
4,138126,0,0,Veobecný lékaø Prevent s.r.o.,Samost. ordinace veob. prakt. lékaøe,Telè,58856.0,Masarykova,330,Kraj Vysoèina,...,28163.0,Kozojedy,1. máje,67,"veobecné praktické lékaøství, veobecné prakt...","primární ambulantní péèe, zdrav. péèe poskytov...",,ARNOT STANÌK,49.183499865155 15.45981410537,01-02-2020 23:03


More options are obviously available, as allowed by the `pandas` methods, for instance [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), [`read_excel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) or [`read_table`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html), depending on the format of the input data you are dealing with. 

Now, basic information regarding the input data can be retrieved through the `data` attribute of the `cz` instance which is a common `pandas` dataframe. For instance, if we are interested in the list of columns:

In [14]:
print("\033[4mInput dataset columns:\033[0m")
print(cz.data.columns)

[4mInput dataset columns:[0m
Index(['ZdravotnickeZarizeniId', 'PCZ', 'PCDP', 'NazevCely', 'DruhZarizeni',
       'Obec', 'Psc', 'Ulice', 'CisloDomovniOrientacni', 'Kraj', 'KrajCode',
       'Okres', 'OkresCode', 'SpravniObvod', 'PoskytovatelTelefon',
       'PoskytovatelFax', 'DatumZahajeniCinnosti',
       'IdentifikatorDatoveSchranky', 'PoskytovatelEmail', 'PoskytovatelWeb',
       'PoskytovatelNazev', 'Ico', 'TypOsoby', 'PravniFormaKod',
       'KrajCodeSidlo', 'KrajSidlo', 'OkresCodeSidlo', 'OkresSidlo',
       'PscSidlo', 'ObecSidlo', 'UliceSidlo', 'CisloDomovniOrientacniSidlo',
       'OborPece', 'FormaPece', 'DruhPece', 'OdbornyZastupce', 'GPS',
       'LastModified'],
      dtype='object')


To make the identification of the appropriate columns in the input table, we can (approximately) translate the column names from *'cs'* to *'en'*, using the [`googletrans`](https://pypi.org/project/googletrans/) package. Note that you may run into [this issue](https://stackoverflow.com/questions/52455774/googletrans-stopped-working-with-error-nonetype-object-has-no-attribute-group) with the abovementioned package. Try to rerun the cell for a dirty fix, or follow the instructions provided in the link (*e.g.*, install the [`py-googletrans`](https://github.com/BoseCorp/py-googletrans) package):


In [15]:
print("\033[4mColumn names translated in English:\033[0m")
print(cz.get_cols(olang = 'en'))

[4mColumn names translated in English:[0m
['Medical Equipment Id', 'PCZ', 'PCDP', 'CellName', 'Type of device', 'Village', 'Zip code', 'Street', 'Home Number Orientation', 'Region', 'KrajCode', 'District', 'OkresCode', 'Administrative District', 'Phone Provider', 'ProviderFax', 'Date Started Activities', 'Data Box Identifier', 'Email Provider', 'Web Provider', 'ProviderName', 'Ico', 'TypePersons', 'LegalFormCode', 'KrajCodeSidlo', 'KrajSidlo', 'OkresCodeSidlo', 'OkresSidlo', 'PscSidlo', 'ObecSidlo', 'StreetSidlo', 'NumberHomeOrientacniSidlo', 'OborPece', 'FormaPece', 'Type of furnace', 'Professional Representatives', 'GPS', 'LastModified']


*update on 1/12/20*: The problem `error in result (AttributeError: 'NoneType' object has no attribute 'group')` reemerged recently as reported in [this ticket](https://github.com/ssut/py-googletrans/issues/234) with a possible [fix](https://github.com/ssut/py-googletrans/pull/237) (also discussed [here](https://stackoverflow.com/questions/52455774/googletrans-stopped-working-with-error-nonetype-object-has-no-attribute-group)). Issue has been solved with an updated version of the `googletrans` package:
```python
pip install googletrans==4.0.0-rc1
```
Well, this did not go very well... The translation is actually made difficult because of the formatting of the column names (merged words). To improve the translation output, we add a filter (split of words on capital letters) on the text to be translated, namely: 

In [16]:
print("\033[4mPrior filtered column names translated in English:\033[0m")
print(cz.get_cols(olang = 'en', filt = text.TextProcess.split_at_upper, force = True))

[4mPrior filtered column names translated in English:[0m
['Medical Equipment Id', 'PCZ', 'PCDP', 'The name of the cell', 'Type of device', 'Village', 'Zip code', 'Street', 'Home Number Orientation', 'Region', 'County Code', 'District', 'District Code', 'Administrative District', 'Phone Provider', 'Fax Provider', 'Date Started Activities', 'Mailbox Identifier', 'Email Provider', 'Web Provider', 'Provider Name', 'Ico', 'Type of Person', 'Legal Form Code', 'Country Code Seat', 'County Seat', 'District Code Headquarters', 'District Sidlo', 'Psc Seat', 'The village of Sidlo', 'Street Sidlo', 'Home Orientation Number', 'Field Furnaces', 'Furnace Form', 'Furnace Type', 'Professional Representative', 'GPS', 'Last Modified']


Note that the translated column names are stored in the `cols` attributes of the instance:

In [17]:
print("\033[4mColumn names in original 'cs' language and available 'en' translation:\033[0m (%s)" 
      % type(cz.cols))
pprint(cz.cols)

[4mColumn names in original 'cs' language and available 'en' translation:[0m (<class 'list'>)
[{'cs': 'ZdravotnickeZarizeniId', 'en': 'Medical Equipment Id'},
 {'cs': 'PCZ', 'en': 'PCZ'},
 {'cs': 'PCDP', 'en': 'PCDP'},
 {'cs': 'NazevCely', 'en': 'The name of the cell'},
 {'cs': 'DruhZarizeni', 'en': 'Type of device'},
 {'cs': 'Obec', 'en': 'Village'},
 {'cs': 'Psc', 'en': 'Zip code'},
 {'cs': 'Ulice', 'en': 'Street'},
 {'cs': 'CisloDomovniOrientacni', 'en': 'Home Number Orientation'},
 {'cs': 'Kraj', 'en': 'Region'},
 {'cs': 'KrajCode', 'en': 'County Code'},
 {'cs': 'Okres', 'en': 'District'},
 {'cs': 'OkresCode', 'en': 'District Code'},
 {'cs': 'SpravniObvod', 'en': 'Administrative District'},
 {'cs': 'PoskytovatelTelefon', 'en': 'Phone Provider'},
 {'cs': 'PoskytovatelFax', 'en': 'Fax Provider'},
 {'cs': 'DatumZahajeniCinnosti', 'en': 'Date Started Activities'},
 {'cs': 'IdentifikatorDatoveSchranky', 'en': 'Mailbox Identifier'},
 {'cs': 'PoskytovatelEmail', 'en': 'Email Provider'},

## Use of input metadata and output configuration template<a id="metadata"></a>

Once a facility instance has been created, the input metadata/configuration templates are necessary to inform the automated and simplified operational workflow, but also to support the reproducibility of the production processes for later reuse. Typically, they can be (and, actually, they are) reused for ingesting *HCS* data from other countries.

Note that the template configuration `config.FACMETADATA['HCS']` for output harmonised data (that provides with the generic formatting of *HCS* facilities and is used, here, for the country-specific *CZ* case) is also informed into the `cz.config` attribute of the `cz` instance:

In [18]:
assert cz.config.to_dict() == config.FACMETADATA['HCS']

print("\033[4mFacility-generic configuration template:\033[0m (%s)" % type(cz.config))
pprint(cz.config.to_dict())

[4mFacility-generic configuration template:[0m (<class 'pyeufacility.config.MetaDatEUFacility'>)
{'category': {'code': 'hcs', 'name': 'Healthcare services'},
 'index': {'ER': {'desc': "Flag 'yes/no' for whether the healthcare site "
                          'provides emergency medical services',
                  'name': 'emergency',
                  'type': 'str',
                  'values': ['yes', 'no']},
           'PP': {'desc': "Status 'private/public' of the healthcare service",
                  'name': 'public_private',
                  'type': 'str',
                  'values': ['public', 'private']},
           'beds': {'desc': 'Measure of capacity by number of beds (most '
                            'common)',
                    'name': 'cap_beds',
                    'type': 'int',
                    'values': None},
           'cc': {'desc': 'Country code (ISO 3166-1 alpha-2 format)',
                  'name': 'cc',
                  'type': 'str',
               

In particular, `cz.config` provides with possible matches between the input datasets columns of `cz` with the desired output template, namely:

In [19]:
print("\033[4mList of output/template columns:\033[0m")
for (k,v) in cz.config['index'].items():
    print("- '\033[1m%s\033[0m' (shortcut: %s)" % (v['name'],k))

[4mList of output/template columns:[0m
- '[1mid[0m' (shortcut: id)
- '[1mhospital_name[0m' (shortcut: name)
- '[1msite_name[0m' (shortcut: site)
- '[1mlat[0m' (shortcut: lat)
- '[1mlon[0m' (shortcut: lon)
- '[1mgeo_qual[0m' (shortcut: geo_qual)
- '[1mstreet[0m' (shortcut: street)
- '[1mhouse_number[0m' (shortcut: number)
- '[1mpostcode[0m' (shortcut: postcode)
- '[1mcity[0m' (shortcut: city)
- '[1mcc[0m' (shortcut: cc)
- '[1mcountry[0m' (shortcut: country)
- '[1mcap_beds[0m' (shortcut: beds)
- '[1mcap_prac[0m' (shortcut: prac)
- '[1mcap_rooms[0m' (shortcut: rooms)
- '[1memergency[0m' (shortcut: ER)
- '[1mfacility_type[0m' (shortcut: type)
- '[1mpublic_private[0m' (shortcut: PP)
- '[1mlist_specs[0m' (shortcut: specs)
- '[1mtel[0m' (shortcut: tel)
- '[1memail[0m' (shortcut: email)
- '[1murl[0m' (shortcut: url)
- '[1mref_date[0m' (shortcut: refdate)
- '[1mpub_date[0m' (shortcut: pubdate)
- '[1mcomments[0m' (shortcut: comments)


Similarly to the facility-generic metadata `cz.config`, it is possible to retrieve the country-specific metadata, *i.e.* all relevant information regarding the `cz` dataset. Actually, the `meta` attribute of the `cz` instance contains the currently available metadata (rather incomplete):

In [20]:
print("\033[Template metadata properties:\033[0m %s" % cz.meta.PROPERTIES)

print("\033[4mCountry-generic metadata:\033[0m (%s)" % type (cz.meta))
pprint(cz.meta.to_dict())

[Template metadata properties:[0m ['provider', 'country', 'lang', 'file', 'path', 'columns', 'index', 'options', 'category', 'date']
[4mCountry-generic metadata:[0m (<class 'pyeufacility.config.MetaDatNatFacility'>)
{'country': {'code': 'CZ', 'name': 'Czechia'},
 'file': 'export-2020-02.csv',
 'lang': {'code': 'cs', 'name': 'czech'},
 'path': '/Users/gjacopo/DevOps/basic-services/data/healthcare/raw'}


To harmonise the dataset, we first identify potential one-to-one correspondance (*i.e.*, simple reassignment/renaming + simple formatting and type cast of columns, no complex transformation required) between input and output fields. 
In practice, the attribute `idx` of the `cz` instance matches the names of the input fields (whatever the language, as stored in the `cz.cols` attribute) with the shortcut names of the desired `cz.config['index']` attributes (except for the `'lat', 'lon'` fields, see below).

For instance, an educated guess regarding the meaning of the translated columns suggests the following matching (subject to changes owing to the quality of the automated translation...):

In [21]:
cz.idx.update({'id':       'Medical Equipment Id', 
               'name':     'Medical Equipment Id', 
               'site':     'The name of the cell', 
               'lat':      'GPS', 
               'lon':      'GPS',
               'street':   'Street', 
               'number':   'Home Number Orientation', 
               'postcode': 'Zip code', 
               'city':     'Village', 
               'email':    'Email Provider',
               'pubdate':  'Last Modified'})

print("\033[4mIndexing dictionary:\033[0m (%s)" % type(cz.idx))
pprint(cz.idx)

[4mIndexing dictionary:[0m (<class 'dict'>)
{'city': 'Village',
 'email': 'Email Provider',
 'id': 'Medical Equipment Id',
 'lat': 'GPS',
 'lon': 'GPS',
 'name': 'Medical Equipment Id',
 'number': 'Home Number Orientation',
 'postcode': 'Zip code',
 'pubdate': 'Last Modified',
 'site': 'The name of the cell',
 'street': 'Street'}


Note in the above that the `'lat'` and `'lon'` are both "matched" to the `'GPS'` column (not translated) of the input dataset since its actually concatenates (in a string) the geographical coordinates. The `locate_data` method (see below) will help handling these specific fields:

In [22]:
for l in ['lat', 'lon']:
    print("- Coordinate \033[1m'%s'\033[0m is represented in the column \033[1m'%s'\033[0m of the input table" 
          % (l, cz.idx[l])) 

cz.data['GPS'].head(5)

- Coordinate [1m'lat'[0m is represented in the column [1m'GPS'[0m of the input table
- Coordinate [1m'lon'[0m is represented in the column [1m'GPS'[0m of the input table


0    49.427787867131 15.218289906425
1    50.497217046878 13.650265371038
2    50.073416715218 14.401090995315
3    50.126760922745 14.456956313382
4     49.183499865155 15.45981410537
Name: GPS, dtype: object

In the following, we will use the metadata information to rerun the process again from scratch (see [metadata section](#metadata-approach), instead of providing a description of the output formated dataset. For that purpose, we inform all current metadata "keys-values" pairs in the dictionary `metadata` variable by parsing the argument `keys = True` to the `get_meta` method. The argument `force = True` ensures that the `cz.options`, `cz.idx` and `cz.cols` attributes are "synchronised" with the corresponding `cz.meta['options']`, `cz.meta['index']` and `cz.meta['columns']` fields in the returned dictionary:

In [23]:
metadata = cz.get_meta(keys = True, force = True)

print("\033[4mFields of country-generic updated metadata:\033[0m")
pprint(metadata.keys())

[4mFields of country-generic updated metadata:[0m
dict_keys(['country', 'path', 'file', 'lang', 'provider', 'options', 'category', 'index', 'date', 'columns'])


## Geocoding and formatting the input data<a id="geocode-format"></a>

Following, we run subsequent geolocation and formatting operations.
* `locate_data` is used to retrieve the geographical  `('lat','lon')` coordinates based on the input geolocation information available in `'GPS'`; because no argument is parsed, default/intrinsic parameters available through the properties of the `cz` dataset are considered: 

In [24]:
cz.locate_data()

* `format_data`  to perform the input matching and output formatting of the different columns of the dataset; `keep` is set to `True` so that all fields listed in the template configuration are represented:

In [25]:
cz.format_data()

Check the output data, _e.g._ let's have a look at the column names and the content of the table:

In [26]:
assert (set(cz.data.columns)
        .difference(set([ind['name'] for ind in cz.config['index'].values()] + ['place']))
       ) == set()

print("\033[4mOutput data columns:\033[0m\n\033[1m%s\033[0m" % list(cz.data.columns))

cz.data.head(5)

[4mOutput data columns:[0m
[1m['id', 'hospital_name', 'site_name', 'lat', 'lon', 'geo_qual', 'street', 'house_number', 'postcode', 'city', 'cc', 'country', 'email', 'pub_date'][0m


Unnamed: 0,id,hospital_name,site_name,lat,lon,geo_qual,street,house_number,postcode,city,cc,country,email,pub_date
0,138130,138130,Zubní studio V+V s.r.o.,49.427788,15.21829,1,Komenského,1465,39301.0,Pelhøimov,CZ,Czechia,,01-02-2020 23:03
1,138129,138129,TECTUM spol. s r.o.,50.497217,13.650265,1,Topolová,1234,43401.0,Most,CZ,Czechia,operator@labin.cz,01-02-2020 23:03
2,138128,138128,"MUDr. Milan Kuèera, s.r.o.",50.073417,14.401091,1,Kartouzská,3274/10,15000.0,Praha,CZ,Czechia,milankucera@seznam.cz,01-02-2020 23:03
3,138127,138127,Tereza Horáèková,50.126761,14.456956,1,Klapkova,154/46,18200.0,Praha,CZ,Czechia,,01-02-2020 23:03
4,138126,138126,Veobecný lékaø Prevent s.r.o.,49.1835,15.459814,1,Masarykova,330,58856.0,Telè,CZ,Czechia,,01-02-2020 23:03


See in particular the `lat`/`lon` attributes that were retrieved:

In [27]:
cz.data[['lat', 'lon']].head(5)

Unnamed: 0,lat,lon
0,49.427788,15.21829
1,50.497217,13.650265
2,50.073417,14.401091
3,50.126761,14.456956
4,49.1835,15.459814


You can transform the data into a *GEOJSON* collection of features and retrieve the output geometry using the `dumps_data` method:

In [28]:
try:
    assert False
except AssertionError:
    print("Nothing saved - set to True to actually generate the output files")
else:
    cz.save_data(fmt='geojson', latlon = ['lat', 'lon'])

Nothing saved - set to True to actually generate the output files


Say you are interested in the 10 first facilities in the list, you can also use:

In [29]:
# columns = set(cz.data.columns).difference(set(['lat', 'lon']))
geom = io.Frame.to_geojson(cz.data.iloc[0:10,:], latlon = ['lat', 'lon'])
print("The first feature of the collection is:")
pprint(geom.get('features',[])[0])

The first feature of the collection is:
{'geometry': {'coordinates': [15.218289906425, 49.427787867131],
              'type': 'Point'},
 'properties': {'cc': 'CZ',
                'city': 'Pelhøimov',
                'country': 'Czechia',
                'email': 'nan',
                'geo_qual': 1,
                'hospital_name': '138130',
                'house_number': '1465',
                'id': 138130,
                'postcode': '39301.0',
                'pub_date': '01-02-2020 23:03',
                'site_name': 'Zubní studio V+V s.r.o.',
                'street': 'Komenského'},
 'type': 'Feature'}


You can use the formatted data to represent the available information about the 10 records you selected on a map, using any of the [`folium`](https://python-visualization.github.io/folium/) or [`ipyleaflet`](https://github.com/jupyter-widgets/ipyleaflet) packages:

In [30]:
try:
    from folium import Map, GeoJson, features
except:
    from ipywidgets import HTML
    from ipyleaflet import Map, GeoJSON, WidgetControl

In [31]:
prague = [50.0755, 14.4378]
zoom = 7
try:
    m = Map(location = prague, zoom_start = zoom)
    GeoJson(json.dumps(geom), name = 'CZ hospital',
        tooltip=features.GeoJsonTooltip(fields=['site_name','city','postcode'], localize=True)).add_to(m)
except:
    m = Map(center = prague, zoom = zoom)
    geo_json = GeoJSON(data=geom)
    m.add_layer(geo_json)
    html = HTML("CZ hospital")
    m.add_control(WidgetControl(widget=html, position='bottomleft'))    
    def on_hover(feature, **kwargs):
        html.value = '''<h4><b>{}</b></h4> \br {} {}'''.format(feature['properties']['site_name'], feature['properties']['city'], feature['properties']['postcode'])
    geo_json.on_hover(on_hover)
m

If instead, you want to save the data in an output file on disk, for instance in *CSV* or *GEOJSON* formats, you can use the `save_data` method:

In [32]:
CZDATA = 'CZgeo-test.csv'

try:
    assert False
except AssertionError:
    print("Nothing saved in %s - set to True to create the output files" % CZDATA)
else:
    cz.save_data(CZDATA, fmt='csv')

Nothing saved in CZgeo-test.csv - set to True to create the output files


and save these metadata in a dedicated file on disk using the `save_meta` method:

In [33]:
CZMETA = 'CZmeta-test.json'

try:
    assert False
except AssertionError:
    print("Nothing saved in %s - set to True to create the output files" % CZMETA)
else:
    cz.save_meta(CZMETA, fmt='json', keys = True, keep = True, force = True) 
    # this can be run as well:
    # cz.update_meta()
    # cz.save_meta(CZMETA, fmt='json')    

Nothing saved in CZmeta-test.json - set to True to create the output files


By default, all the metadata information in `cz.meta` is updated ("synchronised") prior to saving it with `save_meta`. To avoid all fields are updated, one can parse the keyword argument `force = False` to the method `cz.save_meta` so that a copy of (current, non-updated) `cz.meta` is saved.

## Adopting a metadata-based approach<a id="metadata-approach"></a>

Since the metadata contains all the information necessary to process the source dataset, it shall be used _as is_ to operate the harmonisation of the data.

Metadata is retrieved from the previous sections: 

In [34]:
try:
    assert metadata not in ({},None)
except:
    try:
        with open(CZMETA, 'r') as fp:
            metadata = json.load(fp)
    except AssertionError:
        print("Nothing loaded - metadata should be defined")
# finally:
#    metadata = config.MetaDatNatFacility(metadata)

Following, like in the section above, a dynamically class `NewCZhcs` can be created and configured with this information:

In [35]:
NewCZhcs = config.facilityFactory(config.FACMETADATA['HCS'], meta = metadata )

then, similarly, a new instance can be created...and that's all we will need to process the data:

In [36]:
newcz = NewCZhcs()

assert newcz.config.to_dict() == cz.config.to_dict()
assert newcz.cols == cz.cols
assert newcz.options == cz.options

Given the `file` and `path` fields of the `newcz.meta` attribute variable, data can be loaded using the `load_data` method (without argument, since they are all intrinsically provided with the metadata). But also given the `index` field, the formatting using the `format_data` method can be operated directly:

In [37]:
newcz.load_data()

# nothing: newcz.prepare_data()

newcz.locate_data() # no geocoding is operated

newcz.format_data()

Check the output dataset:

In [38]:
print("\033[4mOutput data columns:\033[0m\n\033[1m%s\033[0m" % list(newcz.data.columns))

newcz.data.head(5)

[4mOutput data columns:[0m
[1m['id', 'hospital_name', 'site_name', 'lat', 'lon', 'geo_qual', 'street', 'house_number', 'postcode', 'city', 'cc', 'country', 'email', 'pub_date'][0m


Unnamed: 0,id,hospital_name,site_name,lat,lon,geo_qual,street,house_number,postcode,city,cc,country,email,pub_date
0,138130,138130,Zubní studio V+V s.r.o.,49.427788,15.21829,1,Komenského,1465,39301.0,Pelhøimov,CZ,Czechia,,01-02-2020 23:03
1,138129,138129,TECTUM spol. s r.o.,50.497217,13.650265,1,Topolová,1234,43401.0,Most,CZ,Czechia,operator@labin.cz,01-02-2020 23:03
2,138128,138128,"MUDr. Milan Kuèera, s.r.o.",50.073417,14.401091,1,Kartouzská,3274/10,15000.0,Praha,CZ,Czechia,milankucera@seznam.cz,01-02-2020 23:03
3,138127,138127,Tereza Horáèková,50.126761,14.456956,1,Klapkova,154/46,18200.0,Praha,CZ,Czechia,,01-02-2020 23:03
4,138126,138126,Veobecný lékaø Prevent s.r.o.,49.1835,15.459814,1,Masarykova,330,58856.0,Telè,CZ,Czechia,,01-02-2020 23:03


## Automated metadata-based harmonisation<a id="harmonisation"></a>

Obviously, you will not have to rerun these operations everytime a table is created. 

In order to simplify even further the operations, the steps above, including the loading of configuration and metadata files, have been automated. You can run it at once using the `harmonise` module:

In [39]:
from pyeufacility import harmonise

cz = harmonise.run('HCS', country = "CZ", on_disk = True, dest = 'CZtest.csv')

! Country py-module 'pyeufacility.hcs.CZhcs' found !
! Generic formatting/harmonisation methods used !
! Harmonised data for country 'CZ' generated !


Check again for consistency... :

In [40]:
cz.data.head(5)

Unnamed: 0,id,hospital_name,site_name,lat,lon,geo_qual,street,house_number,postcode,city,...,emergency,facility_type,public_private,list_specs,tel,email,url,ref_date,pub_date,comments
0,138130,138130,Zubní studio V+V s.r.o.,49.427788,15.21829,1,Komenského,1465,39301.0,Pelhøimov,...,,,,,,,,,01-02-2020 23:03,
1,138129,138129,TECTUM spol. s r.o.,50.497217,13.650265,1,Topolová,1234,43401.0,Most,...,,,,,,operator@labin.cz,,,01-02-2020 23:03,
2,138128,138128,"MUDr. Milan Kuèera, s.r.o.",50.073417,14.401091,1,Kartouzská,3274/10,15000.0,Praha,...,,,,,,milankucera@seznam.cz,,,01-02-2020 23:03,
3,138127,138127,Tereza Horáèková,50.126761,14.456956,1,Klapkova,154/46,18200.0,Praha,...,,,,,,,,,01-02-2020 23:03,
4,138126,138126,Veobecný lékaø Prevent s.r.o.,49.1835,15.459814,1,Masarykova,330,58856.0,Telè,...,,,,,,,,,01-02-2020 23:03,


## Automated metadata-based validation<a id="validation"></a>

Last, you will want to validate the output data, given the information you provided in the configuration template (_e.g._, through the field `column`).

Validation is also an automated process:

In [41]:
from pyeufacility import validate

validate.run('HCS', country = "CZ", src = 'CZtest.csv')

! Column 'tel' empty - missing values only !
! Column 'facility_type' empty - missing values only !
! Column 'emergency' empty - missing values only !
! Column 'cap_prac' empty - missing values only !
! Column 'ref_date' empty - missing values only !
! Column 'cap_rooms' empty - missing values only !
! Column 'cap_beds' empty - missing values only !
! Column 'public_private' empty - missing values only !
! Column 'list_specs' empty - missing values only !
! Column 'comments' empty - missing values only !
! Column 'url' empty - missing values only !


