## Input/Output data integration and harmonisation -- Basic example using the `pyeuhcs` and `pyeudatanat` packages

### Setting and checking the environment first

Let's run some setup necessary to import the dependency(ies) required to run the project... 
First, you will need to import the [`pyeudatnat`](https://github.com/eurostat/pyEUDatNat) package that contains various useful functions/methods: 

In [1]:
try:
    import pyeudatnat
except ImportError:
    raise IOError("sorry, you're doomed, you won't be able to run this notebook...")

Then, you will need to load the [`pyeuhcs`](https://github.com/eurostat/healthcare-services/tree/master/src/geo_py) module available with the [`'healthcare-services'`](https://github.com/eurostat/healthcare-services/) project.
The following is required only if `pyeuhcs` is not remotely installed/available from `pypi`, but it will be working only if you have a local version on your machine. 

In [2]:
import os, sys
try:
    thisdir = !pwd
    PROJECT, PACKAGE = 'healthcare-services', 'pyeuhcs'
    PACKNAME = PACKAGE.lower()
    assert '%s' % PACKNAME in [mod.__name__ for mod in sys.modules.values()]
except AssertionError:
    # note: this notebook will need to be run from the project directory, otherwise...
    try:
        pardir = os.path.abspath(os.path.join(thisdir[0], '../'))
        assert PACKAGE in os.listdir(pardir)
    except:
        raise IOError("sorry, you're doomed again...")
    else:
        PACKPATH = pardir
    sys.path.insert(0,PACKPATH)
else:
    print('package %s available to run project %s' % (PACKNAME, PROJECT))
    PACKPATH = getattr(sys.modules[PACKNAME], 'PACKPATH')
    assert PACKPATH == getattr(sys.modules[PACKNAME], '__path__')[0]

Let's import the whole package (as a test) and check some basic global parameters/metadata already made available there...

In [3]:
import pyeuhcs
print("\033[1mCountries\033[0m considered for harmonisation: %s" % list(pyeudatnat.COUNTRIES))
print("\033[1mFacilities\033[0m available for harmonisation: %s" % pyeuhcs.FACILITIES)

[1mCountries[0m considered for harmonisation: ['BE', 'EL', 'LT', 'PT', 'BG', 'ES', 'LU', 'RO', 'CZ', 'FR', 'HU', 'SI', 'DK', 'HR', 'MT', 'SK', 'DE', 'IT', 'NL', 'FI', 'EE', 'CY', 'AT', 'SE', 'IE', 'LV', 'PL', 'UK', 'IS', 'NO', 'CH', 'LI']
[1mFacilities[0m available for harmonisation: {'HCS': {'code': 'hcs', 'name': 'Healthcare services'}, 'Edu': {'code': 'edu', 'name': 'Educational facilities'}}


Next, we import the `config` module, and check also some of the metadata made available through the `pyeuhcs` module. 

In particular, the format of the output harmonised dataset, such as the output columns (name and type), the output encoding format, the separator for output CSV, _etc_..., are defined for any given type of facility, _e.g._ HealthCare Services (*HCS*), separately in the global dictionary `config.CONFIGINFO`:

In [4]:
from pyeuhcs import config
print("Formatting of output data is available for 'HCS': %s" % config.CONFIGINFO['HCS']['category'])
print("Fields of the configuration template: %s" % list(config.CONFIGINFO['HCS'].keys()))

! Missing happygisco package (https://github.com/eurostat/happyGISCO) - GISCO web services not available !


Formatting of output data is available for 'HCS': {'code': 'hcs', 'name': 'Healthcare services'}
Fields of the configuration template: ['fmt', 'lang', 'sep', 'enc', 'dfmt', 'proj', 'path', 'file', 'index', 'category']


Using this global variable, it is possible to retrieve the specific formatting of *HCS* is described through the contents of `config.CONFIGINFO['HCS']`: 

In [5]:
print("Output data contain attributes defined according to following configuration template:\n")
config.CONFIGINFO['HCS']

Output data contain attributes defined according to following configuration template:



{'fmt': {'geojson': 'geojson', 'json': 'json', 'csv': 'csv', 'gpkg': 'gpkg'},
 'lang': 'en',
 'sep': ',',
 'enc': 'utf-8',
 'dfmt': '%d/%m/%Y',
 'proj': None,
 'path': '../../../data/',
 'file': '%s.%s',
 'index': OrderedDict([('id',
               {'name': 'id',
                'desc': 'The healthcare service identifier - This identifier is based on national identification codes, if it exists.',
                'type': 'int',
                'values': None}),
              ('name',
               {'name': 'hospital_name',
                'desc': 'The name of the healthcare institution',
                'type': 'str',
                'values': None}),
              ('site',
               {'name': 'site_name',
                'desc': 'The name of the specific site or branch of a healthcare institution',
                'type': 'str',
                'values': None}),
              ('lat',
               {'name': 'lat',
                'desc': 'Latitude (WGS 84)',
                'type'

This provides for instance with the definitions of the desired output fields, for instance:

In [6]:
print("Output attribute \033[1m'%s'\033[0m defined as: %s" % 
      (config.CONFIGINFO['HCS']['index']['site']['name'], config.CONFIGINFO['HCS']['index']['site']))
print("Output attribute \033[1m'%s'\033[0m defined as: %s" % 
      (config.CONFIGINFO['HCS']['index']['lat']['name'], config.CONFIGINFO['HCS']['index']['lat']))

Output attribute [1m'site_name'[0m defined as: {'name': 'site_name', 'desc': 'The name of the specific site or branch of a healthcare institution', 'type': 'str', 'values': None}
Output attribute [1m'lat'[0m defined as: {'name': 'lat', 'desc': 'Latitude (WGS 84)', 'type': 'float', 'values': None}


Note that the parameters of any given facility, _e.g._ *HCS*, as defined in `config.CONFIGINFO` can actually be (re)set through an external configuration JSON file, _e.g._ *hcs.json*:

In [7]:
import json
cfg = os.path.join(PACKPATH, PACKNAME, 'hcs.json')
with open(cfg, 'r') as fp:
    metadata = json.load(fp)
print("Configuration template is parsed through '%s' file:\n" % cfg)
metadata

Configuration template is parsed through '/Users/gjacopo/Developments/healthcare-services/src/geo_py/pyeuhcs/hcs.json' file:



{'fmt': {'geojson': 'geojson', 'json': 'json', 'csv': 'csv', 'gpkg': 'gpkg'},
 'lang': 'en',
 'sep': ',',
 'enc': 'utf-8',
 'dfmt': '%d/%m/%Y',
 'proj': None,
 'path': '../../../data/',
 'file': '%s.%s',
 'index': {'id': {'name': 'id',
   'desc': 'The healthcare service identifier - This identifier is based on national identification codes, if it exists.',
   'type': 'int',
   'values': None},
  'name': {'name': 'hospital_name',
   'desc': 'The name of the healthcare institution',
   'type': 'str',
   'values': None},
  'site': {'name': 'site_name',
   'desc': 'The name of the specific site or branch of a healthcare institution',
   'type': 'str',
   'values': None},
  'lat': {'name': 'lat',
   'desc': 'Latitude (WGS 84)',
   'type': 'float',
   'values': None},
  'lon': {'name': 'lon',
   'desc': 'Longitude (WGS 84)',
   'type': 'float',
   'values': None},
  'geo_qual': {'name': 'geo_qual',
   'desc': 'A quality indicator for the geolocation - 1: Good, 2: Medium, 3: Low, -1: Unknown'

### Loading and exploring the input table

We require first to import some basic utilities: 

In [8]:
from pyeuhcs.config import facilityFactory

Let's consider a simple example: as raw information in the original table, *CZ* data is already available with the lat/lon geographical coordinates encoded in a single column of the input data. "Integrating" these data is nothing else than extracting the coordinates, reshufling some of the columns and dumping it in a new table.

To do so, it will be necessary to retrieve the structure (in terms of columns and data format) of the original table and "match" it to the output desired structure, as defined in the previously mentioned `config.CONFIGINFO` parameter. 

Let's first defined the input structure using the class constructor `facilityFactory`:

In [9]:
CZhcs = facilityFactory(facility='HCS', country = 'CZ')
print(CZhcs.COUNTRY)
print(CZhcs.CATEGORY)

{'CZ': 'Czechia'}
{'code': 'hcs', 'name': 'Healthcare services'}


and define the *CZ* data as an instance of the previously dynamically defined class `CZhcs`, with specific details regarding the data, such as the location of the original table:

In [10]:
cz = CZhcs(file = 'export-2020-02.csv', path = '../../../data/raw/', lang = 'cs')
print("- source file: %s" % cz.file)
print("- country code: %s" % cz.cc)
print("- language used in the original table: %s" % cz.lang)

- source file: ../../../data/raw/export-2020-02.csv
- country code: CZ
- language used in the original table: cs


Note also that the output template configuration is available throughout the `config` attribute of the newly created `cz` instance:

In [11]:
assert cz.config.to_dict() == config.CONFIGINFO['HCS']

Overall, this is enough to load the data using the `load_data` method which stores the input data as a [`pandas`](https://pandas.pydata.org) dataframe structure in the `data` attribute of the `cz` instance:

In [12]:
cz.load_data()
cz.data.head(5)

  return pd.read_csv(s, **nkw)


Unnamed: 0,ZdravotnickeZarizeniId,PCZ,PCDP,NazevCely,DruhZarizeni,Obec,Psc,Ulice,CisloDomovniOrientacni,Kraj,...,PscSidlo,ObecSidlo,UliceSidlo,CisloDomovniOrientacniSidlo,OborPece,FormaPece,DruhPece,OdbornyZastupce,GPS,LastModified
0,138130,0,1,Zubn� studio V+V s.r.o.,Samostatn� ordinace PL - stomatologa,Pelh�imov,39301,Komensk�ho,1465,Kraj Vyso�ina,...,39301.0,Pelh�imov,Pod N�spem,641,"zubn� l�ka�stv�, Dent�ln� hygienistka","prim�rn� ambulantn� p��e, specializovan� ambul...",,"Veronika �kodov�, V�t �koda",49.427787867131 15.218289906425,01-02-2020 23:03
1,138129,0,4,TECTUM spol. s r.o.,Odb�rov� m�stnost,Most,43401,Topolov�,1234,�steck� kraj,...,36001.0,Karlovy Vary,Bezru�ova,1098/10,klinick� biochemie,specializovan� ambulantn� p��e,,"David Hepnar, JITKA PODROU�KOV�",50.497217046878 13.650265371038,01-02-2020 23:03
2,138128,1,0,"MUDr. Milan Ku�era, s.r.o.",Samostatn� ordinace l�ka�e specialisty,Praha,15000,Kartouzsk�,3274/10,Hlavn� m�sto Praha,...,27201.0,Kladno,T. G. Masaryka,2104,dermatovenerologie,ambulantn� p��e,,Jan Ku�era,50.073416715218 14.401090995315,01-02-2020 23:03
3,138127,1,0,Tereza Hor��kov�,V�dejna zdravotnick�ch prost�edk�,Praha,18200,Klapkova,154/46,Hlavn� m�sto Praha,...,,,,,praktick� l�k�renstv�,,l�k�rensk� p��e,Petra Holasov�,50.126760922745 14.456956313382,01-02-2020 23:03
4,138126,0,0,V�eobecn� l�ka� Prevent s.r.o.,Samost. ordinace v�eob. prakt. l�ka�e,Tel�,58856,Masarykova,330,Kraj Vyso�ina,...,28163.0,Kozojedy,1. m�je,67,"v�eobecn� praktick� l�ka�stv�, v�eobecn� prakt...","prim�rn� ambulantn� p��e, zdrav. p��e poskytov...",,ARNO�T STAN�K,49.183499865155 15.45981410537,01-02-2020 23:03


Check the input data (and, actually, also the warning), _e.g._ let's have a look at the column names and the content of the table: obviously, because of the character formatting, we get some warnings. Taking this into account, let's refine the load operation by updating the parameters of the `load_data` method:

In [13]:
cz.load_data(enc = 'latin1', sep = ';')
cz.data.head(5)

Unnamed: 0,ZdravotnickeZarizeniId,PCZ,PCDP,NazevCely,DruhZarizeni,Obec,Psc,Ulice,CisloDomovniOrientacni,Kraj,...,PscSidlo,ObecSidlo,UliceSidlo,CisloDomovniOrientacniSidlo,OborPece,FormaPece,DruhPece,OdbornyZastupce,GPS,LastModified
0,138130,0,1,Zubní studio V+V s.r.o.,Samostatná ordinace PL - stomatologa,Pelhøimov,39301,Komenského,1465,Kraj Vysoèina,...,39301.0,Pelhøimov,Pod Náspem,641,"zubní lékaøství, Dentální hygienistka","primární ambulantní péèe, specializovaná ambul...",,"Veronika kodová, Vít koda",49.427787867131 15.218289906425,01-02-2020 23:03
1,138129,0,4,TECTUM spol. s r.o.,Odbìrová místnost,Most,43401,Topolová,1234,Ústecký kraj,...,36001.0,Karlovy Vary,Bezruèova,1098/10,klinická biochemie,specializovaná ambulantní péèe,,"David Hepnar, JITKA PODROUKOVÁ",50.497217046878 13.650265371038,01-02-2020 23:03
2,138128,1,0,"MUDr. Milan Kuèera, s.r.o.",Samostatná ordinace lékaøe specialisty,Praha,15000,Kartouzská,3274/10,Hlavní mìsto Praha,...,27201.0,Kladno,T. G. Masaryka,2104,dermatovenerologie,ambulantní péèe,,Jan Kuèera,50.073416715218 14.401090995315,01-02-2020 23:03
3,138127,1,0,Tereza Horáèková,Výdejna zdravotnických prostøedkù,Praha,18200,Klapkova,154/46,Hlavní mìsto Praha,...,,,,,praktické lékárenství,,lékárenská péèe,Petra Holasová,50.126760922745 14.456956313382,01-02-2020 23:03
4,138126,0,0,Veobecný lékaø Prevent s.r.o.,Samost. ordinace veob. prakt. lékaøe,Telè,58856,Masarykova,330,Kraj Vysoèina,...,28163.0,Kozojedy,1. máje,67,"veobecné praktické lékaøství, veobecné prakt...","primární ambulantní péèe, zdrav. péèe poskytov...",,ARNOT STANÌK,49.183499865155 15.45981410537,01-02-2020 23:03


More options are obviously available, as allowed by the `pandas` methods, for instance [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), [`read_excel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) or [`read_table`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html), depending on the format of the input data you are dealing with. 

Now, basic information regarding the input data can be retrieved through the `data` attribute of the `cz` instance which is a common `pandas` dataframe. For instance, if we are interested in the list of columns:

In [14]:
cz.data.columns

Index(['ZdravotnickeZarizeniId', 'PCZ', 'PCDP', 'NazevCely', 'DruhZarizeni',
       'Obec', 'Psc', 'Ulice', 'CisloDomovniOrientacni', 'Kraj', 'KrajCode',
       'Okres', 'OkresCode', 'SpravniObvod', 'PoskytovatelTelefon',
       'PoskytovatelFax', 'DatumZahajeniCinnosti',
       'IdentifikatorDatoveSchranky', 'PoskytovatelEmail', 'PoskytovatelWeb',
       'PoskytovatelNazev', 'Ico', 'TypOsoby', 'PravniFormaKod',
       'KrajCodeSidlo', 'KrajSidlo', 'OkresCodeSidlo', 'OkresSidlo',
       'PscSidlo', 'ObecSidlo', 'UliceSidlo', 'CisloDomovniOrientacniSidlo',
       'OborPece', 'FormaPece', 'DruhPece', 'OdbornyZastupce', 'GPS',
       'LastModified'],
      dtype='object')

To make the identification of the appropriate columns in the input table, we can (approximately) translate the column names from _'cs'_ to _'en'_, using the [`googletrans`](https://pypi.org/project/googletrans/) package:

In [15]:
cz.get_column(olang = 'en')

['Medical devices Id',
 'PCZ',
 'PCDP',
 'NazevCely',
 'Type of device',
 'Village',
 'Zip code',
 'Street',
 'House numbers indicative',
 'Region',
 'KrajCode',
 'District',
 'OkresCode',
 'Administrative District',
 'PoskytovatelTelefon',
 'PoskytovatelFax',
 'Launch Date',
 'Identifier data boxes',
 'PoskytovatelEmail',
 'PoskytovatelWeb',
 'PoskytovatelNazev',
 'ico',
 'TypOsoby',
 'PravniFormaKod',
 'KrajCodeSidlo',
 'KrajSidlo',
 'OkresCodeSidlo',
 'OkresSidlo',
 'PscSidlo',
 'ObecSidlo',
 'UliceSidlo',
 'House numbers indicative Cidlo',
 'OborPece',
 'FormaPece',
 'DruhPece',
 'OdbornyZastupce',
 'GPS',
 'LastModified']

In [16]:
cz.columns

[{'cs': 'ZdravotnickeZarizeniId', 'en': 'Medical devices Id'},
 {'cs': 'PCZ', 'en': 'PCZ'},
 {'cs': 'PCDP', 'en': 'PCDP'},
 {'cs': 'NazevCely', 'en': 'NazevCely'},
 {'cs': 'DruhZarizeni', 'en': 'Type of device'},
 {'cs': 'Obec', 'en': 'Village'},
 {'cs': 'Psc', 'en': 'Zip code'},
 {'cs': 'Ulice', 'en': 'Street'},
 {'cs': 'CisloDomovniOrientacni', 'en': 'House numbers indicative'},
 {'cs': 'Kraj', 'en': 'Region'},
 {'cs': 'KrajCode', 'en': 'KrajCode'},
 {'cs': 'Okres', 'en': 'District'},
 {'cs': 'OkresCode', 'en': 'OkresCode'},
 {'cs': 'SpravniObvod', 'en': 'Administrative District'},
 {'cs': 'PoskytovatelTelefon', 'en': 'PoskytovatelTelefon'},
 {'cs': 'PoskytovatelFax', 'en': 'PoskytovatelFax'},
 {'cs': 'DatumZahajeniCinnosti', 'en': 'Launch Date'},
 {'cs': 'IdentifikatorDatoveSchranky', 'en': 'Identifier data boxes'},
 {'cs': 'PoskytovatelEmail', 'en': 'PoskytovatelEmail'},
 {'cs': 'PoskytovatelWeb', 'en': 'PoskytovatelWeb'},
 {'cs': 'PoskytovatelNazev', 'en': 'PoskytovatelNazev'},
 {

Well, this did not go very well... The translation is actually made difficult because of the formatting of the column names (merged words). To improve the translation output, we add a filter (split of words on capital letters) on the text to be translated, namely: 

In [17]:
from pyeudatnat.text import TextProcess
cz.get_column(olang = 'en', filt = TextProcess.split_at_upper)

['Medical devices Id',
 'PCZ',
 'PCDP',
 'Title Cely',
 'Type of device',
 'Village',
 'Zip code',
 'Street',
 'House numbers indicative',
 'Region',
 'Region Code',
 'District',
 'District Code',
 'Administrative District',
 'Phone provider',
 'Fax provider',
 'Launch Date',
 'Identifier data boxes',
 'Email provider',
 'Web provider',
 'Title provider',
 'ico',
 'type Persons',
 'Legal form Kod',
 'County Seat of Code',
 'region of the seat',
 'Seat of District Code',
 'Seat of district',
 'Seat of psc',
 'Seat of the municipality',
 'street Seat',
 'Indicative Seat of House numbers',
 'Furnace Branch',
 'form Furnaces',
 'Type Furnaces',
 'professional representative',
 'GPS',
 'last Modified']

Note that this information is stored in the `columns` of the instance:

In [18]:
cz.columns

[{'cs': 'ZdravotnickeZarizeniId', 'en': 'Medical devices Id'},
 {'cs': 'PCZ', 'en': 'PCZ'},
 {'cs': 'PCDP', 'en': 'PCDP'},
 {'cs': 'NazevCely', 'en': 'Title Cely'},
 {'cs': 'DruhZarizeni', 'en': 'Type of device'},
 {'cs': 'Obec', 'en': 'Village'},
 {'cs': 'Psc', 'en': 'Zip code'},
 {'cs': 'Ulice', 'en': 'Street'},
 {'cs': 'CisloDomovniOrientacni', 'en': 'House numbers indicative'},
 {'cs': 'Kraj', 'en': 'Region'},
 {'cs': 'KrajCode', 'en': 'Region Code'},
 {'cs': 'Okres', 'en': 'District'},
 {'cs': 'OkresCode', 'en': 'District Code'},
 {'cs': 'SpravniObvod', 'en': 'Administrative District'},
 {'cs': 'PoskytovatelTelefon', 'en': 'Phone provider'},
 {'cs': 'PoskytovatelFax', 'en': 'Fax provider'},
 {'cs': 'DatumZahajeniCinnosti', 'en': 'Launch Date'},
 {'cs': 'IdentifikatorDatoveSchranky', 'en': 'Identifier data boxes'},
 {'cs': 'PoskytovatelEmail', 'en': 'Email provider'},
 {'cs': 'PoskytovatelWeb', 'en': 'Web provider'},
 {'cs': 'PoskytovatelNazev', 'en': 'Title provider'},
 {'cs': 'Ic

Now we are interested in retrieving additional information so that we can match the input columns with the output structure, as listed in the configuration template, namely:

In [19]:
print("List of output fields:")
for (k,v) in cz.config['index'].items():
    print("- '%s' (shortcut: %s)" % (v['name'],k))

List of output fields:
- 'id' (shortcut: id)
- 'hospital_name' (shortcut: name)
- 'site_name' (shortcut: site)
- 'lat' (shortcut: lat)
- 'lon' (shortcut: lon)
- 'geo_qual' (shortcut: geo_qual)
- 'street' (shortcut: street)
- 'house_number' (shortcut: number)
- 'postcode' (shortcut: postcode)
- 'city' (shortcut: city)
- 'cc' (shortcut: cc)
- 'country' (shortcut: country)
- 'cap_beds' (shortcut: beds)
- 'cap_prac' (shortcut: prac)
- 'cap_rooms' (shortcut: rooms)
- 'emergency' (shortcut: ER)
- 'facility_type' (shortcut: type)
- 'public_private' (shortcut: PP)
- 'list_specs' (shortcut: specs)
- 'tel' (shortcut: tel)
- 'email' (shortcut: email)
- 'url' (shortcut: url)
- 'ref_date' (shortcut: refdate)
- 'pub_date' (shortcut: pubdate)


In practice, we will match the names of the input columns, as stored in the `columns` attribute, with the shortcut names from `cz.config['index']` listed above. For instance, we identify the following correspondance (subject to changes...):

In [20]:
matched = {'id':'Medical devices Id', 'name':'Medical devices Id', 'site':'Title Cely', 
            'lat':'GPS', 'lon':'GPS', 
            'street':'Street', 'number':'House numbers indicative', 'postcode':'Zip code', 'city':'Village', 
            'email':'Email provider', 
            'pubdate':'LastModified'}

All we need to do to operate the matching/extraction is to load the `matched` information into the `index` attribute of the `cz` instance. It therefore contains the direct matching between columns that do not require complex transformation (simple reassignment + format cast): 

In [21]:
cz.index.update(matched)
for l in ['lat', 'lon']:
    print("- Coordinate \033[1m'%s'\033[0m is represented in the column \033[1m'%s'\033[0m of the input table" 
          % (l, cz.index[l])) 
#print("Other matching variables are listed as:")
#{k:v for (k,v) in cz.index.items() if v is not None}

- Coordinate [1m'lat'[0m is represented in the column [1m'GPS'[0m of the input table
- Coordinate [1m'lon'[0m is represented in the column [1m'GPS'[0m of the input table


Indeed, both `lat` and `lon` geographical coordinates are available through the `GPS` column (note: as an acronym, note that it was not translated) which represents both coordinates as a single string:

In [22]:
cz.data['GPS'].head(5)

0    49.427787867131 15.218289906425
1    50.497217046878 13.650265371038
2    50.073416715218 14.401090995315
3    50.126760922745 14.456956313382
4     49.183499865155 15.45981410537
Name: GPS, dtype: object

Finally, we can run the `format_data` method (this time again without arguments) to actually operate the matching of the different columns:

In [23]:
cz.format_data()

Check the output data, _e.g._ let's have a look at the column names and the content of the table:

In [24]:
assert set(cz.data.columns).difference(set([ind['name'] for ind in cz.config['index'].values()])) == set()
print("output data columns: %s" % list(cz.data.columns))
cz.data.head(5)

output data columns: ['id', 'site_name', 'city', 'postcode', 'street', 'house_number', 'email', 'pub_date', 'lat', 'lon', 'geo_qual', 'hospital_name']


Unnamed: 0,id,site_name,city,postcode,street,house_number,email,pub_date,lat,lon,geo_qual,hospital_name
0,138130,Zubní studio V+V s.r.o.,Pelhøimov,39301,Komenského,1465,,01-02-2020 23:03,49.427788,15.21829,1,138130
1,138129,TECTUM spol. s r.o.,Most,43401,Topolová,1234,operator@labin.cz,01-02-2020 23:03,50.497217,13.650265,1,138129
2,138128,"MUDr. Milan Kuèera, s.r.o.",Praha,15000,Kartouzská,3274/10,milankucera@seznam.cz,01-02-2020 23:03,50.073417,14.401091,1,138128
3,138127,Tereza Horáèková,Praha,18200,Klapkova,154/46,,01-02-2020 23:03,50.126761,14.456956,1,138127
4,138126,Veobecný lékaø Prevent s.r.o.,Telè,58856,Masarykova,330,,01-02-2020 23:03,49.1835,15.459814,1,138126


See in particular the `lat`/`lon` attributes that were retrieved:

In [25]:
cz.data[['lat', 'lon']].head(5)

Unnamed: 0,lat,lon
0,49.427788,15.21829
1,50.497217,13.650265
2,50.073417,14.401091
3,50.126761,14.456956
4,49.1835,15.459814


You can for instance transform the data into a *GEOJSON* collection of features and retrieve the output geometry using the `dumps_data` method:

In [26]:
geom = cz.dumps_data(fmt='geojson')

Say you are interested in the 10 first facilities in the list, you can also use:

In [27]:
from pyeudatnat.io import Dataframe
# columns = set(cz.data.columns).difference(set(['lat', 'lon']))
geom = Dataframe.to_geojson(cz.data.iloc[0:10,:], latlon = ['lat', 'lon'])
print("The first feature of the collection is: %s" % geom.get('features',[])[0])

The first feature of the collection is: {'type': 'Feature', 'properties': {'street': 'Komenského', 'hospital_name': 138130, 'postcode': '39301', 'city': 'Pelhøimov', 'geo_qual': 1, 'id': 138130, 'pub_date': '01-02-2020 23:03', 'house_number': '1465', 'site_name': 'Zubní studio V+V s.r.o.', 'email': 'nan'}, 'geometry': {'type': 'Point', 'coordinates': [15.218289906425, 49.427787867131]}}


You can use the formatted data to represent the available information about the 10 records you selected on a map, using any of the [`folium`](https://python-visualization.github.io/folium/) or [`ipyleaflet`](https://github.com/jupyter-widgets/ipyleaflet) packages:

In [28]:
import json
try:
    from folium import Map, GeoJson, features
except:
    from ipywidgets import HTML
    from ipyleaflet import Map, GeoJSON, WidgetControl

In [29]:
prague = [50.0755, 14.4378]
zoom = 7
try:
    m = Map(location = prague, zoom_start = zoom)
    GeoJson(json.dumps(geom), name = 'CZ hospital',
        tooltip=features.GeoJsonTooltip(fields=['site_name','city','postcode'], localize=True)).add_to(m)
except:
    m = Map(center = prague, zoom = zoom)
    geo_json = GeoJSON(data=geom)
    m.add_layer(geo_json)
    html = HTML("CZ hospital")
    m.add_control(WidgetControl(widget=html, position='bottomleft'))    
    def on_hover(feature, **kwargs):
        html.value = '''<h4><b>{}</b></h4> \br {} {}'''.format(feature['properties']['site_name'], feature['properties']['city'], feature['properties']['postcode'])
    geo_json.on_hover(on_hover)
m

Map(center=[50.0755, 14.4378], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zo…

If instead, you want to save the data in an output file on disk, for instance in *CSV* or *GEOJSON* formats:

In [30]:
try:
    assert False
    cz.dump_data('czhcs.geojson', fmt='geojson')
    cz.dump_data('czhcs.csv', fmt='csv')
except AssertionError:
    print("Nothing saved - set to True to actually generate the output files")

Nothing saved - set to True to actually generate the output files


### (Re)using metadata to run the operations

Besides the output data, the input metadata/configuration templates are of interest to ensure the reproducibility and reuse of the operation, but also to support the simplifcation of the processes.

First, the configuration template, _i.e._ the output formatting of harmonised data, for any given facility can be shared. For instance, the template for *HCS* facilities used for the specific *CZ* case shall be obviously reused for other countries: 

In [31]:
cz.config.to_dict()

{'fmt': {'geojson': 'geojson', 'json': 'json', 'csv': 'csv', 'gpkg': 'gpkg'},
 'lang': 'en',
 'sep': ',',
 'enc': 'utf-8',
 'dfmt': '%d/%m/%Y',
 'proj': None,
 'path': '../../../data/',
 'file': '%s.%s',
 'index': OrderedDict([('id',
               {'name': 'id',
                'desc': 'The healthcare service identifier - This identifier is based on national identification codes, if it exists.',
                'type': 'int',
                'values': None}),
              ('name',
               {'name': 'hospital_name',
                'desc': 'The name of the healthcare institution',
                'type': 'str',
                'values': None}),
              ('site',
               {'name': 'site_name',
                'desc': 'The name of the specific site or branch of a healthcare institution',
                'type': 'str',
                'values': None}),
              ('lat',
               {'name': 'lat',
                'desc': 'Latitude (WGS 84)',
                'type'

This is actually provided by default with this package through the file `hcsconfig.json`.

Similarly to the above, it is possible to retrieve the metadata associated to the `cz` instance:

In [32]:
cz.update_meta()
cz.meta.to_dict()

{'country': {'code': 'CZ', 'name': 'Czechia'},
 'lang': {'code': 'cs', 'name': 'czech'},
 'file': '../../../data/raw/export-2020-02.csv',
 'columns': [{'cs': 'ZdravotnickeZarizeniId', 'en': 'Medical devices Id'},
  {'cs': 'PCZ', 'en': 'PCZ'},
  {'cs': 'PCDP', 'en': 'PCDP'},
  {'cs': 'NazevCely', 'en': 'Title Cely'},
  {'cs': 'DruhZarizeni', 'en': 'Type of device'},
  {'cs': 'Obec', 'en': 'Village'},
  {'cs': 'Psc', 'en': 'Zip code'},
  {'cs': 'Ulice', 'en': 'Street'},
  {'cs': 'CisloDomovniOrientacni', 'en': 'House numbers indicative'},
  {'cs': 'Kraj', 'en': 'Region'},
  {'cs': 'KrajCode', 'en': 'Region Code'},
  {'cs': 'Okres', 'en': 'District'},
  {'cs': 'OkresCode', 'en': 'District Code'},
  {'cs': 'SpravniObvod', 'en': 'Administrative District'},
  {'cs': 'PoskytovatelTelefon', 'en': 'Phone provider'},
  {'cs': 'PoskytovatelFax', 'en': 'Fax provider'},
  {'cs': 'DatumZahajeniCinnosti', 'en': 'Launch Date'},
  {'cs': 'IdentifikatorDatoveSchranky', 'en': 'Identifier data boxes'},
  

In [33]:
cz.dumps_meta(as_str=True)

'{"country": {"code": "CZ", "name": "Czechia"}, "lang": {"code": "cs", "name": "czech"}, "file": "../../../data/raw/export-2020-02.csv", "columns": [{"cs": "ZdravotnickeZarizeniId", "en": "Medical devices Id"}, {"cs": "PCZ", "en": "PCZ"}, {"cs": "PCDP", "en": "PCDP"}, {"cs": "NazevCely", "en": "Title Cely"}, {"cs": "DruhZarizeni", "en": "Type of device"}, {"cs": "Obec", "en": "Village"}, {"cs": "Psc", "en": "Zip code"}, {"cs": "Ulice", "en": "Street"}, {"cs": "CisloDomovniOrientacni", "en": "House numbers indicative"}, {"cs": "Kraj", "en": "Region"}, {"cs": "KrajCode", "en": "Region Code"}, {"cs": "Okres", "en": "District"}, {"cs": "OkresCode", "en": "District Code"}, {"cs": "SpravniObvod", "en": "Administrative District"}, {"cs": "PoskytovatelTelefon", "en": "Phone provider"}, {"cs": "PoskytovatelFax", "en": "Fax provider"}, {"cs": "DatumZahajeniCinnosti", "en": "Launch Date"}, {"cs": "IdentifikatorDatoveSchranky", "en": "Identifier data boxes"}, {"cs": "PoskytovatelEmail", "en": "Ema

and save these metadata in a specific dedicated file on disk:

In [34]:
print(cz.dumps_meta(as_str=True))
try:
    assert True
    cz.dump_meta('czmeta.json', fmt='json')
except AssertionError:
    print("Nothing saved - set to True to actually generate the output files")

{"country": {"code": "CZ", "name": "Czechia"}, "lang": {"code": "cs", "name": "czech"}, "file": "../../../data/raw/export-2020-02.csv", "columns": [{"cs": "ZdravotnickeZarizeniId", "en": "Medical devices Id"}, {"cs": "PCZ", "en": "PCZ"}, {"cs": "PCDP", "en": "PCDP"}, {"cs": "NazevCely", "en": "Title Cely"}, {"cs": "DruhZarizeni", "en": "Type of device"}, {"cs": "Obec", "en": "Village"}, {"cs": "Psc", "en": "Zip code"}, {"cs": "Ulice", "en": "Street"}, {"cs": "CisloDomovniOrientacni", "en": "House numbers indicative"}, {"cs": "Kraj", "en": "Region"}, {"cs": "KrajCode", "en": "Region Code"}, {"cs": "Okres", "en": "District"}, {"cs": "OkresCode", "en": "District Code"}, {"cs": "SpravniObvod", "en": "Administrative District"}, {"cs": "PoskytovatelTelefon", "en": "Phone provider"}, {"cs": "PoskytovatelFax", "en": "Fax provider"}, {"cs": "DatumZahajeniCinnosti", "en": "Launch Date"}, {"cs": "IdentifikatorDatoveSchranky", "en": "Identifier data boxes"}, {"cs": "PoskytovatelEmail", "en": "Emai

This can easily be retrieved then: 

In [35]:
from pyeuhcs.config import ConfigFacility, MetaDatNatFacility
with open('czmeta.json', 'r') as fp:
    metadata = json.load(fp)
metadata = MetaDatNatFacility(metadata)
print(metadata)

country         : {'code': 'CZ', 'name': 'Czechia'}
lang            : {'code': 'cs', 'name': 'czech'}
file            : ../../../data/raw/export-2020-02.csv
columns         : [{'cs': 'ZdravotnickeZarizeniId', 'en': 'Medical devices Id'}, {'cs': 'PCZ', 'en': 'PCZ'}, {'cs': 'PCDP', 'en': 'PCDP'}, {'cs': 'NazevCely', 'en': 'Title Cely'}, {'cs': 'DruhZarizeni', 'en': 'Type of device'}, {'cs': 'Obec', 'en': 'Village'}, {'cs': 'Psc', 'en': 'Zip code'}, {'cs': 'Ulice', 'en': 'Street'}, {'cs': 'CisloDomovniOrientacni', 'en': 'House numbers indicative'}, {'cs': 'Kraj', 'en': 'Region'}, {'cs': 'KrajCode', 'en': 'Region Code'}, {'cs': 'Okres', 'en': 'District'}, {'cs': 'OkresCode', 'en': 'District Code'}, {'cs': 'SpravniObvod', 'en': 'Administrative District'}, {'cs': 'PoskytovatelTelefon', 'en': 'Phone provider'}, {'cs': 'PoskytovatelFax', 'en': 'Fax provider'}, {'cs': 'DatumZahajeniCinnosti', 'en': 'Launch Date'}, {'cs': 'IdentifikatorDatoveSchranky', 'en': 'Identifier data boxes'}, {'cs': 'Pos

Since the metadata contains all the information necessary to process the source dataset, it shall be used _as is_ to operate the harmonisation of the data.

Likewise the section above, a dynamically class `NewCZhcs` can be created and configured with this information:

In [36]:
NewCZhcs = facilityFactory(facility = 'HCS', meta = metadata)

then, similarly, a new instance can be created...and that's all we will need to process the data:

In [37]:
newcz = NewCZhcs()

Given the `file` and `path` fields of the `newcz.meta` attribute variable, data can be loaded using the `load_data` method (without argument, since they are all intrinsically provided with the metadata). But also given the `index` field, the formatting using the `format_data` method can be operated directly:

In [38]:
cz.load_data()
cz.format_data()

Check the output data:

In [39]:
print("output data columns: %s" % list(cz.data.columns))
cz.data.head(5)

output data columns: ['id', 'site_name', 'city', 'postcode', 'street', 'house_number', 'email', 'pub_date', 'lat', 'lon', 'geo_qual', 'hospital_name']


Unnamed: 0,id,site_name,city,postcode,street,house_number,email,pub_date,lat,lon,geo_qual,hospital_name
0,138130,Zubní studio V+V s.r.o.,Pelhøimov,39301,Komenského,1465,,01-02-2020 23:03,49.427788,15.21829,1,138130
1,138129,TECTUM spol. s r.o.,Most,43401,Topolová,1234,operator@labin.cz,01-02-2020 23:03,50.497217,13.650265,1,138129
2,138128,"MUDr. Milan Kuèera, s.r.o.",Praha,15000,Kartouzská,3274/10,milankucera@seznam.cz,01-02-2020 23:03,50.073417,14.401091,1,138128
3,138127,Tereza Horáèková,Praha,18200,Klapkova,154/46,,01-02-2020 23:03,50.126761,14.456956,1,138127
4,138126,Veobecný lékaø Prevent s.r.o.,Telè,58856,Masarykova,330,,01-02-2020 23:03,49.1835,15.459814,1,138126


### Automating the production

Obviously, you will not have to rerun these operations everytime a table is created. 

In order to simplify even further the operations, the steps above, including the loading of configuration and metadata files, have been automated. You can run it at once using the `harmonise` module:

In [40]:
from pyeuhcs import harmonise
cz = harmonise.run(country = "CZ", on_disk = False)

! Country py-module 'pyeuhcs.hcs.CZhcs' found !
! No default metadata dictionary 'METADATNAT' available !
! Generic formatting/harmonisation methods used !
! Harmonised data for country 'CZ' generated !


Check again for consistency... :

In [41]:
cz.data[['lat', 'lon']].head(5)

Unnamed: 0,lat,lon
0,49.427788,15.21829
1,50.497217,13.650265
2,50.073417,14.401091
3,50.126761,14.456956
4,49.1835,15.459814


### Validating the output data

Last, you will want to validate the output data, given the information you provided in the configuration template (_e.g._, through the field `column`).

Validation is also an automated process:

In [42]:
from pyeuhcs import validate
validate.run(country = "CZ")

! Input data file '../../../data/csv/CZ.csv' will be controlled for validation
! Column 'geo_qual' empty - missing values only !
! Column 'cc' empty - missing values only !
! Column 'country' empty - missing values only !
! Column 'cap_beds' empty - missing values only !
! Column 'cap_prac' empty - missing values only !
! Column 'cap_rooms' empty - missing values only !
! Column 'emergency' empty - missing values only !
! Column 'facility_type' empty - missing values only !
! Column 'public_private' empty - missing values only !
! Column 'list_specs' empty - missing values only !
! Column 'tel' empty - missing values only !
! Column 'email' empty - missing values only !
! Column 'url' empty - missing values only !
! Column 'ref_date' empty - missing values only !


