### Input/Output data integration and harmonisation -- Simple examples using the pyhcs package

Let's run some setup necessary to import the package(s) required to run the project... Actually, this is needed only if `pyhcs` is not remotely installed/available from `pypi`, but it will be working only if you have a local version on your machine. 

In [1]:
import os, sys
try:
    thisdir = !pwd
    package, project = 'pyHCS', 'healthcare-services'
    assert '%s' % package.lower() in [mod.__name__ for mod in sys.modules.values()]
except AssertionError:
    # note: this notebook will need to be run from the project directory, otherwise...
    try:
        pos = thisdir[0].find(project)
        assert pos >= 0
    except:
        raise IOError("sorry, you're doomed, you won't be able to run this notebook...")
    else:
        thisdir = thisdir[0]
    PACKPATH = os.path.join(thisdir[:pos], project, 'src', package)
    sys.path.insert(0,PACKPATH)
else:
    print('package %s available to run project %s' % (package, project))
    PACKPATH = pyhcs.__path__[0]

Let's import the whole package (as a test) and check some basic metadata available there...

In [2]:
import pyhcs
print("countries in area \033[1m%s\033[0m are considered for harmonisation: %s" % 
      (list(pyhcs.COUNTRIES.keys())[0], list(pyhcs.COUNTRIES.values())[0]))

countries in area [1mEU28[0m are considered for harmonisation: ['BE', 'BG', 'CZ', 'DK', 'DE', 'EE', 'IE', 'EL', 'ES', 'FR', 'HR', 'IT', 'CY', 'LV', 'LT', 'LU', 'HU', 'MT', 'NL', 'AT', 'PL', 'PT', 'RO', 'SI', 'SK', 'FI', 'SE', 'UK']


Next, we import the `config` module, and check also some of the metadata made available in this module, in particular the attributes defined for the output harmonised dataset, such as the output columns (name and type), the output encoding format, the separator for output CSV, _etc_...

In [3]:
from pyhcs import config
print('output data will follow the configuration template: \n%s' % config.OCFGMETA)

output data will follow the configuration template: 
{'index': {'id': {'name': 'id', 'desc': 'The healthcare service identifier - This identifier is based on national identification codes, if it exists.', 'type': 'int', 'values': None}, 'name': {'name': 'hospital_name', 'desc': 'The name of the healthcare institution', 'type': 'str', 'values': None}, 'site': {'name': 'site_name', 'desc': 'The name of the specific site or branch of a healthcare institution', 'type': 'str', 'values': None}, 'lat': {'name': 'lat', 'desc': 'Latitude (WGS 84)', 'type': 'float', 'values': None}, 'lon': {'name': 'lon', 'desc': 'Longitude (WGS 84)', 'type': 'float', 'values': None}, 'geo_qual': {'name': 'geo_qual', 'desc': 'A quality indicator for the geolocation - 1: Good, 2: Medium, 3: Low, -1: Unknown', 'type': 'int', 'values': [-1, 1, 2, 3]}, 'street': {'name': 'street', 'desc': 'Street name', 'type': 'str', 'values': None}, 'number': {'name': 'house_number', 'desc': 'House number', 'type': 'str', 'values'



All fields of the configuration dictionary are listed in the `OCFGNAME` list:

In [4]:
print('fields of the configuration template: %s' % config.OCFGNAME)

fields of the configuration template: ['index', 'fmt', 'lang', 'sep', 'enc', 'date', 'proj', 'path', 'file']


Note that some global variables have been defined throughout the project to handle the template attributes. For instance, the global variable `INDEX` is equivalent to the value `CFGMETA['index']`:

In [5]:
assert config.INDEX['site'] == config.OCFGMETA['index']['site'] 
print("output attribute \033[1m'%s'\033[0m defined as: %s" % (config.INDEX['site']['name'], config.INDEX['site']))
assert config.INDEX['lat'] == config.OCFGMETA['index']['lat'] 
print("output attribute \033[1m'%s'\033[0m defined as: %s" % (config.INDEX['lat']['name'],config.INDEX['lat']))

output attribute [1m'site_name'[0m defined as: {'name': 'site_name', 'desc': 'The name of the specific site or branch of a healthcare institution', 'type': 'str', 'values': None}
output attribute [1m'lat'[0m defined as: {'name': 'lat', 'desc': 'Latitude (WGS 84)', 'type': 'float', 'values': None}


In [6]:
from pyhcs.base import MetaHCS, hcsFactory



Let's consider a simple example: CZ data are already available with the lat/lon geographical coordinates encoded in a single column of the input data. "Integrating" these data is nothing else than extracting the coordinates, reshufling some of the columns and dumping it in a new table.
Metadata regarding the input CZ data has been collected in the `meta/CZhcs.json` file and can be retrieved easily:

In [7]:
import json
with open(os.path.abspath(os.path.join(PACKPATH, 'pyhcs/meta', 'CZhcs.json')), 'r') as fp:
    metadata = json.load(fp)
metadata = MetaHCS(metadata)
print(metadata)

sep          : ;
lang         : {'code': 'cs', 'name': 'czech'}
path         : ../../../data/raw/
index        : {'id': 'Medical devices Id', 'name': 'Medical devices Id', 'site': 'Title Cely', 'lat': 'GPS', 'lon': 'GPS', 'geo_qual': None, 'street': 'Street', 'number': 'Indicative house number', 'postcode': 'Zip code', 'city': 'Village', 'cc': 'cc', 'country': 'country', 'ER': None, 'beds': None, 'prac': None, 'rooms': None, 'type': None, 'PP': None, 'specs': None, 'tel': None, 'email': None, 'url': None, 'refdate': None, 'pubdate': 'Last Modified'}
columns      : [{'cs': 'ZdravotnickeZarizeniId', 'en': 'Medical devices Id', 'fr': '', 'de': ''}, {'cs': 'PCZ', 'en': 'PCZ', 'fr': '', 'de': ''}, {'cs': 'PCDP', 'en': 'PCDP', 'fr': '', 'de': ''}, {'cs': 'NazevCely', 'en': 'Title Cely', 'fr': '', 'de': ''}, {'cs': 'DruhZarizeni', 'en': 'Type of device', 'fr': '', 'de': ''}, {'cs': 'Obec', 'en': 'Village', 'fr': '', 'de': ''}, {'cs': 'Psc', 'en': 'Zip code', 'fr': '', 'de': ''}, {'cs': 'Ulice

The metadata contains in particular information regarding the source dataset. Let's use it to actually retrieve the input data. 
Given the metadata, we dynamically create a class that will describe data from CZ:

In [8]:
CZhcs = hcsFactory(metadata)

then create an instance of this class: that's all we need to process the data.

In [9]:
cz = CZhcs()

Given the `file` and `path` fields of the `metadata` variable, data can be loaded using the `load_data` method (without argument, since they are all intrinsically provided with the metadata):

In [10]:
cz.load_data()

Check the input data, _e.g._ let's have a look at the column names and the content of the table:

In [11]:
print("input data columns: %s" % list(cz.data.columns))
cz.data.head(10)

input data columns: ['ZdravotnickeZarizeniId', 'PCZ', 'PCDP', 'NazevCely', 'DruhZarizeni', 'Obec', 'Psc', 'Ulice', 'CisloDomovniOrientacni', 'Kraj', 'KrajCode', 'Okres', 'OkresCode', 'SpravniObvod', 'PoskytovatelTelefon', 'PoskytovatelFax', 'DatumZahajeniCinnosti', 'IdentifikatorDatoveSchranky', 'PoskytovatelEmail', 'PoskytovatelWeb', 'PoskytovatelNazev', 'Ico', 'TypOsoby', 'PravniFormaKod', 'KrajCodeSidlo', 'KrajSidlo', 'OkresCodeSidlo', 'OkresSidlo', 'PscSidlo', 'ObecSidlo', 'UliceSidlo', 'CisloDomovniOrientacniSidlo', 'OborPece', 'FormaPece', 'DruhPece', 'OdbornyZastupce', 'GPS', 'LastModified']


Unnamed: 0,ZdravotnickeZarizeniId,PCZ,PCDP,NazevCely,DruhZarizeni,Obec,Psc,Ulice,CisloDomovniOrientacni,Kraj,...,PscSidlo,ObecSidlo,UliceSidlo,CisloDomovniOrientacniSidlo,OborPece,FormaPece,DruhPece,OdbornyZastupce,GPS,LastModified
0,138130,0,1,Zubní studio V+V s.r.o.,Samostatná ordinace PL - stomatologa,Pelhøimov,39301,Komenského,1465,Kraj Vysoèina,...,39301.0,Pelhøimov,Pod Náspem,641,"zubní lékaøství, Dentální hygienistka","primární ambulantní péèe, specializovaná ambul...",,"Veronika kodová, Vít koda",49.427787867131 15.218289906425,01-02-2020 23:03
1,138129,0,4,TECTUM spol. s r.o.,Odbìrová místnost,Most,43401,Topolová,1234,Ústecký kraj,...,36001.0,Karlovy Vary,Bezruèova,1098/10,klinická biochemie,specializovaná ambulantní péèe,,"David Hepnar, JITKA PODROUKOVÁ",50.497217046878 13.650265371038,01-02-2020 23:03
2,138128,1,0,"MUDr. Milan Kuèera, s.r.o.",Samostatná ordinace lékaøe specialisty,Praha,15000,Kartouzská,3274/10,Hlavní mìsto Praha,...,27201.0,Kladno,T. G. Masaryka,2104,dermatovenerologie,ambulantní péèe,,Jan Kuèera,50.073416715218 14.401090995315,01-02-2020 23:03
3,138127,1,0,Tereza Horáèková,Výdejna zdravotnických prostøedkù,Praha,18200,Klapkova,154/46,Hlavní mìsto Praha,...,,,,,praktické lékárenství,,lékárenská péèe,Petra Holasová,50.126760922745 14.456956313382,01-02-2020 23:03
4,138126,0,0,Veobecný lékaø Prevent s.r.o.,Samost. ordinace veob. prakt. lékaøe,Telè,58856,Masarykova,330,Kraj Vysoèina,...,28163.0,Kozojedy,1. máje,67,"veobecné praktické lékaøství, veobecné prakt...","primární ambulantní péèe, zdrav. péèe poskytov...",,ARNOT STANÌK,49.183499865155 15.45981410537,01-02-2020 23:03
5,138124,0,1,Mgr. Alice Èapková,Samostatné zaøízení fyzioterapeuta,Praha,15000,Ostrovského,253/3,Hlavní mìsto Praha,...,,,,,"Fyzioterapeut, Fyzioterapeut","ambulantní péèe, zdrav. péèe poskytovaná ve vl...",,,50.068645322094 14.402774417737,01-02-2020 23:03
6,138122,11,0,FOKUS optik a.s.,Oèní optika,Tøebíè,67401,Karlovo nám.,51/40,Kraj Vysoèina,...,18200.0,Praha,Tøeboradická,1110/51,Optometrista,specializovaná ambulantní péèe,,DAVID FRIEDMAN,49.215606066693 15.881323723794,01-02-2020 23:03
7,138121,2,0,"BeBridge a.s., Lékárna U Slunce",Lékárna,Liberec,46014,Vrchlického,802/46,Liberecký kraj,...,63900.0,Brno,Bidláky,837/20,,,lékárenská péèe,BLANKA HUDCOVÁ,50.783464595769 15.060350115097,01-02-2020 23:03
8,138120,0,0,ART OPTIKA ZHÁNÌL s.r.o.,Oèní optika,Ostrava,70800,námìstí Jana Nerudy,614/6,Moravskoslezský kraj,...,74301.0,Bítov,,147,Optometrista,specializovaná ambulantní péèe,,Radovan Krejcha,49.829347272973 18.164683025831,01-02-2020 23:03
9,138116,0,0,ZLATICA MESJAROVÁ DiS.,Samostatné zaøízení nelékaøe - jiné,Brandýs nad Labem-Stará Boleslav,25001,Výletní,1411,Støedoèeský kraj,...,25001.0,Brandýs nad Labem-Stará Boleslav,Výletní,1411,Dentální hygienistka,specializovaná ambulantní péèe,,,50.182852358333 14.656601659872,01-02-2020 23:03


We know the geographical coordinates are actually stored as a single variable in the `GPS` column:

In [12]:
cz.data['GPS'].head(10)

0    49.427787867131 15.218289906425
1    50.497217046878 13.650265371038
2    50.073416715218 14.401090995315
3    50.126760922745 14.456956313382
4     49.183499865155 15.45981410537
5    50.068645322094 14.402774417737
6    49.215606066693 15.881323723794
7    50.783464595769 15.060350115097
8    49.829347272973 18.164683025831
9    50.182852358333 14.656601659872
Name: GPS, dtype: object

The `index` field of the `metadata` variable actually tells us about it, and also about other potential matching: 

In [13]:
[print("coordinate attribute \033[1m'%s'\033[0m is available through the column \033[1m'%s'\033[0m of the input table" % (l, metadata['index'][l])) 
     for l in ['lat', 'lon']]
print("other matching variables are listed as:")
{k:v for (k,v) in metadata['index'].items() if v is not None}

coordinate attribute [1m'lat'[0m is available through the column [1m'GPS'[0m of the input table
coordinate attribute [1m'lon'[0m is available through the column [1m'GPS'[0m of the input table
other matching variables are listed as:


{'id': 'Medical devices Id',
 'name': 'Medical devices Id',
 'site': 'Title Cely',
 'lat': 'GPS',
 'lon': 'GPS',
 'street': 'Street',
 'number': 'Indicative house number',
 'postcode': 'Zip code',
 'city': 'Village',
 'cc': 'cc',
 'country': 'country',
 'pubdate': 'Last Modified'}

The integration/matching/extraction process is actually run with the `format_data` method (this time again without arguments):

In [14]:
cz.format_data()

Check the output data, _e.g._ let's have a look at the column names and the content of the table:

In [15]:
assert set(cz.data.columns).difference(set([ind['name'] for ind in config.INDEX.values()])) == set()
print("output data attributes: %s" % list(cz.data.columns))
cz.data.head(10)

output data attributes: ['id', 'site_name', 'city', 'postcode', 'street', 'house_number', 'pub_date', 'country', 'cc', 'lat', 'lon', 'geo_qual', 'hospital_name']


Unnamed: 0,id,site_name,city,postcode,street,house_number,pub_date,country,cc,lat,lon,geo_qual,hospital_name
0,138130,Zubní studio V+V s.r.o.,Pelhøimov,39301,Komenského,1465,01-02-2020 23:03,Czech Republic,CZ,49.427788,15.21829,3,138130
1,138129,TECTUM spol. s r.o.,Most,43401,Topolová,1234,01-02-2020 23:03,Czech Republic,CZ,50.497217,13.650265,3,138129
2,138128,"MUDr. Milan Kuèera, s.r.o.",Praha,15000,Kartouzská,3274/10,01-02-2020 23:03,Czech Republic,CZ,50.073417,14.401091,3,138128
3,138127,Tereza Horáèková,Praha,18200,Klapkova,154/46,01-02-2020 23:03,Czech Republic,CZ,50.126761,14.456956,3,138127
4,138126,Veobecný lékaø Prevent s.r.o.,Telè,58856,Masarykova,330,01-02-2020 23:03,Czech Republic,CZ,49.1835,15.459814,3,138126
5,138124,Mgr. Alice Èapková,Praha,15000,Ostrovského,253/3,01-02-2020 23:03,Czech Republic,CZ,50.068645,14.402774,3,138124
6,138122,FOKUS optik a.s.,Tøebíè,67401,Karlovo nám.,51/40,01-02-2020 23:03,Czech Republic,CZ,49.215606,15.881324,3,138122
7,138121,"BeBridge a.s., Lékárna U Slunce",Liberec,46014,Vrchlického,802/46,01-02-2020 23:03,Czech Republic,CZ,50.783465,15.06035,3,138121
8,138120,ART OPTIKA ZHÁNÌL s.r.o.,Ostrava,70800,námìstí Jana Nerudy,614/6,01-02-2020 23:03,Czech Republic,CZ,49.829347,18.164683,3,138120
9,138116,ZLATICA MESJAROVÁ DiS.,Brandýs nad Labem-Stará Boleslav,25001,Výletní,1411,01-02-2020 23:03,Czech Republic,CZ,50.182852,14.656602,3,138116


See in particular the `lat`/`lon` attributes that were retrieved:

In [16]:
cz.data[['lat', 'lon']].head(10)

Unnamed: 0,lat,lon
0,49.427788,15.21829
1,50.497217,13.650265
2,50.073417,14.401091
3,50.126761,14.456956
4,49.1835,15.459814
5,50.068645,14.402774
6,49.215606,15.881324
7,50.783465,15.06035
8,49.829347,18.164683
9,50.182852,14.656602


Obviously, you will not have to rerun these operations everytime a table is created. You can run it at once:

In [17]:
from pyhcs import harmonise
cz = harmonise.run(country = "CZ", as_file = False)



Check again for consistency... :

In [18]:
cz.data[['lat', 'lon']].head(10)

Unnamed: 0,lat,lon
0,49.427788,15.21829
1,50.497217,13.650265
2,50.073417,14.401091
3,50.126761,14.456956
4,49.1835,15.459814
5,50.068645,14.402774
6,49.215606,15.881324
7,50.783465,15.06035
8,49.829347,18.164683
9,50.182852,14.656602


You can for instance transform the data into a `GEOJSON` collection of features, say the 10 first:

In [19]:
# columns = set(cz.data.columns).difference(set(['lat', 'lon']))
geom = cz.to_geojson(cz.data.iloc[0:10,:], latlon = ['lat', 'lon'])

In [20]:
print("the first feature of the feature collection is for instance: %s" % geom.get('features',[])[0])

the first feature of the feature collection is for instance: {'type': 'Feature', 'properties': {'site_name': 'Zubní studio V+V s.r.o.', 'house_number': '1465', 'geo_qual': 3, 'postcode': '39301', 'cc': 'CZ', 'street': 'Komenského', 'pub_date': '01-02-2020 23:03', 'hospital_name': 138130, 'country': 'Czech Republic', 'id': 138130, 'city': 'Pelhøimov'}, 'geometry': {'type': 'Point', 'coordinates': [15.218289906425, 49.427787867131]}}


You can use the formatted data to display the available information:

In [21]:
from folium import Map, GeoJson, features

In [22]:
prague = [50.0755, 14.4378]
m = Map(location = prague, zoom_start = 7)
GeoJson(json.dumps(geom), name = 'hospital',
        tooltip=features.GeoJsonTooltip(fields=['site_name', 'city', 'postcode'], localize=True)).add_to(m)
m