# Merge facility information

Merge facility data from HCRIS (Healthcare Cost Reporting Information System) and DH (Definitive Healthcare) datasets.

In [None]:
import os
import json
from os.path import join, isdir

import pandas as pd
import geopandas as gpd
import numpy as np

from covidcaremap.data import (processed_data_path, 
                               external_data_path,
                               local_data_path)
from covidcaremap.merge import match_facilities, FacilityColumns
from covidcaremap.mapping import map_facility_match_result

This notebook performs matches between different hospital facility datasets.

See logic in `covidcaremap.merge.match_facilities` for information about how the matching is performed.

Generally, the algorithm is as follows:

- Compute the KNN (n=10) between all facilities.
- Create a graph for containing every facility in HIFLD and an edge for each of its neighbors if the distance between the two is less than `MAX_DISTANCE`. 
- Get the [connected components](https://en.wikipedia.org/wiki/Component_(graph_theory)) of that graph as a set of potentially matched facilities and pass it to a method that:
  - Determines the feasability of a match between each HIFLD facility and any DH and HCRIS facilities in the set, based on the numeric address number matching between the two or a name match. If it's deemed feasible (see `coidccaremap.merge.reduce_matched_facility_records` for exact logic), create a score between the facilities based on a [rapidfuzz](https://github.com/rhasspy/rapidfuzz) fuzz ratio for the name and address of the facilities.
  - Generate the final match set per HIFLD facility by ordering the potential matches between HIFLD and DH or HCRIS facilities, choosing the first of each of DH and HCRIS, and ensuring there's no duplicate matches.
- The matched sets over all components represents the matched facilities.

These keys to refer to the facility datasets by.

In [None]:
HIFLD = 'hifld'
DH = 'dh'
HCRIS = 'hcris'

`MAX_DISTANCE` determines the maximum distance two facilities can be apart from each other and still considered as a potential match.

In [None]:
MAX_DISTANCE = 500 # meters

`AUTHORITATIVE_DATASET` is the dataset that all the other facility datasets match against. All facilities in this dataset will be included in the final output; any unmatched facilities in the other datasets will be dropped.

In [None]:
AUTHORITATIVE_DATASET = HIFLD

Read in the facility datasets and make any necessary data transformations.

In [None]:
hcris = gpd.read_file(processed_data_path('usa_hospital_beds_hcris2018.geojson'), encoding='utf-8')
dh = gpd.read_file(processed_data_path('dh_geocoded_v1_0326202.geojson'), encoding='utf-8')
hifld = gpd.read_file(processed_data_path('hifld_facility_data.geojson'), encoding='utf-8')

# Drop OBJECTID as that is the ID column name 
hifld = hifld.drop(columns=['OBJECTID'])

# Use a combined address field for DH as it's street address is split between two fields.
dh['addr2'] = dh['HQ_ADDRE_1'].fillna('')
dh['combined_address'] = dh['Street_Addr'] + ' ' + dh['addr2']

The configuration for the matching algorithm. It describes the dataframes and the column names of each facility dataset.

In [None]:
facility_datasets = {
    HIFLD: {
        'df': hifld,
        'columns': FacilityColumns(facility_id='ID',
                                   facility_name='NAME',
                                   street_address='ADDRESS')
    },
    DH: {
        'df': dh,
        'columns': FacilityColumns(facility_id='OBJECTID',
                                   facility_name='HOSP10_Name',
                                   street_address='combined_address')
    },
    HCRIS: {
        'df': hcris,
        'columns': FacilityColumns(facility_id='Provider Number',
                                   facility_name='HOSP10_Name',
                                   street_address='Street_Addr')
    }
}


Perform the matching. This can take a bit.

In [None]:
match_result = match_facilities(facility_datasets, 
                                authoritative_dataset='hifld',
                                max_distance=MAX_DISTANCE)

Save off a map of the match results for inspection.

In [None]:
map_dir = local_data_path('merge_facility_validation_maps')
if not os.path.isdir(map_dir):
    os.makedirs(map_dir)
all_map = map_facility_match_result(match_result, facility_datasets, 'hifld')
all_map.add_layer_selector()
all_map.save(os.path.join(map_dir, 
                          '{}.html'.format('-'.join([HIFLD, DH, HCRIS]))))

Save the merged facility information to GeoJSON.

In [None]:
match_result.merged_df.to_file(processed_data_path('hifld-dh-hcris-merged.geojson'), 
                               encoding='utf-8', 
                               driver='GeoJSON')

Save a JSON object describing all unmatched facilities.

In [None]:
with open(processed_data_path('unmatched-facilities_per_dataset.json'), 'w') as f:
    f.write(json.dumps(match_result.get_unmatched_dict(), indent=2))
