# Merge HIFLD licensed bed count data to HCRIS staffed ICU bed count data

## Overview

This notebook complete the following tasks:

- download HIFLD licensed bed count data
- spatially join the above data (with 150 meters buffer) with HCRIS data (which has facility level staffed ICU bed count)
- deduplicate the joined result by checking address similarity and facility name similarity
- populate the non-matching records' licensed bed count by the estimated state average ratio of staffed ICU bed count to licensed bed count
- remove redundant fields and prefix fields from HIFLD data with `hifld_`
- export the merged and cleaned result to a `.geojson` file containing all 6661 facilities originally from HCRIS data

## Download HIFLD data

#### The downloaded data are in `../data/hifld-hospitals.csv`

In [None]:
hifld_file_path = '../data/hifld-hospitals.csv'

In [None]:
!wget https://opendata.arcgis.com/datasets/6ac5e325468c4cb9b905f1728d6fbf0f_0.csv -O {hifld_file_path}

## Observe HIFLD data and create a GeoDataFrame

In [None]:
import pandas as pd
import numpy as np
import geopandas as gpd

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

In [None]:
hifld_df = pd.read_csv(hifld_file_path, encoding='utf-8')

In [None]:
hifld_df.head()

#### Append `hifld_` to all column names to identify datasource

In [None]:
hifld_df = hifld_df.add_prefix('hifld_')

In [None]:
hifld_df.head()

#### Create a `GeoDataFrame` from the above `DataFrame`, set the projection to `WGS84`, read locations from `hifld_X` and `hifld_Y` columns

In [None]:
hifld_gdf = gpd.GeoDataFrame(
    hifld_df,
    crs='EPSG:4326',
    geometry=gpd.points_from_xy(hifld_df.hifld_X, hifld_df.hifld_Y))

#### Reproject to Web Mecator to prepare for further spatial calculations

In [None]:
hifld_gdf = hifld_gdf.to_crs('EPSG:3857')

In [None]:
hifld_gdf.plot(figsize=(15, 10))

## Import HCRIS data 

In [None]:
hcris_hospital_beds_gdf = gpd.read_file('../data/usa_hospital_beds_hcris2018_v2.geojson', encoding='utf-8')

In [None]:
hcris_hospital_beds_gdf.head()

#### There may not be one-to-one matches
There are 6,661 facilities in HCRIS data, and 7,581 facilities in HIFLD data. Even if there are more entries in the latter, we cannot guarantee that all 6,661 records from the former will have matches as a result of spatial join.

In [None]:
print("hcris_hospital_beds_gdf", len(hcris_hospital_beds_gdf))
print("hifld_gdf", len(hifld_gdf))

#### Reproject to Web Mecator to prepare for further spatial calculations

In [None]:
hcris_hospital_beds_gdf = hcris_hospital_beds_gdf.to_crs('EPSG:3857')

Create 150-meter (Web Mecator's unit is meter) buffer for all points in HCRIS facilities

In [None]:
hcris_hospital_beds_gdf['geom_buffered'] = hcris_hospital_beds_gdf.geometry.buffer(150)

#### Create a copy of the data and add an ID column, which will be used later on when deduplicating.

In [None]:
hcris_hospital_beds_gdf_copy = hcris_hospital_beds_gdf.copy()

In [None]:
hcris_hospital_beds_gdf_copy.insert(0, 'ID', range(0, len(hcris_hospital_beds_gdf_copy)))

#### Save the current point geometries to a `point_geometry` column, and set each record's geometry as the buffered polygon from `geom_buffered` field

In [None]:
hcris_hospital_beds_gdf_copy['point_geometry'] = hcris_hospital_beds_gdf_copy.geometry

In [None]:
hcris_hospital_beds_gdf_copy['geometry'] = hcris_hospital_beds_gdf_copy['geom_buffered']

In [None]:
hcris_hospital_beds_gdf_copy = hcris_hospital_beds_gdf_copy.set_geometry('geometry')

## Join data spatially

#### Perform spatial join to merge the HIFLD data (points) to the HCRIS data (buffered polygons)

In [None]:
joined = gpd.sjoin(hcris_hospital_beds_gdf_copy, hifld_gdf, how='left', op='intersects')

#### There are duplicates after join
The result has 7,119 entries, which contains duplicates, since the HCRIS data only has 6,661 recorded facilities.

In [None]:
len(joined)

For exmaple, the first record in HCRIS has duplicated joined results

In [None]:
joined.loc[0, :]

In [None]:
joined = joined.set_geometry('geometry')

## Deduplicate joined data

#### Calculate address and name similarity
A way to deduplicate the joined result is to compare address and name similarities between HCRIS and HIFLD on top of the spatial join.

In [None]:
from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

In [None]:
joined['name_similarity'] = joined.apply(lambda row: similar(str(row['HOSP10_Name']), str(row['hifld_NAME'])), axis = 1) 

In [None]:
joined['address_similarity'] = joined.apply(lambda row: similar(str(row['Street_Addr']), str(row['hifld_ADDRESS'])), axis = 1) 

#### Records without null data from hifld (6210)

In [None]:
joined_no_null = joined[joined['index_right'].notnull()]
len(joined_no_null)

#### Records with null data from hifld (909). These records' licensed bed counts will be estimated by state average.

In [None]:
joined_null = joined[joined['index_right'].isnull()]
len(joined_null)

#### Among the 6210 records without null data from hifld, there are 893 duplicates from hcris due to the spatial join

In [None]:
def getDupeRecords(df, field):
    ids = df[field]
    return df[ids.isin(ids[ids.duplicated()])]

In [None]:
joined_no_null_dupe = getDupeRecords(joined_no_null, "ID")
len(joined_no_null_dupe)

#### 5,317 records are unique ones that we will use directly in the end

In [None]:
ids_no_null = joined_no_null['ID']
joined_no_null_no_dupe = joined_no_null[~ids_no_null.isin(ids_no_null[ids_no_null.duplicated()])]
len(joined_no_null_no_dupe)

#### First, among 893 duplicates, find and keep the records with higher address matching score

In [None]:
address_similarity_maxes = joined_no_null_dupe.groupby(['ID']).address_similarity.transform(max)
joined_no_null_dedupe_address = joined_no_null_dupe[(joined_no_null_dupe.address_similarity == address_similarity_maxes)]
len(joined_no_null_dedupe_address)

#### Then, among the above result, find and keep the records with higher naming matching score

In [None]:
name_similarity_maxes = joined_no_null_dedupe_address.groupby(['ID']).name_similarity.transform(max)
joined_no_null_dedupe_address_name = joined_no_null_dedupe_address[(joined_no_null_dedupe_address.name_similarity == name_similarity_maxes)]
len(joined_no_null_dedupe_address_name)

In [None]:
joined_no_null_deduped = joined_no_null_no_dupe.append(joined_no_null_dedupe_address_name)

#### For each facility calculate staffed ICU bed to licenses bed ratio to prepare for estimation of the non-matching records

In [None]:
joined_no_null_deduped['icu_to_licensed'] = joined_no_null_deduped['ICU Total Staffed Beds'] / joined_no_null_deduped['hifld_BEDS']

## Fill in non-matching data with estimates

#### Mark records with `hifld_BEDS` info from spatial join as "not by estimation flag"

In [None]:
joined_no_null_deduped['is_hifld_BEDS_estimated'] = 0

In [None]:
len(joined_no_null_deduped)

#### Calculate average ratio of staffed ICU bed to licensed bed by state

In [None]:
icu_to_licensed_state_avg = joined_no_null_deduped.groupby(['State'])['icu_to_licensed'].mean().reset_index()

#### Join the state average ratio with the data frame of those non-matching records and calculate estimates

In [None]:
joined_null_with_ratio = joined_null.merge(icu_to_licensed_state_avg, on='State')

In [None]:
joined_null_with_ratio['hifld_BEDS'] = joined_null_with_ratio['ICU Total Staffed Beds']/joined_null_with_ratio['icu_to_licensed']

In [None]:
joined_null_with_ratio['hifld_BEDS'] = joined_null_with_ratio['hifld_BEDS'].astype(int)

#### Mark these records with `hifld_BEDS` info from spatial join as "by estimation flag"

In [None]:
joined_null_with_ratio['is_hifld_BEDS_estimated'] = 1

#### There are some records with no state average data

The resulting `hifld_BEDS` is `null`, mark them as "not from real count and not from estimation"

In [None]:
not_joined_with_null = joined_null[~joined_null.ID.isin(list(joined_null_with_ratio['ID']))].copy()

In [None]:
not_joined_with_null['is_hifld_BEDS_estimated'] = 9

#### Merge the deduplicated records, records resulted from estimations, and records without any licensed bed count info

In [None]:
result = joined_no_null_deduped.append(joined_null_with_ratio).append(not_joined_with_null)

In [None]:
result = result.sort_values(by=['ID'])

In [None]:
result = result.to_crs('EPSG:4326')

In [None]:
result['geom_buffered'] = result['geometry']

In [None]:
result = result.to_crs('EPSG:3857')

In [None]:
result['geometry'] = result['point_geometry']

In [None]:
result = result.set_geometry('geometry')

In [None]:
result = result.to_crs('EPSG:4326')

In [None]:
result['point_geometry'] = result['geometry']

In [None]:
result.plot(figsize=(15, 10))

## Clean data up and export as GeoJSON

#### Remove unwanted columns

In [None]:
result.drop(['point_geometry', 'hifld_X', 'hifld_Y', 'geom_buffered'], axis=1, inplace=True)

#### Make sure the schema can be parsed (by Fiona internally)

In [None]:
gpd.io.file.infer_schema(result)

#### Export data to `../data/usa_hospital_beds_hcris2018_merge_hifld.geojson`

In [None]:
result.to_file('../data/usa_hospital_beds_hcris2018_merge_hifld.geojson', encoding='utf-8', driver='GeoJSON')