## Hospital General Information Dataset

### By: Anurag Bolneni

This notebook has our preliminary hospital reccomendation system MVP based on the Hospital General Information dataset from CMS. It is split into two sections:
+ Step 1: Data Cleaning & Manipulation
+ Step 2: Taking User Input
+ Step 3: Reccomendation System MVP

## Step 1: Data Cleaning & Manipulation

We first import necessary libraries and use an API to pull data from CMS websites. Then, data is fed into a pandas dataframe where we clean for parameters of interest and data types of different columns.

In [1]:
import pandas as pd
from collections import Counter
import numpy as np

In [2]:
pd.read_csv('Data/Hospital_General_Information.csv').head()

Unnamed: 0,Facility ID,Facility Name,Address,City,State,ZIP Code,County Name,Phone Number,Hospital Type,Hospital Ownership,...,Count of READM Measures Better,Count of READM Measures No Different,Count of READM Measures Worse,READM Group Footnote,Pt Exp Group Measure Count,Count of Facility Pt Exp Measures,Pt Exp Group Footnote,TE Group Measure Count,Count of Facility TE Measures,TE Group Footnote
0,10001,SOUTHEAST HEALTH MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,(334) 793-8701,Acute Care Hospitals,Government - Hospital District or Authority,...,1,9,1,,8,8,,14,11,
1,10005,MARSHALL MEDICAL CENTERS,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,(256) 593-8310,Acute Care Hospitals,Government - Hospital District or Authority,...,0,9,1,,8,8,,14,14,
2,10006,NORTH ALABAMA MEDICAL CENTER,1701 VETERANS DRIVE,FLORENCE,AL,35630,LAUDERDALE,(256) 768-8400,Acute Care Hospitals,Proprietary,...,1,7,1,,8,8,,14,11,
3,10007,MIZELL MEMORIAL HOSPITAL,702 N MAIN ST,OPP,AL,36467,COVINGTON,(334) 493-3541,Acute Care Hospitals,Voluntary non-profit - Private,...,0,6,0,,8,8,,14,7,
4,10008,CRENSHAW COMMUNITY HOSPITAL,101 HOSPITAL CIRCLE,LUVERNE,AL,36049,CRENSHAW,(334) 335-3374,Acute Care Hospitals,Proprietary,...,0,4,0,,8,Not Available,5.0,14,8,


In [3]:
df_hosp_gen_info = pd.read_csv('Data/Hospital_General_Information.csv').iloc[:,:13].drop(columns=['Phone Number','Meets criteria for promoting interoperability of EHRs'])
df_hosp_gen_info.head()

Unnamed: 0,Facility ID,Facility Name,Address,City,State,ZIP Code,County Name,Hospital Type,Hospital Ownership,Emergency Services,Hospital overall rating
0,10001,SOUTHEAST HEALTH MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,Acute Care Hospitals,Government - Hospital District or Authority,Yes,3
1,10005,MARSHALL MEDICAL CENTERS,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,Acute Care Hospitals,Government - Hospital District or Authority,Yes,2
2,10006,NORTH ALABAMA MEDICAL CENTER,1701 VETERANS DRIVE,FLORENCE,AL,35630,LAUDERDALE,Acute Care Hospitals,Proprietary,Yes,2
3,10007,MIZELL MEMORIAL HOSPITAL,702 N MAIN ST,OPP,AL,36467,COVINGTON,Acute Care Hospitals,Voluntary non-profit - Private,Yes,2
4,10008,CRENSHAW COMMUNITY HOSPITAL,101 HOSPITAL CIRCLE,LUVERNE,AL,36049,CRENSHAW,Acute Care Hospitals,Proprietary,Yes,2


In [4]:
print('Length of unique Facility ID vs total:',len(df_hosp_gen_info['Facility ID'].unique()),',',len(df_hosp_gen_info['Facility ID']))
print('Length of unique address vs total:',len(df_hosp_gen_info.Address.unique()),',',len(df_hosp_gen_info.Address))
print('Variations in hospital overall ratings:', Counter(df_hosp_gen_info['Hospital overall rating']))
print('Emerg_services:', Counter(df_hosp_gen_info['Emergency Services']))
df_hosp_gen_info.dtypes

Length of unique Facility ID vs total: 5306 , 5306
Length of unique address vs total: 5276 , 5306
Variations in hospital overall ratings: Counter({'Not Available': 1996, '3': 1006, '4': 979, '2': 682, '5': 452, '1': 191})
Emerg_services: Counter({'Yes': 4455, 'No': 851})


Facility ID                object
Facility Name              object
Address                    object
City                       object
State                      object
ZIP Code                    int64
County Name                object
Hospital Type              object
Hospital Ownership         object
Emergency Services         object
Hospital overall rating    object
dtype: object

-- Anurag to write blurb about above findings later

We'll be taking the cosine similarity of State and Emergency Services to test our MVP of the reccomendation algorithm. As a last step of cleaning, we need to change Emergency Services (Y/N) to binary classifications.

In [5]:
df_hosp_gen_info['Emergency Services'] = [1 if x=='Yes' else 0 for x in df_hosp_gen_info['Emergency Services']]

In [6]:
df_HCAHPS = pd.read_csv('Data/HCAHPS-Hospital.csv')
df_HCAHPS = df_HCAHPS[['Facility ID','HCAHPS Measure ID','HCAHPS Question','HCAHPS Answer Percent']]
df_HCAHPS['HCAHPS Answer Percent'] = pd.to_numeric(df_HCAHPS['HCAHPS Answer Percent'], errors='coerce')
df_HCAHPS = df_HCAHPS.dropna(axis=0)

df_info = df_HCAHPS.groupby(['Facility ID']).count()
df = df_info[df_info['HCAHPS Question'] == 72].reset_index()
VALID_FACILITY_IDS = list(df['Facility ID'])

df_HCAHPS = df_HCAHPS[df_HCAHPS['Facility ID'].isin(VALID_FACILITY_IDS) == True]

In [7]:
%%time
# Here's how to calculate summary metrics using group_by and creating some new columns
question_type_dict = {'H_COMP_1_A_P': "nurses", 
                      'H_NURSE_RESPECT_A_P': "nurses", 
                      'H_NURSE_LISTEN_A_P': "nurses", 
                      'H_NURSE_EXPLAIN_A_P': "nurses",
                      'H_COMP_2_A_P': "doctors",
                      'H_DOCTOR_RESPECT_A_P': "doctors",
                      'H_DOCTOR_LISTEN_A_P': "doctors", 
                      'H_DOCTOR_EXPLAIN_A_P': "doctors",
                      'H_COMP_3_A_P': "patients", 
                      'H_CALL_BUTTON_A_P': "patients", 
                      'H_BATH_HELP_A_P': "patients", 
                      'H_COMP_5_A_P': "staffs", 
                      'H_MED_FOR_A_P': "staffs", 
                      'H_SIDE_EFFECTS_A_P': "staffs"
                     }
df_HCAHPS["measurement_type"] = df_HCAHPS.apply(lambda row: question_type_dict[row["HCAHPS Measure ID"]] if row["HCAHPS Measure ID"] in question_type_dict.keys() else "UNKNOWN", axis=1)
grouped = df_HCAHPS.groupby(['Facility ID', 'measurement_type']).mean()
fin = grouped.drop("UNKNOWN", level="measurement_type").reset_index()

CPU times: user 2.36 s, sys: 28.2 ms, total: 2.39 s
Wall time: 2.39 s


In [8]:
fin.head()

Unnamed: 0,Facility ID,measurement_type,HCAHPS Answer Percent
0,10001,doctors,80.75
1,10001,nurses,77.0
2,10001,patients,60.666667
3,10001,staffs,63.666667
4,10005,doctors,84.0


In [67]:
# We'll know obtain a dictionary and join on the hospital df
CLINICAL_RATINGS = fin.groupby('Facility ID')['HCAHPS Answer Percent'].apply(list).to_dict()
clinician_df = pd.DataFrame(data = CLINICAL_RATINGS).T

print('We have '+ str(len(clinician_df)) + ' facilties in the initial df with all measurement types')

LOL = clinician_df.to_numpy()
v = [list(items) for items in LOL]
clinician_df['Clinician_Metrics'] = v

clinican_df = pd.DataFrame(clinician_df['Clinician_Metrics'])
clinican_df.index.names = ['Facility ID']

new = df_hosp_gen_info.join(clinican_df, on='Facility ID').dropna().reset_index()
print('We have '+ str(len(new)) + ' facilties in the df after joining and dropping NA')

new = new.drop(labels=['index','Hospital Type','Hospital Ownership'], axis=1)

We have 2490 facilties in the initial df with all measurement types
We have 2410 facilties in the df after joining and dropping NA


## Step 2: Take User Input

In [101]:
columns = ['Desired Hospital Rating','Emergency Services','State']
df = pd.DataFrame(columns = columns,dtype=object)

def user_input(df):
    # Take user input for a series of factors
    return df.append({
        'Desired Hospital Rating': int(input('Please rate your desired hospital on scale of 1-5:   ')),
        'Emergency Services': int(input('Do you need emergency services? (Yes = 1, No = 0)     ')),
        'State':str(input('Which State do you live in?     ')),
        'Doctors': int(input('Rate your ideal doctor on a scale of 0-100%'))/100,
        'Nurses': int(input('Rate your ideal nurses on a scale of 0-100%'))/100,
        'Patients': int(input('Rate your ideal patients on a scale of 0-100%'))/100,
        'Staffs': int(input('Rate your ideal staffs on a scale of 0-100%'))/100},
    ignore_index = True)

In [103]:
user_input_df = user_input(df)
user_input_df.head()

Please rate your desired hospital on scale of 1-5:   2
Do you need emergency services? (Yes = 1, No = 0)     0
Which State do you live in?     CA
Rate your ideal doctor on a scale of 0-100%88
Rate your ideal nurses on a scale of 0-100%58
Rate your ideal patients on a scale of 0-100%69
Rate your ideal staffs on a scale of 0-100%79


Unnamed: 0,Desired Hospital Rating,Emergency Services,State,Doctors,Nurses,Patients,Staffs
0,2,0,CA,0.88,0.58,0.69,0.79


## Step 3: Reccommendar System MVP

Our hospital reccomendation system utilizes the cleaned dataset from Step 1 and the User Input from Step 2. We use cosine similarity to determine the top hospitals based on the users target inputs. This section is broken into a set of function that to conduct vectorization of string parmaters as well as numeric parameters to determine cosine similarity of hospitals relative to a user's need. 

For our MVP, we tested our recommender sytem on Emergency Services (Y/N) and State of hospital as initial inputs. Finally, we sort the outcomes by cosine to determine the list of best hospitals. The outcomes so far look like it works relatively well, but we'll need to see how this fluctuates with additional parameters and available data.

In [104]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [105]:
def give_me_hospitals(df_hospital, df_patient):
    State_List = [state for state in df_hospital.State]
    tfidf_vectorizer = TfidfVectorizer()
    
    #Setting up TFIDF and begin vectorizing. Note: Number States in Dataset = 36 which is not complete US with 2140 records
    sparse_matrix = tfidf_vectorizer.fit_transform(State_List)
    doc_term_matrix = sparse_matrix.toarray()
        
    #1st 2 parameters: State and Emergency Services
    X_no1 = [np.append(doc_term_matrix[i],df_hospital['Emergency Services'][i]) for i in range(len(df_hospital))]
    
    #2nd 2 params: no1 & Clinician Metrics
    X_no2 = np.array([np.append(X_no1[i], df_hospital['Clinician_Metrics'][i]) for i in range(len(df_hospital))])
    
    #Begin Y paramters
    user_state = df_patient['State']
    Y_transform = tfidf_vectorizer.transform(user_state).toarray()
    
    #2nd params
    Y_Params = [df_patient['Emergency Services'],
                 df_patient['Doctors'],
                 df_patient['Nurses'],
                 df_patient['Patients'],
                 df_patient['Staffs']]
        
    #1st 2 parameters: 
    Y_Final = np.array([np.append(Y_transform, Y_Params)])
    
    #Y_no2 = np.array([np.append(no1[i], df_hospital['Clinician_Metrics'][i]) for i in range(len(df_hospital))])
    
    
    #Y_final = np.array([np.append(Y_transform, df_patient['Emergency Services'])])

    Cos = cosine_similarity(X_no2,Y_Final)
    df_hospital['Cosine Similarity'] = [values[0] for values in Cos]
    #Recc_df = df_hospital[['Facility Name', 'State', 'Emergency Services','Cosine Similarity']].copy()
    
    #return Y_params
    return df_hospital.sort_values('Cosine Similarity', ascending=False).reset_index()

In [106]:
df_hospital = new
df_patient = user_input_df
give_me_hospitals(df_hospital, df_patient).head(10)

Unnamed: 0,index,Facility ID,Facility Name,Address,City,State,ZIP Code,County Name,Emergency Services,Hospital overall rating,Clinician_Metrics,Cosine Similarity
0,454,050745,CHAPMAN GLOBAL MEDICAL CENTER,2601 E CHAPMAN AVE,ORANGE,CA,92869,ORANGE,1,2,"[65.75, 59.25, 51.666666666666664, 58.33333333...",0.827913
1,489,051320,BANNER LASSEN MEDICAL CENTER,1800 SPRING RIDGE DRIVE,SUSANVILLE,CA,96130,LASSEN,1,3,"[82.25, 77.25, 75.0, 71.66666666666667]",0.824807
2,686,100175,DESOTO MEMORIAL HOSPITAL,900 N ROBERT AVE,ARCADIA,FL,34265,DE SOTO,1,4,"[81.25, 72.5, 72.66666666666667, 74.0]",0.824358
3,477,050780,FOOTHILL REGIONAL MEDICAL CENTER,14662 NEWPORT AVE,TUSTIN,CA,92780,ORANGE,0,Not Available,"[70.75, 67.75, 57.666666666666664, 61.0]",0.82421
4,295,05020F,NH Camp Pendleton,200 Mercy Circle,Camp Pendleton,CA,92055,SAN DIEGO,1,Not Available,"[85.5, 81.75, 71.33333333333333, 73.3333333333...",0.823549
5,668,100130,LAKESIDE MEDICAL CENTER,39200 HOOKER HWY,BELLE GLADE,FL,33430,PALM BEACH,1,2,"[77.75, 73.0, 67.66666666666667, 76.0]",0.823343
6,282,05015F,60th Medical Group (Travis AFB),103 Bodin Circle,Fairfield,CA,94533,SOLANO,1,Not Available,"[82.75, 80.75, 74.33333333333333, 72.333333333...",0.823048
7,363,05039F,NH Twenty Nine Palms,1145 Sturgis Road,Twentynine Palms,CA,92278,SAN BERNARDINO,1,Not Available,"[86.0, 84.75, 74.66666666666667, 76.6666666666...",0.822902
8,460,050758,MONTCLAIR HOSPITAL MEDICAL CENTER,5000 SAN BERNARDINO ST,MONTCLAIR,CA,91763,SAN BERNARDINO,1,3,"[73.0, 70.0, 62.666666666666664, 60.0]",0.82281
9,265,050128,TRI-CITY MEDICAL CENTER,4002 VISTA WAY,OCEANSIDE,CA,92056,SAN DIEGO,1,3,"[72.25, 69.5, 60.666666666666664, 59.666666666...",0.8228
