<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Problem-Description" data-toc-modified-id="Problem-Description-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Problem Description</a></span></li><li><span><a href="#Column-Description" data-toc-modified-id="Column-Description-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Column Description</a></span></li></ul></li><li><span><a href="#Exploratory-Data-Analysis" data-toc-modified-id="Exploratory-Data-Analysis-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Exploratory Data Analysis</a></span><ul class="toc-item"><li><span><a href="#Functions" data-toc-modified-id="Functions-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Functions</a></span></li><li><span><a href="#Cleaning" data-toc-modified-id="Cleaning-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Cleaning</a></span></li></ul></li></ul></div>

## Introduction

### Problem Description

https://www.kaggle.com/abrambeyer/us-hospital-customer-satisfaction-20162020

- Problems : 
- Which hospitals have the best overall performance? Best performance by measure such as Patient Experience?
- Which states, cities, counties have the most high-performing hospitals? The lowest performing hospitals?
- How have individual hospitals overall performance changed between 2016 and 2020? Are certain locations getting better or worse overall?
- Is the overall CMS star rating correlated with patient satisfaction?
- Can we predict a hospital overall rating from its patient satisfaction scores?
- Can we predict a hospital overall rating based on the hospital features ?
- Which survey questions are most informative of a hospital's overall quality?

### Column Description

https://www.hcahpsonline.org/en/hcahps-star-ratings/

- Column Name	Data Type	Description
- Facility ID	Char(6)	Facility Medicare ID
- Facility Name	Char(72)	Name of the facility
- Address	Char(51)	Facility street address
- City	Char(20)	Facility City
- State	Char(2)	Facility State
- ZIP Code	Num(8)	Facility ZIP Code
- County Name	Char(25)	Facility County
- Phone Number	Char(14)	Facility Phone Number
- HCAHPS Measure ID	Char(25)	HCAHPS Patient Survey Measure Name
- HCAHPS Question	Char(138)	HCAHPS Patient Survey Question
- HCAHPS Answer Description	Char(118)	HCAHPS Patient Survey Answer
- Patient Survey Star Rating	Char(14)	Overall rating for survey item
- Patient Survey Star Rating Footnote	Char(7)	n/a
- HCAHPS Answer Percent	Char(14)	Percent of surveys with question answered
- HCAHPS Answer Percent Footnote	Char(8)	n/a
- HCAHPS Linear Mean Value	Char(14)	HCAHPS Patient Survey question linear mean value
- Number of Completed Surveys	Char(13)	Number of completed surveys for hospital. N-size.
- Number of Completed Surveys Footnote	Char(8)	n/a
- Survey Response Rate Percent	Char(13)	Hospital survey response rate.
- Survey Response Rate Percent Footnote	Char(8)	n/a
- Start Date	Date	Survey collection period start date
- End Date	Date	Survey collection period end date
- Year	Char(4)	cms data release year
- Hospital Type	Char(34)	What type of facility is it?
- Hospital Ownership	Char(43)	What type of ownership does the facility have?
- Emergency Services	Char(3))	Does the facility have emergency services Yes/No?
- Meets criteria for promoting interoperability of EHRs	Char(1)	Does facility meet government EHR standard Yes/No?
- Hospital overall rating	Char(13)	Hospital Overall Star Rating 1=Worst; 5=Best. Aggregate measure of all other measures
- Hospital overall rating footnote	Num(8)	
- Mortality national comparison	Char(28)	Facility overall performance on mortality measures compared to other facilities
- Mortality national comparison footnote	Num(8)	
- Safety of care national comparison	Char(28)	Facility overall performance on safety measures compared to other facilities
- Safety of care national comparison footnote	Num(8)	
- Readmission national comparison	Char(28)	Facility overall performance on readmission measures compared to other facilities
- Readmission national comparison footnote	Num(8)	
- Patient experience national comparison	Char(28)	Facility overall performance on pat. exp. measures compared to other facilities
- Patient experience national comparison footnote	Char(8)	
- Effectiveness of care national comparison	Char(28)	Facility overall performance on effect. of care measures compared to other facilities
- Effectiveness of care national comparison footnote	Char(8)	
- Timeliness of care national comparison	Char(28)	Facility overall performance on timeliness of care measures compared to other facilities
- Timeliness of care national comparison footnote	Char(8)	
- Efficient use of medical imaging national comparison	Char(28)	Facility overall performance on efficient use measures compared to other facilities
- Efficient use of medical imaging national comparison footnote	Char(8)

## Exploratory Data Analysis 

### Functions

In [1]:
def lower_case_column_names(my_df):
    my_df.columns=[i.lower() for i in my_df.columns]
    return

In [2]:
def lowercase_values(dataframe):
    for column_name in list(dataframe.select_dtypes(include='object').columns.values):
        dataframe[column_name] = dataframe[column_name].str.lower()

### Cleaning


In [3]:
#Module import
import pandas as pd

In [4]:
#Data import and concatenation
file_1 = pd.read_csv("Data/cms_hospital_patient_satisfaction_2016.csv")
lower_case_column_names(file_1)
file_2 = pd.read_csv("Data/cms_hospital_patient_satisfaction_2017.csv")
lower_case_column_names(file_2)
file_3 = pd.read_csv("Data/cms_hospital_patient_satisfaction_2018.csv")
lower_case_column_names(file_3)
file_4 = pd.read_csv("Data/cms_hospital_patient_satisfaction_2019.csv")
lower_case_column_names(file_4)
file_5 = pd.read_csv("Data/cms_hospital_patient_satisfaction_2020.csv")
lower_case_column_names(file_5)

df = pd.concat([file_1, file_2, file_3,file_4,file_5], axis=0)

  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)


In [5]:
#Option to see all the columns displayed
pd.set_option("display.max_columns", None)

In [6]:
#Verification of the data information
print(df.head())
print(df.info())

  facility id                     facility name                 address  \
0       10001  SOUTHEAST ALABAMA MEDICAL CENTER  1108 ROSS CLARK CIRCLE   
1       10001  SOUTHEAST ALABAMA MEDICAL CENTER  1108 ROSS CLARK CIRCLE   
2       10001  SOUTHEAST ALABAMA MEDICAL CENTER  1108 ROSS CLARK CIRCLE   
3       10001  SOUTHEAST ALABAMA MEDICAL CENTER  1108 ROSS CLARK CIRCLE   
4       10001  SOUTHEAST ALABAMA MEDICAL CENTER  1108 ROSS CLARK CIRCLE   

     city state  zip code county name phone number     hcahps measure id  \
0  DOTHAN    AL     36301     HOUSTON   3347938701         H_STAR_RATING   
1  DOTHAN    AL     36301     HOUSTON   3347938701       H_CLEAN_HSP_A_P   
2  DOTHAN    AL     36301     HOUSTON   3347938701      H_CLEAN_HSP_SN_P   
3  DOTHAN    AL     36301     HOUSTON   3347938701       H_CLEAN_HSP_U_P   
4  DOTHAN    AL     36301     HOUSTON   3347938701  H_CLEAN_LINEAR_SCORE   

                                     hcahps question  \
0                                Sum

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1653683 entries, 0 to 442586
Data columns (total 43 columns):
 #   Column                                                         Non-Null Count    Dtype 
---  ------                                                         --------------    ----- 
 0   facility id                                                    1653683 non-null  object
 1   facility name                                                  1653683 non-null  object
 2   address                                                        1653683 non-null  object
 3   city                                                           1653683 non-null  object
 4   state                                                          1653683 non-null  object
 5   zip code                                                       1653683 non-null  int64 
 6   county name                                                    1651283 non-null  object
 7   phone number                                  

In order to clean the data,I want to keep only the columns with no footnotes, which means the columns where all data is available

In [7]:
#Adding columns with footnotes in a separate list
footnotes = []

lower_case_column_names(df)

for x in list(df.columns): 
    if "footnote"  in x:
       footnotes.append(x)
#print(footnotes)

In [8]:
#Dropping rows with footnotes, which means incomplete data
for x in footnotes:
    df.drop(df[df[x].isna() == False].index,inplace = True)

In [10]:
#Dropping unused columns
df.drop(columns=['facility name', 'county name','phone number','address','city','zip code'],inplace = True)

In [11]:
df.drop(columns=footnotes,inplace = True)

In [12]:
#Replacing value in specific columns
df["meets criteria for promoting interoperability of ehrs"].fillna(value="N",inplace = True)

In [13]:
#Standardizing strings
lowercase_values(df)

In [14]:
#Dropping duplicates
df.drop_duplicates(inplace=True)

In [21]:
df.drop(columns=["facility id"],inplace = True)

In [23]:
df

Unnamed: 0,state,hcahps measure id,hcahps question,hcahps answer description,patient survey star rating,hcahps answer percent,hcahps linear mean value,number of completed surveys,survey response rate percent,start date,end date,year,hospital type,hospital ownership,emergency services,meets criteria for promoting interoperability of ehrs,hospital overall rating,mortality national comparison,safety of care national comparison,readmission national comparison,patient experience national comparison,effectiveness of care national comparison,timeliness of care national comparison,efficient use of medical imaging national comparison
0,al,h_star_rating,summary star rating,summary star rating,3,not applicable,not applicable,1213,27,04/01/2015,03/31/2016,2016,acute care hospitals,government - hospital district or authority,yes,y,3,same as the national average,above the national average,same as the national average,below the national average,same as the national average,same as the national average,same as the national average
1,al,h_clean_hsp_a_p,patients who reported that their room and bath...,"room was ""always"" clean",not applicable,65,not applicable,1213,27,04/01/2015,03/31/2016,2016,acute care hospitals,government - hospital district or authority,yes,y,3,same as the national average,above the national average,same as the national average,below the national average,same as the national average,same as the national average,same as the national average
2,al,h_clean_hsp_sn_p,patients who reported that their room and bath...,"room was ""sometimes"" or ""never"" clean",not applicable,12,not applicable,1213,27,04/01/2015,03/31/2016,2016,acute care hospitals,government - hospital district or authority,yes,y,3,same as the national average,above the national average,same as the national average,below the national average,same as the national average,same as the national average,same as the national average
3,al,h_clean_hsp_u_p,patients who reported that their room and bath...,"room was ""usually"" clean",not applicable,23,not applicable,1213,27,04/01/2015,03/31/2016,2016,acute care hospitals,government - hospital district or authority,yes,y,3,same as the national average,above the national average,same as the national average,below the national average,same as the national average,same as the national average,same as the national average
4,al,h_clean_linear_score,cleanliness - linear mean score,cleanliness - linear mean score,not applicable,not applicable,84,1213,27,04/01/2015,03/31/2016,2016,acute care hospitals,government - hospital district or authority,yes,y,3,same as the national average,above the national average,same as the national average,below the national average,same as the national average,same as the national average,same as the national average
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
439699,tx,h_recmnd_dy,"patients who reported yes, they would definite...","""yes"", patients would definitely recommend the...",not applicable,76,not applicable,562,17,07/01/2018,06/30/2019,2020,acute care hospitals,proprietary,yes,y,3,same as the national average,below the national average,below the national average,above the national average,same as the national average,same as the national average,same as the national average
439700,tx,h_recmnd_py,"patients who reported yes, they would probably...","""yes"", patients would probably recommend the h...",not applicable,19,not applicable,562,17,07/01/2018,06/30/2019,2020,acute care hospitals,proprietary,yes,y,3,same as the national average,below the national average,below the national average,above the national average,same as the national average,same as the national average,same as the national average
439701,tx,h_recmnd_linear_score,recommend hospital - linear mean score,recommend hospital - linear mean score,not applicable,not applicable,90,562,17,07/01/2018,06/30/2019,2020,acute care hospitals,proprietary,yes,y,3,same as the national average,below the national average,below the national average,above the national average,same as the national average,same as the national average,same as the national average
439702,tx,h_recmnd_star_rating,recommend hospital - star rating,recommend hospital - star rating,4,not applicable,not applicable,562,17,07/01/2018,06/30/2019,2020,acute care hospitals,proprietary,yes,y,3,same as the national average,below the national average,below the national average,above the national average,same as the national average,same as the national average,same as the national average


In [25]:
df[df["hcahps question"] == "summary star rating"]

Unnamed: 0,state,hcahps measure id,hcahps question,hcahps answer description,patient survey star rating,hcahps answer percent,hcahps linear mean value,number of completed surveys,survey response rate percent,start date,end date,year,hospital type,hospital ownership,emergency services,meets criteria for promoting interoperability of ehrs,hospital overall rating,mortality national comparison,safety of care national comparison,readmission national comparison,patient experience national comparison,effectiveness of care national comparison,timeliness of care national comparison,efficient use of medical imaging national comparison
0,al,h_star_rating,summary star rating,summary star rating,3,not applicable,not applicable,1213,27,04/01/2015,03/31/2016,2016,acute care hospitals,government - hospital district or authority,yes,y,3,same as the national average,above the national average,same as the national average,below the national average,same as the national average,same as the national average,same as the national average
110,al,h_star_rating,summary star rating,summary star rating,3,not applicable,not applicable,376,25,04/01/2015,03/31/2016,2016,acute care hospitals,government - hospital district or authority,yes,y,2,below the national average,same as the national average,same as the national average,below the national average,same as the national average,above the national average,same as the national average
275,al,h_star_rating,summary star rating,summary star rating,2,not applicable,not applicable,2318,29,04/01/2015,03/31/2016,2016,acute care hospitals,voluntary non-profit - private,yes,y,2,same as the national average,below the national average,same as the national average,below the national average,below the national average,same as the national average,same as the national average
495,al,h_star_rating,summary star rating,summary star rating,3,not applicable,not applicable,829,25,04/01/2015,03/31/2016,2016,acute care hospitals,government - hospital district or authority,yes,y,2,below the national average,below the national average,below the national average,below the national average,same as the national average,above the national average,below the national average
660,al,h_star_rating,summary star rating,summary star rating,3,not applicable,not applicable,1224,27,04/01/2015,03/31/2016,2016,acute care hospitals,government - hospital district or authority,yes,y,3,below the national average,above the national average,same as the national average,below the national average,same as the national average,below the national average,above the national average
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
437192,tx,h_star_rating,summary star rating,summary star rating,4,not applicable,not applicable,922,15,07/01/2018,06/30/2019,2020,acute care hospitals,voluntary non-profit - private,yes,y,5,same as the national average,above the national average,above the national average,above the national average,same as the national average,same as the national average,same as the national average
437471,tx,h_star_rating,summary star rating,summary star rating,4,not applicable,not applicable,1531,25,07/01/2018,06/30/2019,2020,acute care hospitals,voluntary non-profit - private,yes,y,5,above the national average,above the national average,above the national average,above the national average,same as the national average,same as the national average,same as the national average
437564,tx,h_star_rating,summary star rating,summary star rating,3,not applicable,not applicable,1262,26,07/01/2018,06/30/2019,2020,acute care hospitals,voluntary non-profit - private,yes,y,3,same as the national average,same as the national average,below the national average,same as the national average,same as the national average,above the national average,same as the national average
438308,tx,h_star_rating,summary star rating,summary star rating,3,not applicable,not applicable,1753,18,07/01/2018,06/30/2019,2020,acute care hospitals,proprietary,yes,y,4,same as the national average,above the national average,below the national average,above the national average,same as the national average,below the national average,above the national average
