In [1]:
import csv
import pymysql.cursors
import pandas as pd
import numpy as np

database = "healthCC"


### Step 1:  Quick Data Exploration

First step in Data Analysis is to see the basic information and structure of our data. We start with our file *'hospital_gen_info'* which contains basic information and comparison of all the hospitals.

In [2]:
df = pd.read_csv("data/hospital_gen_info.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4806 entries, 0 to 4805
Data columns (total 28 columns):
Provider ID                                                      4806 non-null int64
Hospital Name                                                    4806 non-null object
Address                                                          4806 non-null object
City                                                             4806 non-null object
State                                                            4806 non-null object
ZIP Code                                                         4806 non-null int64
County Name                                                      4791 non-null object
Phone Number                                                     4806 non-null int64
Hospital Type                                                    4806 non-null object
Hospital Ownership                                               4806 non-null object
Emergency Services                  

We observe that **'Provider Id'** column has 4806 non-null entries. Which means that we have 4806 entries in our csv. However we observe that some columns like 'County Name' have 4791 non-null entries. Which implies that some of the rows have 'county name' missing. As 'County Name' is not an information whose emptiness impacts our analysis, we will leave it as it is. Similarly we can evaluate the basic features of other columns in our dataset.

We can also view  part of our dataset to get the better picture of overall dataset.

In [3]:
df.head(3)

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Hospital Type,Hospital Ownership,...,Readmission national comparison,Readmission national comparison footnote,Patient experience national comparison,Patient experience national comparison footnote,Effectiveness of care national comparison,Effectiveness of care national comparison footnote,Timeliness of care national comparison,Timeliness of care national comparison footnote,Efficient use of medical imaging national comparison,Efficient use of medical imaging national comparison footnote
0,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,3347938701,Acute Care Hospitals,Government - Hospital District or Authority,...,Below the national average,,Below the national average,,Same as the national average,,Same as the national average,,Same as the national average,
1,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,Acute Care Hospitals,Government - Hospital District or Authority,...,Above the national average,,Same as the national average,,Same as the national average,,Above the national average,,Below the national average,
2,10006,ELIZA COFFEE MEMORIAL HOSPITAL,205 MARENGO STREET,FLORENCE,AL,35631,LAUDERDALE,2567688400,Acute Care Hospitals,Government - Hospital District or Authority,...,Below the national average,,Below the national average,,Same as the national average,,Above the national average,,Below the national average,


In [4]:
df.tail()

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Hospital Type,Hospital Ownership,...,Readmission national comparison,Readmission national comparison footnote,Patient experience national comparison,Patient experience national comparison footnote,Effectiveness of care national comparison,Effectiveness of care national comparison footnote,Timeliness of care national comparison,Timeliness of care national comparison footnote,Efficient use of medical imaging national comparison,Efficient use of medical imaging national comparison footnote
4801,670120,THE HOSPITALS OF PROVIDENCE TRANSMOUNTAIN CAMPUS,2000 TRANSMOUNTAIN RD,EL PASO,TX,79911,EL PASO,9158778136,Acute Care Hospitals,Proprietary,...,Not Available,There are too few measures or measure groups r...,Not Available,There are too few measures or measure groups r...,Not Available,There are too few measures or measure groups r...,Not Available,There are too few measures or measure groups r...,Not Available,There are too few measures or measure groups r...
4802,640001,LBJ TROPICAL MEDICAL CENTER,FAGAALU VILLAGE,PAGO PAGO,AS,96799,,6846334590,Acute Care Hospitals,Government - Hospital District or Authority,...,Not Available,Results are not available for this reporting p...,Not Available,Results are not available for this reporting p...,Not Available,Results are not available for this reporting p...,Not Available,Results are not available for this reporting p...,Not Available,Results are not available for this reporting p...
4803,650001,GUAM MEMORIAL HOSPITAL AUTHORITY,85O GOV CARLOS G CAMACHO ROAD,TAMUNING,GU,96913,,6716472552,Acute Care Hospitals,Government - Local,...,Same as the national average,,Not Available,There are too few measures or measure groups r...,Same as the national average,,Not Available,Results are not available for this reporting p...,Not Available,Results are not available for this reporting p...
4804,650003,GUAM REGIONAL MEDICAL CITY,133 ROUTE 3,DEDEDO,GU,96929,,6716455500,Acute Care Hospitals,Voluntary non-profit - Private,...,Not Available,Data are shown only for hospitals that partici...,Not Available,Data are shown only for hospitals that partici...,Not Available,Data are shown only for hospitals that partici...,Not Available,Data are shown only for hospitals that partici...,Not Available,Data are shown only for hospitals that partici...
4805,660001,COMMONWEALTH HEALTH CENTER,"PO BOX 409CK, NAVY HILL ROAD",GARAPAN,MP,96950,,6702348950,Acute Care Hospitals,Proprietary,...,Below the national average,,Not Available,There are too few measures or measure groups r...,Not Available,There are too few measures or measure groups r...,Not Available,There are too few measures or measure groups r...,Not Available,There are too few measures or measure groups r...


### Step 2: Audit / Validate Data

In next step, we will examine our dataset. Data validation includes checking for:

* **Completeness**: It means, to check what percentage of data we have in our dataset compared to data available in the real world. Also we check if the critical data (like hospital names, provider ID) are not missing

* **Uniqueness**: Once we have examined that our data is complete, we check that duplicate entries do not exist for columns which should be unique

* **Accuracy**: We also check how much the data in our dataset reflects the data present in the real world. 

* **Consistency**: To check for data consistency, we check various attributes of our dataset if they align with preconceived pattern. For example, we expect Provider ID of hospitals to be integer values for all the hospitals in the United States.

#####  *Completeness*
As per [American Hospital Association](https://aha.org/statistics/fast-facts-us-hospitals) there are 5,534 Hospitals in the United States. Most (but not all), hospitals are registered with Medicare. To check for completeness of data, we will see how many hospitals our dataset has, by checking the count of Provider IDs.

As per [Medicare Gov](https://www.medicare.gov/), Massachusetts has 66, Ohio has 174, California has 350, New York has 181, Texas has 413 hospitals registered under Medicare. We will verify the completeness of data by checking count of these hospitals in our dataset.

In [5]:
print('Total Medicare enrolled hospitals in our dataset : '+str(df['Provider ID'].count()))
print('Total number of hospitals in the US are 5,534, hence our dataset has '+
      str(int((df['Provider ID'].count()/5534)*100))+
      '% of hospitals in the US')
#df['Hospital Ownership'].unique()
ma_dataset = df['State'][df['State']=='MA'].count()
oh_dataset = df['State'][df['State']=='OH'].count()
ca_dataset = df['State'][df['State']=='CA'].count()
ny_dataset = df['State'][df['State']=='NY'].count()
tx_dataset = df['State'][df['State']=='TX'].count()
print('Total Medicare Hospitals in our dataset for MA: '+
      str(ma_dataset)+
      ' out of 66 ('+str(int((ma_dataset/66)*100))+'% of actual)')
print('Total Medicare Hospitals in our dataset for OH: '+
      str(oh_dataset)+' out of 174 ('+
      str(int((oh_dataset/174)*100))+'% of actual)')
print('Total Medicare Hospitals in our dataset for CA: '+
      str(ca_dataset)+' out of 350 ('+
      str(int((ca_dataset/350)*100))+'% of actual)')
print('Total Medicare Hospitals in our dataset for NY: '+
      str(ny_dataset)+' out of 181 ('+
      str(int((ny_dataset/181)*100))+'% of actual)')
print('Total Medicare Hospitals in our dataset for TX: '+
      str(tx_dataset)+' out of 413 ('+
      str(int((tx_dataset/413)*100))+'% of actual)')

Total Medicare enrolled hospitals in our dataset : 4806
Total number of hospitals in the US are 5,534, hence our dataset has 86% of hospitals in the US
Total Medicare Hospitals in our dataset for MA: 63 out of 66 (95% of actual)
Total Medicare Hospitals in our dataset for OH: 170 out of 174 (97% of actual)
Total Medicare Hospitals in our dataset for CA: 342 out of 350 (97% of actual)
Total Medicare Hospitals in our dataset for NY: 171 out of 181 (94% of actual)
Total Medicare Hospitals in our dataset for TX: 408 out of 413 (98% of actual)


Based on above statistics, we can infer that our dataset has over 95% of uniqueness, which is fairly good for a dataset.

#### *Uniqueness*

Provider ID for each hospital is a Unique ID, hence to check uniqueness of our dataset, we check if we have any Provider ID duplicated in our dataset.


In [6]:
df.duplicated('Provider ID').unique()

array([False])

Based on above output, we can infer that our dataset doesnt have any duplicate Provider ID and hence it satisfies Uniqueness property.

#### *Accuracy*

To check for accuracy, we matched addresses of [**SOUTHEAST ALABAMA MEDICAL CENTER**](http://www.samc.org/contact-information-home/) , [**MARSHALL MEDICAL CENTER SOUTH**](https://www.mmcenters.com/contact) and [**MIZELL MEMORIAL HOSPITAL**](http://www.mizellmh.com/getpage.php?name=contact&sub=Static) and found them to be same as the addresses in our dataset for respective hospitals. 

Hence we can infer that our data is fairly updated and accurate.

#### *Consistency*

In order to check consistency of our dataset, we check following properties:

* Datatype of Provider ID must be of integer type
* There should not be any Null values in Provider ID
* There should not be any Null values in ZIP
* Hospital Name must be of object type
* Hospital Name should not have any Null values

In [7]:
print('Datatype of Provider ID is: '+str(df['Provider ID'].dtype))
print('Provider ID has any Null values? '+str(df['Provider ID'].isnull().values.any()))
print('ZIP has any Null values? '+str(df['ZIP Code'].isnull().values.any()))
print('Datatype of Hospital Name is: '+str(df['Hospital Name'].dtype))
print('Hospital Name has any Null values? '+str(df['Hospital Name'].isnull().values.any()))

Datatype of Provider ID is: int64
Provider ID has any Null values? False
ZIP has any Null values? False
Datatype of Hospital Name is: object
Hospital Name has any Null values? False


We can infer from above output that our dataset is fairly consistent.

### Step 3: Clean Data

As can be seen from **df.head()** output, many columns contain a lot of NaN values as well as blank values. In this step we locate such columns and replace missing or bad data with a meaningful data. 

We will also replace some string values with float values, so that it becomes easier to compute on data later. For example, we replace 'Yes' values with 1, 'No' values with 0 and so on.

In [8]:
# Hospital Type ## Replace 'Acute Care Hospitals'=1, 'Critical Access Hospitals'=2, 'Childrens'=3
HType_mapping={'Acute Care Hospitals':1, 'Critical Access Hospitals':2, 'Childrens':3}
df['Hospital Type'].replace(HType_mapping, inplace=True)
# 'Meets criteria for meaningful use of EHRs' ## Replace Y=1, nan=NaN
EHR_mapping={'Y':1, 'nan':np.NaN}
df['Meets criteria for meaningful use of EHRs'].replace(EHR_mapping, inplace=True)

# 'Emergency Services' ## Replace Yes=1, No=0
ES_mapping = {'Yes': 1, 'No': 0, 'nan':np.NaN}
df['Emergency Services'].replace(ES_mapping, inplace=True)

# 'Hospital overall rating ## Replace Not Available as NaN 
df['Hospital overall rating'].replace({'Not Available':np.NaN}, inplace=True)

# Same as the national average = 0, Below the national average = -1, Above the national average = 1, Not Available = NaN
MC_mapping = {'Same as the national average' : 0,
              'Below the national average' : -1, 
              'Above the national average' : 1, 
              'Not Available' : np.NaN }

df['Mortality national comparison'].replace(MC_mapping, inplace=True)
df['Safety of care national comparison'].replace(MC_mapping, inplace=True)
df['Readmission national comparison'].replace(MC_mapping, inplace=True)
df['Patient experience national comparison'].replace(MC_mapping, inplace=True)
df['Effectiveness of care national comparison'].replace(MC_mapping, inplace=True)
df['Timeliness of care national comparison'].replace(MC_mapping, inplace=True)
df['Efficient use of medical imaging national comparison'].replace(MC_mapping, inplace=True)


### Step 4: Reformat Data

We need to reformat our data from this csv file to fit into our database. We dont need all the columns from this csv, also not all of these columns belong to single table. For instance, information like Address, City, Zip will needed to be inserted into 'address' table, while information like Provider ID, Hospital Name, Hospital Type will needed to be inserted into 'hospital' table. Therefore, it becomes important that our data is reformated from csv as per our database design.

We need following columns for respective tables:

* **hospital**: Provider ID, Hospital Name, Hospital Type, Phone Number, Hospital Ownership
* **address**: Address, City, State, Zip, County Name
* **ownership_type**: Hospital Ownership
* **hospital_address**: hospital_id, address_id
* **hospital_ownership**: ownership_id, hospital_id
* **hospital_service_comparison**: Provider Id, Emergency Services, Meets criteria for meaningful use of EHRs, Hospital overall rating, Mortality national comparison, Safety of care national comparison, Readmission national comparison, Patient experience national comparison, Effectiveness of care national comparison, Timeliness of care national comparison, Efficient use of medical imaging national comparison

In [9]:
hospital = df.filter(['Provider ID','Hospital Name','Hospital Type', 'Phone Number', 'Hospital overall rating'], axis=1)
address = df.filter(['Address', 'City', 'State', 'ZIP Code', 'County Name'], axis=1)
ownership_type = df.filter(['Hospital Ownership'], axis=1)
ownership_type.head()

Unnamed: 0,Hospital Ownership
0,Government - Hospital District or Authority
1,Government - Hospital District or Authority
2,Government - Hospital District or Authority
3,Voluntary non-profit - Private
4,Proprietary


### Step 5: Insert Data in Database

Script to insert the complete dataset is available at [Insert Data](insertData.ipynb)

### Step 6: Data Analysis

Finally, once we have all our cleaned data inserted in the database, we can run the analysis and infer questions from it.

#### **Case 1**
First we would like to see the distribution of Medicare registered hospitals in each State.

In [10]:
def create_connection(db_file):
    """ create a database connection to the MySQL database
        specified by db_file
    :param db_file: database file
    :return: Connection object or None
    """
    try:
        connection = pymysql.connect(host='localhost',
                             user='root',
                             password='root',
                             db=db_file)
        return connection
    except pymysql.InternalError as e:
        return False

In [11]:
conn = create_connection(database)
if conn is not False:    
    cur = conn.cursor()
    query = """SELECT a.state AS `State`, Count(h.provider_id) as Count
                    FROM hospitals h 
                    JOIN hospital_address ha ON ha.hospital_id = h.provider_id
                    JOIN address a ON a.id = ha.address_id
                    GROUP BY a.state
                    ORDER BY count DESC
                    """
    results = pd.read_sql_query(query, conn)

conn.commit()
cur.close()
conn.close()

print(results.head())
print('  ')
print(results.tail())

  State  Count
0    TX    408
1    CA    342
2    FL    188
3    IL    180
4    NY    171
  
   State  Count
51    DE      7
52    VI      2
53    GU      2
54    AS      1
55    MP      1


Infereing from **Case 1** we see that Texas has most number of Medicare registered hospitals followed by California, Florida, Illinois and New York. Northern Mariana Islands, American Samoa and Dover have the least number of Medicare registered hospitals.

#### **Case 2** 
In this use-case we see the distribution of different *Hospital Owners* with *Hospital rating* more than 3 (good)

In [12]:
conn = create_connection(database)
if conn is not None:    
    cur = conn.cursor()
    query = """SELECT o.ow_type AS `Ownership Type`, Count(h.provider_id) as Count
                        FROM hospitals h
                        JOIN hospital_ownership ho ON ho.hospital_id = h.provider_id
                        JOIN ownership_type o ON o.id = ho.ownership_id
                        WHERE h.rating > 3
                        GROUP BY o.ow_type
                        ORDER BY count DESC
                        """
    results = pd.read_sql_query(query, conn)

conn.commit()
cur.close()
conn.close()

print(results)

                                Ownership Type  Count
0               Voluntary non-profit - Private    736
1                 Voluntary non-profit - Other    181
2                                  Proprietary    175
3                Voluntary non-profit - Church    150
4  Government - Hospital District or Authority    117
5                           Government - Local     93
6                                    Physician     21
7                           Government - State     13
8                         Government - Federal      5
9                                       Tribal      1


We observe that non-Profit Private hospitals have most number of **Good** ratings, while Federal Government dont have many **GOOD** ratings for their hospitals.

#### **Case 3** 
In this Use-case we see which of the *High rated hospitals* (ratings more than 3) have *Readmission* rate above *National Average* . Which means, how frequent are readmissions for high rated hospitals.

Readmission measure focuses on whether patients who were discharged from a hospitalization were hospitalized again within 30 days.

In [13]:
conn = create_connection(database)
if conn is not None:    
    cur = conn.cursor()
    query = """SELECT h.name AS `Hospital Name`, a.state AS `State`
                            FROM hospitals h
                            JOIN hospital_comparison hc ON hc.hospital_id = h.provider_id
                            JOIN hospital_address ha ON ha.hospital_id = h.provider_id
                            JOIN address a ON a.id = ha.address_id
                            WHERE hc.readmission = 1 AND h.rating > 3
                            ORDER BY h.rating DESC
                            LIMIT 15
                            """
    
    results = pd.read_sql_query(query, conn)

conn.commit()
cur.close()
conn.close()

print(results)

                                   Hospital Name State
0                    LAKELAND COMMUNITY HOSPITAL    AL
1                    VERDE VALLEY MEDICAL CENTER    AZ
2                  BANNER BOSWELL MEDICAL CENTER    AZ
3      SUMMIT HEALTHCARE REGIONAL MEDICAL CENTER    AZ
4                           MAYO CLINIC HOSPITAL    AZ
5                          BANNER HEART HOSPITAL    AZ
6                   MERCY GILBERT MEDICAL CENTER    AZ
7        SCOTTSDALE THOMPSON PEAK MEDICAL CENTER    AZ
8                  CROSSRIDGE COMMUNITY HOSPITAL    AR
9                 MILLS-PENINSULA MEDICAL CENTER    CA
10                       SHARP MEMORIAL HOSPITAL    CA
11  COMMUNITY HOSPITAL OF THE MONTEREY PENINSULA    CA
12                              SEQUOIA HOSPITAL    CA
13              SHARP CHULA VISTA MEDICAL CENTER    CA
14             METHODIST HOSPITAL OF SOUTHERN CA    CA


We observe that a lot of Hospitals in California and Arizona, although have high overall ratings, but the rate at which patients get hospitalied within 30-Days is Above National Average. Which means, these hospitals perform well on other quality measures, however, patients tend to get hospitalized back more often.

#### **Case 4** 
In this Use-case we focus on *Low rated* Hospitals, and would like to see which of them have *mortality* rate *higher than national average*.

In [14]:
conn = create_connection(database)
if conn is not None:    
    cur = conn.cursor()
    query = """SELECT h.name AS `Hospital Name`, o.ow_type AS `Ownership Type`
                            FROM hospitals h
                            JOIN hospital_comparison hc ON hc.hospital_id = h.provider_id
                            JOIN hospital_ownership ho ON ho.hospital_id = h.provider_id
                            JOIN ownership_type o ON  o.id = ho.ownership_id
                            WHERE hc.mortality = 1 AND h.rating < 3
                            ORDER BY h.rating ASC
                            LIMIT 15
                            """
    results = pd.read_sql_query(query, conn)

conn.commit()
cur.close()
conn.close()

print(results)

                             Hospital Name                  Ownership Type
0                        OROVILLE HOSPITAL  Voluntary non-profit - Private
1      REGIONAL MEDICAL CENTER OF SAN JOSE            Government - Federal
2   CITRUS VALLEY MEDICAL CENTER-IC CAMPUS                     Proprietary
3                      BRIDGEPORT HOSPITAL  Voluntary non-profit - Private
4           MEMORIAL HOSPITAL JACKSONVILLE                     Proprietary
5     PRESENCE SAINT JOSEPH MEDICAL CENTER   Voluntary non-profit - Church
6        MERCY HOSPITAL AND MEDICAL CENTER  Voluntary non-profit - Private
7   JEWISH HOSPITAL & ST MARY'S HEALTHCARE  Voluntary non-profit - Private
8        UMASS MEMORIAL MEDICAL CENTER INC  Voluntary non-profit - Private
9                     SINAI-GRACE HOSPITAL                     Proprietary
10                     HENRY FORD HOSPITAL  Voluntary non-profit - Private
11       SUMMERLIN HOSPITAL MEDICAL CENTER                     Proprietary
12            EAST ORANGE

We observe that a lot of *Voluntary non-Profit Private* hospitals have a very **low rating** and **high mortality rate**. 

#### **Case 5**
In our last case, we would like to see the state-wise distribution of high (above national average) mortality rate.

In [15]:
conn = create_connection(database)
if conn is not None:    
    cur = conn.cursor()
    query = """SELECT a.state AS `State`, Count(h.provider_id) as Count
                        FROM hospitals h
                        JOIN hospital_comparison hc ON hc.hospital_id = h.provider_id
                        JOIN hospital_address ha ON ha.hospital_id = h.provider_id
                        JOIN address a ON a.id = ha.address_id
                        WHERE hc.mortality = 1
                        GROUP BY a.state
                        ORDER BY Count DESC
                        LIMIT 10
                        """
    
    results = pd.read_sql_query(query, conn)

conn.commit()
cur.close()
conn.close()

print(results)

  State  Count
0    CA     54
1    IL     33
2    FL     31
3    MA     28
4    OH     27
5    NY     26
6    TX     25
7    MI     19
8    NJ     17
9    PA     16


We observe that many Medicare registered hospitals in states like California, Florida, Massachusetts have Mortality Rate higher than National Average. This can be a very important indicator for Hospitals, either Private or Federal. It could mean that more patients in CA, FL and MA visit hospitals in critical conditions. Hence hospitals in these states must recruit more nurses and care-takers for emergency services.