## Working notebook for Python data analysis project

In [1]:
import numpy as np
import pandas as pd
import re

In [2]:
surveys_dtypes = {'Facility ID':'string',
 'Facility Name':'string',
 'Address':'string',
 'City':'string',
 'State':'string',
 'ZIP Code':'string',
 'County Name':'string',
 'Phone Number':'string',
 'HCAHPS Measure ID':'string',
 'HCAHPS Question':'string',
 'HCAHPS Answer Description':'string',
 'Patient Survey Star Rating':'Int32',
 'Patient Survey Star Rating Footnote':'string',
 'HCAHPS Answer Percent':'Int32',
 'HCAHPS Answer Percent Footnote':'string',
 'HCAHPS Linear Mean Value':'Int32',
 'Number of Completed Surveys':'Int32',
 'Number of Completed Surveys Footnote':'string',
 'Survey Response Rate Percent':'Int32',
 'Survey Response Rate Percent Footnote':'string',
 'Start Date':'string',
 'End Date':'string'}

In [3]:
surveys = pd.read_csv('HCAHPS-Hospital.csv', low_memory = False, 
                      dtype = surveys_dtypes, na_values = ['Not Available', 'Not Applicable'])

In [4]:
unpland_dtypes = {'Facility ID':'string',
 'Facility Name':'string',
 'Address':'string',
 'City':'string',
 'State':'string',
 'ZIP Code':'string',
 'County Name':'string',
 'Phone Number':'string',
 'Measure ID':'string',
 'Measure Name':'string',
 'Compared to National':'string',
 'Denominator':'float',
 'Score':'float',
 'Lower Estimate':'float',
 'Higher Estimate':'float',
 'Number of Patients':'Int32',
 'Number of Patients Returned':'Int32',
 'Footnote':'string',
 'Start Date':'string',
 'End Date':'string'}

In [5]:
unpland = pd.read_csv('Unplanned_Hospital_Visits-Hospital.csv', 
                      low_memory = False, 
                      dtype = unpland_dtypes, na_values = ['Not Available', 'Not Applicable'])

In [6]:
print(surveys.shape)
print(surveys['Facility ID'].unique().size)
print(surveys['Start Date'].unique().size)
print(surveys['End Date'].unique().size)
print(surveys['Start Date'].unique()[0])
print(surveys['End Date'].unique()[0])

(452538, 22)
4866
1
1
01/01/2019
12/31/2019


In [10]:
print(unpland.shape)
print(unpland['Facility ID'].unique().size)
print(unpland['Start Date'].unique().size)
print(unpland['End Date'].unique().size)
print(list(unpland['Start Date'].unique()))
print(list(unpland['End Date'].unique()))

(68124, 20)
4866
4
2
['07/01/2017', '01/01/2017', '01/01/2019', '07/01/2019']
['12/01/2019', '12/24/2019']


#### Note that the survey data is for 2019 only, whereas the unplanned visits data can span periods beginning in 2017. It is impossible (or at least problematic) to extract only 2019 data from unplanned visits because a single measure (single row) can have a start date in 2017 and an end date in 2019. We will continue with the analysis but this is certainly an issue to be aware of and communicated along with the results.

In [11]:
list(enumerate(surveys.columns.tolist()))

[(0, 'Facility ID'),
 (1, 'Facility Name'),
 (2, 'Address'),
 (3, 'City'),
 (4, 'State'),
 (5, 'ZIP Code'),
 (6, 'County Name'),
 (7, 'Phone Number'),
 (8, 'HCAHPS Measure ID'),
 (9, 'HCAHPS Question'),
 (10, 'HCAHPS Answer Description'),
 (11, 'Patient Survey Star Rating'),
 (12, 'Patient Survey Star Rating Footnote'),
 (13, 'HCAHPS Answer Percent'),
 (14, 'HCAHPS Answer Percent Footnote'),
 (15, 'HCAHPS Linear Mean Value'),
 (16, 'Number of Completed Surveys'),
 (17, 'Number of Completed Surveys Footnote'),
 (18, 'Survey Response Rate Percent'),
 (19, 'Survey Response Rate Percent Footnote'),
 (20, 'Start Date'),
 (21, 'End Date')]

In [14]:
star_ratings = surveys.iloc[:, [0, 8, 11]][surveys['HCAHPS Measure ID'].apply(lambda x: re.search('STAR_RATING$', x) != None)]

In [16]:
linear_scores = surveys.iloc[:, [0, 8, 15]][surveys['HCAHPS Measure ID'].apply(lambda x: re.search('LINEAR_SCORE$', x) != None)]

In [18]:
list(enumerate(unpland.columns.tolist()))

[(0, 'Facility ID'),
 (1, 'Facility Name'),
 (2, 'Address'),
 (3, 'City'),
 (4, 'State'),
 (5, 'ZIP Code'),
 (6, 'County Name'),
 (7, 'Phone Number'),
 (8, 'Measure ID'),
 (9, 'Measure Name'),
 (10, 'Compared to National'),
 (11, 'Denominator'),
 (12, 'Score'),
 (13, 'Lower Estimate'),
 (14, 'Higher Estimate'),
 (15, 'Number of Patients'),
 (16, 'Number of Patients Returned'),
 (17, 'Footnote'),
 (18, 'Start Date'),
 (19, 'End Date')]

#### For the most part footnotes explain why a value is missing, or give information on a restriction that applies to a measure. At this time they are deemed unimportant. 