# Hospital Readmissions Data Analysis and Recommendations for Reduction

### Background
In October 2012, the US government's Center for Medicare and Medicaid Services (CMS) began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. Excess readmissions are measured by a ratio, by dividing a hospital’s number of “predicted” 30-day readmissions for heart attack, heart failure, and pneumonia by the number that would be “expected,” based on an average hospital with similar patients. A ratio greater than 1 indicates excess readmissions.

### Exercise Directions

In this exercise, you will:
+ critique a preliminary analysis of readmissions data and recommendations (provided below) for reducing the readmissions rate
+ construct a statistically sound analysis and make recommendations of your own 

More instructions provided below. Include your work **in this notebook and submit to your Github account**. 

### Resources
+ Data source: https://data.medicare.gov/Hospital-Compare/Hospital-Readmission-Reduction/9n3s-kdb3
+ More information: http://www.cms.gov/Medicare/medicare-fee-for-service-payment/acuteinpatientPPS/readmissions-reduction-program.html
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh.plotting as bkp
import scipy.stats as stats
import warnings
import datetime as dt
from mpl_toolkits.axes_grid1 import make_axes_locatable

In [None]:
# read in readmissions data provided
hospital_read_df = pd.read_csv('data/cms_hospital_readmissions.csv')

****
## Preliminary Analysis

In [None]:
# deal with missing and inconvenient portions of data 
clean_hospital_read_df = hospital_read_df[hospital_read_df['Number of Discharges'] != 'Not Available']
clean_hospital_read_df.loc[:, 'Number of Discharges'] = clean_hospital_read_df['Number of Discharges'].astype(int)
clean_hospital_read_df = clean_hospital_read_df.sort_values('Number of Discharges')

In [None]:
# generate a scatterplot for number of discharges vs. excess rate of readmissions
# lists work better with matplotlib scatterplot function
x = [a for a in clean_hospital_read_df['Number of Discharges'][81:-3]]
y = list(clean_hospital_read_df['Excess Readmission Ratio'][81:-3])

fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(x, y,alpha=0.2)

ax.fill_between([0,350], 1.15, 2, facecolor='red', alpha = .15, interpolate=True)
ax.fill_between([800,2500], .5, .95, facecolor='green', alpha = .15, interpolate=True)

ax.set_xlim([0, max(x)])
ax.set_xlabel('Number of discharges', fontsize=12)
ax.set_ylabel('Excess rate of readmissions', fontsize=12)
ax.set_title('Scatterplot of number of discharges vs. excess rate of readmissions', fontsize=14)

ax.grid(True)
fig.tight_layout()

In [None]:
clean_hospital_read_df.head()

In [None]:
clean_hospital_read_df = pd.concat([clean_hospital_read_df, pd.get_dummies( clean_hospital_read_df['Measure Name'])],\
          sort=False, axis=1).drop(['Measure Name'], axis=1)

In [None]:
clean_hospital_read_df.columns

In [None]:
notnull_idx = clean_hospital_read_df['Excess Readmission Ratio'].isnull() == False

In [None]:
cols = ['Number of Discharges','Excess Readmission Ratio', 'Predicted Readmission Rate',
       'Expected Readmission Rate', 'Number of Readmissions', 'Start Date',
       'End Date', 'READM-30-AMI-HRRP', 'READM-30-COPD-HRRP',
       'READM-30-HF-HRRP', 'READM-30-HIP-KNEE-HRRP', 'READM-30-PN-HRRP']
df_not_null = clean_hospital_read_df.loc[notnull_idx,cols].copy()
df_not_null.loc[:,'End Date'] = df_not_null.loc[:,'End Date'].apply(lambda x: dt.datetime.strptime(x,'%m/%d/%Y'))
df_not_null.loc[:,'Start Date'] = df_not_null.loc[:,'Start Date'].apply(lambda x: dt.datetime.strptime(x,'%m/%d/%Y'))
df_not_null['LengthOfStayInDays'] = df_not_null.apply(lambda row: (row['End Date'] - row['Start Date']).days, axis=1)
df_not_null.head()

In [None]:
df_not_null['Start Date'].value_counts()

In [None]:
np.corrcoef(df_not_null[['LengthOfStayInDays', 'LengthOfStayInDays']].T)[0]

In [None]:
cols1 = ['Number of Discharges','Excess Readmission Ratio', 'Predicted Readmission Rate',
       'Expected Readmission Rate', 'Number of Readmissions', 'READM-30-AMI-HRRP', 'READM-30-COPD-HRRP',
       'READM-30-HF-HRRP', 'READM-30-HIP-KNEE-HRRP', 'READM-30-PN-HRRP']


cov_mat = np.corrcoef(df_not_null[cols1].T)
cov_mat[0]

In [None]:
colz = ['NumOfDischarges       ', 'ExcessReadmissRatio ', 'PredictReadmissRate   ', 'ExpectedReadmRate    ',
        'NumbOfReadmissions    ', 'READM-30-AMI-HRRP   ', 'READM-30-COPD-HRRP    ', 'READM-30-HF-HRRP     ', 
        'READM-30-HIP-KNEE-HRRP', 'READM-30-PN-HRRP    ']

def format_cov(mat,columns):
    for i in range(len(columns)):
        c = columns[i][:20] + ': \t' + '\t' * i
        for a in mat[i][i:]:
            val = str(round(a,4))
            if a > 0.0:
                c =  c + ' '+ (val+ '000')[:6] + ' '
            else:
                c = c + (val+ '000')[:7] + ' '
        #print(len(c))  #len is 92
        print(c)
        
format_cov(cov_mat,colz)

In [None]:
print(cols1)

In [None]:
colls = ['Number of Discharges','Excess Readmission Ratio', 'Predicted Readmission Rate',
       'Expected Readmission Rate', 'Number of Readmissions']

df_new = pd.concat([clean_hospital_read_df.loc[notnull_idx,colls], \
                    pd.get_dummies( clean_hospital_read_df.loc[notnull_idx,'State'])],\
          sort=False, axis=1)

df_new.head()

In [None]:
st_list = []
for st in set(clean_hospital_read_df.State.values):
    corr_mat = np.corrcoef(df_new[['Excess Readmission Ratio','Number of Discharges',\
                                   'Number of Readmissions', st ]].T)
    st_list.append((corr_mat[0][3], st))
sorted(st_list, reverse=True)[:5]

In [None]:
df_new['AK'].isnull().any()

In [None]:
_ = sns.pairplot(df_not_null[cols1])

****

## Preliminary Report

Read the following results/report. While you are reading it, think about if the conclusions are correct, incorrect, misleading or unfounded. Think about what you would change or what additional analyses you would perform.

**A. Initial observations based on the plot above**
+ Overall, rate of readmissions is trending down with increasing number of discharges
+ With lower number of discharges, there is a greater incidence of excess rate of readmissions (area shaded red)
+ With higher number of discharges, there is a greater incidence of lower rates of readmissions (area shaded green) 

**B. Statistics**
+ In hospitals/facilities with number of discharges < 100, mean excess readmission rate is 1.023 and 63% have excess readmission rate greater than 1 
+ In hospitals/facilities with number of discharges > 1000, mean excess readmission rate is 0.978 and 44% have excess readmission rate greater than 1 

**C. Conclusions**
+ There is a significant correlation between hospital capacity (number of discharges) and readmission rates. 
+ Smaller hospitals/facilities may be lacking necessary resources to ensure quality care and prevent complications that lead to readmissions.

**D. Regulatory policy recommendations**
+ Hospitals/facilties with small capacity (< 300) should be required to demonstrate upgraded resource allocation for quality care to continue operation.
+ Directives and incentives should be provided for consolidation of hospitals and facilities to have a smaller number of them with higher capacity and number of discharges.

In [None]:
lt_100_idx = df_new.loc[df_new['Number of Discharges'] < 100, :].index
gt_1000_idx = df_new.loc[df_new['Number of Discharges'] > 1000, :].index
total_lt100 = df_new.loc[lt_100_idx, 'Excess Readmission Ratio'].count()
total_gt1000 = df_new.loc[gt_1000_idx, 'Excess Readmission Ratio'].count()
mu_lt_100 = df_new.loc[lt_100_idx, 'Excess Readmission Ratio'].mean()
print('Mean Excess Readmission Rate for Number of Discharges < 100:',mu_lt_100)
pct_lt_100 = round((np.sum(df_new.loc[lt_100_idx, 'Excess Readmission Ratio'] > 1.0) / total_lt100)*100)
print('Percent of hospitals w\# of Discharges < 100 that have Excess Readmission Rate > 1.0:',pct_lt_100)
print('Mean Excess Readmission Rate for Number of Discharges > 1000:',\
      df_new.loc[gt_1000_idx, 'Excess Readmission Ratio'].mean())
print('Percent of hospitals w\# of Discharges > 1000 that have Excess Readmission Rate > 1.0:',\
      round((np.sum(df_new.loc[gt_1000_idx, 'Excess Readmission Ratio'] > 1.0) / total_gt1000)*100))
df_new['Excess Readmission Ratio'].mean()

In [None]:
df_new.shape[0]

In [None]:
t_df = (df_new.loc[lt_100_idx, 'Excess Readmission Ratio'] > 1.0)
all_gt_1 = (df_new.loc[:, 'Excess Readmission Ratio'] > 1.0)
percent_above_1_total = round((all_gt_1.sum() / len(all_gt_1)),2)
print(percent_above_1_total)
pct_std = percent_above_1_total*(1-percent_above_1_total)
print('pct_std', pct_std)
pct_sem = pct_std / np.sqrt(len(all_gt_1))
print(pct_sem)
print(df_new.loc[t_df.values, 'Excess Readmission Ratio'].mean())
print(df_new.loc[t_df.index, 'Excess Readmission Ratio'].mean())
total_mu = df_new.loc[:, 'Excess Readmission Ratio'].mean()
total_std = df_new.loc[:, 'Excess Readmission Ratio'].std()
total_sem = total_std / np.sqrt(int(df_new.shape[0]))
print(total_mu)
print(total_std)
print(total_sem)

In [None]:
print(total_mu + 3*total_sem)
print(total_mu + 3*total_std)
tstat = (mu_lt_100 - total_mu)/total_sem
print(tstat)
print((mu_lt_100 - total_mu)/total_std)
print(stats.t.sf(np.abs(tstat), df_new.shape[0]-1)*2)
print(stats.t.ppf(0.01, df_new.shape[0]-1))

In [None]:
moe = total_mu + 3*total_sem
print('moe:',moe)
d_array = df_new.loc[lt_100_idx,'Excess Readmission Ratio'] > moe
df_new.loc[d_array.values,'Excess Readmission Ratio'].count()

In [None]:
df_new.columns

****
### Exercise

Include your work on the following **in this notebook and submit to your Github account**. 

A. Do you agree with the above analysis and recommendations? Why or why not?
   
B. Provide support for your arguments and your own recommendations with a statistically sound analysis:

   1. Setup an appropriate hypothesis test.
   2. Compute and report the observed significance value (or p-value).
   3. Report statistical significance for $\alpha$ = .01. 
   4. Discuss statistical significance and practical significance. Do they differ here? How does this change your recommendation to the client?
   5. Look at the scatterplot above. 
      - What are the advantages and disadvantages of using this plot to convey information?
      - Construct another plot that conveys the same information in a more direct manner.



You can compose in notebook cells using Markdown: 
+ In the control panel at the top, choose Cell > Cell Type > Markdown
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [None]:
# Your turn