# Hospital Readmissions Data Analysis and Recommendations for Reduction

### Background
In October 2012, the US government's Center for Medicare and Medicaid Services (CMS) began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. Excess readmissions are measured by a ratio, by dividing a hospital’s number of “predicted” 30-day readmissions for heart attack, heart failure, and pneumonia by the number that would be “expected,” based on an average hospital with similar patients. A ratio greater than 1 indicates excess readmissions.

### Exercise Directions

In this exercise, you will:
+ critique a preliminary analysis of readmissions data and recommendations (provided below) for reducing the readmissions rate
+ construct a statistically sound analysis and make recommendations of your own 

More instructions provided below. Include your work **in this notebook and submit to your Github account**. 

### Resources
+ Data source: https://data.medicare.gov/Hospital-Compare/Hospital-Readmission-Reduction/9n3s-kdb3
+ More information: http://www.cms.gov/Medicare/medicare-fee-for-service-payment/acuteinpatientPPS/readmissions-reduction-program.html
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
import bokeh.plotting as bkp
from mpl_toolkits.axes_grid1 import make_axes_locatable

In [None]:
# read in readmissions data provided
hospital_read_df = pd.read_csv('data/cms_hospital_readmissions.csv')

****
## Preliminary Analysis

In [None]:
# deal with missing and inconvenient portions of data 
clean_hospital_read_df = hospital_read_df[hospital_read_df['Number of Discharges'] != 'Not Available']
clean_hospital_read_df.loc[:, 'Number of Discharges'] = clean_hospital_read_df['Number of Discharges'].astype(int)
clean_hospital_read_df = clean_hospital_read_df.sort_values('Number of Discharges')

In [None]:
# generate a scatterplot for number of discharges vs. excess rate of readmissions
# lists work better with matplotlib scatterplot function
x = [a for a in clean_hospital_read_df['Number of Discharges'][81:-3]]
y = list(clean_hospital_read_df['Excess Readmission Ratio'][81:-3])

fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(x, y,alpha=0.2)

ax.fill_between([0,350], 1.15, 2, facecolor='red', alpha = .15, interpolate=True)
ax.fill_between([800,2500], .5, .95, facecolor='green', alpha = .15, interpolate=True)

ax.set_xlim([0, max(x)])
ax.set_xlabel('Number of discharges', fontsize=12)
ax.set_ylabel('Excess rate of readmissions', fontsize=12)
ax.set_title('Scatterplot of number of discharges vs. excess rate of readmissions', fontsize=14)

ax.grid(True)
fig.tight_layout()

****

## Preliminary Report

Read the following results/report. While you are reading it, think about if the conclusions are correct, incorrect, misleading or unfounded. Think about what you would change or what additional analyses you would perform.

**A. Initial observations based on the plot above**
+ Overall, rate of readmissions is trending down with increasing number of discharges
+ With lower number of discharges, there is a greater incidence of excess rate of readmissions (area shaded red)
+ With higher number of discharges, there is a greater incidence of lower rates of readmissions (area shaded green) 

**B. Statistics**
+ In hospitals/facilities with number of discharges < 100, mean excess readmission rate is 1.023 and 63% have excess readmission rate greater than 1 
+ In hospitals/facilities with number of discharges > 1000, mean excess readmission rate is 0.978 and 44% have excess readmission rate greater than 1 

**C. Conclusions**
+ There is a significant correlation between hospital capacity (number of discharges) and readmission rates. 
+ Smaller hospitals/facilities may be lacking necessary resources to ensure quality care and prevent complications that lead to readmissions.

**D. Regulatory policy recommendations**
+ Hospitals/facilties with small capacity (< 300) should be required to demonstrate upgraded resource allocation for quality care to continue operation.
+ Directives and incentives should be provided for consolidation of hospitals and facilities to have a smaller number of them with higher capacity and number of discharges.

****
### Exercise

Include your work on the following **in this notebook and submit to your Github account**. 

A. Do you agree with the above analysis and recommendations? Why or why not?
   
B. Provide support for your arguments and your own recommendations with a statistically sound analysis:

   1. Setup an appropriate hypothesis test.
   2. Compute and report the observed significance value (or p-value).
   3. Report statistical significance for $\alpha$ = .01. 
   4. Discuss statistical significance and practical significance. Do they differ here? How does this change your recommendation to the client?
   5. Look at the scatterplot above. 
      - What are the advantages and disadvantages of using this plot to convey information?
      - Construct another plot that conveys the same information in a more direct manner.



You can compose in notebook cells using Markdown: 
+ In the control panel at the top, choose Cell > Cell Type > Markdown
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [None]:
# Your turn

There are issues with the above conclusions which are presented without statistical backing or methods.

1)  There are no hypothesis tests done, and the initial report is based only by looking at the graph.

2) The report switches between representing small hospitals as <100 discharges and <300 discharges. The original breaking up of the sample into <100 and >1000 discounts 85% of the sample. It also has almost 3x the amount of samples in the <100 group compared to the >1000 one.

In [None]:
dis_ex_df = hospital_read_df.loc[:, ['Number of Discharges',
                                     'Excess Readmission Ratio']]
dis_ex_df = dis_ex_df[dis_ex_df['Number of Discharges'] != 'Not Available']
dis_ex_df = dis_ex_df.dropna()
dis_ex_df['Number of Discharges'] = pd.to_numeric(dis_ex_df['Number of Discharges'])

To begin, I will rerun the original tests in order to test their statistics and provide hypothesis test results to check their conclusions.

In [None]:
under100 = dis_ex_df[dis_ex_df['Number of Discharges'] < 100]
over1000 = dis_ex_df[dis_ex_df['Number of Discharges'] > 1000]

In [None]:
print('Mean Excess Readmission Ratio:', 
      under100['Excess Readmission Ratio'].mean())
print('Percent of Hospitals over 1.0 Excess Readmission Ratio:', 
      100 - stats.percentileofscore(under100['Excess Readmission Ratio'], 1))

The statistics described above are accurate for hospitals with less than 100 discharges.

In [None]:
print('Mean Excess Readmission Ratio:', 
      over1000['Excess Readmission Ratio'].mean())
print('Percent of Hospitals over 1.0 Excess Readmission Ratio:', 
      100 - stats.percentileofscore(over1000['Excess Readmission Ratio'], 1))

The statistics described are accurate for hospitals with over 1000 discharges.
I want to check the p-value of the report's conclusions that there is a significant correlation between hospital Excess Readmission Ratio and number of discharges. My null hypothesis is there is no correlation. The alternative hypothesis is there is a significant correlation. The alpha value I will be using is 0.01 

In [None]:
stats.pearsonr(dis_ex_df['Number of Discharges'],
               dis_ex_df['Excess Readmission Ratio'])

There is a negative correlation between the two variables, and with a p-value of essentially 0, the correlation is significant. For correlation, the preliminary report is correct and statistically sound.

Next I am going to redo their tests with one group representing hospitals with =<300 discharges and those with >300 discharges. This is a more even sample. When looking at the box and whisker plot below and the histogram of the data, you can see how  a vast majority of samples are ignored by the original sampling.

In [11]:
_ = sns.set()
_ = sns.boxplot(clean_hospital_read_df['Number of Discharges'])


In [12]:
_ = clean_hospital_read_df['Number of Discharges'].plot(kind='hist', bins=150, normed=True, xlim=(0,2500))

The data is strongly skewed right.

The two groups that the analysis looks at only covers 15% of the dataset, so even if their findings were statistically significant (they do not provide any hypothesis testing results to corroborate their statements), they are not representative of the entire dataset. I am going to compare the performance of two groups taken from the entire population. The demarcation will be hospitals with >300 discharges and those with =<300 discharges. The null hypothesis is that there is no difference. The alternate hypothesis is that hospitals with fewer discharges have higher Excess Readmission Ratios

In [13]:
under300 = dis_ex_df[dis_ex_df['Number of Discharges'] <= 300]
over300 = dis_ex_df[dis_ex_df['Number of Discharges'] > 300]

In [14]:
under300['Excess Readmission Ratio'].describe()

In [15]:
over300['Excess Readmission Ratio'].describe()

The mean value of Excess Readmission Ratio is much closer with these two groups. It is a valid question to test if this difference is significant or not. The null hypothesis will be there is no significant difference between the two samples, and the alternative hypothesis is that there is. The alpha is set to 0.01.

In [16]:
stats.ttest_ind(under300['Excess Readmission Ratio'], 
                over300['Excess Readmission Ratio'], equal_var=False)

The difference of means is significantly different. Larger hopsitals tend to have lower Excess Readmission Ratios.

# 4.
Statistical significance means that the observed difference is not likely due to sampling error, and that the difference is significant when the sample size is large. Practical significance is the application of the findings to real life. What does the observed difference mean for the given problem? In this case, the two are aligned. There are tangible, significant differences in the performance of different hospitals, so there are practical steps that can be taken following this hypothesis testing. However, I believe that though these results show a significant difference, there is no way to know why that difference is present. Investigating the causes of the differing performance would yield more fruitful changes.