# Hospital Readmissions Data Analysis and Recommendations for Reduction

### Background
In October 2012, the US government's Center for Medicare and Medicaid Services (CMS) began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. Excess readmissions are measured by a ratio, by dividing a hospital’s number of “predicted” 30-day readmissions for heart attack, heart failure, and pneumonia by the number that would be “expected,” based on an average hospital with similar patients. A ratio greater than 1 indicates excess readmissions.

### Exercise Directions

In this exercise, you will:
+ critique a preliminary analysis of readmissions data and recommendations (provided below) for reducing the readmissions rate
+ construct a statistically sound analysis and make recommendations of your own 

More instructions provided below. Include your work **in this notebook and submit to your Github account**. 

### Resources
+ Data source: https://data.medicare.gov/Hospital-Compare/Hospital-Readmission-Reduction/9n3s-kdb3
+ More information: http://www.cms.gov/Medicare/medicare-fee-for-service-payment/acuteinpatientPPS/readmissions-reduction-program.html
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import bokeh.plotting as bkp
from mpl_toolkits.axes_grid1 import make_axes_locatable
import scipy.stats as stats

In [None]:
# read in readmissions data provided
hospital_read_df = pd.read_csv('data/cms_hospital_readmissions.csv')

****
## Preliminary Analysis

In [None]:
# deal with missing and inconvenient portions of data 
clean_hospital_read_df = hospital_read_df[hospital_read_df['Number of Discharges'] != 'Not Available']
clean_hospital_read_df.loc[:, 'Number of Discharges'] = clean_hospital_read_df['Number of Discharges'].astype(int)
clean_hospital_read_df = clean_hospital_read_df.sort_values('Number of Discharges')
clean_hospital_read_df.tail()

In [None]:
# generate a scatterplot for number of discharges vs. excess rate of readmissions
# lists work better with matplotlib scatterplot function
x = [a for a in clean_hospital_read_df['Number of Discharges'][81:-3]]
y = list(clean_hospital_read_df['Excess Readmission Ratio'][81:-3])

fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(x, y,alpha=0.2)

ax.fill_between([0,350], 1.15, 2, facecolor='red', alpha = .15, interpolate=True)
ax.fill_between([800,2500], .5, .95, facecolor='green', alpha = .15, interpolate=True)

ax.set_xlim([0, max(x)])
ax.set_xlabel('Number of discharges', fontsize=12)
ax.set_ylabel('Excess rate of readmissions', fontsize=12)
ax.set_title('Scatterplot of number of discharges vs. excess rate of readmissions', fontsize=14)

ax.grid(True)
fig.tight_layout()

****

## Preliminary Report

Read the following results/report. While you are reading it, think about if the conclusions are correct, incorrect, misleading or unfounded. Think about what you would change or what additional analyses you would perform.

**A. Initial observations based on the plot above**
+ Overall, rate of readmissions is trending down with increasing number of discharges
+ With lower number of discharges, there is a greater incidence of excess rate of readmissions (area shaded red)
+ With higher number of discharges, there is a greater incidence of lower rates of readmissions (area shaded green) 

**B. Statistics**
+ In hospitals/facilities with number of discharges < 100, mean excess readmission rate is 1.023 and 63% have excess readmission rate greater than 1 
+ In hospitals/facilities with number of discharges > 1000, mean excess readmission rate is 0.978 and 44% have excess readmission rate greater than 1 

**C. Conclusions**
+ There is a significant correlation between hospital capacity (number of discharges) and readmission rates. 
+ Smaller hospitals/facilities may be lacking necessary resources to ensure quality care and prevent complications that lead to readmissions.

**D. Regulatory policy recommendations**
+ Hospitals/facilties with small capacity (< 300) should be required to demonstrate upgraded resource allocation for quality care to continue operation.
+ Directives and incentives should be provided for consolidation of hospitals and facilities to have a smaller number of them with higher capacity and number of discharges.

****
<div class="span5 alert alert-info">
### Exercise

Include your work on the following **in this notebook and submit to your Github account**. 

A. Do you agree with the above analysis and recommendations? Why or why not?
   
B. Provide support for your arguments and your own recommendations with a statistically sound analysis:

   1. Setup an appropriate hypothesis test.
   2. Compute and report the observed significance value (or p-value).
   3. Report statistical significance for $\alpha$ = .01. 
   4. Discuss statistical significance and practical significance. Do they differ here? How does this change your recommendation to the client?
   5. Look at the scatterplot above. 
      - What are the advantages and disadvantages of using this plot to convey information?
      - Construct another plot that conveys the same information in a more direct manner.



You can compose in notebook cells using Markdown: 
+ In the control panel at the top, choose Cell > Cell Type > Markdown
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
</div>
****

## Do you agree with the above analysis and recommendations? Why or why not?

#### Statement A - Investigating initial observation claims

The analysis suggests there is an inverse relationship between rate of readmissions and number of discharges

In [None]:
#rename to df
df = clean_hospital_read_df

#plot the relationship between rate of readmissions and number of discharges
x = df['Number of Discharges']
y = df['Excess Readmission Ratio']

fig,ax = plt.subplots(figsize=(16,5))
ax.scatter(x,y,alpha=0.3)
ax.set_xscale('symlog')
ax.set_xlim([10, max(x)])
plt.title('Rate of readmissions vs Number of discharges', fontsize=14)
plt.ylabel('Excess Readmission Ratio')
plt.xlabel('Number of Discharges (log scale)')
plt.show()

In [None]:
# Getting Correleation Coefficient - R2
# 2 methods to calculate this
print(df[['Number of Discharges', 'Excess Readmission Ratio']].corr())

x = [x for x in df['Number of Discharges']]
y = [y for r in df['Excess Readmission Ratio']]

slope,intercept,r_value,p_value,std_err = stats.linregress(x,y)

print('The Correleastion Coefficient is ', round(r_value,3))



There is a very slight downward trend with increase number of discharge. 
However, the association is weak that the correlation coefficient only shows -0.097 as R2 value is very close to 0.
This indicate a downward trend between **readmission rate** and **numnber of discharges**.


#### Statement B - Number of Discharges Statistical Claim

In [None]:
#plot the cleaned data to show its distribution
discharge_plot = df['Number of Discharges'].plot(kind='hist',xlim=(0,2500),xticks=[x*100 for x in range(25)], figsize= (15,5),bins=300,grid=True)
discharge_plot.set_xlabel("Number of Discharge")
discharge_plot.set_ylabel("Number of Incidence")
df['Number of Discharges'].describe()

This graph shows the sample is skewed to the right (positively-skewed), meaning it is not normally distributed. 


Section B made the following conclusion:
+ In hospitals/facilities with number of discharges < 100, mean excess readmission rate is 1.023 and 63% have excess readmission rate greater than 1 
+ In hospitals/facilities with number of discharges > 1000, mean excess readmission rate is 0.978 and 44% have excess readmission rate greater than 1 

From the distribution of the 'Number of discharges', we can see the interquartile (25%-75%),which is the 50% of the distribution, is between 157 - 472.75. This statement is ignoring the majority of the distribution by setting 100 and 1000 as the threshold for the analysis.

In [None]:
#get > 100 and >1000 dataset
df_100 = df[df['Number of Discharges'] < 100]
df_100_errgt1 = (df[(df['Number of Discharges'] < 100) & (df['Excess Readmission Ratio'] > 1)]['Excess Readmission Ratio'])

df_1000 = df[df['Number of Discharges'] > 1000]
df_1000_errgt1 = (df[(df['Number of Discharges'] > 1000) & (df['Excess Readmission Ratio'] > 1)]['Excess Readmission Ratio'])

#get lenth of different dataset
n = len(df['Number of Discharges'])
discharges_100_len = float(len(df_100))
discharges_1000_len = float(len(df_1000))

#get stats
df_100_mean = round(df_100['Excess Readmission Ratio'].mean(),2)
df_100_gt1 = round(len(df_100_errgt1)/discharges_100_len *100,2)
df_100_occur = round(discharges_100_len/len(df),2)

df_1000_mean = round(df_1000['Excess Readmission Ratio'].mean(),2)
df_1000_gt1 = round(len(df_1000_errgt1)/discharges_1000_len *100,2)
df_1000_occur = round(discharges_1000_len/len(df),2)

ERR_100_mean = df_100['Excess Readmission Ratio'].mean()
ERR_1000_mean = df_1000['Excess Readmission Ratio'].mean()

print('Findings for Discharges < 100')
print('-'*40)
print('Mean Excess Readmission Rate',df_100_mean)
print(df_100_gt1, "%s have excess readmission rate greater than 1")
print('Percentage of number of discharge < 100 is ', df_100_occur, '%\n\n')

print('Findings for Discharges > 1000')
print('-'*40)
print('Mean Excess Readmission Rate',df_1000_mean)
print(df_1000_gt1, "%s have excess readmission rate greater than 1")
print('Percentage of number of discharge > 1000 is ', df_1000_occur, '%')

As shown above, the statistics in Section B are looking at < 1% of the dataset. ( 0.11% of data is < 100 and 0.04% > 1000) 

Furthermore, one of the statistics is not correct. 
Only 59.18% of hospitals with Number of Discharges < 100 have an excess readmission rate of greater than 1.

#### Correlation between Number of Discharge & readmission rates

In [None]:
df[['Number of Discharges', 'Excess Readmission Ratio']].corr()

As stated above, a correlation of -0.1 points to a very weak negative linear relationship between discharges and admission ratio. 
(A correlation of 1 or -1 = perfect linear relationsip. 0 = no linear relationship.)

The claim that "There is a significant correlation between hospital capacity (number of discharges) and readmission rates" does not hold.

#### Regulatory policy recommendations

In [None]:
#clean up data where number of discharges = 0 
df['Number of Discharges'] = df['Number of Discharges'][df['Number of Discharges'] != 0]   # Get rid of 0 discharges

In [None]:
#get distribution of readmissions and discharges
plt.subplots(figsize=(16,8))

#all Discharges and Readmission Rate
plt.subplot(2, 3, 1)
df['Excess Readmission Ratio'].plot(kind='hist', title='distribution of all readmissions')
plt.subplot(2, 3, 4)
df['Number of Discharges'].plot(kind='hist', title='distribution of all discharges')

#filter datasets
df_l300 = df[df['Number of Discharges'] < 300]
df_g300 = df[(df['Number of Discharges'] >= 300)]

#<300
plt.subplot(2, 3, 2)
plot_l300_ERR = df_l300['Excess Readmission Ratio'].plot(kind='hist', title='distribution of readmissions <300')
plot_l300_ERR.set_xlabel('Excess Readmission Ratio')
plt.subplot(2, 3, 5)
plot_1300_NOD = df_l300['Number of Discharges'].plot(kind='hist', title='distribution of discharges <300')

#>=300
plt.subplot(2, 3, 3)
plot_g300_ERR = df_g300['Excess Readmission Ratio'].plot(kind='hist', title='distribution of readmissions >=300')
plot_l300_ERR.set_xlabel('Excess Readmission Ratio')

plt.subplot(2, 3, 6)
# has 3 values 3570, 3980, 6793 that spread the chart. Eliminate
plot_g300_NOD = df_g300['Number of Discharges'].plot(kind='hist', bins=25, title='distribution of discharges >=300').set_xlim([300, 3000]); 

Original conclusion are:
- Hospitals/facilties with small capacity (< 300) should be required to demonstrate upgraded resource allocation for quality care to continue operation.
- Directives and incentives should be provided for consolidation of hospitals and facilities to have a smaller number of them with higher capacity and number of discharges.


The distribution of readmissions for hospitals with discharges < 300 seems quite similar to those with >300 discharges. Just slightly higher.

The means value of 1.02 for hospitals with less than 100 discharges and 0.98 for hopsitals with more than 1,000 discharges are both within 0.23 of the theshold value of 1. 

So the point statement about larger hospitals leading to better outcomes seems to be not valid.

### B. Provide support for your arguments and your own recommendations with a statistically sound analysis

    1.Setup an appropriate hypothesis test.
    2.Compute and report the observed significance value (or p-value).
    3.Report statistical significance for α = .01.
    4.Discuss statistical significance and practical significance


1. Hypothesis Testing

Since Policy recommendation use 300 discharges or less as the cut off point for improvement. I will use a 2-Sample Z test to measure the difference in excess readmission rate between hospital with **few than 300 discharges** and **more than 300 discharges**.

**Null Hypothesis**: There is no difference in the mean readmission rate of hospital facilities with discharges less than 300 and hospitals with 300 or more discharges 

**Alternative Hypothesis**: There is a difference in the excess readmission rate between hostpirals with discharges less than 300 and hospitals with 300 or more discharges.

    Null Hypothesis: mul300 - mug300=0
    Alternative Hypothesis: mul300 - mug300 !=0


In [None]:
# a two-sided test
mu = 0

# extract data
l300 = df_l300['Excess Readmission Ratio']
g300 = df_g300['Excess Readmission Ratio']

df_300 = df[['Number of Discharges', 'Excess Readmission Ratio']]
df_300['Category'] = 'None'
df_300.Category.loc[df_300['Number of Discharges'] < 300] = '< 300'
df_300.Category.loc[df_300['Number of Discharges'] >= 300] = '>= 300'

**Assumption for 2-Sample Test**
1. Both Groups are approximatly normally distributed and have large sample size (>30)
2. Each Sample is expected to be representative of CMS population. Only hospital credited by Centers for Medicare and Medicaid Service(CMS).
3. Population standard deviation will be estimated by sample deviation of sampel

In [None]:
l300.hist()

In [None]:
#point estimate - difference between sampled hospitals eith less than 300 discharges and sampled with 300 or more discharges
l300.mean() - g300.mean()

Check Central Limit Theorem - sample size: each sample must be at least n >= 30


In [None]:
l300.describe()[0], g300.describe()[0]

In [None]:
print("Readmission Ratio")
print("all  : " + str(stats.normaltest(y)))
print("<300 : " + str(stats.normaltest(l300)))
print(">=300: " + str(stats.normaltest(g300)))

In [None]:
# variability should be consistent across groups (homoscedastic)
print('Visual inspection of constant variance')
df_300[[ 'Excess Readmission Ratio', 'Category']].groupby('Category').boxplot(return_type='axes', figsize=(20,10));

As the point estimates is close to 0 and the sample sizes are similar and large. The means looks similar with slight difference in the variance. seems to support the null hypothesis.