# Inferential Statistics

In this notebook, I will perform some quick statistical tests, based on observations made in the data exploration 
portion of this project.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import random

from scipy.stats import shapiro
from scipy.stats import normaltest
from scipy.stats import pearsonr
from statsmodels.stats.weightstats import ztest
from statsmodels.graphics.gofplots import qqplot

In [2]:
df = pd.read_csv('binary_diabetes.csv')

## Creating the numerical dataset

In [19]:
num_df = df[['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient',
                 'number_emergency', 'number_inpatient', 'number_diagnoses', 'readmitted']]

## Bootstrapping function to compare the means of readmitted vs non-readmitted patients, for a given metric

In [75]:
def bootstrap_mean_diff(metric_list):
    """
    metric: A list of metrics of interest
    """
    bootstrap_df = pd.DataFrame(index=metric_list, columns=['Observed difference', 'Permuted pval'])
    
    for metric in metric_list:
        readmitted = num_df.loc[num_df.readmitted=='YES', metric] - np.mean(num_df.loc[num_df.readmitted=='YES', metric]) #Shifting mean
        not_readmitted = num_df.loc[num_df.readmitted=='NO', metric] - np.mean(num_df.loc[num_df.readmitted=='NO', metric]) #Shifting mean

        differences_replicates = np.empty(10000) # Initializing empty list for the 10000 entries
        for i in range(10000):
            bootstrap_readmitted = np.random.choice(readmitted, size=len(readmitted)) # Bootstrap replicate for the first group
            bootstrap_not_readmitted = np.random.choice(not_readmitted, size=len(not_readmitted)) # 2nd group bootstrap replicate
            bootstrap_mean_readmitted = np.mean(bootstrap_readmitted)
            bootstrap_mean_not_readmitted = np.mean(bootstrap_not_readmitted)
            differences_replicates[i] = bootstrap_mean_readmitted - bootstrap_mean_not_readmitted # Store the difference in means 

        obs_difference = np.mean(num_df.loc[num_df.readmitted=='YES', metric]) - np.mean(num_df.loc[num_df.readmitted=='NO', metric])
        pval = sum(differences_replicates >= np.absolute(obs_difference))/10000
        
        bootstrap_df.loc[metric, 'Observed difference'] = obs_difference
        bootstrap_df.loc[metric, 'Permuted pval'] = pval

        
    return(bootstrap_df)

In [78]:
metric_list = ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient',
                 'number_emergency', 'number_inpatient', 'number_diagnoses']

## Final results

In [80]:
bootstrap_mean_diff(metric_list)

Unnamed: 0,Observed difference,Permuted pval
time_in_hospital,0.355712,0
num_lab_procedures,1.87047,0
num_procedures,-0.100741,0
num_medications,0.579525,0
number_outpatient,0.139759,0
number_emergency,0.0795761,0
number_inpatient,0.174408,0
number_diagnoses,0.409719,0


From this table, we can see that although the absolute differences is small between the two groups, that all these
differences are statistically significant (significantly low pvalue, even after Bonferroni correction). 

In most of these cases, the story is relatively expected; patients who were readmitted spent more time in the hospital,
had more lab procedures, more medications, more diagnoses, and so on. It is interesting to note, that they 
actually had fewer total procedures performed other than lab procedure, compared to patients who were not readmitted.