<a href="https://colab.research.google.com/github/bonitr02/datasci_6_anova/blob/main/data_sci_6_anova.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HHA 507 Week 6 - ANOVA Analysis Part II

## Import Packages

In [66]:
!pip install ucimlrepo
import pandas as pd
from ucimlrepo import fetch_ucirepo
from scipy.stats import shapiro
import scipy.stats as stats
from statsmodels.stats.anova import anova_lm
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd



## Load Dataset

In [20]:
# fetch dataset
diabetes = fetch_ucirepo(id=296)

  df = pd.read_csv(data_url)


In [21]:
# data (as pandas dataframes)
X = diabetes.data.features
y = diabetes.data.targets

In [35]:
df = pd.DataFrame(X)
len(df)
#101766 values

101766

In [23]:
df.columns

Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'discharge_disposition_id', 'admission_source_id', 'time_in_hospital',
       'payer_code', 'medical_specialty', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3',
       'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed'],
      dtype='object')

In [24]:
df.dtypes

race                        object
gender                      object
age                         object
weight                      object
admission_type_id            int64
discharge_disposition_id     int64
admission_source_id          int64
time_in_hospital             int64
payer_code                  object
medical_specialty           object
num_lab_procedures           int64
num_procedures               int64
num_medications              int64
number_outpatient            int64
number_emergency             int64
number_inpatient             int64
diag_1                      object
diag_2                      object
diag_3                      object
number_diagnoses             int64
max_glu_serum               object
A1Cresult                   object
metformin                   object
repaglinide                 object
nateglinide                 object
chlorpropamide              object
glimepiride                 object
acetohexamide               object
glipizide           

In [None]:
# Will use num_medications as dependent variable and race as indepent variable with multiple levels
# Both variables are the appropriate data types

### Look for missingness

In [36]:
newdf = df[['num_medications', 'race']]
print(newdf)
len(newdf)
#101766 rows

        num_medications             race
0                     1        Caucasian
1                    18        Caucasian
2                    13  AfricanAmerican
3                    16        Caucasian
4                     8        Caucasian
...                 ...              ...
101761               16  AfricanAmerican
101762               18  AfricanAmerican
101763                9        Caucasian
101764               21        Caucasian
101765                3        Caucasian

[101766 rows x 2 columns]


101766

In [39]:
## keep only complete rows
newdf = newdf.dropna()
len(newdf)
#99493 remaining rows

<bound method NDFrame.describe of         num_medications             race
0                     1        Caucasian
1                    18        Caucasian
2                    13  AfricanAmerican
3                    16        Caucasian
4                     8        Caucasian
...                 ...              ...
101761               16  AfricanAmerican
101762               18  AfricanAmerican
101763                9        Caucasian
101764               21        Caucasian
101765                3        Caucasian

[99493 rows x 2 columns]>

## Assumption Checks

### Normality

In [None]:
stat, p = shapiro(newdf['num_medications'])
if p > 0.05:
    print("Number of medication data follows a normal distribution. P-value is: ",p)
else:
    print("Number of medication data does not follow a normal distribution. P-value is: ",p)

In [None]:
# Histogram
plt.hist(newdf['num_medications'], bins=20, edgecolor='k', alpha=0.7)
plt.title('Histogram of Numbers of Medications')
plt.xlabel('number of medications')
plt.ylabel('number of patients')
plt.show()

In [None]:
groups = newdf.groupby(['race'])

for (race_status), group_df in groups:
    _, p_value = stats.shapiro(group_df['num_medications'])

    print(f"Group ({race_status}):")
    print(f"P-value from Shapiro-Wilk Test: {p_value}\n")

# greater than 0.05 is normal distribution

#### Interpretation

Num_Medications: p-value is 0.0, indicating a statistically significant difference in the distribution of the data. Visualizing the histogram confirms that the data is skewed to the left.

Race: p-value is less than 0.05 for all racial categories, indicating a statistically signficant difference in the distribution of data across racial categories

### Homoscedasticity (Equal Variances)

In [57]:
newdf['race'].value_counts()

Caucasian          76099
AfricanAmerican    19210
Hispanic            2037
Other               1506
Asian                641
Name: race, dtype: int64

In [58]:
# check homogeneous variance with Levene's test
print(stats.levene(newdf[newdf["race"] == "Caucasian"]["num_medications"],
                     newdf[newdf["race"] == "AfricanAmerican"]["num_medications"],
                   newdf[newdf["race"] == "Hispanic"]["num_medications"],
                   newdf[newdf["race"] == "Other"]["num_medications"],
                   newdf[newdf["race"] == "Asian"]["num_medications"]))

LeveneResult(statistic=9.964541877327243, pvalue=4.6475787972404207e-08)


#### Interpretation

Interpretation of the results:

The p-value is less than 0.05, indicating that there is strong evidence to suggest that the homogeneity of variance is violated.


## One-way ANOVA test

Hypothesis: Number of medications differ by patient race.

Null hypothesis: Number of medications does not differ by patient race.

In [65]:
model = ols('num_medications ~ C(race)', data=newdf).fit()
anova_table = anova_lm(model, typ=2)
print(anova_table)


                sum_sq       df           F        PR(>F)
C(race)   2.739461e+04      4.0  104.307503  8.104325e-89
Residual  6.532212e+06  99488.0         NaN           NaN


#### Interpretation

F: The computed F-statistic value is 104.307503, indicating that the group means for race are different from each other

PR(>F): The p-value associated with the F-statistic is 8.104e-89

Since the p-value (8.104e-89) is less than the common alpha level of 0.05, we reject the null hypothesis.

## Post-hoc Tukey Test

In [68]:
posthoc = pairwise_tukeyhsd(newdf['num_medications'], newdf['race'], alpha=0.05)

print(posthoc)

      Multiple Comparison of Means - Tukey HSD, FWER=0.05       
     group1       group2  meandiff p-adj   lower   upper  reject
----------------------------------------------------------------
AfricanAmerican     Asian  -2.0772    0.0 -2.9647 -1.1897   True
AfricanAmerican Caucasian   0.9168    0.0  0.7383  1.0953   True
AfricanAmerican  Hispanic  -1.3385    0.0 -1.8535 -0.8234   True
AfricanAmerican     Other   -0.183 0.9168 -0.7745  0.4085  False
          Asian Caucasian    2.994    0.0  2.1173  3.8707   True
          Asian  Hispanic   0.7387 0.2596 -0.2623  1.7397  False
          Asian     Other   1.8942    0.0  0.8518  2.9366   True
      Caucasian  Hispanic  -2.2553    0.0 -2.7515  -1.759   True
      Caucasian     Other  -1.0998    0.0  -1.675 -0.5246   True
       Hispanic     Other   1.1555 0.0003  0.4043  1.9067   True
----------------------------------------------------------------


#### Interpretation

-African American vs Asian: p-value = 0.0, indicating a significant difference between the two groups

-African American vs Caucasian:p-value = 0.0, indicating a significant difference between the two groups

-African American vs Hispanic:p-value = 0.0, indicating a significant difference between the two groups

-African American vs Other: p-value is 0.9168, indicating that there is not a signficant difference between the two groups

-Asian vs Caucasian:p-value = 0.0, indicating a significant difference between the two groups

-Asian vs Hispanic: p-value = 0.2596, indicating that there is not a significant difference between the two groups

-Asian vs Other:p-value = 0.0, indicating a significant difference between the two groups

-Caucasian vs Hispanic:p-value = 0.0, indicating a significant difference between the two groups

-Caucasian vs Other: p-value = 0.0, indicating a significant difference between the two groups

-Hispanic vs Other: p-value = 0.0003, indicating a significant difference between the two groups




## Markdown Insights

- Number of medications and race were chosen for this test to observe if there is a statistical difference between different demographic groups and the number of medications in a patient regimen. This test requires a dependent variable (number of medications) and a categorical variable (race) that can be split into multiple groups.


- The normality and equal variance tests both failed, indicating that the ANOVA test results may not be accurate. In this situation, a non-parametric test such as the Friedman Test may be more appropriate.

- According to the results of the ANOVA test, there are differences between the number of medications and patient race, however a different test may give more accurate results.