<a href="https://colab.research.google.com/github/c-susan/datasci_5_statistics/blob/main/python_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Loading in Packages

In [118]:
import pandas as pd
from scipy.stats import chi2_contingency as chi2, ttest_ind
import statsmodels.api as sm
from statsmodels.formula.api import ols

## **1. Chi-Square Test**

In [None]:
df1 = pd.read_csv('Order_and_Referring_10_3_2023.csv')
df1

Unnamed: 0,NPI,LAST_NAME,FIRST_NAME,PARTB,DME,HHA,PMD
0,1558467555,.MCINDOE,THOMAS,Y,Y,Y,Y
1,1417051921,A BELLE,N,Y,Y,Y,Y
2,1972040137,A NOVOTNY,ELIZABETH,Y,Y,Y,Y
3,1760465553,A SATTAR,MUHAMMAD,Y,Y,Y,Y
4,1295400745,A'NEAL,BROGAN,Y,Y,N,N
...,...,...,...,...,...,...,...
1785834,1336502301,ZYZO,JOHN,Y,Y,Y,N
1785835,1225502768,ZZIWA,JACKIE,N,Y,N,Y
1785836,1124277249,ZZIWA-KABENGE,IRYNE,Y,Y,Y,Y
1785837,1033160296,ZZIWAMBAZZA,NATHAN,Y,Y,Y,Y


In [None]:
## View the values counts of selected columns
# 'DME' = Indicates whether provider can order Durable Medical Equipment
df1['DME'].value_counts()

Y    1785838
N          1
Name: DME, dtype: int64

In [None]:
# 'PMD' = Indicates whether provider can order Power Mobility Devices
df1['PMD'].value_counts()

Y    1475449
N     310390
Name: PMD, dtype: int64

In [None]:
## Creating a contingency table selected columns
contingency_table = pd.crosstab(df1['DME'], df1['PMD'])
contingency_table

PMD,N,Y
DME,Unnamed: 1_level_1,Unnamed: 2_level_1
N,1,0
Y,310389,1475449


In [None]:
## Using the contingency table to perform a chi-square test
chi2, p, degf, expected = chi2(contingency_table)
print(f"Chi2 value: {chi2}")
print(f"P-value: {p}")
print(f'Degrees of freedom: {degf}')
print(f'Expected Frequency: {expected}')

Chi2 value: 0.7409760431083185
P-value: 0.3893483960070897
Degrees of freedom: 1
Expected Frequency: [[1.73806261e-01 8.26193739e-01]
 [3.10389826e+05 1.47544817e+06]]


### **Summary**
**Question:** Is there an association between whether a provider can order Durable Medical Equipment (DME) and whether they can order Power Mobility Devices (PMD)?

**H0:** There is no relationship between DME and PMD (independent).

**H1:** There is a relationship between DME and PMD (dependent).

**Interpretation**

Since the p-value of 0.389 is more than the significance level of 0.05, we do not reject the null hypothesis. There is no relationship between a provider's ability to order DME and their ability to order PMD (DME is independent of PMD).

**______________________________________________________________________________________________________________**

## **2. T-Test**

In [119]:
df2 = pd.read_csv('https://raw.githubusercontent.com/c-susan/datasci_5_statistics/main/datasets/Specific_Chronic_Conditions.csv')
df2

Unnamed: 0,Bene_Geo_Lvl,Bene_Geo_Desc,Bene_Geo_Cd,Bene_Age_Lvl,Bene_Demo_Lvl,Bene_Demo_Desc,Bene_Cond,Prvlnc,Tot_Mdcr_Stdzd_Pymt_PC,Tot_Mdcr_Pymt_PC,Hosp_Readmsn_Rate,ER_Visits_Per_1000_Benes
0,State,Alabama,1.0,All,All,All,Alcohol Abuse,0.0188,25102.3405,23348.6039,0.2413,2184.7557
1,State,Alabama,1.0,65+,Dual Status,Medicare Only,Alcohol Abuse,0.0118,,,,
2,State,Alabama,1.0,<65,Dual Status,Medicare Only,Alcohol Abuse,0.0320,,,,
3,State,Alabama,1.0,All,Dual Status,Medicare Only,Alcohol Abuse,0.0147,,,,
4,State,Alabama,1.0,65+,Dual Status,Medicare and Medicaid,Alcohol Abuse,0.0238,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
236119,State,Unknown,,65+,All,All,Hypertension,,,,,
236120,State,Unknown,,65+,All,All,Ischemic Heart Disease,,,,,
236121,State,Unknown,,65+,All,All,Osteoporosis,,,,,
236122,State,Unknown,,65+,All,All,Schizophrenia and Other Psychotic Disorders,,,,,


In [128]:
## Selecting rows where column 'Bene_Demo_Desc' is either Female or Male, 'Bene_Age_Lvl' is All, and 'Prvlnc' is not null.
df2 = df2[(df2['Bene_Demo_Desc'].isin(['Female', 'Male'])) & (df2['Bene_Age_Lvl'] == 'All') & df2['Prvlnc'].notnull()]
df2

Unnamed: 0,Bene_Geo_Lvl,Bene_Geo_Desc,Bene_Geo_Cd,Bene_Age_Lvl,Bene_Demo_Lvl,Bene_Demo_Desc,Bene_Cond,Prvlnc,Tot_Mdcr_Stdzd_Pymt_PC,Tot_Mdcr_Pymt_PC,Hosp_Readmsn_Rate,ER_Visits_Per_1000_Benes
9,State,Alabama,1.0,All,Sex,Female,Alcohol Abuse,0.0092,,,,
12,State,Alabama,1.0,All,Sex,Male,Alcohol Abuse,0.0306,,,,
22,State,Alabama,1.0,All,Sex,Female,Alzheimer's Disease/Dementia,0.1314,,,,
25,State,Alabama,1.0,All,Sex,Male,Alzheimer's Disease/Dementia,0.0928,,,,
35,State,Alabama,1.0,All,Sex,Female,Arthritis,0.4510,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
14988,State,Wyoming,56.0,All,Sex,Male,Osteoporosis,0.0100,,,,
14998,State,Wyoming,56.0,All,Sex,Female,Schizophrenia and Other Psychotic Disorders,0.0145,,,,
15001,State,Wyoming,56.0,All,Sex,Male,Schizophrenia and Other Psychotic Disorders,0.0164,,,,
15011,State,Wyoming,56.0,All,Sex,Female,Stroke,0.0219,,,,


In [129]:
## Selecting alcohol abuse as the condition to focus on
alcohol = df2[df2['Bene_Cond'] == 'Alcohol Abuse']

In [131]:
## Splitting the data into 2 groups: Female and Male
alcohol_female = alcohol[alcohol['Bene_Demo_Desc'] == 'Female']['Prvlnc']
alcohol_female

9        0.0092
282      0.0204
555      0.0114
828      0.0084
1101     0.0131
1374     0.0134
1647     0.0174
1920     0.0123
2193     0.0165
2466     0.0143
2739     0.0101
3012     0.0076
3285     0.0130
3558     0.0099
3831     0.0114
4104     0.0084
4377     0.0082
4650     0.0093
4923     0.0090
5196     0.0209
5469     0.0109
5742     0.0227
6015     0.0144
6288     0.0238
6561     0.0085
6834     0.0105
7107     0.0120
7380     0.0119
7653     0.0076
7926     0.0156
8199     0.0183
8472     0.0116
8745     0.0130
9018     0.0120
9291     0.0109
9564     0.0118
9837     0.0108
10110    0.0090
10383    0.0159
10656    0.0108
10929    0.0025
11202    0.0191
11475    0.0092
11748    0.0111
12021    0.0090
12294    0.0100
12840    0.0086
13113    0.0177
13386    0.0038
13659    0.0100
13932    0.0136
14205    0.0094
14478    0.0150
14751    0.0109
Name: Prvlnc, dtype: float64

In [132]:
alcohol_male = alcohol[alcohol['Bene_Demo_Desc'] == 'Male']['Prvlnc']
alcohol_male

12       0.0306
285      0.0411
558      0.0274
831      0.0228
1104     0.0315
1377     0.0298
1650     0.0436
1923     0.0319
2196     0.0518
2469     0.0340
2742     0.0296
3015     0.0243
3288     0.0260
3561     0.0279
3834     0.0328
4107     0.0260
4380     0.0255
4653     0.0298
4926     0.0313
5199     0.0458
5472     0.0317
5745     0.0548
6018     0.0385
6291     0.0453
6564     0.0308
6837     0.0294
7110     0.0284
7383     0.0315
7656     0.0227
7929     0.0328
8202     0.0388
8475     0.0295
8748     0.0347
9021     0.0311
9294     0.0320
9567     0.0287
9840     0.0315
10113    0.0255
10386    0.0361
10659    0.0310
10932    0.0115
11205    0.0471
11478    0.0269
11751    0.0255
12024    0.0271
12297    0.0280
12843    0.0203
13116    0.0399
13389    0.0162
13662    0.0294
13935    0.0300
14208    0.0327
14481    0.0345
14754    0.0252
Name: Prvlnc, dtype: float64

In [133]:
t_stat, p_val = ttest_ind(alcohol_female, alcohol_male, equal_var=False)
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_val}")

T-statistic: -15.756728444684015
P-value: 1.9267043606049252e-26


In [134]:
# Compute means for Female and Male data with 'Alcohol Abuse'
female_mean = alcohol_female.mean()
male_mean = alcohol_male.mean()

print(f"Mean prevalence for female: {female_mean}")
print(f"Mean prevalence for males: {male_mean}")

Mean prevalence for female: 0.012149999999999998
Mean prevalence for males: 0.03152962962962963


### **Summary**

**Question:** Is there a difference in means of alcohol abuse between males and females?

**H0:** There is no difference in means of alcohol abuse between males and females.

**H1:** There is a difference in means of alcohol abuse between males and females.

**Interpretation**

The p-value is 1.9267043606049252e-26, which is less than the significance level of 0.05. This means that there is a significant difference in means of alcohol abuse between males and females. The t-statisitic is -15.7567, which measures the difference between the means of the two groups, indicates that the mean in alcohol abuse for females is lower than for males.

**______________________________________________________________________________________________________________**

## **3. ANOVA**

In [170]:
## Loading in dataset and cleaning the column names
df3 = pd.read_csv('https://raw.githubusercontent.com/c-susan/datasci_5_statistics/main/datasets/Provisional_COVID-19_Deaths_by_Sex_and_Age.csv')
df3.columns = df3.columns.str.replace(' ', '_', regex=True).str.replace('/', '_', regex=True).str.replace('(', '', regex=True).str.replace(')', '', regex=True).str.replace('-','', regex=True).str.lower()
df3

Unnamed: 0,data_as_of,start_date,end_date,group,year,month,state,sex,age_group,covid19_deaths,total_deaths,pneumonia_deaths,pneumonia_and_covid19_deaths,influenza_deaths,"pneumonia,_influenza,_or_covid19_deaths",footnote
0,09/27/2023,01/01/2020,09/23/2023,By Total,,,United States,All Sexes,All Ages,1146774.0,12303399.0,1162844.0,569264.0,22229.0,1760095.0,
1,09/27/2023,01/01/2020,09/23/2023,By Total,,,United States,All Sexes,Under 1 year,519.0,73213.0,1056.0,95.0,64.0,1541.0,
2,09/27/2023,01/01/2020,09/23/2023,By Total,,,United States,All Sexes,0-17 years,1696.0,130970.0,2961.0,424.0,509.0,4716.0,
3,09/27/2023,01/01/2020,09/23/2023,By Total,,,United States,All Sexes,1-4 years,285.0,14299.0,692.0,66.0,177.0,1079.0,
4,09/27/2023,01/01/2020,09/23/2023,By Total,,,United States,All Sexes,5-14 years,509.0,22008.0,818.0,143.0,219.0,1390.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137695,09/27/2023,09/01/2023,09/23/2023,By Month,2023.0,9.0,Puerto Rico,Female,50-64 years,,75.0,14.0,,0.0,14.0,One or more data cells have counts between 1-9...
137696,09/27/2023,09/01/2023,09/23/2023,By Month,2023.0,9.0,Puerto Rico,Female,55-64 years,0.0,65.0,10.0,0.0,0.0,10.0,
137697,09/27/2023,09/01/2023,09/23/2023,By Month,2023.0,9.0,Puerto Rico,Female,65-74 years,,91.0,,,0.0,,One or more data cells have counts between 1-9...
137698,09/27/2023,09/01/2023,09/23/2023,By Month,2023.0,9.0,Puerto Rico,Female,75-84 years,,211.0,36.0,,0.0,38.0,One or more data cells have counts between 1-9...


In [171]:
## Filtering for columns 'sex', 'age_group', and 'covid-19_deaths'
df3 = df3[['sex', 'age_group', 'covid19_deaths']]
df3

Unnamed: 0,sex,age_group,covid19_deaths
0,All Sexes,All Ages,1146774.0
1,All Sexes,Under 1 year,519.0
2,All Sexes,0-17 years,1696.0
3,All Sexes,1-4 years,285.0
4,All Sexes,5-14 years,509.0
...,...,...,...
137695,Female,50-64 years,
137696,Female,55-64 years,0.0
137697,Female,65-74 years,
137698,Female,75-84 years,


In [172]:
## Selecting for rows where the column 'sex' equals 'All Sexes', column 'age_group' is not equal 'All Ages',
## and column 'covid-19_deaths' does not have any null values.
df3 = df3[(df3['sex'] == 'All Sexes') & (df3['age_group'] != 'All Ages') & df3['covid19_deaths'].notnull()]
df3

Unnamed: 0,sex,age_group,covid19_deaths
1,All Sexes,Under 1 year,519.0
2,All Sexes,0-17 years,1696.0
3,All Sexes,1-4 years,285.0
4,All Sexes,5-14 years,509.0
5,All Sexes,15-24 years,3021.0
...,...,...,...
137657,All Sexes,30-39 years,0.0
137658,All Sexes,35-44 years,0.0
137659,All Sexes,40-49 years,0.0
137664,All Sexes,75-84 years,15.0


In [174]:
model = ols('covid19_deaths ~ C(age_group)', data=df3).fit()

In [175]:
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

                    sum_sq       df          F        PR(>F)
C(age_group)  2.891307e+09     15.0  12.530232  7.565137e-32
Residual      4.668617e+11  30349.0        NaN           NaN


### **Summary**
**Question:** Is there a significant difference in Covid-19 deaths between age groups?

**H0:** There is no significant difference in the mean number of COVID-19 deaths among different age groups.

**H1:** There is a significant difference in the mean number of COVID-19 deaths among different age groups.

**Interpretation**

F-value = 12.530232

P-value = 7.565137e-32

Since the p-value of 7.565137e-32 is less than 0.05, there is a significant difference in number of COVID-19 across age groups.

**______________________________________________________________________________________________________________**

## **4. Regression Analysis**