<a href="https://colab.research.google.com/github/c-susan/datasci_5_statistics/blob/main/python_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Loading in Packages

In [1]:
import pandas as pd
from scipy.stats import chi2_contingency as chi2, ttest_ind

## **1. Chi-Square Test**

In [26]:
df1 = pd.read_csv('Order_and_Referring_10_3_2023.csv')
df1

Unnamed: 0,NPI,LAST_NAME,FIRST_NAME,PARTB,DME,HHA,PMD
0,1558467555,.MCINDOE,THOMAS,Y,Y,Y,Y
1,1417051921,A BELLE,N,Y,Y,Y,Y
2,1972040137,A NOVOTNY,ELIZABETH,Y,Y,Y,Y
3,1760465553,A SATTAR,MUHAMMAD,Y,Y,Y,Y
4,1295400745,A'NEAL,BROGAN,Y,Y,N,N
...,...,...,...,...,...,...,...
1785834,1336502301,ZYZO,JOHN,Y,Y,Y,N
1785835,1225502768,ZZIWA,JACKIE,N,Y,N,Y
1785836,1124277249,ZZIWA-KABENGE,IRYNE,Y,Y,Y,Y
1785837,1033160296,ZZIWAMBAZZA,NATHAN,Y,Y,Y,Y


In [27]:
## View the values counts of selected columns
# 'DME' = Indicates whether provider can order Durable Medical Equipment
df1['DME'].value_counts()

Y    1785838
N          1
Name: DME, dtype: int64

In [28]:
# 'PMD' = Indicates whether provider can order Power Mobility Devices
df1['PMD'].value_counts()

Y    1475449
N     310390
Name: PMD, dtype: int64

In [29]:
## Creating a contingency table selected columns
contingency_table = pd.crosstab(df1['DME'], df1['PMD'])
contingency_table

PMD,N,Y
DME,Unnamed: 1_level_1,Unnamed: 2_level_1
N,1,0
Y,310389,1475449


In [30]:
## Using the contingency table to perform a chi-square test
chi2, p, degf, expected = chi2(contingency_table)
print(f"Chi2 value: {chi2}")
print(f"P-value: {p}")
print(f'Degrees of freedom: {degf}')
print(f'Expected Frequency: {expected}')

Chi2 value: 0.7409760431083185
P-value: 0.3893483960070897
Degrees of freedom: 1
Expected Frequency: [[1.73806261e-01 8.26193739e-01]
 [3.10389826e+05 1.47544817e+06]]


### **Summary**
**Question:** Is there an association between whether a provider can order Durable Medical Equipment (DME) and whether they can order Power Mobility Devices (PMD)?

**H0:** There is no relationship between DME and PMD (independent).

**H1:** There is a relationship between DME and PMD (dependent).

**Interpretation**

Since the p-value of 0.389 is more than the significance level of 0.05, we do not reject the null hypothesis. There is no relationship between a provider's ability to order DME and their ability to order PMD (DME is independent of PMD).

**______________________________________________________________________________________________________________**

## **2. T-Test**

In [12]:
df2 = pd.read_csv('https://raw.githubusercontent.com/c-susan/datasci_5_statistics/main/datasets/Specific_Chronic_Conditions.csv')
df2

Unnamed: 0,Bene_Geo_Lvl,Bene_Geo_Desc,Bene_Geo_Cd,Bene_Age_Lvl,Bene_Demo_Lvl,Bene_Demo_Desc,Bene_Cond,Prvlnc,Tot_Mdcr_Stdzd_Pymt_PC,Tot_Mdcr_Pymt_PC,Hosp_Readmsn_Rate,ER_Visits_Per_1000_Benes
0,State,Alabama,1.0,All,All,All,Alcohol Abuse,0.0188,25102.3405,23348.6039,0.2413,2184.7557
1,State,Alabama,1.0,65+,Dual Status,Medicare Only,Alcohol Abuse,0.0118,,,,
2,State,Alabama,1.0,<65,Dual Status,Medicare Only,Alcohol Abuse,0.0320,,,,
3,State,Alabama,1.0,All,Dual Status,Medicare Only,Alcohol Abuse,0.0147,,,,
4,State,Alabama,1.0,65+,Dual Status,Medicare and Medicaid,Alcohol Abuse,0.0238,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
236119,State,Unknown,,65+,All,All,Hypertension,,,,,
236120,State,Unknown,,65+,All,All,Ischemic Heart Disease,,,,,
236121,State,Unknown,,65+,All,All,Osteoporosis,,,,,
236122,State,Unknown,,65+,All,All,Schizophrenia and Other Psychotic Disorders,,,,,


In [9]:
## Putting the dataset into a SQL database to use SQL to query spcific rows for the t-test analysis.
from sqlalchemy import create_engine
import sqlite3
conn = sqlite3.connect('health.db')
c = conn.cursor()
df2.to_sql('specific_conditions', conn, if_exists='replace')

236124

In [86]:
query = '''
SELECT * FROM specific_conditions
WHERE Bene_Demo_Desc IN ('Female', 'Male') AND Bene_Age_Lvl = 'All' AND Prvlnc IS NOT NULL;
'''
response = pd.read_sql(query, conn)
response.to_csv('cleaned_specific_chronic_conditions.csv', index=False)

In [87]:
df2 = pd.read_csv('cleaned_specific_chronic_conditions.csv')
df2

Unnamed: 0,index,Bene_Geo_Lvl,Bene_Geo_Desc,Bene_Geo_Cd,Bene_Age_Lvl,Bene_Demo_Lvl,Bene_Demo_Desc,Bene_Cond,Prvlnc,Tot_Mdcr_Stdzd_Pymt_PC,Tot_Mdcr_Pymt_PC,Hosp_Readmsn_Rate,ER_Visits_Per_1000_Benes
0,9,State,Alabama,1.0,All,Sex,Female,Alcohol Abuse,0.0092,,,,
1,12,State,Alabama,1.0,All,Sex,Male,Alcohol Abuse,0.0306,,,,
2,22,State,Alabama,1.0,All,Sex,Female,Alzheimer's Disease/Dementia,0.1314,,,,
3,25,State,Alabama,1.0,All,Sex,Male,Alzheimer's Disease/Dementia,0.0928,,,,
4,35,State,Alabama,1.0,All,Sex,Female,Arthritis,0.4510,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2259,14988,State,Wyoming,56.0,All,Sex,Male,Osteoporosis,0.0100,,,,
2260,14998,State,Wyoming,56.0,All,Sex,Female,Schizophrenia and Other Psychotic Disorders,0.0145,,,,
2261,15001,State,Wyoming,56.0,All,Sex,Male,Schizophrenia and Other Psychotic Disorders,0.0164,,,,
2262,15011,State,Wyoming,56.0,All,Sex,Female,Stroke,0.0219,,,,


In [93]:
## ALcohol abuse
alcohol = df2[df2['Bene_Cond'] == 'Alcohol Abuse']

In [94]:
## Splitting the data into 2 groups: Female and Male
alcohol_female = alcohol[alcohol['Bene_Demo_Desc'] == 'Female']['Prvlnc']
alcohol_female

0       0.0092
42      0.0204
84      0.0114
126     0.0084
168     0.0131
210     0.0134
252     0.0174
294     0.0123
336     0.0165
378     0.0143
420     0.0101
462     0.0076
504     0.0130
546     0.0099
588     0.0114
630     0.0084
672     0.0082
714     0.0093
756     0.0090
798     0.0209
840     0.0109
882     0.0227
924     0.0144
966     0.0238
1008    0.0085
1050    0.0105
1092    0.0120
1134    0.0119
1176    0.0076
1218    0.0156
1260    0.0183
1302    0.0116
1344    0.0130
1386    0.0120
1428    0.0109
1470    0.0118
1512    0.0108
1554    0.0090
1596    0.0159
1638    0.0108
1680    0.0025
1720    0.0191
1762    0.0092
1804    0.0111
1846    0.0090
1888    0.0100
1930    0.0086
1972    0.0177
2014    0.0038
2054    0.0100
2096    0.0136
2138    0.0094
2180    0.0150
2222    0.0109
Name: Prvlnc, dtype: float64

In [95]:
alcohol_male = alcohol[alcohol['Bene_Demo_Desc'] == 'Male']['Prvlnc']
alcohol_male

1       0.0306
43      0.0411
85      0.0274
127     0.0228
169     0.0315
211     0.0298
253     0.0436
295     0.0319
337     0.0518
379     0.0340
421     0.0296
463     0.0243
505     0.0260
547     0.0279
589     0.0328
631     0.0260
673     0.0255
715     0.0298
757     0.0313
799     0.0458
841     0.0317
883     0.0548
925     0.0385
967     0.0453
1009    0.0308
1051    0.0294
1093    0.0284
1135    0.0315
1177    0.0227
1219    0.0328
1261    0.0388
1303    0.0295
1345    0.0347
1387    0.0311
1429    0.0320
1471    0.0287
1513    0.0315
1555    0.0255
1597    0.0361
1639    0.0310
1681    0.0115
1721    0.0471
1763    0.0269
1805    0.0255
1847    0.0271
1889    0.0280
1931    0.0203
1973    0.0399
2015    0.0162
2055    0.0294
2097    0.0300
2139    0.0327
2181    0.0345
2223    0.0252
Name: Prvlnc, dtype: float64

In [101]:
t_stat, p_val = ttest_ind(alcohol_female, alcohol_male, equal_var=False)
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_val}")

T-statistic: -15.756728444684015
P-value: 1.9267043606049252e-26


In [97]:
# Compute means for Female and Male data with 'Alcohol Abuse'
female_mean = alcohol_female.mean()
male_mean = alcohol_male.mean()

print(f"Mean prevalence for female: {female_mean}")
print(f"Mean prevalence for males: {male_mean}")

Mean prevalence for female: 0.012149999999999998
Mean prevalence for males: 0.03152962962962963


### **Summary**

**Question:** Is there a difference in means of alcohol abuse between males and females?

**H0:** There is no difference in means of alcohol abuse between males and females.

**H1:** There is a difference in means of alcohol abuse between males and females.

**Interpretation**

The p-value is 1.9267043606049252e-26, which is less than the significance level of 0.05. This means that there is a significant difference in means of alcohol abuse between males and females. The t-statisitic is -15.7567, which measures the difference between the means of the two groups, indicates that the mean in alcohol abuse for females is lower than for males.

**______________________________________________________________________________________________________________**

## **3. ANOVA**

**______________________________________________________________________________________________________________**

## **4. Regression Analysis**