<a href="https://colab.research.google.com/github/amnasyed1/datasci_5_statistics/blob/main/datasci_5_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Chi-Square Test**


In [None]:
import pandas as pd
import scipy
from scipy.stats import chi2_contingency as chi2

In [11]:
# The dataset contains data from a survey where individuals who work in the tech industry completed about mental health
mhtech = pd.read_csv('https://raw.githubusercontent.com/amnasyed1/datasci_5_statistics/main/datasets/mental_health_in_tech_survey.csv')
mhtech

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1254,2015-09-12 11:17:21,26,male,United Kingdom,,No,No,Yes,,26-100,...,Somewhat easy,No,No,Some of them,Some of them,No,No,Don't know,No,
1255,2015-09-26 01:07:35,32,Male,United States,IL,No,Yes,Yes,Often,26-100,...,Somewhat difficult,No,No,Some of them,Yes,No,No,Yes,No,
1256,2015-11-07 12:36:58,34,male,United States,CA,No,Yes,Yes,Sometimes,More than 1000,...,Somewhat difficult,Yes,Yes,No,No,No,No,No,No,
1257,2015-11-30 21:25:06,46,f,United States,NC,No,No,No,,100-500,...,Don't know,Yes,No,No,No,No,No,No,No,


## Hypothesis

Hypothesis: There is a correlation between working remote and it interfering with one's work if they have a mental health condition.

- Null Hypothesis (H0): There is no relationship or correlation between working remote and if the individual has a mental health condition, it interfering with their work.

- Alternative Hypothesis (H1): There is a correlation or relationship between working remote and for those who have a mental health condition, it interfering with their work.

In [4]:
mhtech.columns

Index(['Timestamp', 'Age', 'Gender', 'Country', 'state', 'self_employed',
       'family_history', 'treatment', 'work_interfere', 'no_employees',
       'remote_work', 'tech_company', 'benefits', 'care_options',
       'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'comments'],
      dtype='object')

In [5]:
mhtech['remote_work'].value_counts()

No     883
Yes    376
Name: remote_work, dtype: int64

In [6]:
mhtech['work_interfere'].value_counts()

Sometimes    465
Never        213
Rarely       173
Often        144
Name: work_interfere, dtype: int64

In [7]:
contingency_table = pd.crosstab(mhtech['remote_work'], mhtech['work_interfere'])
print(contingency_table)

work_interfere  Never  Often  Rarely  Sometimes
remote_work                                    
No                157     93     119        322
Yes                56     51      54        143


In [8]:
chi2, p, _, _ = chi2(contingency_table)
print(f"Chi2 value: {chi2}")
print(f"P-value: {p}")

Chi2 value: 3.4731660005348104
P-value: 0.3242591134502088


## Results

Significance Level = 0.05

P-value: 0.3242591134502088

Due to the p-value being greater than the significance level, we can reject the H1 because there is not enough evidence to reject the H0. The analysis shows that there is not a strong correlation between working remotely and one's mental health condition interfering with their work.

## Rational

I chose to examine the "remote_work" and "work_interfere" columns due to the trend and increasing number of jobs/companies, especially in the tech industry, having the option for employees to work remotely or requiring their employees to work remotely. Working remotely may be isolating, and for individuals with mental health conditions it may interfere with their work.

# **T-Test**

In [30]:
import pandas as pd
from scipy.stats import ttest_ind

In [31]:
brca = pd.read_csv('https://raw.githubusercontent.com/amnasyed1/datasci_5_statistics/main/datasets/BRCA.csv')
brca.head(10)

Unnamed: 0,Patient_ID,Age,Gender,Protein1,Protein2,Protein3,Protein4,Tumour_Stage,Histology,ER status,PR status,HER2 status,Surgery_type,Date_of_Surgery,Date_of_Last_Visit,Patient_Status
0,TCGA-D8-A1XD,36.0,FEMALE,0.080353,0.42638,0.54715,0.27368,III,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Modified Radical Mastectomy,15-Jan-17,19-Jun-17,Alive
1,TCGA-EW-A1OX,43.0,FEMALE,-0.42032,0.57807,0.61447,-0.031505,II,Mucinous Carcinoma,Positive,Positive,Negative,Lumpectomy,26-Apr-17,09-Nov-18,Dead
2,TCGA-A8-A079,69.0,FEMALE,0.21398,1.3114,-0.32747,-0.23426,III,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,08-Sep-17,09-Jun-18,Alive
3,TCGA-D8-A1XR,56.0,FEMALE,0.34509,-0.21147,-0.19304,0.12427,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Modified Radical Mastectomy,25-Jan-17,12-Jul-17,Alive
4,TCGA-BH-A0BF,56.0,FEMALE,0.22155,1.9068,0.52045,-0.31199,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,06-May-17,27-Jun-19,Dead
5,TCGA-AO-A1KQ,84.0,MALE,-0.081872,1.7241,-0.057335,0.043025,III,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Modified Radical Mastectomy,18-Sep-17,15-Nov-21,Alive
6,TCGA-D8-A73X,53.0,FEMALE,-0.069535,1.4183,-0.36105,0.39158,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Simple Mastectomy,04-Feb-17,07-Feb-18,Alive
7,TCGA-A7-A426,50.0,FEMALE,0.67249,1.279,-0.32107,-0.11239,III,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Simple Mastectomy,16-May-17,,
8,TCGA-EW-A1P5,77.0,FEMALE,-0.15175,-0.66332,1.1894,0.21718,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Modified Radical Mastectomy,28-Sep-17,28-Sep-18,Alive
9,TCGA-A8-A09A,40.0,FEMALE,-0.5657,1.2668,-0.29346,0.19395,II,Infiltrating Lobular Carcinoma,Positive,Positive,Positive,Other,14-Feb-17,15-Dec-17,Alive


## Hypothesis

Null Hypothesis (H0): There is no correlation between the expression levels of Protien4 and the Patient Status

Alternative Hypothesis (H1): There is a correlation between the expression levels of Protien4 and the Patient Status


In [34]:
brca.columns

Index(['Patient_ID', 'Age', 'Gender', 'Protein1', 'Protein2', 'Protein3',
       'Protein4', 'Tumour_Stage', 'Histology', 'ER status', 'PR status',
       'HER2 status', 'Surgery_type', 'Date_of_Surgery', 'Date_of_Last_Visit',
       'Patient_Status'],
      dtype='object')

In [37]:
dead_data = brca[brca['Patient_Status'] == 'Dead']['Protein4']
dead_data

1     -0.031505
4     -0.311990
11    -0.350700
18     0.601030
25     1.629900
         ...   
329   -0.517350
330   -0.354920
331    0.339680
332   -0.188480
333   -0.067848
Name: Protein4, Length: 66, dtype: float64

In [38]:
alive_data = brca[brca['Patient_Status'] == 'Alive']['Protein4']
alive_data

0      0.273680
2     -0.234260
3      0.124270
5      0.043025
6      0.391580
         ...   
322    1.044100
323    0.351920
324    0.261420
325    0.458740
326    0.721820
Name: Protein4, Length: 255, dtype: float64

In [40]:
# Assuming unequal variances with Welch's t-test # noqa
t_stat, p_val = ttest_ind(alive_data, dead_data)
t_stat, p_val = ttest_ind(alive_data, dead_data, equal_var=False)
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_val}")

T-statistic: -1.4214132952738339
P-value: 0.1582830197703266


In [41]:
# Compute means for alive and dead data
alive_mean = alive_data.mean()
dead_mean = dead_data.mean()

print(f"Mean prevalence for southern states: {alive_mean}")
print(f"Mean prevalence for non-southern states: {dead_mean}")

Mean prevalence for southern states: -0.016361614666666666
Mean prevalence for non-southern states: 0.10607757575757576


## Results

Significance Level = 0.05
P-value: 0.1582830197703266


Due to the p-value being greater than the significance level, we can reject the H1 because there is not enough evidence to reject the H0. The test shows that there is not a strong correlation  between the patient's status (alive or expired) and the expression levels of Protien4.

## Rational

I chose to examine "Protien4" and "Patient_Status" columns to analyze if Protein4 expression levels have any type of correlations to a breast cancer patient's death.


# **ANOVA**

In [43]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [44]:
asthma = pd.read_csv('https://raw.githubusercontent.com/amnasyed1/datasci_5_statistics/main/datasets/current-asthma-prevalence-by-county-2015_2020.csv')
asthma

Unnamed: 0,COUNTY,YEARS,STRATA,AGE GROUP,CURRENT PREVALENCE,95% CONFIDENCE INTERVAL,COUNTIES GROUPED,COMMENT,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,California,2015–2016,Total population,All ages,8.7%,(8.1–9.3),,,,,,,,,,
1,Alameda,2015–2016,Total population,All ages,9.1%,(5.9–12.2),,,,,,,,,,
2,Alpine,2015–2016,Total population,All ages,9.3%,(4.1–14.6),"Alpine, Amador, Calaveras, Inyo, Mariposa, Mon...",,,,,,,,,
3,Amador,2015–2016,Total population,All ages,9.3%,(4.1–14.6),"Alpine, Amador, Calaveras, Inyo, Mariposa, Mon...",,,,,,,,,
4,Butte,2015–2016,Total population,All ages,9.4%,(3.8–15.1),,Estimate is statistically unstable. Caution is...,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1234,Tulare,2019–2020,Age groups,65+ years,13.3%,(9.0–17.7),,,,,,,,,,
1235,Tuolumne,2019–2020,Age groups,65+ years,10.3%,(2.6–17.9),"Alpine, Amador, Calaveras, Inyo, Mariposa, Mon...",Prevalence estimate is statistically unstable....,,,,,,,,
1236,Ventura,2019–2020,Age groups,65+ years,7.5%,(3.9–11.1),,,,,,,,,,
1237,Yolo,2019–2020,Age groups,65+ years,14.0%,(7.6–20.3),,,,,,,,,,


In [45]:
asthma.columns

Index(['COUNTY', 'YEARS', 'STRATA', 'AGE GROUP', 'CURRENT PREVALENCE',
       '95% CONFIDENCE INTERVAL', 'COUNTIES GROUPED', 'COMMENT', 'Unnamed: 8',
       'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15'],
      dtype='object')

In [63]:
model = ols('[95% CONFIDENCE INTERVAL] ~ [COUNTY] * [YEARS]', data=asthma).fit()

model

SyntaxError: ignored

# **Regression**

In [42]:
# The dataset contains diabetes medical predictor variables. The patients in the dataset are women of Pima Native American heritage, who are 21 years old or older.
diabetes = pd.read_csv('https://raw.githubusercontent.com/amnasyed1/datasci_5_statistics/main/datasets/diabetes.csv')
diabetes

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


## Hypothesis

Hypothesis: There is no correlation with high BMI and glucose levels.
- BMI is less than 18.5 = underweight range
- BMI is 18.5 to <25 = healthy weight range
- BMI is 25.0 to <30 = overweight range.
- BMI is 30.0 or higher = obesity range.

Reference for indicating BMI ranges:
https://www.cdc.gov/obesity/basics/adult-defining.html#:~:text=If%20your%20BMI%20is%20less,falls%20within%20the%20obesity%20range.

- Null Hypothesis (H0): There is no correlation between high BMI and glucose levels

- Alternative Hypothesis (H1): There is a correlation between high BMI aand glucose levels