# Statistical Analysis

### Possible Research Qs
- Does sleep affect mental health?
- Does diet affect mental health?
- Does financial stress affect mental health?
- Do depressed students perform worse/better academically? (gpa)

In [6]:
# install.packages("dplyr")
library(dplyr)

In [7]:
df = read.csv('clean_dataset.csv')
df = df[,-1]
head(df, 15)

Unnamed: 0_level_0,id,gender,age,academic_pressure,work_pressure,gpa,study_sat,sleep_dur,diet,suicidal_thoughts,study_hrs,financial_stress,family_hist,depression,age_lab,sleep_lab
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<chr>
1,2,Male,33,5,0,8.97,2,'5-6 hours',Healthy,Yes,3,1,No,1,31+,>5
2,8,Female,24,2,0,5.9,5,'5-6 hours',Moderate,No,3,2,Yes,0,23-26,>5
3,26,Male,31,3,0,7.03,5,'Less than 5 hours',Healthy,No,9,1,Yes,0,31+,<5
4,30,Female,28,3,0,5.59,2,'7-8 hours',Moderate,Yes,4,5,Yes,1,27-30,>5
5,32,Female,25,4,0,8.13,3,'5-6 hours',Moderate,Yes,1,1,No,0,23-26,>5
6,33,Male,29,2,0,5.7,3,'Less than 5 hours',Healthy,No,4,1,No,0,27-30,<5
7,52,Male,30,3,0,9.54,4,'7-8 hours',Healthy,No,1,2,No,0,27-30,>5
8,56,Female,30,2,0,8.04,4,'Less than 5 hours',Unhealthy,No,0,1,Yes,0,27-30,<5
9,59,Male,28,3,0,9.79,1,'7-8 hours',Moderate,Yes,12,3,No,1,27-30,>5
10,62,Male,31,2,0,8.38,3,'Less than 5 hours',Moderate,Yes,2,5,No,1,31+,<5


## Research Question 2

Is there statistically significant evidence that there is a difference in depression scores between Gender? <br>
$H_o: \mu_{male} = \mu_{female}$ <br>
$H_a: \mu_{male} \neq \mu_{female}$ <br>

Welch's T-Test:
$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_{1}^2}{n_1} + \frac{s_{2}^2}{n_2}}}$
<br>
$x_1, x_2 = $sample means of female's and male's depression score <br>
$s_1, s_2 =$ sample variances of female's and male's depression score <br>
$n_1, n_2 = 27,837$<br>

Two Sample Z-Test for proportions:
$z = \frac{(\hat{p}_1 - \hat{p}_2)}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}}$ <br>
$\hat{p}_1, \hat{p}_2 = $sample proportions of female's and male's depression score <br>
$\hat{p} = $ pooled sample proportions of "Yes" across both male and females <br>
$n_1, n_2 = $ sample size <br>


In [8]:
# Group dataframe
table(df$gender)


Female   Male 
 12326  15511 

In [9]:
gender <- df %>%
    group_by(gender) %>%
    summarise(
        mean_depression = mean(depression, na.rm = TRUE),
        sd_depression = sd(depression, na.rm = TRUE),
        variance_depression = var(depression, na.rm=TRUE),
        n = n())
gender

gender,mean_depression,sd_depression,variance_depression,n
<chr>,<dbl>,<dbl>,<dbl>,<int>
Female,0.58405,0.492905,0.2429553,12326
Male,0.5861002,0.4925468,0.2426024,15511


In [24]:
#Run a two-sample t-test for difference in means
t.test(depression ~ gender, data = df)


	Welch Two Sample t-test

data:  depression by gender
t = -0.34482, df = 26424, p-value = 0.7302
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.01370412  0.00960370
sample estimates:
mean in group Female   mean in group Male 
           0.5840500            0.5861002 


In [42]:
#Two-sample z-test for diff of proportions
prop.test(table(df$depression, df$gender))


	2-sample test for equality of proportions with continuity correction

data:  table(df$depression, df$gender)
X-squared = 0.11063, df = 1, p-value = 0.7394
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.009834373  0.014002110
sample estimates:
   prop 1    prop 2 
0.4440114 0.4419276 


In testing the difference in means of depression score between Female and Male, we fail to reject the null hypothesis. Since our p-value of 0.7302 is greater than our alpha value of 0.05, we can say that there is not statistically significant evidence that there is a difference in means for depression between genders. The mean for depression score for females is ~0.5840 and for males, ~0.5861. I also conducted a two-sample proportion test, where I reached the same conclusion: a p-value not within the rejection region and to fail the null hypothesis. There is not enough evidence to show that there is a difference in mean or proportions of depression score between males and females. The proportion of suicidal thoughts for females is ~ 0.4440 and for males is ~0.4420.

## Research Question 3

Is there statistically significant evidence that there is a difference in suicidal thoughts between gender? <br>
$H_o: \mu_{male} = \mu_{female}$ <br>
$H_a: \mu_{male} \neq \mu_{female}$ <br>

Welch's T-Test:
$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_{1}^2}{n_1} + \frac{s_{2}^2}{n_2}}}$
<br>
$x_1, x_2 = $sample means of female's and male's suicidal thoughts score <br>
$s_1, s_2 =$ sample variances of female's and male's suicidal thoughts score <br>
$n_1, n_2 = 27,837$<br>

Two Sample Z-Test for proportions:
$z = \frac{(\hat{p}_1 - \hat{p}_2)}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}}$ <br>
$\hat{p}_1, \hat{p}_2 = $sample proportions of female's and male's suicidal thoughts score <br>
$\hat{p} = $ pooled sample proportions of "Yes" across both male and females
$n_1, n_2 = $ sample size <br>

In [36]:
#Two-sample t-test for diff of means
df$suicidal_thoughts_num <- ifelse(df$suicidal_thoughts == "Yes", 1, 0)

gender_thoughts <- df %>%
    group_by(gender) %>%
    summarise(
        mean_thoughts = mean(suicidal_thoughts_num, na.rm = TRUE),
        sd_depression = sd(suicidal_thoughts_num, na.rm=TRUE),
        n = n())
gender_thoughts

#t-test for difference in means
t.test(suicidal_thoughts_num ~ gender, data = df)

gender,mean_thoughts,sd_depression,n
<chr>,<dbl>,<dbl>,<int>
Female,0.6333766,0.481902,12326
Male,0.6320031,0.482276,15511



	Welch Two Sample t-test

data:  suicidal_thoughts_num by gender
t = 0.23613, df = 26441, p-value = 0.8133
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.01002784  0.01277486
sample estimates:
mean in group Female   mean in group Male 
           0.6333766            0.6320031 


In [41]:
#Two-sample z-test for proportions
prop.test(table(df$suicidal_thoughts_num, df$gender))


	2-sample test for equality of proportions with continuity correction

data:  table(df$suicidal_thoughts_num, df$gender)
X-squared = 0.049996, df = 1, p-value = 0.8231
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.01363786  0.01072167
sample estimates:
   prop 1    prop 2 
0.4418696 0.4433277 


In testing the difference in means of suicidal thoughts between Female and Male, we fail to reject the null hypothesis. Since our p-value of 0.8133 is greater than our alpha value of 0.05, we can say that there is not statistically significant evidence that there is a difference in means for suicidal thoughts between gender. The mean for suicidal thoughts for females is ~ 0.6334 and for males, ~0.6320. I also conducted a two-sample proportion test, where I reached the same conclusion: a p-value not within the rejection region and to fail the null hypothesis. There is not enough evidence to show that there is a difference in mean or proportions of depression score between males and females. The proportion of suicidal thoughts for females is ~ 0.4419 and for males is ~0.4433.

## Chi-Squared Test  of Independence for Family History and Depression <br>
$H_o:$ Depression is independent of family history <br>
$H_a:$ Depression is dependent of family history<br>

$\chi^2 = \sum{\frac{(O_{ij} - E_{ij})^2}{E_{ij}}}$

$O_{ij} = $ number of students who experience one of these categories: family issues & depression, family issues & no depression, no family issues & depression, no family issues and no depression. <br>
$E_{ij} = $ hypothetical number of students who would fall into those categories if family history and depression were independent.

In [20]:
family_hist <- factor(df$family_hist)
depression <- factor(df$depression)
tab<- table( depression, family_hist)
chisq.test(tab)


	Pearson's Chi-squared test with Yates' continuity correction

data:  tab
X-squared = 79.033, df = 1, p-value < 2.2e-16


Since our p-value is less than alpha of 0.05, we can reject the null hypothesis. The p-value, 2.2e-16, falls within the rejection region with 0.05 as alpha. There is statistically significant evidence of an association between family issues and depression. 