# Preliminary Analysis

In this preliminary analysis, I used a linear regression model to explore whether sexuality has an effect on life satisfaction. Specifically, I estimated the model $Y_i = \beta_0 + \beta_1 S_i + \alpha X_i + \epsilon_i$, where $Y_i$ is "satisfaction with life in general" and $S_i$ is "sex"* in the Canadian Community Health Survey (CCHS) data. I also included "emotional bond with more than 1 person" denoted by $X_i$ as the control variable in the model.

In the first stage, I set $\alpha = 0$, using a simple regression model to see if there is a relationship between sex and life satisfaction without controlling the other variable. In the next stage, I included $X_i$ into the model.

The statistical result suggested that sex may not have an effect on the life satisfaction, while the emotional bond seems to have a noticeable effect on the outcome.

*It may be preferred to use "gender" as the variable in interest, but the CCHS data set does not provide such.

## Data Features
- The variable "satisfaction with life in general" is quantitative and ranges from 0 (Very dissatisfied) to 10 (Very satisfied). 
- The variable "sex" is a dummy variable.
- "Emotional bond with more than 1 person" is on likert scale and includes 4 levels - strongly disagree, disagree, agree and strongly agree.

## Stage 1: Simple Regression
### Data Cleaning

In [1]:
library(tidyverse)
library(haven)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

“package ‘ggplot2’ was built under R version 4.1.3”
“package ‘tidyr’ was built under R version 4.1.2”
“package ‘readr’ was built under R version 4.1.2”
“package ‘dplyr’ was built under R version 4.1.3”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘haven’ was built under R version 4.1.3”


In [2]:
data <- read_dta("data/CCHS_Annual_2017_2018_curated_trimmed_25%.dta") |>
    select(GEN_010, DHH_SEX, SPS_040) |>
    na.omit() |>
    rename(satisfaction = GEN_010, sex = DHH_SEX, emo_bond = SPS_040) |>
    filter(emo_bond <= 4) |>
    mutate(sex = as_factor(sex),
           emo_bond = as_factor(emo_bond))

### t-test

First, use a t-test to estimate the difference in average $Y_i$ for the two levels of $S_i$.

In [3]:
t1 = t.test(
       x = filter(data, sex == "Male")$satisfaction,
       y = filter(data, sex == "Female")$satisfaction,
       alternative = "two.sided",
       mu = 0,
       conf.level = 0.95)

t1 

round(t1$estimate[1] - t1$estimate[2], 3)


	Welch Two Sample t-test

data:  filter(data, sex == "Male")$satisfaction and filter(data, sex == "Female")$satisfaction
t = -2.2037, df = 7944.7, p-value = 0.02757
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.156510465 -0.009150901
sample estimates:
mean of x mean of y 
 7.985047  8.067877 


From the results of the t-test, we can reject the null hypothesis and infer that there is a significant difference in satisfaction with life for males and for females in a 95% confidence level (p-value < 0.05). The sex-satisfaction gap is -0.083: on average, males tend to have a slightly lower satisfaction level than females.

### Estimation of Simple Regression Model

In [4]:
regression1 <- lm(satisfaction ~ sex, data = data)

summary(regression1)


Call:
lm(formula = satisfaction ~ sex, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-8.068 -0.985  0.015  1.015  2.015 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.98505    0.02764 288.908   <2e-16 ***
sexFemale    0.08283    0.03759   2.203   0.0276 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.691 on 8148 degrees of freedom
Multiple R-squared:  0.0005954,	Adjusted R-squared:  0.0004728 
F-statistic: 4.854 on 1 and 8148 DF,  p-value: 0.0276


The coefficient on x is 0.08283.

## Stage 2: Take the Control Variable into the Model

In [5]:
regression2 <- lm(satisfaction ~ sex + emo_bond, data = data)

summary(regression2)


Call:
lm(formula = satisfaction ~ sex + emo_bond, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3452 -0.7000  0.2927  1.3000  4.3572 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                8.345151   0.031913 261.500   <2e-16 ***
sexFemale                 -0.007252   0.036577  -0.198    0.843    
emo_bondAgree             -0.637879   0.037920 -16.822   <2e-16 ***
emo_bondDisagree          -1.718779   0.102462 -16.775   <2e-16 ***
emo_bondStrongly disagree -2.695106   0.281682  -9.568   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.636 on 8145 degrees of freedom
Multiple R-squared:  0.06508,	Adjusted R-squared:  0.06463 
F-statistic: 141.8 on 4 and 8145 DF,  p-value: < 2.2e-16


As we took the "emotional bond" variable into the model, the absolute value of the coefficient on $X_i$ decreases from 0.083 to 0.007, and the new coefficient is not statistically significant with a 95% confidence level.

## Conclusion
While we control for another variable such as the "emotional bond," the coefficient measures the pure effect (relative to the without-control case) of sex on satisfaction with life. The statistical test (p-value > 0.05) suggests a difference of life satisfaction level in means may not exist by sexual groups.

However, the coefficients on different levels of "emotional bond" are all statistically significant and tend to decrease as people perceive less emotional bond with others. This suggests a promising relationship between emotional bond and life satisfaction, which we will focus more on in the final project.