# Run different specifications on the regression model

In [1]:
library(tidyverse)
library(haven)
library(stargazer)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

“package ‘ggplot2’ was built under R version 4.1.3”
“package ‘tidyr’ was built under R version 4.1.2”
“package ‘readr’ was built under R version 4.1.2”
“package ‘dplyr’ was built under R version 4.1.3”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘haven’ was built under R version 4.1.3”
“package ‘stargazer’ was built under R version 4.1.2”

Please cite as: 


 Hlavac, Marek (2022). stargaz

## Data Cleaning

In [2]:
dat <- read_dta("data/CCHS_Annual_2017_2018_curated_trimmed_25%.dta") |> 
    select(GEN_010, SPS_040, dhhgage, DHH_SEX, dhhdglvg) |>
    na.omit()

In [3]:
dat_cleaned <- dat |>
    rename(satisfaction = GEN_010, emo_bond = SPS_040, age = dhhgage, sex = DHH_SEX, family = dhhdglvg) |>
    filter(satisfaction < 11 & emo_bond <= 4 & age <= 16 & sex <= 2 & family <= 8) |> #filter out invalid values
    mutate(sex = as_factor(sex),
           emo_bond = as_factor(emo_bond),
         family = as_factor(family),
         age = as_factor(age))

In [9]:
dat_cleaned$age <- case_when(dat_cleaned$age == "Age between 12 and 14" ~ 13,
                            dat_cleaned$age == "Age between 15 and 17" ~ 16,
                            dat_cleaned$age == "Age between 18 and 19" ~ 18.5,
                            dat_cleaned$age == "Age between 20 and 24" ~ 22,
                            dat_cleaned$age == "Age between 25 and 29" ~ 27,
                            dat_cleaned$age == "Age between 30 and 34" ~ 32,
                            dat_cleaned$age == "Age between 35 and 39" ~ 37,
                            dat_cleaned$age == "Age between 40 and 44" ~ 42,
                            dat_cleaned$age == "Age between 45 and 49" ~ 47,
                            dat_cleaned$age == "Age between 50 and 54" ~ 52,
                            dat_cleaned$age == "Age between 55 and 59" ~ 57,
                            dat_cleaned$age == "Age between 60 and 64" ~ 62,
                            dat_cleaned$age == "Age between 65 and 69" ~ 67,
                            dat_cleaned$age == "Age between 70 and 74" ~ 72,
                            dat_cleaned$age == "Age between 75 and 79" ~ 77,
                            dat_cleaned$age == "Age 80 and older" ~ 80
)

We re-code the family arrangement as living condition which involves four levels - living alone (alone), unattached individual living with others (living with others), living with family (living with family), and others (other). We force "other" as the base group since it does not contain economic information.

In [27]:
dat_cleaned$living <- case_when(dat_cleaned$family == 'Unattached individual living alone.' ~ 'alone',
                                dat_cleaned$family == 'Unattached individual living with others.' ~ 'living with others',
                                dat_cleaned$family == 'Individual living with spouse/partner.' ~ 'living with family',
                                dat_cleaned$family == 'Parent living with spouse/partner and child(ren).' ~ 'living with family',
                                dat_cleaned$family == 'Single parent living with children.' ~ 'living with family',
                                dat_cleaned$family == 'Child living with a single parent with or without siblings.' ~ 'living with family',
                                dat_cleaned$family == 'Child living with two parents with or without siblings' ~ 'living with family',
                               dat_cleaned$family == 'Other' ~ 'Other')

dat_cleaned$living <- factor(dat_cleaned$living, levels = c("Other", "alone", "living with others", "living with family"))

## Model
To perform the statistical analysis, we will estimate a linear regression model in this paper:

$$
Y_i = \beta_0 + \sum_{b=1}^3 \beta_{1, b} E^b_{i} + \beta_2 A_i + \sum_{b=1}^3 \sigma_{b} (E^b_{i} \times A_i) + \alpha X_i + \epsilon_i
$$

Let $i$ index the observation. 
- $Y_i$ is the satisfaction with life in general. 
- $E_i$ is the degree of agreement for strong emotional bond with at least one person. In the summation function $\sum_{b=1}^3 \beta_{1, b} E^b_{i}$, $E^b_{i}$ is an indicator variable equal to one if $E_i$ falls in the given level $b$ (e.g., “agree”). 
- $A_i$ is the age. 
- $E^b_{i} \times A_i$ is the interaction between the emotional bond of a given category $b$ and age. We include this term because we hypothesize that the effect of emotional bond on life satisfaction may depend on age groups, as indicated by the previous study (Vandeleur et al., 2009). 
- $X_i$ represents other control variables. As mentioned above, we will include sex and living/family arrangement, for which we will run different specifications for multiple trials.

### Specifications

I choose 4 specifications. 

Adhering to our proposed model, my first specification is only considering the first three terms of our primary interest.

Having a look at the regression coefficients of the first specification, I realized the effects of interaction term on satisfaction are not significant. Thus, I run the second specification similar to the first one but left the interaction term.

Later, I took control variables into account. In the third specification, I included "sex" in the model. In the fourth one, I included "family arrangement."

#### Specification 1 - Without controls, with interaction

In [12]:
reg1 = lm(satisfaction ~ emo_bond + age + emo_bond:age, data = dat_cleaned)

#### Specification 2 - Without controls, without interaction

In [13]:
reg2 = lm(satisfaction ~ emo_bond + age, data = dat_cleaned)

#### Specification 3 - Controlling "sex", without interaction

In [14]:
reg3 = lm(satisfaction ~ emo_bond + age + sex, data = dat_cleaned)

#### Specification 4 - Controlling "family arrangement", without interaction

In [28]:
reg4 = lm(satisfaction ~ emo_bond + age + living, data = dat_cleaned)

## Output Summary - Regression Table

In [17]:
stargazer(reg1, reg2, reg3, type = "text")


                                                          Dependent variable:                             
                              ----------------------------------------------------------------------------
                                                              satisfaction                                
                                        (1)                       (2)                       (3)           
----------------------------------------------------------------------------------------------------------
emo_bondAgree                        -0.503***                 -0.620***                 -0.620***        
                                      (0.102)                   (0.038)                   (0.038)         
                                                                                                          
emo_bondDisagree                     -1.373***                 -1.702***                 -1.702***        
                                    

In [29]:
stargazer(reg4, type = "text") # no enough space, thus output a new table


                              Dependent variable:    
                          ---------------------------
                                 satisfaction        
-----------------------------------------------------
emo_bondAgree                      -0.579***         
                                    (0.038)          
                                                     
emo_bondDisagree                   -1.517***         
                                    (0.102)          
                                                     
emo_bondStrongly disagree          -2.455***         
                                    (0.279)          
                                                     
age                                 -0.001           
                                    (0.001)          
                                                     
livingalone                        -0.206**          
                                    (0.092)          
                           

### Analysis

The first specification aligns with out initial model excluding the control variable. The coefficients are significant except those for the interaction terms. 

Thus, in the second specification, we dropped the interaction terms. From here, as the degree of disagreement to the statement "having emotional bonds with at least one person" increases, the coefficient on life satisfaction decreases with statistical significance. This holds true across specifications, indicating that the perception of emotional bonds with others is highly likely to be positively correlated to the satisfaction with life in general. What's more, the coefficient values ($\hat{\beta_{1, 1}} = -0.50, \hat{\beta_{1, 2}} = -1.37, \hat{\beta_{1, 3}} = -2.89$) are relatively large in a 0-10 scale, meaning the emotional bonds might be an essential element that affects life satisfaction. This answers part of our research question.

Also, the coefficient of age on life satisfaction ($\hat{\beta_2} = -0.004$) is statistically significant. Though the absolute value is small, a very small standard error ($se = 0.001$) indicates the effect of age on life satisfaction does not vary much across data. The coefficient presents that the life satisfaction has minor decrease in unit increase in age.

In the third specification, we add sex as the control variable into the regression equation. The coefficient (-0.001) is very small and not statistically significant. It also does not affect the estimates for coefficients of our main interest, suggesting the effects of emotional bonds and age on life satisfaction do not differ in sex.

In the fourth specification, we have the living condition as the control variable in the model. As they are included, the absolute values of coefficients of emotional bond on life satisfaction slightly decrease, and the standard errors almost hold constant, compared to specification two. However, the coefficient on age changes from -0.004 to -0.001 and is insignificant. This may suggest that living condition accounts most of the effect of age on life satisfaction. However, since we derive the variable of the living condition from the original variable "family arrangement," it should be careful of its features. The levels of the original variable contain information of the age group (e.g., living condition of "parent" and "children"). This may disturb the investigation.