In [None]:
library(ggplot2)
library(reshape2)
library(haven)
theme_set(theme_bw())


The `datasets` directory contains the file `gsoep09.dta`, which is a Stata file built upon the [German Socio Economic Survey](https://www.eui.eu/Research/Library/ResearchGuides/Economics/Statistics/DataPortal/GSOEP) that was ran in 2009.



In [None]:
base_url <- "https://raw.github.com/neurospin/pystatsml/master/datasets/gsoep09.dta"
d <- read_dta(base_url)
head(d)


The variables we are interested in are described below:

- `persnr`: respondant ID
- `hhnr2009`: household ID
- `ybirth`: year of birth
- `sex`: sex of respondant
- `mar`: marital status
- `egp`: socio-economic class
- `yedu`: no. years of education
- `income`: annual income (€)
- `rel2head`: position of respondant relative to household
- `wor01` to `wor12`: 3-point Likert answers to socio-economic and political questions

Like the ESS survey, these data come with survey weights (`dweight` and `xweights`) but will proceed as if it was a cross-sectional sample.

## Data preparation

First, let us subset the data frame by selecting only the above variables:



In [None]:
vars <- c("persnr", "hhnr2009", "ybirth", "sex", "mar", "egp",
          "yedu", "income", "rel2head", "wor01", "wor02", "wor03",
          "wor04", "wor05", "wor06", "wor07", "wor08", "wor09",
          "wor10", "wor11", "wor12")
d <- subset(d, select = vars)
summary(d)


The next steps are to re-encode the Stata variable in a more suitable R format:



In [None]:
d$persnr <- factor(d$persnr)
d$hhnr2009 <- factor(d$hhnr2009)
d$sex <- droplevels(as_factor(d$sex))
d$mar <- droplevels(as_factor(d$mar))
d$egp <- droplevels(as_factor(d$egp))
d$rel2head <- droplevels(as_factor(d$rel2head))
d$age <- 2009 - d$ybirth


Let us now look at the above variables, and recode some of mar and egp categories: (For simplicity, we will discard all refusals from the present dataset.)



In [None]:
table(d$mar)
levels(d$mar)[3:5] <- "Single"
d$mar[d$mar == "Refusal"] <- NA
d$mar <- droplevels(d$mar)
table(d$mar)

In [None]:
table(d$egp)
levels(d$egp)[1:2] <- "High"
levels(d$egp)[2:4] <- "Mid"
levels(d$egp)[3:4] <- "Low"
levels(d$egp)[4:6] <- "None"
d$egp[d$egp == "Refusal"] <- NA
d$egp <- droplevels(d$egp)
table(d$egp)


## Visual exploratory analysis

Say we are interested in the relationship between `income` (or its log), `sex` and `age`, as well as socio-economic status (`egp`), where we anticipate that average income will be lower for younger people, women, and people with lower SES. Again, to simplify we will only consider individuals with available income:



In [None]:
d <- subset(d, income > 0 & !is.na(mar) & !is.na(egp))
d$logincome <- log(d$income)


Exploratory displays: histograms, scatter plot, etc.



In [None]:
p <- ggplot(data = d, aes(x = age)) +
  geom_histogram(binwidth = 5) +
  labs(x = "Age", y = "Counts")
p

In [None]:
p <- ggplot(data = d, aes(x = age, y = logincome)) +
  geom_point() +
  geom_smooth(method = "loess") +
  labs(x = "Age", y = "Annual Income (log)")
p

In [None]:
p <- ggplot(data = d, aes(x = mar, y = logincome)) +
  geom_boxplot() +
  labs(x = "Marital Status", y = "Annual Income (log)") +
  coord_flip()
p


Let us summarize the distribution of average (log) income across socio-economic classes. First, we need to compute the mean and standard deviation of  `logincome` for each level of `egp`. This is easily performed using `aggregate` (or `tapply`):



In [None]:
egp_stats <- aggregate(logincome ~ egp, data = d, mean)
egp_stats$sd <- aggregate(logincome ~ egp, data = d, sd)$logincome
names(egp_stats)[2] <- "mean"
egp_stats

In [None]:
p <- ggplot(data = egp_stats, aes(x = egp, y = mean)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = .2, col = "black") +
  labs(x = "Socio-Economic Status", y = "Average income (log)")
p


Same as above but when considering both `egp` and `sex`



In [None]:
egp_stats <- aggregate(logincome ~ egp + sex, data = d, mean)
egp_stats$sd <- aggregate(logincome ~ egp + sex, data = d, sd)$logincome
names(egp_stats)[3] <- "mean"
egp_stats

In [None]:
p <- ggplot(data = egp_stats, aes(x = egp, y = mean, fill = sex)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = .2, col = "black", position = position_dodge(.9)) +
  scale_fill_manual("", values = c("cornflowerblue", "darkorange")) +
  labs(x = "Socio-Economic Status", y = "Average income (log)")
p


## Statistics

Objectives:

1. Univariate t-tests to assess whether average income differ across gender
2. Test the association between `age` and `income`
3. Test for differences of average income between socio-economic classes
4. Test for interaction between `egp` and `sex` in a two-way ANOVA using `logincome` as outcome
5. Test the association between marital status and socio-economic classes
6. Test the association between responses to the `wor*` questions

**Univariate t-tests**

Some visual checks for the t-test assumptions



In [None]:
p <- ggplot(data = d, aes(x = logincome, color = sex)) +
  geom_line(stat = "density")
p

In [None]:
t.test(logincome ~ sex, data = d)
t.test(logincome ~ sex, data = d, var.equal = TRUE)


**Age and income**



In [None]:
p <- ggplot(data = d, aes(x = age, y = logincome)) +
  geom_point() +
  geom_smooth(method = "loess")
p

In [None]:
cor.test(~ age + logincome, data = d)


**One-way ANOVA**



In [None]:
p <- ggplot(data = d, aes(x = egp, y = logincome)) +
  geom_boxplot()
p

In [None]:
m <- aov(logincome ~ egp, data = d)
summary(m)


**Two-way ANOVA with interaction**



In [None]:
m1 <- aov(logincome ~ egp*sex, data = subset(d, egp != "None"))
summary(m1)
m2 <- update(m1, . ~ . - egp:sex)
summary(m2)
anova(m2, m1)


**Association between marital status and socio-economic classes**



In [None]:
tab <- xtabs(~ mar + egp, data = d)
tab
summary(tab)
chsq <- chisq.test(tab)
chsq
chsq$expected


**Association between responses to the `wor*` questions**



In [None]:
head(d[,grep("^wor", names(d))])
print(cor(d[,grep("^wor", names(d))], use = "pairwise"), digits = 2)


Quick plot (see also [how to reorder rows/cols](https://bit.ly/1vpTlg2) using, e.g., hierarchical clustering)



In [None]:
cor_mat <- melt(cor(d[,grep("^wor", names(d))], use = "pairwise"))
p <- ggplot(data = cor_mat, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() + labs(x = NULL, y = NULL)
p