/
Rank-Rank-Reg.rmd
177 lines (135 loc) · 8.68 KB
/
Rank-Rank-Reg.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
title: "Inference for Rank-Rank Regressions"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Inference for Rank-Rank Regressions}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
The following example illustrates how the `csranks` package can be used for estimation and inference in rank-rank regressions. These are commonly used for studying intergenerational mobility.
<!-- ## Rank-Rank regression
Denote by $X$ the explanatory variable, $Y$ the outcome variable, and by
$F_X, F_Y$ CDFs of $X$ and $Y$. We postulate a linear model between
$R_X=F_X(X)$ and $R_Y=F_Y(Y)$:
\[R_Y = c+\rho R_X\]
A naive course of action would be to estimate $R_X$ and $R_Y$ using the empirical
cumulative distribution function and plug the results into `lm`. However, this approach
ignores the uncertainty originating from rank estimation. An individual
with income larger than 90% of sample could have income larger than 92 or 87% of the whole
population after all. This ignorance results in inconsistent standard errors, confidence
intervals and p-values of regression coefficients $c$ and $\rho$.
In the `csranks` package, an asymptotically correct method of calculation of those standard errors
is implemented. The key function in the workflow is `lmranks`. -->
## Example: Intergenerational Mobility
In this example, we want to intergenerational income mobility by estimating and performing inference on the rank correlation between parents and their children's incomes. The `csranks` package contains an artificial dataset with data on children's and parents' household incomes, the child's gender and race (`black`, `hisp` or `neither`).
First, load the package `csranks`. Second, load the data and take a quick look at it:
```{r setup}
library(csranks)
data(parent_child_income)
head(parent_child_income)
```
### Rank-rank regression
In economics, it is common to estimate measures of mobility by running rank-rank regressions. For instance, the rank correlation between parents' and children's incomes can be estimated by running a regression of a child's income rank on the parent's income rank:
```{r lmranks}
lmr_model <- lmranks(r(c_faminc) ~ r(p_faminc), data=parent_child_income)
summary(lmr_model)
```
This regression specification takes each child's income (`c_faminc`), computes its rank among all children's incomes, then takes each parent's income (`p_faminc`) and computes its rank among all parents' incomes. Then the child's rank is regressed on the parent's rank using OLS. The `lmranks` function computes standard errors, t-values and p-values according to the asymptotic theory developed in Chetverikov and Wilhelm (2023).
A naive approach, which **does not** lead to valid inference, would compute the children's and parents' ranks first and the run a standard OLS regression afterwards:
```{r lm}
c_faminc_rank <- frank(parent_child_income$c_faminc, omega=1, increasing=TRUE)
p_faminc_rank <- frank(parent_child_income$p_faminc, omega=1, increasing=TRUE)
lm_model <- lm(c_faminc_rank ~ p_faminc_rank)
summary(lm_model)
```
Notice that the point estimates of the intercept and slope are the same as those of the `lmranks` function. However, the standard errors, t-values and p-values differ. This is because the usual OLS formulas for standard errors do not take into account the estimation uncertainty in the ranks.
One can also run the rank-rank regression with additional covariates, e.g.:
```{r lmrankscov}
lmr_model_cov <- lmranks(r(c_faminc) ~ r(p_faminc) + gender + race, data=parent_child_income)
summary(lmr_model_cov)
```
### Grouped rank-rank regression
In some economic applications, it is desired to run rank-rank regressions separately in subgroups of the population, but compute the ranks in the whole population. For instance, we might want to estimate rank-rank regression slopes as measures of intergenerational mobility separately for males and females, but the ranking of children's incomes is formed among all children (rather than form separate rankings for males and females).
Such regressions can easily be run using the `lmranks` function and interaction notation:
```{r grouped_lmranks_simple}
grouped_lmr_model_simple <- lmranks(r(c_faminc) ~ r(p_faminc_rank):gender,
data=parent_child_income)
summary(grouped_lmr_model_simple)
```
In this example, we have run a separate OLS regression of children's ranks on parents' ranks among the female and male children. However, incomes of children are ranked among all children and incomes of parents are ranked among all parents. The standard errors, t-values and p-values are implemented according to the asymptotic theory developed in Chetverikov and Wilhelm (2023), where it is shown that the asymptotic distribution of the estimators now need to not only account for the fact that ranks are estimated, but also for the fact that estimators are correlated across gender subgroups because they use the same estimated ranking.
A naive application of the `lm` function would produce the same point estimates, but **not** the correct standard errors:
```{r grouped_lm_simple}
grouped_lm_model_simple <- lm(c_faminc_rank ~ p_faminc_rank:gender + gender - 1, #group-wise intercept
data=parent_child_income)
summary(grouped_lm_model_simple)
```
One can also create more granular subgroups by interacting several characteristics such as gender and race:
```{r grouped_lmranksgran}
parent_child_income$subgroup <- interaction(parent_child_income$gender, parent_child_income$race)
grouped_lmr_model <- lmranks(r(c_faminc) ~ r(p_faminc_rank):subgroup,
data=parent_child_income)
summary(grouped_lmr_model)
```
Let's compare the confidence intervals for regression coefficients produced
by `lmranks` and naive approaches.
```{r grouped_lm}
grouped_lm_model <- lm(c_faminc_rank ~ p_faminc_rank:subgroup + subgroup - 1, #group-wise intercept
data=parent_child_income)
summary(grouped_lm_model)
```
```{r plot_CIs, message = FALSE, out.width = "90%", fig.width=6, fig.height=4}
library(ggplot2)
theme_set(theme_minimal())
ci_data <- data.frame(estimate=coef(lmr_model),
parameter=c("Intercept", "slope"),
group="Whole sample",
method="csranks",
lower=confint(lmr_model)[,1],
upper=confint(lmr_model)[,2])
ci_data <- rbind(ci_data, data.frame(
estimate = coef(grouped_lmr_model),
parameter = rep(c("Intercept", "slope"), each=6),
group = rep(c("Hispanic female", "Hispanic male", "Black female", "Black male",
"Other female", "Other male"), times=2),
method="csranks",
lower=confint(grouped_lmr_model)[,1],
upper=confint(grouped_lmr_model)[,2]
))
ci_data <- rbind(ci_data, data.frame(
estimate = coef(lm_model),
parameter = c("Intercept", "slope"),
group = "Whole sample",
method="naive",
lower=confint(lm_model)[,1],
upper=confint(lm_model)[,2]
))
ci_data <- rbind(ci_data, data.frame(
estimate = coef(grouped_lm_model),
parameter = rep(c("Intercept", "slope"), each=6),
group = rep(c("Hispanic female", "Hispanic male", "Black female", "Black male",
"Other female", "Other male"), times=2),
method="naive",
lower=confint(grouped_lm_model)[,1],
upper=confint(grouped_lm_model)[,2]
))
ggplot(ci_data, aes(y=estimate, x=group, ymin=lower, ymax=upper,col=method, fill=method)) +
geom_point(position=position_dodge2(width = 0.9)) +
geom_errorbar(position=position_dodge2(width = 0.9)) +
geom_hline(aes(yintercept=estimate), data=subset(ci_data, group=="Whole sample"),
linetype="dashed",
col="gray") +
coord_flip() +
labs(title="95% confidence intervals of intercept and slope\nin rank-rank regression")+
facet_wrap(~parameter)
```
The coefficient calculated for the whole sample has a narrow confidence interval, which is
expected. In this example, there are some differences in the correct (`csranks`) confidence intervals and the incorrect (`naive`) confidence intervals, but they are rather small. The paper by Chetverikov and Wilhelm (2023), however, provides empirical examples in which the differences can be quite large.
## Reference & further reading
[Chetverikov and Wilhelm (2023), "Inference for Rank-Rank Regressions". arXiv preprint arXiv:2310.15512](https://arxiv.org/pdf/2310.15512.pdf)
Check out the documentation of individual functions at the package's [website](https://danielwilhelm.github.io/R-CS-ranks/) and further examples in the package's [Github repository](https://github.com/danielwilhelm/R-CS-ranks/tree/master/examples).