-
Calculates analytical and numerical bounds on conditional expectation functions from censored data, from Novosad, Rafkin & Asher (2021) "Mortality Change Among Less Educated Americans".
-
Calculates mortality change in constant education percentile bins. This is non-trivial because education rank boundaries change over time: dropouts were the bottom 20% in 1992 and the bottom 9% in 2018.
-
Replication code and data for Novosad, Rafkin & Asher (2021) "Mortality Change Among Less Educated Americans". (See replication information below)
This is a set of programs that bounds a conditional expectation function E(y|x), when x is uniformly distributed but only observed in a discrete set of non-overlapping intervals. The bounds are derived in "Mortality Change Among Less Educated Americans" above.
The method is valid for any conditional expectation function with uniformly distributed x. Bounds with other known distributions are derived in the paper; these are not included in the present implementation, but it would be straightforward to extend the Matlab code to these use cases.
We include Stata and Matlab code. The Stata code is analytical, and thus fast and easy to run. The Matlab code uses a numerical optimization, which is slower and more involved on the coding side, but allows complex structural restrictions on the CEF. We present an example by constraining the curvature of the CEF.
Dataset | Description |
---|---|
mortality_by_ed_group.dta | The file contains unadjusted nation-wide annual mortality estimates, by cause, age group, race, sex and education level. |
mortality_by_percentile.xlsx | The file contains bounds on mortality change, by cause, age group, race, sex and education percentile group. The normalized change divides the level change in mortality from a given cause by baseline (1992-1994) all-cause mortality. The number thus represents the percentage change in all-cause mortality that is accounted for by changes in the listed cause. |
We calculated bounds on expected mortality in arbitrary education rank bins, for example education among the least educated 10%. These bounds are calculated under the assumption that mortality is non-increasing in the education rank.
The stata program bound_mort()
in mortality_programs.do
calculates bounds on mortality, assuming that import parameters specify mortality rates in deaths per 100,000 people.
Sample usage to bound mortality in the bottom 20%:
bound_mort tmortrate if sex == 1, s(0) t(20) gen(varname) [xvar(varname) by(varname)]
tmortrate
is the mortality rate in deaths per 100,000. Arbitrary use of if
and by
are permissible. xvar
is the interval-censored rank variable. s()
and t()
are the desired rank range for the mortality estimate as described above. gen
specifies a stub for the upper and lower bounds. e.g. If you specify gen(mu)
, bound_mort()
will generate variables mu_lb
and mu_ub
with the bounds for each row of the data.
For instance, to calculate mortality among the least educated 10%, you would use:
bound_mort tmortrate, s(0) t(10) gen(mu) [xvar(varname) by(varname)]
The Matlab code in the repo can implement bounds with curvature restrictions; these are not yet documented. But if you want to figure it out yourself, bound_mort_stats.m
might be a good place to start.
Syntax: bound_mu(input_csv, cuts, vals, mu_s, mu_t, f2, spec)
Need to specify either input_csv
or cuts
and vals
but not both.
input_csv
: 2-column input file with (i) mean education rank [0-1] in bin; (ii) mean mortality in bin.cuts
: rank bin boundaries. e.g. If 15% are dropouts,cuts(1) = 15
.vals
: mean mortality in each rank binmu_s
,mu_t
: Target bin boundaries. To calculate mortality in bottom 10%, usemu_s = 0, mu_t = 10
.f2
: maximum allowed curvature across any pair of bins is mean mortality *f2
spec
: (i)nomon
: no monotonicity constraint; (ii)mon
: monotonicity constraint; (iii)mon-step
: monotonicity constraint, but no curvature constraint at bin boundaries.
The function returns a pair of floats with the bounds on mortality in the desired bin.
To regenerate the tables and figures from the paper, take the following steps:
-
Download and unzip the replication data package from this Google Drive Folder
nra-mortality.zip
-- huge file includes ACS and CPS componentsnra-mortality-small.zip
-- replication mortality datasets, allows complete replication of paper but not some appendices
-
Clone this repo.
-
Open the do file make_nra_mortality.do, and set the globals
out
,mdata
, andtmp
.$out
is the target folder for all outputs, such as tables and graphs.$mdata
is the folder where you unzipped and saved the replication data package.- intermediate files will be placed in both
$tmp
and$mdata/int
.
-
Open
matlab/set_matlab_paths.m
and setbase_path
to the same path as$mdata
. -
Open
a/graph_intuitive.py
and setoutput_path
to$mdata/out
in line 10.
NOTE: The code probably won't work if you have spaces in the pathnames. Blame StataCorp, not us.
- Run the do file
make_nra_mortality.do
. This will run through all the other do files to regenerate all of the results in$out/
.
This paper uses restricted NCHS data, because it requires the education of the deceased, which was not reported in public NCHS files beginning around 2005. These restricted data cannot be included in the replication package. Therefore, the makefile comments out make_mortality_data.do
, which constructs the NCHS + ACS + CPS national aggregates which form the basis of the analysis. However, make_mortality_data.do
and its subcomponents are provided for anyone with access to the restricted access data. The outputs of this code appear in $mdata/mort
(and are provided). We have permission from NCHS to post national mortality aggregates constructed from the microdata.
Restricted mortality microdata is available via an application process from the NCHS. Public-use mortality microdata is very similar but excludes county identifiers in recent years, which affects some of our calculations. Other datasets, including ACS, CPS, and NHIS data, are publicly available.
The Matlab bound-generating code (run_matlab_solver.do
) was run in parallel across 45 processes on a research server, each process taking about 6 hours. As such, we have configured the code to generate bounds only for one age/race group (age 25, white), which are saved in $mdata/bounds/int/
. The analysis draws all of its code from $mdata/bounds/
, which has the complete set of bounds. Note that the Matlab bound-generating code is based on a 100-parameter numerical minimization problem which can have local minima, and thus may produce marginally different results in different versions of Matlab or on servers with different memory or default parameters. As such, the bounds generated in bounds/int
may differ slightly from those in bounds/
. We do not expect any substantive differences that would affect any of the conclusions of the paper.
You might need to change \mortalitypath
in mortality.tex
to an absolute path. The relative path to exhibits/
works for some of us and not for others.
This code was tested using Stata 16.0 and Matlab R2019a. Estimated run times on our server are:
- NCHS build and pre-Matlab build: 2 hours
- Matlab bound generation: 6 hours * 45 parallel processes
make_results.do
: 1 hour
The mapping of results output names to tables and figures is as follows:
Figure 1
Exhibit | Filename |
---|---|
Figure 1 | scatter-smooth-t-50-[12]-[12].pdf |
Figure 2 | intuit_[a-d].png |
Figure 3 | mort_cef.pdf |
Figure 4 | naive-5-women-50-t-[12].pdf |
Figure 5 | trend-smooth-mon-step-t-sex-50.pdf |
Figure 6 | changes-total-[12]-[12].pdf |
Figure 7 | changes-nod-[12]-[12].pdf |
Table 1 | table_mort_stats_1992.tex |
table_mort_stats_2016.tex | |
Table 2 | age_adjusted_all_cause.tex |
Table A1 | icd_causes.tex |
Table A2 | all_cause_std.tex |
Figure A1 | std_mort_perc_total.pdf |
Figure A2 | naive-1-women-50-t-[12].pdf |
naive-1-men-50-t-[12].pdf | |
Figure C1 | polyspline__50_[MF]_2012-2014.pdf |
Table D1 | semimon_bounds.tex |
Figure D1 | f1992_semimon_[0520100].pdf |
Figure D2 | causes-1992-2-1.pdf |
causes-racesex-2-1.pdf | |
causes-mon-step-2-1.pdf | |
causes-nof-2-1.pdf | |
Figure D3 | mean_within_rank_50_[12]_comb.pdf |
Figure D4 | total_pops.pdf |
Figure D5 | hisp_shift_[12].pdf |
Figure D6 | cps_pred_all_dropout.pdf |
Figure D7 | cps_pred_all_hs.pdf |
Figure D8 | ests_yline.pdf |
Figure D9 | lowess_sex_both.pdf |
This code relies on Unix (Linux or Mac) Stata/Matlab, and on Python 3.2.