GitHub - aflaxman/space-time-predictive-validity: A framework for evaluating spatial-temporal imputation methods

aflaxman / space-time-predictive-validity Public

Notifications You must be signed in to change notification settings
Fork 1
Star 3

A framework for evaluating spatial-temporal imputation methods

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README		README
create_data.do		create_data.do
evaluate_estimates.py		evaluate_estimates.py
gold_45q15m.csv		gold_45q15m.csv
simulate_data.py		simulate_data.py
validate.py		validate.py

Repository files navigation

This project is an effort to validate and compare space-time smoothing methods

plan for now:

python script to generate noisy simulation data from 45q15 male estimates.

output format is a csv file with columns labeled::

iso3, country, region, super-region, year, sex, age, y, se, x_1, x_2, x_3, ...

file should rows for each ihme standard country, for each year 1970 to 2010 (for sex=male, and age='' for now)

big question: what is y? (and what is se?)

sometimes y = '' [1]
otherwise y = rnormal(logit(45q15), se**-2.) [2]

[1] decision for making y == '' could be::

for 20% of rows, chosen at random
for 20% of countries, chosen at random
forecasting and backcasting (drop earliest/latest 20% of years)
compositional bias, e.g. drop with probability proportional to 45q15
missingness patterns from child, maternal, ihd

[2] se could be::

constant
f(x_i) for some x_i
f(45q15) "heteroskedastic"
two constants, large and small, selected randomly with small and large probability "heavy-tail error"

1. need a script (in python or otherwise) that messes up data in this format, but with complete data for column 'y'

$ python simulate_noisy_data.py --missing=compositionalbias --error=heavytail < gold_45q15m.csv > noisy_45q15m.csv

2. have competing smoothing approach that take the results of this and create estimates in compatible format

$ python nest_gp.py < noisy_45q15m.csv > estimate_45q15m.csv
$ R --script='my_loess.r' < noisy_45q15m.csv > estimate_45q15m.csv

3. need a script that compares the results of the estimate to the gold standard

$ python compare_estimate_to_gold.py estimate_45q15m.csv gold_45q15m.csv

4. master script to run simulate_noisy_data, a given model, and compare_estimate_to_gold multiple times and with different patterns of missingness.

$ python master.py gold_45q15m.csv

Additional Notes:

Eureqa: Possibly a good model for the GPR machine?

Eureqa: Takes data, looks for a functional relationship between the
columns. You specify which variable you would like as a function of
which others, you can tell it a lot about what you want the function
to look like, and it finds some functions within your specifications
that are minimal w.r.t. the error metric you choose.

Like DisMod, we would like to take fairly complicated alorithms and
put them in a framework where they are easy to use. SPK is another
good example.

Testing platform for Age-Space-Time smoother: Adult mortality
estimates from the Lancet paper as the gold standard. Noise them up,
see how well the smoother gets them back to where they were. How we
noise them up is an open question.

Like Eureqa, we may want to make our smoother so that users can choose
which components - GPR, spline, etc. - will go into it.

We should make sure we're not duplicating something that has already
been done. We haven't yet found something that does exactly what we
would like. How problem-specific is our goal? This would be widely
useful around IHME. If we want it to be useful beyond IHME, good
documentation and well-worked examples are essential.

Common inputs:
a csv file with columns labeled::

subregion, year, sex, age, y, se, x_1, x_2, x_3, ...

y and se may both be '' (empty string)

additionally, a csv representing the distance metric on subregions:
subregion_i, subregion_j, distance between subregion_i and subregion_j

Perhas something more general, i.e. you can specify a heirarchy using
any variables. At least a heirarchy, in terms of groupings, for the
spatial component.

options:
include sex in model?

another goal:
make using cluster easy and transparent