Getting slower with every variable? #11

cjvanlissa · 2019-10-28T15:14:57Z

It seems like syn() is gets slower with every next variable. In a data.frame with 305 variables (all integers on a 0-4 range), the first variables take +/- a second to synthesize, and the last ones take about 15 minutes each. Any idea what causes this behavior?

The text was updated successfully, but these errors were encountered:

gillian-raab · 2019-10-30T10:00:29Z

This is a well known issue. I assume you are using the default of everything which means CART models where each variable is predicted by everything that comes before it and CART models can be slow with as many variables as that.

Do tyou actually get to the end of the fit? If not and even if you do, there are several tactics you can use to get over this. The most obvious is to use a predictor matrix which allows you to select which variables predict which others. If you check out the synthpop web page www.synthpop.org.uk (a bit incomplete I'm afraid) then you will find a page with links to various papers and presentations we have written that might help.
We are always interested to know what people use synthpop for. DO drop us a note to let us know how you are using it and feel free to ask any other questions. Best Gillian Raab gillian.raab@ed.ac.uk

cjvanlissa · 2019-10-30T11:17:42Z

So just to clarify - is it not the case that each variable is predicted by every other variable? Each variable is predicted only by the ones preceding it in the data.frame?

gillian-raab · 2019-10-30T11:43:50Z

Not quite. It is predicted by all variables that come before it in the visit.sequence. If you don't specify visit sequence then it is the same as what you said. Changing the visit sequence is another way of customising your synthesis. If it is important to maintain the relationships between a set of variables than put them together near the start of the visit sequence.

gillian-raab · 2019-10-30T11:44:31Z

PS do let me know what you are synthesising. G

cjvanlissa · 2019-11-05T14:06:02Z

Dear Gillian, I think I understand how the package operates now, but I'm not clear on why "the choice of explanatory variables is restricted by the synthesis sequence and variables that are not synthesised yet cannot be used in prediction models." It seems that, this way, structural relationships among variables are only fully preserved for the final variable in the visit.sequence?

gillian-raab · 2019-12-03T15:15:27Z

If byou tried to use a variable that was not yet synthesised it would not work at all. You are building up the synthetic data from conditional distyributions. In each case you fit a model from the unsynthsised data to get the parameters of the prediction model for the next variable. Then you make a synthetic version of the next variable in the synthetic data set by getting its predicted values from the variables synthesised already.
A simple example of why your suggestion will not work is the following. The first variable (v1) to be synthesised is usually just a bootstrap sample of the original data. THen you fit a model from the real data of the second variable (V2) predicted from v1. THis prediction model is then used to get v2 in the synthetic data by predicting it from v1. If you tried to use a prediction from another variable in the original data it would not work because you would not have a version of it in the synthetic data and the version in the original data would not line up with the synthetic data at all.

You also suggested that the only relationships that would be maintained between variables would be the ones defined by these models, so here the relationship between the later variables and the earlier ones. But remember that a relationship between an earlier variable and a later one is maintained
because this same fit makes the earlier variable in the synthetic data dependent on the later one.

cjvanlissa closed this as completed Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting slower with every variable? #11

Getting slower with every variable? #11

cjvanlissa commented Oct 28, 2019

gillian-raab commented Oct 30, 2019

cjvanlissa commented Oct 30, 2019

gillian-raab commented Oct 30, 2019

gillian-raab commented Oct 30, 2019

cjvanlissa commented Nov 5, 2019

gillian-raab commented Dec 3, 2019

Getting slower with every variable? #11

Getting slower with every variable? #11

Comments

cjvanlissa commented Oct 28, 2019

gillian-raab commented Oct 30, 2019

cjvanlissa commented Oct 30, 2019

gillian-raab commented Oct 30, 2019

gillian-raab commented Oct 30, 2019

cjvanlissa commented Nov 5, 2019

gillian-raab commented Dec 3, 2019