New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting slower with every variable? #11
Comments
This is a well known issue. I assume you are using the default of everything which means CART models where each variable is predicted by everything that comes before it and CART models can be slow with as many variables as that. Do tyou actually get to the end of the fit? If not and even if you do, there are several tactics you can use to get over this. The most obvious is to use a predictor matrix which allows you to select which variables predict which others. If you check out the synthpop web page www.synthpop.org.uk (a bit incomplete I'm afraid) then you will find a page with links to various papers and presentations we have written that might help. |
So just to clarify - is it not the case that each variable is predicted by every other variable? Each variable is predicted only by the ones preceding it in the data.frame? |
Not quite. It is predicted by all variables that come before it in the visit.sequence. If you don't specify visit sequence then it is the same as what you said. Changing the visit sequence is another way of customising your synthesis. If it is important to maintain the relationships between a set of variables than put them together near the start of the visit sequence. |
PS do let me know what you are synthesising. G |
Dear Gillian, I think I understand how the package operates now, but I'm not clear on why "the choice of explanatory variables is restricted by the synthesis sequence and variables that are not synthesised yet cannot be used in prediction models." It seems that, this way, structural relationships among variables are only fully preserved for the final variable in the visit.sequence? |
If byou tried to use a variable that was not yet synthesised it would not work at all. You are building up the synthetic data from conditional distyributions. In each case you fit a model from the unsynthsised data to get the parameters of the prediction model for the next variable. Then you make a synthetic version of the next variable in the synthetic data set by getting its predicted values from the variables synthesised already. You also suggested that the only relationships that would be maintained between variables would be the ones defined by these models, so here the relationship between the later variables and the earlier ones. But remember that a relationship between an earlier variable and a later one is maintained |
It seems like
syn()
is gets slower with every next variable. In a data.frame with 305 variables (all integers on a 0-4 range), the first variables take +/- a second to synthesize, and the last ones take about 15 minutes each. Any idea what causes this behavior?The text was updated successfully, but these errors were encountered: