Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting slower with every variable? #11

Closed
cjvanlissa opened this issue Oct 28, 2019 · 6 comments
Closed

Getting slower with every variable? #11

cjvanlissa opened this issue Oct 28, 2019 · 6 comments

Comments

@cjvanlissa
Copy link
Contributor

It seems like syn() is gets slower with every next variable. In a data.frame with 305 variables (all integers on a 0-4 range), the first variables take +/- a second to synthesize, and the last ones take about 15 minutes each. Any idea what causes this behavior?

@gillian-raab
Copy link

This is a well known issue. I assume you are using the default of everything which means CART models where each variable is predicted by everything that comes before it and CART models can be slow with as many variables as that.

Do tyou actually get to the end of the fit? If not and even if you do, there are several tactics you can use to get over this. The most obvious is to use a predictor matrix which allows you to select which variables predict which others. If you check out the synthpop web page www.synthpop.org.uk (a bit incomplete I'm afraid) then you will find a page with links to various papers and presentations we have written that might help.
We are always interested to know what people use synthpop for. DO drop us a note to let us know how you are using it and feel free to ask any other questions. Best Gillian Raab gillian.raab@ed.ac.uk

@cjvanlissa
Copy link
Contributor Author

So just to clarify - is it not the case that each variable is predicted by every other variable? Each variable is predicted only by the ones preceding it in the data.frame?

@gillian-raab
Copy link

Not quite. It is predicted by all variables that come before it in the visit.sequence. If you don't specify visit sequence then it is the same as what you said. Changing the visit sequence is another way of customising your synthesis. If it is important to maintain the relationships between a set of variables than put them together near the start of the visit sequence.

@gillian-raab
Copy link

PS do let me know what you are synthesising. G

@cjvanlissa
Copy link
Contributor Author

Dear Gillian, I think I understand how the package operates now, but I'm not clear on why "the choice of explanatory variables is restricted by the synthesis sequence and variables that are not synthesised yet cannot be used in prediction models." It seems that, this way, structural relationships among variables are only fully preserved for the final variable in the visit.sequence?

@gillian-raab
Copy link

If byou tried to use a variable that was not yet synthesised it would not work at all. You are building up the synthetic data from conditional distyributions. In each case you fit a model from the unsynthsised data to get the parameters of the prediction model for the next variable. Then you make a synthetic version of the next variable in the synthetic data set by getting its predicted values from the variables synthesised already.
A simple example of why your suggestion will not work is the following. The first variable (v1) to be synthesised is usually just a bootstrap sample of the original data. THen you fit a model from the real data of the second variable (V2) predicted from v1. THis prediction model is then used to get v2 in the synthetic data by predicting it from v1. If you tried to use a prediction from another variable in the original data it would not work because you would not have a version of it in the synthetic data and the version in the original data would not line up with the synthetic data at all.

You also suggested that the only relationships that would be maintained between variables would be the ones defined by these models, so here the relationship between the later variables and the earlier ones. But remember that a relationship between an earlier variable and a later one is maintained
because this same fit makes the earlier variable in the synthetic data dependent on the later one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants