Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between synpp and other tools #75

Open
ainar opened this issue Dec 9, 2022 · 1 comment
Open

Difference between synpp and other tools #75

ainar opened this issue Dec 9, 2022 · 1 comment

Comments

@ainar
Copy link
Contributor

ainar commented Dec 9, 2022

I discovered synpp through https://github.com/eqasim-org/ile-de-france.
This tool is handy. Doing some research, I found that this kind of framework is widespread. They are called "data pipeline" frameworks.
We can find it in bioinformatics or, more generally, in data science research works.

So here are my questions:

  • What are the differences between synpp and the other pipeline frameworks that are far more used?
  • Does synpp have specific and mandatory features for population synthesis?

For example:

I think the reason for synpp is that it is more straightforward than the other tools I listed (or I am just used to it). Because of that, I think synpp should stay the most simple, not reinvent the wheel with each new feature.
Do you have another opinion? What were your thoughts when you thought about alternatives if you did?

That makes me wonder, can synpp be officially generalized for other works non-related to population synthesis?

@sebhoerl
Copy link
Contributor

sebhoerl commented Dec 9, 2022

Hi @ainar, good question. It was mainly a development to have all the functionality we wanted, but if there is another framework that ticks all the boxes and has already a good community around, it would indeed be worth a thought to port our pipelines to another framework and rather contribute to it.

Back then I tested quite some frameworks (and for me snakemake seems to be the most known one with the largest community), but somehow they all didn't exactly what we wanted. Some points that made us create our own pipeline tool is indeed, to keep it simple. Although now we are step by step thinking that it would be nice to have more complex functionality such as multi-machine scheduling etc.

Some (maybe unique) functionality of synpp:

  • Compared to snakemake, we do a lot in memory. Snakemake needs a file for everything you output, on one hand this concerns space consumption (if the stage is not "ephemeral"), on the other hand this means over and over readinga and writing input and output when passing it on
  • This is especially awkward when we parametrize the pipeline (sampling_rate = 0.5, random_seed = 1234, ...) then we would always need to construct our stages to have things like population_output_sr0.5_rs1234.csv which is not very flexible. I don't know if the other systems you have cited are better in passing down configuration parameters along the stage hierarchy. Basically, each stage in synpp is not only the stage itself, but also its parametrization
  • I think I tried another framework (not sure which) but there it was awkward to "devalidate" stages along the line to only rerun a part of the pipeline
  • In general, I'm not sure if the cited frameworks have the intuitive caching structure of synpp (rerunning stages when the code has changed, rerunning when inputs have changed, ...)

If we find something that can do all of this with an existing community behind, that would be quite nice :)

I think other options are, for instance, Celery or non-python-based Airflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants