Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run ReproZip for part of script #358

Closed
appukuttan-shailesh opened this issue Jul 5, 2019 · 10 comments
Closed

Run ReproZip for part of script #358

appukuttan-shailesh opened this issue Jul 5, 2019 · 10 comments

Comments

@appukuttan-shailesh
Copy link

Would it possible to run ReproZip for part of a python script?

To elaborate a bit... we are developing a tool whereby users would specify multiple parameters, and based on this different models and protocols would be employed for the simulations (i.e. it involves user interactivity). Naturally, the packages and files that are invoked would also vary based on the above, and the 'environment' I wish to save should exclude these initial parts and other housekeeping tasks, and focus solely on the loading and execution of the model.

With this in mind, is it possible to invoke ReproZip from within a python script (as opposed to calling from the terminal CLI) so that I can track (and save) the files/packages that are required between, e.g. , line number x and y of my script (i.e. to be able to enable/disable ReproZip tracing inside a python script)?

I suppose ReproZip wasn't intended to run in this fashion, but I am curious to know if I could employ certain sub-modules or methods to achieve this. I also took a look at the Jupyter plugin to see if some bits might be useful.

I intend to dig deeper, but felt it was much better to ask here to get a better idea of the lay of the land. Thanks in advance.

(apologies if a similar question has been answered previously elsewhere)

@remram44
Copy link
Member

remram44 commented Jul 5, 2019

Unfortunately ReproZip is meant to track a process, from its creation.

When you reach the section of interest to you, there is no way for you to find out which part of the already-loaded files are required for this new section. For example, numpy might already have been loaded because it's a requirement of your UI package, and you won't see it getting loaded when you load pandas at the start of your simulation code (because it's already been loaded). ReproZip can't automatically determine that you want numpy but not the UI package.

Would it be possible for you to split this script into two separate script? You could have a first script set everything up through a UI, then call the simulation script, passing the simulation parameters on the command line or via a file. Then you can easily interpose ReproZip to trace this second process.

@appukuttan-shailesh
Copy link
Author

appukuttan-shailesh commented Jul 5, 2019

Thanks for the super quick response. Your suggestion about splitting into two scripts was my "plan B" and I intend to try it out soon. Will update you on this shortly.

p.s. Is there a provision for recording package versioning info (wherever possible)... like a pip freeze but limited to the specific packages that were loaded? This is with the intention of obtaining an environment snapshot that can be displayed to visitors (as a detailed requirements file).

@remram44
Copy link
Member

remram44 commented Jul 5, 2019

Yes! I'm hoping to add support for common interpreters (Python, R, Ruby) so that version information can be recorded. I completely agree that this information should be in the bundle.

@appukuttan-shailesh
Copy link
Author

Out of curiosity.... do you have an idea by when this feature might be available? For now, I have been planning to include certain parts of "Sumatra" package to do this version tracking. If this is expected within ReproZip in the near future, I would be inclined to wait :-)

This might be useful for the Python implementation:
https://github.com/open-research/sumatra/blob/20821e8a62fff2869cbdbe74d39aa580c3a19d0a/sumatra/dependency_finder/python.py

@remram44
Copy link
Member

remram44 commented Jul 5, 2019

Unfortunately, ReproZip is not hooked into the experiment's Python interpreter, so I have to take a different approach. Probably simply reading the .dist-info folders.

@remram44
Copy link
Member

remram44 commented Jul 5, 2019

[edit: moved to #359]

@appukuttan-shailesh
Copy link
Author

appukuttan-shailesh commented Jul 26, 2019

I have been attempting to splitting my script into two parts, one of which would be invoked through reprozip. The workflow seems to work in general, but I have the following concerns:

  1. Is it possible to specify via CLI a target directory where the .reprozip and .reprozip-trace directories would be created? My situation is that the same (separated out) script could be required to be run several times in parallel, from within the same directory location. In such a case, I would require to be able to specify distinct target locations for each of the runs. --continue and --overwrite don't suffice for me here.

  2. Is it possible to edit the configuration file via CLI? For instance, I don't wish to store information such as values of environment variables. It wouldn't be feasible for me to do so manually each time.

  3. Do you have any tips for reducing the size of the .rpz file? Are there any group of packages that can be ignored (e.g. Miniconda)? (I realize that doing so will not guarantee reproducibility of the outputs).

@remram44
Copy link
Member

Is it possible to specify via CLI a target directory where the .reprozip and .reprozip-trace directories would be created?

For .reprozip-trace, you can select its location using -d: reprozip trace -d .reprozip-trace-3 ./mycommand. The .reprozip directory is always in $HOME, but that shouldn't cause more issues than a combined log file in .reprozip/log.

Is it possible to edit the configuration file via CLI?

That's currently not possible, sorry. We would need a lot of different commands to support every use case. You can however change this file from Python using PyYaml. Note that changing the environment might cause the experiment not to run though, since some variables are necessary for the reproduction (I'm thinking PATH, HOME, XDG_*, LANG).

Do you have any tips for reducing the size of the .rpz file?

Some things might not be strictly be needed like fonts (#360) but usually all that gets packed is required for the experiment to run. You can omit your data if it's repeated between all the experiments you trace; there is no automated way to put the data in, but running the upload command to put the data in before reproducing is straightforward.

@appukuttan-shailesh
Copy link
Author

appukuttan-shailesh commented Jul 29, 2019

Thanks for the quick reply. I have implemented your suggestions and have got a working prototype ready. Will start testing this out and collecting feedback from others users. Will get back to you with any further developments.

@remram44
Copy link
Member

remram44 commented Jul 29, 2019

Glad I could help! I am very interested in your feedback and experience as you attempt this, so don't hesitate to share what you can.

Closing this ticket in favor of #359.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants