Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified wrapping method #991

Open
remram44 opened this issue Jan 8, 2015 · 22 comments
Open

Unified wrapping method #991

remram44 opened this issue Jan 8, 2015 · 22 comments

Comments

@remram44
Copy link
Member

remram44 commented Jan 8, 2015

We have several libraries automatically-wrapped as VisTrails packages. Unfortunately, each one is using a different method, which can lead to problems (when using a method that is not optimal), code duplication (they all do basically the same thing), and lesser maintainability.

Packages can be created in different ways:

  • Statically, i.e. we generate Python code that is loaded normally by VisTrails
  • Dynamically, i.e. Python creates modules using type() without generating code

Packages can be created from different sources:

  • Either directly from introspecting the library (and for instance parsing docstrings)
  • Or from an intermediate representation (XML, JSON, ...)

Having an intermediate representation allows for manual corrections, and if the wrapping is dynamic, is the only way we can be sure the wrapped package is always exactly the same no matter the installed library version.

What we have:

  • VTK
    • dynamic, no intermediate repr, introspects Python classes in vtk module
  • matplotlib
    • static, intermediate XML with patch mechanism, parses docstrings
  • Java (WIP, Wrapping Java libraries in VisTrails #820)
    • static, intermediate JSON, uses Java language introspection and source parser
  • scikit-learn (WIP, scikit-learn package #955, @amueller)
    • dynamic, no intermediate repr, ???
  • numpy/scipy (WIP, @tacaswell)
    • dynamic, intermediate JSON, parses docstrings
@amueller
Copy link
Contributor

amueller commented Jan 8, 2015

Can we maybe discuss next Monday? We are just working on a patch that will improve parsing of docstrings. That will require the newest version then....

@remram44
Copy link
Member Author

remram44 commented Jan 8, 2015

For my opinion:

I'm not sure generating Python code is a good idea, since it's kind of hard (even with text templates, as done in matplotlib), doesn't really help understanding what's happening (since the code is usually hard to read, cf matplotlib's). I'm not sure about performance here (parsing Python code is not necessarily fast).

If emitting the modules dynamically, I think having an intermediate representation is a must-have, else the package can have different contents on different machines. I think we all agree on this since all wrappers work this way (except @amueller's?)

I don't really have a preference on the intermediate representation format. I do think having a way to patch (like the matplotlib package's) is a very nice feature.

@dakoop
Copy link
Member

dakoop commented Jan 8, 2015

I created a branch vtk-new-package on the vistrails.org repo that worked to implement a static version similar to the matplotlib representation for VTK. One interesting item was that it seemed like the dynamic generation was faster to load than the static version because of some inefficiencies when parsing port specs. I haven't looked at this for a bit, but it should be included in the conversation.

I agree that an intermediate representation is desirable and patching should be supported.

@remram44
Copy link
Member Author

remram44 commented Jan 8, 2015

@amueller: yes definitely, I'm just gathering info here for the next meeting ;)

If we come up with a wonderful and unified system for wrappers (hard in itself), migrating everything is a lot of work, this is definitely not happening overnight.

@tacaswell
Copy link
Contributor

My implementation is targeted at parsing numpydoc formatted docstrings + python introspection.

Currently only functions are handled, but getting classes in is on my near-term to-do list.

I am completely sold on the need for intermediate representations.

I live in Queens, would it be helpful for me to meet with you all in person?

@amueller
Copy link
Contributor

amueller commented Jan 8, 2015

For the intermediate representation:
That means that you either need to support only a specific version of a package or that you need to write code to make the wrappers compatible with several versions. How do you propose to do that?
I don't see how the package can have the same content using different versions of the wrapped package.

@remram44
Copy link
Member Author

remram44 commented Jan 8, 2015

Using an older version of the library would mean that some modules don't work. This is unavoidable, but IMHO better than having VisTrails report that these modules "don't exist".

But this means that users need to update that intermediate representation to use new features of the wrapped library. We can make this easy (provide a button inside VisTrails?).

I didn't think too much about the problem you describe (supporting multiple versions); I'm not sure what we can do if new library versions break older versions of the wrapper.

In any case, we can warn when the wrapper and wrapped versions don't match.

@amueller
Copy link
Contributor

amueller commented Jan 8, 2015

I guess I didn't think about the usecase when someone gets a file from someone else with a new module. Then it seems better to say "you need a newer version of the lib for this module to work".
Usually I would have thought the other way around: If you are just creating a new pipeline, I would rather not list modules that don't work anyhow.

Btw, the installation base for most python libraries is very diverse, as packages get regularly updated.
So you should assume that basically none of your users uses the same version as the one you wrapped. At least that is what you should assume for numpy, scipy, matplotlib and scikit-learn.
If you did not install exactly the same version of anaconda and did not upgrade anything, I daresay the chances that all these have the same version as the one you wrapped are basically zero.

@tacaswell
Copy link
Contributor

The intermediate representations are relatively cheap (just a text file), it might be worth having versions of that for all of the versions of the library you want to support.

The init.py (which turns that intermediate representation into Modules) can intropsect the version of the wrapped library and select the correct intermediate file.

I don't fully have the full VT data model in my head yet, but could you also propagate this information into the VT version-tracking scheme to allow for smooth(ish) upgrades as well?

@dakoop
Copy link
Member

dakoop commented Jan 8, 2015

I agree with @tacaswell here. Our goal is ensure reproducibility but still allow people to use whatever versions they when possible. I do think that multiple versions is the way to go. With the VTK wrapping, we have had cases where a vistrail that worked on one user's machine will not on another user's machine even though the VisTrails package version is the same because the dynamic wrapping picks up different classes. In addition, the patching should mostly work across multiple versions (it should not need to be updated for each version as these are the special cases that are not handled by the parser).

We already have an upgrade scheme in VisTrails that allows developers to specify the upgrade paths between package versions. It could use some improvement, but it allows mostly automatic upgrades in a way that records which version of the package is being used.

@amueller
Copy link
Contributor

amueller commented Jan 8, 2015

The problem is that if you "just" wrap the version that is available at the moment, it will error when someone uses an old version of the library and you load a pipeline that uses a newer version (ports might appear or disappear etc), which I think is the problem that @remram44 alluded to.

Maybe the easiest way would be to store the version with which a pipeline was created and if there is any error in the loading, give a message about version mismatch.

If you do this, however, I don't see the benefit of an intermediate representation.

@dakoop
Copy link
Member

dakoop commented Jan 8, 2015

I think our messages crossed, but the intermediate representation is very useful for patching (when you need to work around an issue with automated generation). I would argue that hard-coding the exceptional cases into the dynamic generator makes updates more difficult.

@amueller
Copy link
Contributor

amueller commented Jan 8, 2015

On 01/08/2015 11:34 AM, dakoop wrote:

With the VTK wrapping, we have had cases where a vistrail that worked
on one user's machine will not on another user's machine even though
the VisTrails package version is the same because the dynamic wrapping
picks up different classes.
With the scikit-learn wrappers, that can not really happen.
Scikit-learn uses the same methods in our tests.

@amueller
Copy link
Contributor

amueller commented Jan 8, 2015

On 01/08/2015 11:36 AM, dakoop wrote:

I think our messages crossed, but the intermediate representation is
very useful for patching (when you need to work around an issue with
automated generation). I would argue that hard-coding the exceptional
cases into the dynamic generator makes updates more difficult.

My wrappers are functional without hard coded cases ;)
Adding the whole of scikit-learn is literally 359 lines (no
documentation and testing yet, though).

@amueller
Copy link
Contributor

amueller commented Jan 8, 2015

Anyhow, I don't want to argue too much if you are all of the opinion that intermediate formats are great. So what does dynamic with intermediate format mean? Does it mean the wrapper is created dynamically from the intermediate format or that the intermediate format is created dynamically?

@remram44
Copy link
Member Author

remram44 commented Jan 8, 2015

The former; "dynamic" means no Python source generation. Of course it's not "truly dynamic" if generating from the static intermediate files.

@amueller
Copy link
Contributor

amueller commented Jan 8, 2015

Fair enough. So I could achieve that by splitting all of my functions into two, one that writes the variables that I extracted into a json, and one that reads the json and generates a class from it. And reading would happen on load, while the writing would happen offline. Seems like a small-enough change.
Generating the jsons would require a script that goes through the last x releases, creates virtualenvs with the release, and runs the generation. A bit of hackery there but not too bad.
Then on dynamic generation you could just check the version string and match it with the json version.

If you do that

the only way we can be sure the wrapped package is always exactly the same no matter the installed library version.

is not true any more, though.

@tacaswell
Copy link
Contributor

@amueller That is exactly what I did in terms of splitting the logic in two.

The code that extracts the meta-data ended up being independent of VT which makes it more generally useful (for example if you wanted to go through a library and add type-hinting). It is also in the back of my head to try to move the numpydoc scraping code into numpydoc or a stand-alone project.

@remram44
Copy link
Member Author

After the discussion in today's meeting:
@rexissimus will try and come up with a nice wrapping system, with intermediate representation (but w/o Python code generation?) and use it for VTK first.

@tacaswell
Copy link
Contributor

Bah, sorry I missed today's meeting.

@rexissimus please have a look at wrap_lib.py and scrape.py in https://github.com/Nikea/VTTools/tree/master/vttools which I think are already a nice wrapping system with an intermediate representation.

@rexissimus
Copy link
Member

Thanks, I will do that.

@rexissimus
Copy link
Member

The general wrapper based on VTK is here: https://github.com/VisTrails/VisTrails/tree/python-wrapper

It provides a specification for wrapping python functions and classes that vistrails then knows how to execute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

5 participants