Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom pickler #615

Closed

Conversation

dmitriy-serdyuk
Copy link
Contributor

Via @bartvm in the comments on #685 (adding here for posterity):

[This PR] combines the previous Checkpoint extension (which pickled the main loop) and the Dump extension (which collected the values of the parameters and saved them to an NPZ file). It does so by making use of pickle's "persistent ID" support (which we already merged into Theano a while ago, Theano/Theano#2802).

In practice, it means that during pickling, each shared variable gets intercepted by the PersistentSharedVariableID class and is assigned a name (if its owned by a brick, the path to the brick is used, otherwise the name of the shared variable is used). The value of the shared variable is saved as an NPY file to a ZIP file, together with the pickled main loop (which contains string references to the names of the parameters). The final result is a single ZIP file with a pickled main loop and a series of NumPy arrays containing the values of the all the shared variables contained within the main loop.

Advantages over the old method:

  • No need to use both Checkpoint and Dump. This single method allows you to resume training seamlessly, while also allowing you to inspect and share parameter values across platforms; no unpickling needed.
  • Parameters are no longer silently overwritten if they had the same name and/or don't belong to a brick, as Dump does.
  • The entire serialization situation gets (a) less confusing and (b) documented (because the current docs are out of date, and still talk about using Dill).

if obj.name:
if obj.name == 'pkl':
ValueError("can't pickle shared variable with name `pkl`")
if hasattr(obj, 'tag'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't shared variables always have the tag attribute? It could be an empty dictionary though if it's not annotated.

Thinking about it, this class should probably inherit from PersistentNdarrayID. Right now, when super gets called, doesn't the parent class (PersistentSharedVariableID) simply override the name you've just created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh.. Funny, it was working because I made a wrong super call.

@dmitriy-serdyuk
Copy link
Contributor Author

I suppose, we won't need DumpManager anymore, but load_parameters and save_parameters functions might be useful for some people.

I'd like to save LoadFromDump extension. In order to make it work with a new serialization I'll pickle log and iteration state separately.

Do you agree?

@bartvm
Copy link
Member

bartvm commented May 5, 2015

We'd only need load_parameters, right? There's no real need for save_parameters, considering that it's exactly the same as dump, except perhaps that save_parameters could raise an error for non-unique names.

I would change LoadFromDump to something like Load or Resume then, because the reason we had "dump" was to make a distinction between "checkpointing" (pickling everything) and "dumping" (which saved the parameters, but neglected to save a lot of other things). Ideally we get rid of this distinction, so then there will just be Checkpoint, Resume, load_parameters, and potentially save_parameters as a very thin wrapper around dump (for symmetry's sake).

@dmitriy-serdyuk
Copy link
Contributor Author

@bartvm , the problem was completely different and I solved it now.

But now blocks-continue doesn't work for the reason we discussed: objects from global name space cannot be pickled. And actually it didn't work with normal pickler.

I propose to remove blocks-continue and force people to use Load extension.

@dmitriy-serdyuk
Copy link
Contributor Author

It is also possible to use a construction in a script like

if __name__ == '__main__':
    if continue:
        continue_training(path)
    ...

@bartvm
Copy link
Member

bartvm commented May 7, 2015

So blocks-continue always had this limitation? Or did something change with this new approach?

The Load extension doesn't do the same thing though. The purpose of pickling is being able to completely reconstruct the main loop, including the state of extensions, etc. So it doesn't make sense to expect the user to first construct a MainLoop instance, because it would be entirely replaced by whatever the Load extension unpickles (and performing this replacement from within the extensions would be a mess).

If this limitation has always existed, I'd argue that we just need to document it clearly. I imagine that something like this should work fine, right?

# my_model.py
def some_helper_function():
   pass
# train_my_model.py
from my_model import some_helper_function
main_loop = MainLoop(...)  # With Checkpoint, depends on some_helper_function
main_loop.run()

Because when unpickling, it can load the function from my_model.

@dmitriy-serdyuk
Copy link
Contributor Author

Yes, it always had this limitation.

Yes, this is going to work, but we cannot provide a general script which is able to continue any kind of experiment. So I think, the best we can provide is a function continue_training and explain in the documentation that it should be used like

# define or import objects from global namespace like you defined it before
if __name__ == '__main__':
    if continue:
        continue_training(path)
    # Construct main loop with Checkpoint extension

@bartvm
Copy link
Member

bartvm commented May 7, 2015

Right, we could just make it a function instead of a script, but we should make it clear that using it within a script isn't necessarily the only way to use it. Technically speaking the user could have just created their model in the interpreter, without any sort of script anywhere. They should still be able to fire up a new interpreter then and call continue_training(path) then.

Anyway, that sounds like a reasonable solution to me. The one thing that I think would be very nice--but I have no idea how difficult this is--is being able to give a warning whenever the main loop depends on functions defined in the __main__ namespace. It could say something like:

WARNING: Main loop depends on the function `_array_tuple` in `__main__` namespace. 

Because of limitations to pickling, this means that you will not be able to resume
your model outside of a namespace containing this function. In other words, you
can only call `continue_training` from within this cript.

Looking at pickle.py maybe it would work by overwriting save_global to check if module is '__main__' or something?

@dmitriy-serdyuk
Copy link
Contributor Author

@bartvm , is there any documentation for python 3 pickle internals? This one doesn't seem full.

@dmitriy-serdyuk
Copy link
Contributor Author

How can I test, that arrays were not pickled in a regular way? Is there a better way then to check pkl file size?

@bartvm
Copy link
Member

bartvm commented May 12, 2015

You could load the zip file using numpy.load and check whether the array is one of the keys there, right?

I don't know of any documentation on pickle's internals that is any better... It's pretty badly documented I'm afraid. You can look at custom picklers like joblib's to get a better idea of how to write custom ones. Is there anything in particular you're running into?

@dmitriy-serdyuk
Copy link
Contributor Author

Yes, but I'd like to check that it was not saved twice. If something goes wrong in __call__ method, it may add a numpy file into zip, but not to return an id.

Seems, there is no save_global function in python 3 or its signature is different.

@bartvm
Copy link
Member

bartvm commented May 12, 2015

I just opened up pickle.py from Python 3, and it seems like the class is called _Pickler and the method looks like def save_global(self, obj, name=None)

This was referenced May 13, 2015
if isinstance(obj, SharedVariable):
if obj.name:
if obj.name == 'pkl':
ValueError("can't pickle shared variable with name `pkl`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot the raise statement here.

@bartvm
Copy link
Member

bartvm commented May 13, 2015

I think this is starting to look pretty good, and it'd be great if some other people could have a look, because it's a pretty big change (@rizar, @dwf).

It replaces the Dump and Checkpoint extensions by a custom pickler (an extension of the ones we recently merged into Theano as part of Theano/Theano#2802). The result is a single zip archive that contains the pickled main loop (as a file called pkl) and all the shared variables stored as NPY files. The names of these arrays are either the path leading to them (numbered, if need be e.g. mlp/linear/W_0, mlp/linear/W_1) or the name of the shared variable (also numbered if necessary).

The advantage is that it effectively does what Dump does, but without needing a Model instance, while still pickling the entire main loop (so the state of e.g. extensions is maintained). The resulting zip archive can be loaded using NumPy's numpy.load function if necessary in order to inspect parameters without unpickling.

@dmitriy-serdyuk
Copy link
Contributor Author

Maybe I need to rebase it because I made a lot of stupid commits changing things back and forth.

@bartvm
Copy link
Member

bartvm commented May 13, 2015

Yeah, a rebase would be nice eventually. Also, I made a comment about a missing raise statement.

@bartvm
Copy link
Member

bartvm commented May 13, 2015

Oh, and we'll need a bunch of tests I guess.

@bartvm
Copy link
Member

bartvm commented May 13, 2015

The main test I'd like to add is to make sure that #480 is fixed for both brick parameters as well as normal shared variables.

@dmitriy-serdyuk
Copy link
Contributor Author

For my log it decreases pickling time from 5.3s to 1.4s while dumping without a persistent id with cPickle takes 238ms and 946ms with pickle.

Conflicts:
	tests/test_dump.py
	tests/test_examples.py
@dmitriy-serdyuk
Copy link
Contributor Author

How can I create a pull request to two repositories simultaneously? Tests fail for blocks-examples repository.

@bartvm
Copy link
Member

bartvm commented May 25, 2015

The way I've done it is by making a pull request to blocks-examples, and then changing this PR's .travis.yml to do git clone of your branch instead. Then if everything works, and things are ready to merge, we just change the git clone line back just before doing so.

@bartvm
Copy link
Member

bartvm commented May 25, 2015

Great, so two things left to do then:

  • Undo deleting blocks-continue, if people like @rizar use it we might as well keep it, but knowing that sometimes it won't work.
  • Make dump optionally use cPickle by doing something like
def dump(obj, file_handler, protocol=DEFAULT_PROTOCOL,
         persistent_id=PersistentParameterID, use_cpickle=False):
    with closing(zipfile.ZipFile(file_handler, 'w', zipfile.ZIP_DEFLATED,
                                 allowZip64=True)) as zip_file:
        def func(f):
            if use_cpickle:
                p = cPickle.Pickler(f, protocol=protocol)
            else:
                p = PicklerWithWarning(f, protocol=protocol)
            p.persistent_id = persistent_id(zip_file)
            p.dump(obj)
        pkl_utils.zipadd(func, zip_file, 'pkl')

@rizar
Copy link
Contributor

rizar commented May 26, 2015

On my log in Python3 the slowdown from switching from pickle.dump and pickle.load to the functions from this PR is now roughly 3 times. This is much less than before, good job, but still quite a lot of overhead. I will try it with Python2 as well when you guys support cPickle there.

@bartvm
Copy link
Member

bartvm commented May 29, 2015

Oops, my bad on the formatting errors, but at least tests are passing again. We need a bunch of tests now, because I realised that the previous PR was actually not storing a lot of things in the ZIP file (and the initial one was storing too many things...).

@rizar You might want to re-run your benchmarks, and test the support for cPickle.

@bartvm
Copy link
Member

bartvm commented May 29, 2015

Needs to be rebased on top of #680 when that is merged

@bartvm
Copy link
Member

bartvm commented May 29, 2015

#680 was merged, so could you rebase @dmitriy-serdyuk? Earlier the doctests were failing

@@ -61,14 +61,14 @@ def main(save_to, num_batches, continue_=False):
step_rule=Scale(learning_rate=0.001)),
get_data_stream(range(100)),
model=Model(cost),
extensions=([LoadFromDump(save_to)] if continue_ else []) +
extensions=([Load(save_to)] if continue_ else []) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, missed this line, needs to be deleted

name = '{}.{}'.format(
BRICK_DELIMITER.join([brick.name for brick in
get_brick(obj).get_unique_path()]),
[obj.name]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitriy-serdyuk The current error is my bad, this should be obj.name instead of [obj.name]

@bartvm
Copy link
Member

bartvm commented May 29, 2015

So there's only one thing left to do that I can think of, which is changing Load to use numpy.load instead of unpickling when loading the parameters, and making the loading of the iteration state and log optional. This will allow someone to load the parameters from a model, even if unpickling fails completely.

@bartvm bartvm mentioned this pull request May 31, 2015
Update Load extension, small fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants