Improve mxnet support for activity classifier save/load #129

gustavla · 2017-12-20T18:19:35Z

This PR updates the activity classifier (AC) save/load to work with more versions of mxnet. There are still some issues with export to Core ML for newer versions of mxnet, so this PR alone does not expand support yet (#17).

The issue with the AC is that it saves and loads the network graph using mxnet json files. MXNet is pretty good at backward compatibility, but not forward compatibility. If we support more than one version of mxnet at a time, this creates a problem for us where we can't even be same-version compatible:

User 1 (turicreate==4.0.0 and mxnet==0.12.1) saves a model.
User 2 (turicreate==4.0.0 and mxnet==0.11.0) loads the model.

This won't work due to forward incompatibility in mxnet. The solution is to avoid saving and loading the graph and simply building it up and copying over the weights. This has much better forward compatibility.

Support (old and new)

Let's look at current support and support after this PR. I'll show my testing as matrices where:

rows: model saved in
columns: model loaded in

Classifier (IC)/Similarity (IS)/Detector (OD): (all work on all combinations of save/load, although currently you get warnings if you do not use MXNet 0.11.0. This PR will eliminate those warnings. This broad support is thanks to the same changes that I'm making to the AC in this PR)

TC/MXNet	4/0.11.0	4/0.12.1	4/1post1
v4/0.11.0	✅	✅	✅
v4/0.12.1	✅	✅	✅
v4/1post1	✅	✅	✅

Activity Classifier (AC): (since this PR changes the AC saver/loader, I tested a bunch of combinations)

TC/MXNet	4/0.11.0	4/0.12.1	4/1post1	PR/0.11.0	PR/0.12.1	PR/1post1
v4/0.11.0	✅	✅	✅	✅	✅	✅
v4/0.12.1	🚫	✅	✅	✅	✅	✅
v4/1post1	🚫	🚫	✅	✅	✅	✅
PR/0.11.0	⚠️	⚠️	⚠️	✅	✅	✅
PR/0.12.1	⚠️	⚠️	⚠️	✅	✅	✅
PR/1post1	⚠️	⚠️	⚠️	✅	✅	✅

✅ Works
🚫 Does not work (ugly error)
⚠️ Does not work (fails gracefully)

"PR" refers to the Turi Create model as defined by this PR's commit. "1post1" is short for 1.0.0.post1 (1.0.0 segfaults the object detector, and this seems to have been resolved in the post1 version).

Top-left: This is the status quo
Right half: Backwards and same-version compatible
Bottom-left: Forward incompatible (with respect to TC version). See note at the bottom.

The "graceful" failure in 4.0 actually says "Corrupted model. Cannot load a model with this version." for OD/AC, and for IC/IS it does not even check the version! This PR in an isolated commit also improves this and makes the message friendlier and tells the user to upgrade Turi Create. Unfortunately, whenever we upgrade the file format for IC/IS, it will fail very ungracefully on 4.0.

Why not be forward compatible?

We could make newer models load in 4.0 as well. However, that is a commitment to write the backward migration to mxnet for all its future versions. For instance, in mxnet 5.0, we would still need to write json graphs that look like 0.11.0. It's better to break this compatibility now, since we would probably break it eventually. At least going forward, we have much better chances of being forward compatible (in mxnet version) for the AC, just like it turned out we are for IC/IS/OD.

igiloh · 2017-12-21T13:09:01Z

src/unity/python/turicreate/toolkits/activity_classifier/_activity_classifier.py

        context = _mxnet_utils.get_mxnet_context(max_devices=state['num_sessions'])
-        state['_loss_model'] = _mxnet_utils.load_mxnet_model_from_state(
-            state['_loss_model'], data, labels, None, context)


Doesn't this mean we're no longer backward compatible?
If someone saved a model using version 1, the weights are now saved only in the loss model, and therefore when later in lines 301-303 when loading params from state['_pred_model'] they would be all zeros, won't they?

I can understand not being forward compatible (model saved in new version should not load in old version). But backwards compatibility is important.
We could check for if version==1 or '_loss_model' in state then extract the params from loss model, else extract from pred model.
Right?

Thanks for the review! In the current v4, there is no weight sharing when it gets saved to file. All weights are saved twice. Looking at the actual saved files, a model saved with v4 takes 4 MB while a model saved with v4+ takes 2 MB. Therefore, there is no problem for v4+ to simply ignore half of those weights and load the model entirely from the pred_model.

Also, regarding backward compatibility. Every cell in the 6x6 matrix I showed in the original post is the result of an actual test and not just my hopes (I wanted to be very thorough!), so I have tested and verified full backward compatibility.

igiloh · 2017-12-21T13:14:10Z

Please see my comment about backward compatibility. Otherwise LGTM.
I would still wait for @alonpal's review as well, as he's more familiar with the MXnet arch in the AC.

gustavla · 2017-12-21T19:13:28Z

Sounds good, I will wait for @alonpal's review. Thanks!

igiloh

Got a message from @alonpal. He's taking a flight so he can't get online, but he reviewed the changes and approves them.

Support newer mxnet (update in activity classifier)

3fb496c

gustavla force-pushed the dev/ac-mxnet-save-load branch from aa1274b to 5e09c7d Compare December 20, 2017 18:21

Make model forward incompatibility more graceful

ae3e948

gustavla force-pushed the dev/ac-mxnet-save-load branch from 5e09c7d to ae3e948 Compare December 20, 2017 18:22

igiloh reviewed Dec 21, 2017

View reviewed changes

gustavla requested a review from alonpal December 21, 2017 19:13

igiloh approved these changes Dec 21, 2017

View reviewed changes

gustavla merged commit 8633186 into apple:master Dec 22, 2017

This was referenced Jan 5, 2018

CUDA 9 and cuDNN 7 support #17

Closed

Illegal instruction on training image similarity model from example #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve mxnet support for activity classifier save/load #129

Improve mxnet support for activity classifier save/load #129

gustavla commented Dec 20, 2017

igiloh Dec 21, 2017 •

edited

Loading

gustavla Dec 21, 2017 •

edited

Loading

igiloh commented Dec 21, 2017

gustavla commented Dec 21, 2017

igiloh left a comment

Improve mxnet support for activity classifier save/load #129

Improve mxnet support for activity classifier save/load #129

Conversation

gustavla commented Dec 20, 2017

Support (old and new)

Why not be forward compatible?

igiloh Dec 21, 2017 • edited Loading

Choose a reason for hiding this comment

gustavla Dec 21, 2017 • edited Loading

Choose a reason for hiding this comment

igiloh commented Dec 21, 2017

gustavla commented Dec 21, 2017

igiloh left a comment

Choose a reason for hiding this comment

igiloh Dec 21, 2017 •

edited

Loading

gustavla Dec 21, 2017 •

edited

Loading