Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MXNET-651] MXNet Model Backwards Compatibility Checker #11626

Merged
merged 61 commits into from Jul 31, 2018

Conversation

Projects
None yet
6 participants
@piyushghai
Copy link
Contributor

commented Jul 10, 2018

Description

This PR aims to check whether models trained on earlier versions of MXNet are loading fine on the latest version or the latest release candidate. It also aims to do a check for consistency in the inference results on these trained models.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
    The PRs purpose itself is to add more tests for backwards compatibility.
  • Code is well-documented:
  • A README.md has been added to explain the what Model Backwards Compatibility is and how it can be run.
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

TODO

  • Check the tasks using an actual Jenkins Slave. @marcoabreu Can you help me out with that ?
  • Backfill the S3 bucket with models trained on earlier versions of MXNet.
@piyushghai

This comment has been minimized.

Copy link
Contributor Author

commented Jul 10, 2018

Pinging @marcoabreu @ThomasDelteil @piiswrong for review.
@marcoabreu Please let me know if I did the CI integration part correctly on this PR or if I need to make more changes to it.

@lupesko

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2018

Thanks for this important contribution!

A not about the description, which is lacking, and should be updated as needed, rather than be left with un-relevant stuff from the template. A few examples:

  • the issue description says "The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)"
    This issue is not tiny, and yet it has no JIRA. Please follow the guidelines and file a JIRA and update the issue name.
  • "All changes have test coverage" is not checked. I think it should be checked, or else you should add the relevant coverage.
  • I recommend you go over the issue description and update as needed. No need to leave things that are not relevant for the PR, and on the other hand add more details on bullet points that are relevant.

@piyushghai piyushghai changed the title MXNet Model Backwards Compatibility Checker [MXNET-651] MXNet Model Backwards Compatibility Checker Jul 10, 2018

@szha
Copy link
Member

left a comment

Could you clarify what this is for?

@piyushghai

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2018

@szha Here's what this PR does :
"Backward compatibility aims to check whether models trained on earlier versions of MXNet are loading fine on the latest version or the latest release candidate. It also aims to do a sanity check for consistency in the inference on these trained models. "

Here's the relevant JIRA Issue for it : https://issues.apache.org/jira/browse/MXNET-651

@szha

This comment has been minimized.

Copy link
Member

commented Jul 11, 2018

What kind of "loading" are you testing? The statement, as is, is underspecified.

@piyushghai

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2018

I am checking the following scenarios to save/load models :

  1. Hybridized models import/export API from Gluon Package
  2. Gluon Models load_params/save_params API from Gluon Package
  3. Gluon Models load_parameters/save_parameters API from Gluon Package
  4. Declarative Models load_checkpoint() from Model API

The Models are trained and saved on older MXNet versions, while the loading and inference part is performed on the newer versions.

@szha

This comment has been minimized.

Copy link
Member

commented Jul 11, 2018

Any reason not to do it in unittest, using a smaller model?

@szha

This comment has been minimized.

Copy link
Member

commented Jul 11, 2018

2 is deprecated so there's no need to test.

@szha

This comment has been minimized.

Copy link
Member

commented Jul 11, 2018

How do you test load_parameters and save_parameters? What additional case would the new test case for gluon's save_parameters/load_parameters cover that's not already covered by https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_gluon.py#L1093 ?

@piyushghai

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2018

We cannot unittest it as we are not testing the various model save/load APIs on the same version. We are trying to check whether models trained (and saved) on older MXNet versions are still loading correctly and give the same inference results on the latest release candidate of MXNet and thus any new changes that might have crept in remain backwards compatible.
@marcoabreu can give more context on this change as well.


def download_data_from_s3(self, ):
print ('Downloading files from bucket : ptb-small-dataset' )
bucket = s3.Bucket('ptb-small-dataset')

This comment has been minimized.

Copy link
@ThomasDelteil

ThomasDelteil Jul 11, 2018

Contributor

where are we downloading this from? since we can't host the ptb dataset ourselves because of the license, I wonder if it is a good idea to depend on it.

This comment has been minimized.

Copy link
@piyushghai

piyushghai Jul 11, 2018

Author Contributor

Oh. I wasn't aware of the Penn Tree Bank dataset license issue. I had uploaded a trimmed down version of PTB to an S3 bucket. The trimmed down version was 10k training sentences and 3k each for test and validation.
Can you suggest a different dataset which we can host ?

This comment has been minimized.

Copy link
@szha

szha Jul 11, 2018

Member

See the dataset introduced in #11435

This comment has been minimized.

Copy link
@piyushghai

piyushghai Jul 11, 2018

Author Contributor

Thanks. Replaced PTB with Sherlock in : c3c9129

'''This function returns the top level folders in the S3Bucket. These folders help us to navigate to the trained model files stored for different MXNet versions. '''
bucket = s3client.Bucket(bucket_name)
result = bucket.meta.client.list_objects(Bucket=bucket.name,
Delimiter=backslash)

This comment has been minimized.

Copy link
@ThomasDelteil

ThomasDelteil Jul 11, 2018

Contributor

fix indent

This comment has been minimized.

Copy link
@piyushghai

piyushghai Jul 11, 2018

Author Contributor

Done. Fixed in : 3485352

echo '=========================='
echo "Running training files and preparing models"
echo '=========================='
python mnist_mlp_module_api_train.py

This comment has been minimized.

Copy link
@ThomasDelteil

ThomasDelteil Jul 11, 2018

Contributor

what are the consequences of having one of these scripts crash with respect to the integrity of the generated data?

This comment has been minimized.

Copy link
@piyushghai

piyushghai Jul 11, 2018

Author Contributor

I did not quite fully understand by what you meant by integrity here.
Are you suggesting that I do a checksum validation on the generated models and other relevant files while downloading them in the inference files ?

This comment has been minimized.

Copy link
@ThomasDelteil

ThomasDelteil Jul 11, 2018

Contributor

if one of the training script randomly crashes, let's say for dataset unavailability, what are the consequences of the fact that the data stored in s3 will be incomplete for that given run?

This comment has been minimized.

Copy link
@piyushghai

piyushghai Jul 11, 2018

Author Contributor

That's gracefully handled in this commit : 7c41488

As of now, I'm exiting the inference script with a message indicating no models found in S3.

This comment has been minimized.

Copy link
@piyushghai

piyushghai Jul 11, 2018

Author Contributor

Also, the training script would not be a part of nightly runs. It will only be run when a new version of MXNet is released in order to refresh the trained models inventory on S3.
This run can either be manual or automated by editing the train_mxnet_legacy_models.sh script.
So any random dataset unavailability issues (ideally) should not block the nightly inference files.

@piyushghai

This comment has been minimized.

Copy link
Contributor Author

commented Jul 27, 2018

@marcoabreu The Jenkins CI [1] build is giving an error on : the import statement for MXNet under the Inference Stage of the JenkinsFileForMBCC. I'm not able to figure out why that's happening.
Can you have a look at this 69843fb and give a second eye opinion ?

[1] : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/test-backwards-compatibility-checker/detail/test-backwards-compatibility-checker/12/pipeline/4/

Edit --> fae44fe seems to have fixed the issue. [2]
@marcoabreu We now need to add the IAM User policy for S3 bucket access.

[2] : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/test-backwards-compatibility-checker/detail/test-backwards-compatibility-checker/13/pipeline/4

@marcoabreu marcoabreu force-pushed the piyushghai:mbcc branch from 05bb40d to c099979 Jul 30, 2018

@marcoabreu

This comment has been minimized.

Copy link
Contributor

commented Jul 30, 2018

Hello,
the following error is shown at http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-test-backwards-compatibility-checker/detail/restricted-test-backwards-compatibility-checker/4/pipeline/4:

Installing collected packages: graphviz, numpy, mxnet

  Found existing installation: graphviz 0.8.4

    Uninstalling graphviz-0.8.4:

      Successfully uninstalled graphviz-0.8.4

  Found existing installation: mxnet 1.2.0

    Uninstalling mxnet-1.2.0:

      Successfully uninstalled mxnet-1.2.0

Successfully installed graphviz-0.8.1 mxnet-1.1.0 numpy-1.13.3

==========================

Running training files and preparing models

==========================

Traceback (most recent call last):

  File "model_backwards_compat_train.py", line 20, in <module>

    from common import *

  File "/work/mxnet/tests/nightly/model_backwards_compatibility_check/common.py", line 22, in <module>

    import mxnet as mx

  File "/work/mxnet/python/mxnet/__init__.py", line 24, in <module>

    from .context import Context, current_context, cpu, gpu, cpu_pinned

  File "/work/mxnet/python/mxnet/context.py", line 24, in <module>

    from .base import classproperty, with_metaclass, _MXClassPropertyMetaClass

  File "/work/mxnet/python/mxnet/base.py", line 197, in <module>

    _LIB = _load_lib()

  File "/work/mxnet/python/mxnet/base.py", line 187, in _load_lib

    lib_path = libinfo.find_lib_path()

  File "/work/mxnet/python/mxnet/libinfo.py", line 74, in find_lib_path

    'List of candidates:\n' + str('\n'.join(dll_path)))

RuntimeError: Cannot find the MXNet library.

List of candidates:

libmxnet.so

/usr/local/lib/libmxnet.so

/work/mxnet/python/mxnet/libmxnet.so

/work/mxnet/python/mxnet/../../lib/libmxnet.so

/work/mxnet/python/mxnet/../../build/libmxnet.so

../../../libmxnet.so

==========================

Successfully trained files for the previous two MXNet release versions

It's strange that an error occurs but the run sitll continues. Could you please have a look?

@@ -42,7 +42,7 @@

# get the current mxnet version we are running on
mxnet_version = mx.__version__
model_bucket_name = 'mxnet-model-backwards-compatibility-models'
model_bucket_name = 'mxnet-ci-prod-backwards-compatibility-models'

This comment has been minimized.

Copy link
@piyushghai

piyushghai Jul 30, 2018

Author Contributor

This bucket name change also has to be made in upload_models_to_s3.sh.

Done that in 33096c0

@@ -57,10 +57,10 @@ do
fi

## If MXNet major version starts with a number >=1. with a wildcard match for the minor version numbers
if [[ $version = [1-9]* ]]
if [[ $version = [1-9].[0-9].[0-9] ]]

This comment has been minimized.

Copy link
@marcoabreu

marcoabreu Jul 30, 2018

Contributor

I don't think this will handle multi digit versions.

This comment has been minimized.

Copy link
@piyushghai

piyushghai Jul 30, 2018

Author Contributor

Good point. Done 16d320a.

# under the License.

#Author: Piyush Ghai

This comment has been minimized.

Copy link
@marcoabreu

marcoabreu Jul 30, 2018

Contributor

Missing set -ex

This comment has been minimized.

Copy link
@piyushghai

piyushghai Jul 30, 2018

Author Contributor

Done ceac705.

# under the License.

#Author: Piyush Ghai

This comment has been minimized.

Copy link
@marcoabreu

marcoabreu Jul 30, 2018

Contributor

Missing set -ex

This comment has been minimized.

Copy link
@piyushghai

piyushghai Jul 30, 2018

Author Contributor

Done ceac705.

## If MXNet major version starts with a number >=1. with a wildcard match for the minor version numbers
## Could have used a [[:digit:]]+. as well but it was not working as a traditional regex in bash.
## so had to resort to using [[:digit:]] [[:digit:]]* to indicate multi-digit version regex match
if [[ $version = [[:digit:][[:digit:]]*.[[:digit:]].[[:digit:]] ]]

This comment has been minimized.

Copy link
@marcoabreu

marcoabreu Jul 30, 2018

Contributor

Would this also cover 1.12.0? It seems like you only added multi digit for the major version.

Also, could you explain how this exactly determines whether it is a major version match?

This comment has been minimized.

Copy link
@piyushghai

piyushghai Jul 30, 2018

Author Contributor

Done. d8fa75d.


## We read the current major version from libinfo.py file. And we extract the major version from it.
curr_mxnet_version=$(grep -w "__version__" python/mxnet/libinfo.py | grep -o '".*"' | sed 's/"//g')
curr_major_version=$(get_major_version $curr_mxnet_version)

This comment has been minimized.

Copy link
@marcoabreu

marcoabreu Jul 30, 2018

Contributor

Please validate the output to ensure we get a notification/error if somebody breaks this by accident.

This comment has been minimized.

Copy link
@piyushghai

piyushghai Jul 30, 2018

Author Contributor

Done. ca01aa2.

@marcoabreu

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2018

It seems like that this pipeline has revealed some slight changes between 1.0.0 and our master:

INFO:root:Performing inference for model/API lenet_gluon_save_params_api
INFO:root:Fetching files for MXNet version : 1.0.0 and model lenet_gluon_save_params_api
/work/mxnet/python/mxnet/gluon/block.py:420: UserWarning: load_params is deprecated. Please use load_parameters.
  warnings.warn("load_params is deprecated. Please use load_parameters.")
Traceback (most recent call last):
  File "model_backwards_compat_inference.py", line 135, in <module>
    test_lenet_gluon_load_params_api()
  File "model_backwards_compat_inference.py", line 72, in test_lenet_gluon_load_params_api
    assert_almost_equal(old_inference_results.asnumpy(), output.asnumpy())
  File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 1.969787 exceeds tolerance rtol=0.000010, atol=0.000000.  Location of maximum error:(5, 0), a=0.003014, b=0.003014
 a: array([[ 0.01743407, -0.30903903],
       [ 0.08352755, -0.365019  ],
       [ 0.06324662, -0.4323489 ],...
 b: array([[ 0.01743408, -0.3090391 ],
       [ 0.08352771, -0.365019  ],
       [ 0.06324662, -0.4323488 ],...

I will still merge this PR because fixing regressions is outside the scope of this PR.

@marcoabreu marcoabreu merged commit a56a569 into apache:master Jul 31, 2018

1 check passed

continuous-integration/jenkins/pr-merge This commit looks good
Details

aaronmarkham added a commit to aaronmarkham/incubator-mxnet that referenced this pull request Aug 7, 2018

[MXNET-711] Website build and version dropdown update (apache#11892)
* adding param for list of tags to display on website

* using new website display argument for artifact placement in version folder

* adding display logic

* remove restricted setting for testing

* update usage instructions

* reverted Jenkinsfile to use restricted nodes

[MXAPPS-581] Fixes for broken Straight Dope tests. (apache#11923)

* Update relative paths pointing to the data directory to point to the
  correct place in the testing temporary folder.

* Enable the notebooks that were previously broken because of relative
  file paths not pointing to the correct place.

* Move some notebooks we do not plan to test to the whitelist. These
  notebooks are not published in the Straight Dope book.

* Clean-up: Convert print statements to info/warn/error logging
  statements. Add some logging statements for better status.

Disable flaky test: test_spatial_transformer_with_type (apache#11930)

apache#11839

Add linux and macos MKLDNN Building Instruction (apache#11049)

* add linux and macos doc

* update doc

* Update MKL_README.md

* Update MKL_README.md

Add convolution code to verify mkldnn backend

* add homebrew link

* rename to MKLDNN_README

* add mkl verify

* trigger

* trigger

* set mac complier to gcc47

* add VS2017 support experimentally

* improve quality

* improve quality

* modify mac build instruction since prepare_mkldnn.sh has been rm

* trigger

* add some improvement

[MXNET-531] Add download util (apache#11866)

* add changes to example

* place the file to the util

* add retry scheme

* fix the retry logic

* change the DownloadUtil to Util

* Trigger the CI

[MXNET-11241] Avoid use of troublesome cudnnFind() results when grad_req='add' (apache#11338)

* Add tests that fail due to issue 11241

* Fix apache#11241 Conv1D throws CUDNN_STATUS_EXECUTION_FAILED

* Force algo 1 when grad_req==add with large c.  Expand tests.

* Shorten test runtimes.

Improving documentation and error messages for Async distributed training with Gluon (apache#11910)

* Add description about update on kvstore

* add async check for gluon

* only raise error if user set update_on_kvstore

* fix condition

* add async nightly test

* fix case when no kvstore

* add example for trainer creation in doc

[MXNET-641] fix R windows install docs (apache#11805)

* fix R windows install docs

* addressed PR comments

* PR comments

* PR comments

* fixed line wrappings

* fixed line wrappings

a hot fix for mkldnn link (apache#11939)

re-enabling randomized test_l2_normalization (apache#11900)

[MXNET-651] MXNet Model Backwards Compatibility Checker (apache#11626)

* Added MNIST-MLP-Module-API models to check model save and load_checkpoint methods

* Added LENET with Conv2D operator training file

* Added LENET with Conv2d operator inference file

* Added LanguageModelling with RNN training file

* Added LamguageModelling with RNN inference file

* Added hybridized LENET Gluon Model training file

* Added hybridized LENET gluon model inference file

* Added license headers

* Refactored the model and inference files and extracted out duplicate code in a common file

* Added runtime function for executing the MBCC files

* Added JenkinsFile for MBCC to be run as a nightly job

* Added boto3 install for s3 uploads

* Added README for MBCC

* Added license header

* Added more common functions from lm_rnn_gluon_train and inference files into common.py to clean up code

* Added scripts for training models on older versions of MXNet

* Added check for preventing inference script from crashing in case no trained models are found

* Fixed indentation issue

* Replaced Penn Tree Bank Dataset with Sherlock Holmes Dataset

* Fixed indentation issue

* Removed training in models and added smaller models. Now we are simply checking a forward pass in the model with dummy data.

* Updated README

* Fixed indentation error

* Fixed indentation error

* Removed code duplication in the training file

* Added comments for runtime_functions script for training files

* Merged S3 Buckets for storing data and models into one

* Automated the process to fetch MXNet versions from git tags

* Added defensive checks for the case where the data might not be found

* Fixed issue where we were performing inference on state model files

* Replaced print statements with logging ones

* Removed boto install statements and move them into ubuntu_python docker

* Separated training and uploading of models into separate files so that training runs in Docker and upload runs outside Docker

* Fixed pylint warnings

* Updated comments and README

* Removed the venv for training process

* Fixed indentation in the MBCC Jenkins file and also separated out training and inference into two separate stages

* Fixed indendation

* Fixed erroneous single quote

* Added --user flag to check for Jenkins error

* Removed unused methods

* Added force flag in the pip command to install mxnet

* Removed the force-re-install flag

* Changed exit 1 to exit 0

* Added quotes around the shell command

* added packlibs and unpack libs for MXNet builds

* Changed PythonPath from relative to absolute

* Created dedicated bucket with correct permission

* Fix for python path in training

* Changed bucket name to CI bucket

* Added set -ex to the upload shell script

* Now raising an exception if no models are found in the S3 bucket

* Added regex to train models script

* Added check for performing inference only on models trained on same major versions

* Added set -ex flags to shell scripts

* Added multi-version regex checks in training

* Fixed typo in regex

* Now we will train models for all the minor versions for a given major version by traversing the tags

* Added check for validating current_version

[MXNET-531] NeuralStyle Example for Scala (apache#11621)

* add initial neuralstyle and test coverage

* Add two more test and README

* kill comments

* patch on memory leaks fix

* fix formatting issues

* remove redundant files

* disable the Gan example for now

* add ignore method

* add new download scheme to match the changes

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

[MXNET-651] MXNet Model Backwards Compatibility Checker (apache#11626)
* Added MNIST-MLP-Module-API models to check model save and load_checkpoint methods

* Added LENET with Conv2D operator training file

* Added LENET with Conv2d operator inference file

* Added LanguageModelling with RNN training file

* Added LamguageModelling with RNN inference file

* Added hybridized LENET Gluon Model training file

* Added hybridized LENET gluon model inference file

* Added license headers

* Refactored the model and inference files and extracted out duplicate code in a common file

* Added runtime function for executing the MBCC files

* Added JenkinsFile for MBCC to be run as a nightly job

* Added boto3 install for s3 uploads

* Added README for MBCC

* Added license header

* Added more common functions from lm_rnn_gluon_train and inference files into common.py to clean up code

* Added scripts for training models on older versions of MXNet

* Added check for preventing inference script from crashing in case no trained models are found

* Fixed indentation issue

* Replaced Penn Tree Bank Dataset with Sherlock Holmes Dataset

* Fixed indentation issue

* Removed training in models and added smaller models. Now we are simply checking a forward pass in the model with dummy data.

* Updated README

* Fixed indentation error

* Fixed indentation error

* Removed code duplication in the training file

* Added comments for runtime_functions script for training files

* Merged S3 Buckets for storing data and models into one

* Automated the process to fetch MXNet versions from git tags

* Added defensive checks for the case where the data might not be found

* Fixed issue where we were performing inference on state model files

* Replaced print statements with logging ones

* Removed boto install statements and move them into ubuntu_python docker

* Separated training and uploading of models into separate files so that training runs in Docker and upload runs outside Docker

* Fixed pylint warnings

* Updated comments and README

* Removed the venv for training process

* Fixed indentation in the MBCC Jenkins file and also separated out training and inference into two separate stages

* Fixed indendation

* Fixed erroneous single quote

* Added --user flag to check for Jenkins error

* Removed unused methods

* Added force flag in the pip command to install mxnet

* Removed the force-re-install flag

* Changed exit 1 to exit 0

* Added quotes around the shell command

* added packlibs and unpack libs for MXNet builds

* Changed PythonPath from relative to absolute

* Created dedicated bucket with correct permission

* Fix for python path in training

* Changed bucket name to CI bucket

* Added set -ex to the upload shell script

* Now raising an exception if no models are found in the S3 bucket

* Added regex to train models script

* Added check for performing inference only on models trained on same major versions

* Added set -ex flags to shell scripts

* Added multi-version regex checks in training

* Fixed typo in regex

* Now we will train models for all the minor versions for a given major version by traversing the tags

* Added check for validating current_version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.