updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8 #18445

ma-hei · 2020-05-30T01:25:37Z

Description

This PR is used to observe failing tests when updating ubuntu_cpu base image to 20.04. With ubuntu 20.04 python 3.8 is used. As described in #18380 we should observe various test failures.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

mxnet-bot · 2020-05-30T01:25:39Z

Hey @ma-hei , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [edge, unix-cpu, sanity, clang, miscellaneous, centos-gpu, website, windows-gpu, centos-cpu, unix-gpu, windows-cpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

leezu · 2020-05-30T03:13:10Z

When updating the base image, the Dockerfile would require some changes. You can test that the containers can still build, locally via: cd mxnet/ci; DOCKER_CACHE_REGISTRY=mxnetci docker-compose -f docker/docker-compose.yml pull; DOCKER_CACHE_REGISTRY=mxnetci docker-compose -f docker/docker-compose.yml build

ma-hei · 2020-06-01T17:21:12Z

@leezu When using Ubuntu 20.04 as the base image of Docker.build.ubuntu, I found some issues with the apt-get installation of packages such as clang-10 and doxygen. To solve the problem at hand, I went back to Ubuntu 18.04 but instead of installing python3 (in Docker.build.ubuntu), I'm installing python3.8. The image builds successfully locally. I have two questions:

Will the CI tests of this pull request start automatically after some time?
After fixing the issues with python 3.8, do we actually want to upgrade to Ubuntu 20.04 in the base image or should this be a separate pull request?

leezu · 2020-06-01T17:45:33Z

Thanks @ma-hei. I wasn't aware that 3.8 is also available on 18.04 via the bionic-updates repository. Thus it's best to first solve the Python 3.8 bugs without updating the image.

We can update to 20.04 in a separate PR. Updating to 20.04 will allow us to simplify parts of the Dockerfile that currently install dependencies from third-party resources and replace them with installation from the official 20.04 repository. This will also improve the stability of the CI (as the 3rdparty sometimes become unavailable). To update the GPU containers, we may want to wait for https://gitlab.com/nvidia/container-images/cuda/-/issues/67

mxnet-bot · 2020-06-02T00:04:28Z

Undefined action detected.
Permissible actions are : run ci [all], run ci [job1, job2]
Example : @mxnet-bot run ci [all]
Example : @mxnet-bot run ci [centos-cpu, clang]

ma-hei · 2020-06-02T00:15:47Z

@leezu I was hoping that I could observe the test failures related to Python3.8 in one of the ci/jenkins/mxnet-validation build jobs. I assume those jobs did not run because the ci/jenkins/mxnet-validation/sanity build failed. Does the failure of the sanity build look related to the python3.8 update I made in Dockerfile.build.ubuntu to you? To me it looks like the build stalled at the end and was automatically killed.

leezu · 2020-06-02T00:21:02Z

There were some issues with the CI this morning. So let's just retry

@mxnet-bot run ci [all]

leezu · 2020-06-02T00:22:05Z

@mxnet-bot run ci [all]

leezu · 2020-06-02T00:27:27Z

@ChaiBapchya the bot doesn't work

ma-hei · 2020-06-05T21:41:29Z

In the build job ci/jenkins/mxnet-validation/unix-cpu the following command was previously failing:
ci/build.py --docker-registry mxnetci --platform ubuntu_cpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh sanity_check
I was able to reproduce the issue locally and I fixed it by making additional changes to ci/docker/Dockerfile.build.ubuntu and ci/docker/install/requirements. What I did in those files is the following:

making python3.8 the default python3 binary, by creating a symlink (see change in ci/docker/Dockerfile.build.ubuntu)
update requirement versions in ci/docker/install/requirements, so that python3 -m pip install -r /work/requirements in Dockerfile.build.ubuntu can be run successfully. I needed to update onnx, Cython and Pillow. The current versions were not installable via apt-get with python3.8.

I was then able to build the image successfully and was able to successfully run the ci/build.py command I mentioned above.
I'm now wondering if the failure in "continuous build / macosx-x86_64" that I'm seeing below is already a consequence of the onnx update I made (which is necessary in order to update to python 3.8, which in turn is the goal of this PR). My question is basically: what is each of the CI jobs below doing? In which of the jobs below should I be able to observe the test failures? Also let me know if you think this is going down a wrong route and I should try something different.

leezu · 2020-06-05T22:26:16Z

@ma-hei

I'm now wondering if the failure in "continuous build / macosx-x86_64" that I'm seeing below is already a consequence of the onnx update I made (which is necessary in order to update to python 3.8, which in turn is the goal of this PR).
My question is basically: what is each of the CI jobs below doing? In which of the jobs below should I be able to observe the test failures?

ONNX 1.6 appears to have some major changes (cf. discussion in #18054). So it's possible that updating ONNX leads to test failures.

The Ubuntu container you're modifying here is used for the ubuntu-cpu and ubuntu-gpu jobs. So you'd start seeing Python 3.8 related failures there.

But the requirement file you modified is shared among all containers. So you may want to use https://www.python.org/dev/peps/pep-0508/#environment-markers feature to only update ONNX when running under Python 3.8 and then disable or fix the ONNX tests in the unix-cpu and unix-gpu pipeline. Alternatively you can try building ONNX 1.5 for Py 3.8 from source https://github.com/onnx/onnx#linux-and-macos

@RuRo what do you think?

Also let me know if you think this is going down a wrong route and I should try something different.

I think your current steps are fine

leezu

.

leezu · 2020-06-11T16:44:53Z

ci/docker/install/requirements


 # Development dependencies
 cpplint==1.3.0
-pylint==2.3.1  # pylint and astroid need to be aligned
+pylint==2.3.1;python_version<"3.8"
+pylint==2.4.4;python_version=="3.8"  # pylint and astroid need to be aligned
 astroid==2.3.3  # pylint and astroid need to be aligned


I suspect your lint errors are due to not aligning the pylint and astroid version.

ma-hei · 2020-06-14T02:44:22Z

@mxnet-bot run ci [all]

mxnet-bot · 2020-06-14T02:44:34Z

Jenkins CI successfully triggered : [unix-gpu, clang, centos-cpu, edge, sanity, windows-gpu, unix-cpu, miscellaneous, windows-cpu, website, centos-gpu]

marcoabreu · 2020-06-14T15:05:40Z

ci/docker/runtime_functions.sh

@@ -433,6 +433,7 @@ build_ubuntu_cpu_mkl() {
        -DUSE_TVM_OP=ON \
        -DUSE_MKL_IF_AVAILABLE=ON \
        -DUSE_BLAS=MKL \
+        -DPYTHON_EXECUTABLE:FILEPATH=/usr/bin/python3.8 \


/usr/bin/python3 ?

ma-hei · 2020-06-14T19:00:37Z

@leezu I think I got to a state where I can run the unittests with python3.8 and reproduce what is described in the issue ticket #18380
I ignored the lint errors for now by adding disable flags to the pylintrc file.

As described in #18380 we're seeing an issue related to the usage of time.clock(). Besides that, I found the following issue:

The test tests/python/unittest/onnx/test_node.py::TestNode::test_import_export seems to fail. In the Jenkins job I don't see the error, but when running the test locally with python3.8 and onnx 1.7, I'm getting:

>               bkd_rep = backend.prepare(onnxmodel, operation='export', backend='mxnet')

tests/python/unittest/onnx/test_node.py:164:
tests/python/unittest/onnx/backend.py:104: in prepare
    sym, arg_params, aux_params = MXNetBackend.perform_import_export(sym, arg_params, aux_params,
tests/python/unittest/onnx/backend.py:62: in perform_import_export
    graph_proto = converter.create_onnx_graph_proto(sym, params, in_shape=input_shape,
python/mxnet/contrib/onnx/mx2onnx/export_onnx.py:308: in create_onnx_graph_proto
...
E       onnx.onnx_cpp2py_export.checker.ValidationError: Node (pad0) has input size 1 not in range [min=2, max=3].
E 
E       ==> Context: Bad node spec: input: "input1" output: "pad0" name: "pad0" op_type: "Pad" attribute { name: "mode" s: "constant" type: STRING } attribute { name: "pads" ints: 0 ints: 0 ints: 0 ints: 0 in
ts: 0 ints: 0 ints: 0 ints: 0 type: INTS } attribute { name: "value" f: 0 type: FLOAT }

../../../Library/Python/3.8/lib/python/site-packages/onnx/checker.py:53: ValidationError

I believe this is an issue in onnx 1.7 as it looks exactly like onnx/onnx#2548.

I also found that the test job Python3: MKL-CPU is not running through which seems to be due to a Timeout. I believe this is happening in test tests/python/conftest.py but the log output is not telling me what exactly goes wrong here. Do you have any idea how to reproduce this locally, or to get better insight into that failure?

I will now look into the following:

can I work around the onnx 1.7 related issue?
even after aligning pylint and astroid, I'm seeing unexpected linting errors. The linter is telling me that ndarray is a bad class name. Why is that happening?

Btw. I believe that you and @marcoabreu are getting pinged every time I do a commit to this PR. I don't think this PR is actually in a reviewable state yet but the purpose of it is more to see what's breaking when upgrading to Python3.8.

leezu · 2020-06-15T19:00:47Z

Thank you @ma-hei. For pylint and astroid: You have now updated astroid to latest version, but not pylint. Would updating pylint to 2.5.3 resolve the misalignment? It's fine to update astroid and pylint for all Python versions; though the new release may add some new checks which may fail now. Ideally (if it's straightforward), the failures can be fixed.

I'm not sure about the ONNX test failure. If you can figure it out, that would be great. Feel also free to ping other people that show up in he commit history for the related mxnet onnx files.

To avoid pinging people on commits, you can mark the PR as draft PR. I did that for you for now. Thanks for your help in fixing the Python 3.8 support!

ma-hei · 2020-06-30T07:49:43Z

here's whats going on with onnx 1.7: onnx/onnx#2865
We just need to use the newer way of instantiating a Pad node.

leezu · 2020-07-01T23:59:51Z

@ma-hei I also noticed this error in some unrelated cd job which was building mxnet for python 3.8. Just FYI, and not necessarily something you need to take into account here.

Traceback (most recent call last):

  File "setup.py", line 137, in <module>

    from mxnet.base import _generate_op_module_signature

  File "/Users/travis/build/dmlc/mxnet-distro/mxnet/__init__.py", line 34, in <module>

    from . import contrib

  File "/Users/travis/build/dmlc/mxnet-distro/mxnet/contrib/__init__.py", line 31, in <module>

    from . import onnx

  File "/Users/travis/build/dmlc/mxnet-distro/mxnet/contrib/onnx/__init__.py", line 21, in <module>

    from .mx2onnx.export_model import export_model

  File "/Users/travis/build/dmlc/mxnet-distro/mxnet/contrib/onnx/mx2onnx/__init__.py", line 23, in <module>

    from . import _op_translations

  File "/Users/travis/build/dmlc/mxnet-distro/mxnet/contrib/onnx/mx2onnx/_op_translations.py", line 60, in <module>

    import onnx

  File "/Users/travis/.local/lib/python3.8/site-packages/onnx/__init__.py", line 8, in <module>

    from onnx.external_data_helper import load_external_data_for_model, write_external_data_tensors

  File "/Users/travis/.local/lib/python3.8/site-packages/onnx/external_data_helper.py", line 10, in <module>

    from .onnx_pb import TensorProto, ModelProto

  File "/Users/travis/.local/lib/python3.8/site-packages/onnx/onnx_pb.py", line 8, in <module>

    from .onnx_ml_pb2 import *  # noqa

  File "/Users/travis/.local/lib/python3.8/site-packages/onnx/onnx_ml_pb2.py", line 19, in <module>

    DESCRIPTOR = _descriptor.FileDescriptor(

TypeError: __init__() got an unexpected keyword argument 'serialized_options'

ma-hei · 2020-07-02T18:06:55Z

Thanks @leezu, I think I found the underlying cause of the test failure in unittest/onnx/test_node.py::TestNode::test_import_export. In onnx 1.7, the input of the Pad operator has changed. We can see this by comparing https://github.com/onnx/onnx/blob/master/docs/Operators.md#Pad to https://github.com/onnx/onnx/blob/master/docs/Changelog.md#Pad-1. I believe I can fix this test and I'm working on that now. However the same test will not pass with onnx 1.5 anymore after that (but at least we know how to fix it, I guess). I assume the stacktrace you posted above from the unrelated cd job probably has some similar root cause. Besides that I'm trying to make pylint happy.. but that's the smaller issue I think.

ma-hei · 2020-07-03T03:36:12Z

I see that the discussion above regarding the failing test unittest/onnx/test_node.py::TestNode::test_import_export is now obsolete since this test got removed with commit fb73de7

leezu · 2020-07-05T23:40:55Z

@ma-hei there is more background at #18525 (comment)

ma-hei · 2020-07-09T16:40:25Z

Seems like the unit tests in the unix-cpu job are failing at this point

[2020-07-02T19:59:32.830Z] tests/python/unittest/test_profiler.py::test_gpu_memory_profiler_gluon 
[2020-07-02T19:59:32.830Z] [gw0] [ 89%] SKIPPED tests/python/unittest/test_profiler.py::test_gpu_memory_profiler_gluon 
[2020-07-02T19:59:32.830Z] tests/python/unittest/test_recordio.py::test_recordio 
[2020-07-02T22:59:39.221Z] Sending interrupt signal to process
[2020-07-02T22:59:44.185Z] 2020-07-02 22:59:39,244 - root - WARN

Trying to reproduce it locally.
Note: comparing this with the same job running on python 3.6, I see that the same test is running much later (at 99% completion compared to 89% in this job):

[2020-07-08T22:31:40.815Z] tests/python/unittest/test_profiler.py::test_gpu_memory_profiler_gluon 
[2020-07-08T22:31:40.815Z] [gw3] [ 99%] SKIPPED tests/python/unittest/test_profiler.py::test_gpu_memory_profiler_gluon 
[2020-07-08T22:31:40.815Z] tests/python/unittest/test_recordio.py::test_recordio 
[2020-07-08T22:31:40.815Z] [gw3] [ 99%] PASSED tests/python/unittest/test_recordio.py::test_recordio

ma-hei · 2020-07-15T23:58:30Z

@mxnet-bot run ci [centos-cpu]

mxnet-bot · 2020-07-15T23:58:36Z

Jenkins CI successfully triggered : [centos-cpu]

szha · 2020-09-14T19:29:40Z

There are a couple of python lint issues that block the CI:

[2020-09-14T16:51:10.099Z] + python3 -m pylint --rcfile=ci/other/pylintrc '--ignore-patterns=.*\.so1,.*\.dll1,.*\.dylib1' python/mxnet

[2020-09-14T16:52:17.785Z] ************* Module mxnet.test_utils

[2020-09-14T16:52:17.785Z] python/mxnet/test_utils.py:1143:31: E1121: Too many positional arguments for function call (too-many-function-args)

[2020-09-14T16:52:17.785Z] ************* Module mxnet.numpy.multiarray

[2020-09-14T16:52:17.785Z] python/mxnet/numpy/multiarray.py:264:0: C0103: Class name "ndarray" doesn't conform to '[A-Z_][a-zA-Z0-9]+$' pattern (invalid-name)

[2020-09-14T16:52:17.785Z] python/mxnet/numpy/multiarray.py:1871:4: W0222: Signature differs from overridden 'expand_dims' method (signature-differs)

[2020-09-14T16:52:17.785Z] ************* Module mxnet.gluon.contrib.data.vision.transforms.bbox.bbox

[2020-09-14T16:52:17.785Z] python/mxnet/gluon/contrib/data/vision/transforms/bbox/bbox.py:278:18: E1123: Unexpected keyword argument 'val' in function call (unexpected-keyword-arg)

[2020-09-14T16:52:17.785Z] python/mxnet/gluon/contrib/data/vision/transforms/bbox/bbox.py:278:18: E1120: No value for argument 'fill_value' in function call (no-value-for-parameter)

[2020-09-14T16:52:17.785Z] ************* Module mxnet.symbol.numpy._symbol

[2020-09-14T16:52:17.785Z] python/mxnet/symbol/numpy/_symbol.py:645:4: W0222: Signature differs from overridden 'expand_dims' method (signature-differs)

…ures

ma-hei · 2020-09-29T20:20:11Z

@leezu I think I have to give up on this issue or find a different approach. What I tried was the following:

update python to 3.8
update some dependencies that needed updating in order to work with python 3.8
Fix some linting issues that needed fixing because of the updated linter dependencies.

Then I did the following :
(1) Kick off a build and observe the failing tests
(2) remove the failing test in order to isolate the test failures
I was expecting that the build would pass at some point. However I noticed that even after multiple iterations, I couldn't isolate the test failures. Every time I kicked off a build, I would find another set of failing tests, sometimes in a different sub-build.
Hmm.. I'm not sure how to proceed at this point. If you have some suggestions, I'd be happy try it. Otherwise I think I have to give up on this. Maybe the issue needs to be split up into multiple tasks such as "make unix-cpu succeed with python 3.8", "make unix-gpu succeed with python 3.8", etc.

leezu · 2020-12-04T03:10:25Z

Python 3.8 is now tested on CI as part of #19588

To get that through, onnx had to first be updated #19017

Thank you @ma-hei for taking a stab at this before!

ma-hei requested review from aaronmarkham and marcoabreu as code owners May 30, 2020 01:25

ma-hei force-pushed the fixing_python3.8_issues branch from 8aba6cf to 890473d Compare June 1, 2020 05:51

ma-hei force-pushed the fixing_python3.8_issues branch from 5f8a3ef to 3ebbd65 Compare June 11, 2020 01:42

leezu reviewed Jun 11, 2020

View reviewed changes

marcoabreu reviewed Jun 14, 2020

View reviewed changes

leezu marked this pull request as draft June 15, 2020 18:54

ma-hei force-pushed the fixing_python3.8_issues branch from 518a219 to a917be8 Compare July 2, 2020 17:53

ma-hei force-pushed the fixing_python3.8_issues branch 2 times, most recently from 82c187e to 1c2274e Compare July 17, 2020 23:48

szha mentioned this pull request Aug 24, 2020

Various issues on Python 3.8 #18380

Closed

ma-hei force-pushed the fixing_python3.8_issues branch from 1c2274e to ec0615e Compare September 14, 2020 05:33

lanking520 added the pr-work-in-progress PR is still work in progress label Sep 14, 2020

ma-hei force-pushed the fixing_python3.8_issues branch 3 times, most recently from 7ae5ae6 to 17e403d Compare September 14, 2020 16:43

ma-hei force-pushed the fixing_python3.8_issues branch 13 times, most recently from 9e2d114 to c144a8c Compare September 18, 2020 21:58

ma-hei force-pushed the fixing_python3.8_issues branch from c144a8c to 5ea9ff9 Compare September 28, 2020 19:31

Fixing lint issues and removing tests in attempt to isolate test fail…

1714c6b

…ures

ma-hei force-pushed the fixing_python3.8_issues branch from 5ea9ff9 to 1714c6b Compare September 29, 2020 17:24

leezu closed this Dec 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8 #18445

updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8 #18445

ma-hei commented May 30, 2020

mxnet-bot commented May 30, 2020

leezu commented May 30, 2020 •

edited

Loading

ma-hei commented Jun 1, 2020

leezu commented Jun 1, 2020

mxnet-bot commented Jun 2, 2020

ma-hei commented Jun 2, 2020 •

edited

Loading

leezu commented Jun 2, 2020

leezu commented Jun 2, 2020

leezu commented Jun 2, 2020

ma-hei commented Jun 5, 2020 •

edited

Loading

leezu commented Jun 5, 2020

leezu left a comment •

edited

Loading

leezu Jun 11, 2020

ma-hei commented Jun 14, 2020

mxnet-bot commented Jun 14, 2020

marcoabreu Jun 14, 2020

ma-hei commented Jun 14, 2020 •

edited

Loading

leezu commented Jun 15, 2020

ma-hei commented Jun 30, 2020

leezu commented Jul 1, 2020

ma-hei commented Jul 2, 2020 •

edited

Loading

ma-hei commented Jul 3, 2020

leezu commented Jul 5, 2020

ma-hei commented Jul 9, 2020 •

edited

Loading

ma-hei commented Jul 15, 2020

mxnet-bot commented Jul 15, 2020

szha commented Sep 14, 2020

ma-hei commented Sep 29, 2020 •

edited

Loading

leezu commented Dec 4, 2020

updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8 #18445

updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8 #18445

Conversation

ma-hei commented May 30, 2020

Description

Checklist

Essentials

Changes

Comments

mxnet-bot commented May 30, 2020

leezu commented May 30, 2020 • edited Loading

ma-hei commented Jun 1, 2020

leezu commented Jun 1, 2020

mxnet-bot commented Jun 2, 2020

ma-hei commented Jun 2, 2020 • edited Loading

leezu commented Jun 2, 2020

leezu commented Jun 2, 2020

leezu commented Jun 2, 2020

ma-hei commented Jun 5, 2020 • edited Loading

leezu commented Jun 5, 2020

leezu left a comment • edited Loading

Choose a reason for hiding this comment

leezu Jun 11, 2020

Choose a reason for hiding this comment

ma-hei commented Jun 14, 2020

mxnet-bot commented Jun 14, 2020

marcoabreu Jun 14, 2020

Choose a reason for hiding this comment

ma-hei commented Jun 14, 2020 • edited Loading

leezu commented Jun 15, 2020

ma-hei commented Jun 30, 2020

leezu commented Jul 1, 2020

ma-hei commented Jul 2, 2020 • edited Loading

ma-hei commented Jul 3, 2020

leezu commented Jul 5, 2020

ma-hei commented Jul 9, 2020 • edited Loading

ma-hei commented Jul 15, 2020

mxnet-bot commented Jul 15, 2020

szha commented Sep 14, 2020

ma-hei commented Sep 29, 2020 • edited Loading

leezu commented Dec 4, 2020

leezu commented May 30, 2020 •

edited

Loading

ma-hei commented Jun 2, 2020 •

edited

Loading

ma-hei commented Jun 5, 2020 •

edited

Loading

leezu left a comment •

edited

Loading

ma-hei commented Jun 14, 2020 •

edited

Loading

ma-hei commented Jul 2, 2020 •

edited

Loading

ma-hei commented Jul 9, 2020 •

edited

Loading

ma-hei commented Sep 29, 2020 •

edited

Loading