Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8 #18445

Closed
wants to merge 1 commit into from

Conversation

ma-hei
Copy link

@ma-hei ma-hei commented May 30, 2020

Description

This PR is used to observe failing tests when updating ubuntu_cpu base image to 20.04. With ubuntu 20.04 python 3.8 is used. As described in #18380 we should observe various test failures.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@mxnet-bot
Copy link

Hey @ma-hei , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [edge, unix-cpu, sanity, clang, miscellaneous, centos-gpu, website, windows-gpu, centos-cpu, unix-gpu, windows-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@leezu
Copy link
Contributor

leezu commented May 30, 2020

When updating the base image, the Dockerfile would require some changes. You can test that the containers can still build, locally via: cd mxnet/ci; DOCKER_CACHE_REGISTRY=mxnetci docker-compose -f docker/docker-compose.yml pull; DOCKER_CACHE_REGISTRY=mxnetci docker-compose -f docker/docker-compose.yml build

@ma-hei
Copy link
Author

ma-hei commented Jun 1, 2020

@leezu When using Ubuntu 20.04 as the base image of Docker.build.ubuntu, I found some issues with the apt-get installation of packages such as clang-10 and doxygen. To solve the problem at hand, I went back to Ubuntu 18.04 but instead of installing python3 (in Docker.build.ubuntu), I'm installing python3.8. The image builds successfully locally. I have two questions:

  • Will the CI tests of this pull request start automatically after some time?
  • After fixing the issues with python 3.8, do we actually want to upgrade to Ubuntu 20.04 in the base image or should this be a separate pull request?

@leezu
Copy link
Contributor

leezu commented Jun 1, 2020

Thanks @ma-hei. I wasn't aware that 3.8 is also available on 18.04 via the bionic-updates repository. Thus it's best to first solve the Python 3.8 bugs without updating the image.

We can update to 20.04 in a separate PR. Updating to 20.04 will allow us to simplify parts of the Dockerfile that currently install dependencies from third-party resources and replace them with installation from the official 20.04 repository. This will also improve the stability of the CI (as the 3rdparty sometimes become unavailable). To update the GPU containers, we may want to wait for https://gitlab.com/nvidia/container-images/cuda/-/issues/67

@mxnet-bot
Copy link

Undefined action detected.
Permissible actions are : run ci [all], run ci [job1, job2]
Example : @mxnet-bot run ci [all]
Example : @mxnet-bot run ci [centos-cpu, clang]

@ma-hei
Copy link
Author

ma-hei commented Jun 2, 2020

@leezu I was hoping that I could observe the test failures related to Python3.8 in one of the ci/jenkins/mxnet-validation build jobs. I assume those jobs did not run because the ci/jenkins/mxnet-validation/sanity build failed. Does the failure of the sanity build look related to the python3.8 update I made in Dockerfile.build.ubuntu to you? To me it looks like the build stalled at the end and was automatically killed.

@leezu
Copy link
Contributor

leezu commented Jun 2, 2020

There were some issues with the CI this morning. So let's just retry

@mxnet-bot run ci [all]

@leezu
Copy link
Contributor

leezu commented Jun 2, 2020

@mxnet-bot run ci [all]

@leezu
Copy link
Contributor

leezu commented Jun 2, 2020

@ChaiBapchya the bot doesn't work

@ma-hei
Copy link
Author

ma-hei commented Jun 5, 2020

In the build job ci/jenkins/mxnet-validation/unix-cpu the following command was previously failing:
ci/build.py --docker-registry mxnetci --platform ubuntu_cpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh sanity_check
I was able to reproduce the issue locally and I fixed it by making additional changes to ci/docker/Dockerfile.build.ubuntu and ci/docker/install/requirements. What I did in those files is the following:

  • making python3.8 the default python3 binary, by creating a symlink (see change in ci/docker/Dockerfile.build.ubuntu)
  • update requirement versions in ci/docker/install/requirements, so that python3 -m pip install -r /work/requirements in Dockerfile.build.ubuntu can be run successfully. I needed to update onnx, Cython and Pillow. The current versions were not installable via apt-get with python3.8.

I was then able to build the image successfully and was able to successfully run the ci/build.py command I mentioned above.
I'm now wondering if the failure in "continuous build / macosx-x86_64" that I'm seeing below is already a consequence of the onnx update I made (which is necessary in order to update to python 3.8, which in turn is the goal of this PR). My question is basically: what is each of the CI jobs below doing? In which of the jobs below should I be able to observe the test failures? Also let me know if you think this is going down a wrong route and I should try something different.

@leezu
Copy link
Contributor

leezu commented Jun 5, 2020

@ma-hei

I'm now wondering if the failure in "continuous build / macosx-x86_64" that I'm seeing below is already a consequence of the onnx update I made (which is necessary in order to update to python 3.8, which in turn is the goal of this PR).
My question is basically: what is each of the CI jobs below doing? In which of the jobs below should I be able to observe the test failures?

ONNX 1.6 appears to have some major changes (cf. discussion in #18054). So it's possible that updating ONNX leads to test failures.

The Ubuntu container you're modifying here is used for the ubuntu-cpu and ubuntu-gpu jobs. So you'd start seeing Python 3.8 related failures there.

But the requirement file you modified is shared among all containers. So you may want to use https://www.python.org/dev/peps/pep-0508/#environment-markers feature to only update ONNX when running under Python 3.8 and then disable or fix the ONNX tests in the unix-cpu and unix-gpu pipeline. Alternatively you can try building ONNX 1.5 for Py 3.8 from source https://github.com/onnx/onnx#linux-and-macos

@RuRo what do you think?

Also let me know if you think this is going down a wrong route and I should try something different.

I think your current steps are fine

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.


# Development dependencies
cpplint==1.3.0
pylint==2.3.1 # pylint and astroid need to be aligned
pylint==2.3.1;python_version<"3.8"
pylint==2.4.4;python_version=="3.8" # pylint and astroid need to be aligned
astroid==2.3.3 # pylint and astroid need to be aligned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect your lint errors are due to not aligning the pylint and astroid version.

@ma-hei
Copy link
Author

ma-hei commented Jun 14, 2020

@mxnet-bot run ci [all]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu, clang, centos-cpu, edge, sanity, windows-gpu, unix-cpu, miscellaneous, windows-cpu, website, centos-gpu]

@@ -433,6 +433,7 @@ build_ubuntu_cpu_mkl() {
-DUSE_TVM_OP=ON \
-DUSE_MKL_IF_AVAILABLE=ON \
-DUSE_BLAS=MKL \
-DPYTHON_EXECUTABLE:FILEPATH=/usr/bin/python3.8 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/bin/python3 ?

@ma-hei
Copy link
Author

ma-hei commented Jun 14, 2020

@leezu I think I got to a state where I can run the unittests with python3.8 and reproduce what is described in the issue ticket #18380
I ignored the lint errors for now by adding disable flags to the pylintrc file.

As described in #18380 we're seeing an issue related to the usage of time.clock(). Besides that, I found the following issue:

The test tests/python/unittest/onnx/test_node.py::TestNode::test_import_export seems to fail. In the Jenkins job I don't see the error, but when running the test locally with python3.8 and onnx 1.7, I'm getting:

>               bkd_rep = backend.prepare(onnxmodel, operation='export', backend='mxnet')

tests/python/unittest/onnx/test_node.py:164:
tests/python/unittest/onnx/backend.py:104: in prepare
    sym, arg_params, aux_params = MXNetBackend.perform_import_export(sym, arg_params, aux_params,
tests/python/unittest/onnx/backend.py:62: in perform_import_export
    graph_proto = converter.create_onnx_graph_proto(sym, params, in_shape=input_shape,
python/mxnet/contrib/onnx/mx2onnx/export_onnx.py:308: in create_onnx_graph_proto
...
E       onnx.onnx_cpp2py_export.checker.ValidationError: Node (pad0) has input size 1 not in range [min=2, max=3].
E 
E       ==> Context: Bad node spec: input: "input1" output: "pad0" name: "pad0" op_type: "Pad" attribute { name: "mode" s: "constant" type: STRING } attribute { name: "pads" ints: 0 ints: 0 ints: 0 ints: 0 in
ts: 0 ints: 0 ints: 0 ints: 0 type: INTS } attribute { name: "value" f: 0 type: FLOAT }

../../../Library/Python/3.8/lib/python/site-packages/onnx/checker.py:53: ValidationError

I believe this is an issue in onnx 1.7 as it looks exactly like onnx/onnx#2548.

I also found that the test job Python3: MKL-CPU is not running through which seems to be due to a Timeout. I believe this is happening in test tests/python/conftest.py but the log output is not telling me what exactly goes wrong here. Do you have any idea how to reproduce this locally, or to get better insight into that failure?

I will now look into the following:

  • can I work around the onnx 1.7 related issue?
  • even after aligning pylint and astroid, I'm seeing unexpected linting errors. The linter is telling me that ndarray is a bad class name. Why is that happening?

Btw. I believe that you and @marcoabreu are getting pinged every time I do a commit to this PR. I don't think this PR is actually in a reviewable state yet but the purpose of it is more to see what's breaking when upgrading to Python3.8.

@leezu leezu marked this pull request as draft June 15, 2020 18:54
@leezu
Copy link
Contributor

leezu commented Jun 15, 2020

Thank you @ma-hei. For pylint and astroid: You have now updated astroid to latest version, but not pylint. Would updating pylint to 2.5.3 resolve the misalignment? It's fine to update astroid and pylint for all Python versions; though the new release may add some new checks which may fail now. Ideally (if it's straightforward), the failures can be fixed.

I'm not sure about the ONNX test failure. If you can figure it out, that would be great. Feel also free to ping other people that show up in he commit history for the related mxnet onnx files.

To avoid pinging people on commits, you can mark the PR as draft PR. I did that for you for now. Thanks for your help in fixing the Python 3.8 support!

@ma-hei
Copy link
Author

ma-hei commented Jun 30, 2020

here's whats going on with onnx 1.7: onnx/onnx#2865
We just need to use the newer way of instantiating a Pad node.

@leezu
Copy link
Contributor

leezu commented Jul 1, 2020

@ma-hei I also noticed this error in some unrelated cd job which was building mxnet for python 3.8. Just FYI, and not necessarily something you need to take into account here.

Traceback (most recent call last):

  File "setup.py", line 137, in <module>

    from mxnet.base import _generate_op_module_signature

  File "/Users/travis/build/dmlc/mxnet-distro/mxnet/__init__.py", line 34, in <module>

    from . import contrib

  File "/Users/travis/build/dmlc/mxnet-distro/mxnet/contrib/__init__.py", line 31, in <module>

    from . import onnx

  File "/Users/travis/build/dmlc/mxnet-distro/mxnet/contrib/onnx/__init__.py", line 21, in <module>

    from .mx2onnx.export_model import export_model

  File "/Users/travis/build/dmlc/mxnet-distro/mxnet/contrib/onnx/mx2onnx/__init__.py", line 23, in <module>

    from . import _op_translations

  File "/Users/travis/build/dmlc/mxnet-distro/mxnet/contrib/onnx/mx2onnx/_op_translations.py", line 60, in <module>

    import onnx

  File "/Users/travis/.local/lib/python3.8/site-packages/onnx/__init__.py", line 8, in <module>

    from onnx.external_data_helper import load_external_data_for_model, write_external_data_tensors

  File "/Users/travis/.local/lib/python3.8/site-packages/onnx/external_data_helper.py", line 10, in <module>

    from .onnx_pb import TensorProto, ModelProto

  File "/Users/travis/.local/lib/python3.8/site-packages/onnx/onnx_pb.py", line 8, in <module>

    from .onnx_ml_pb2 import *  # noqa

  File "/Users/travis/.local/lib/python3.8/site-packages/onnx/onnx_ml_pb2.py", line 19, in <module>

    DESCRIPTOR = _descriptor.FileDescriptor(

TypeError: __init__() got an unexpected keyword argument 'serialized_options'

@ma-hei
Copy link
Author

ma-hei commented Jul 2, 2020

Thanks @leezu, I think I found the underlying cause of the test failure in unittest/onnx/test_node.py::TestNode::test_import_export. In onnx 1.7, the input of the Pad operator has changed. We can see this by comparing https://github.com/onnx/onnx/blob/master/docs/Operators.md#Pad to https://github.com/onnx/onnx/blob/master/docs/Changelog.md#Pad-1. I believe I can fix this test and I'm working on that now. However the same test will not pass with onnx 1.5 anymore after that (but at least we know how to fix it, I guess). I assume the stacktrace you posted above from the unrelated cd job probably has some similar root cause. Besides that I'm trying to make pylint happy.. but that's the smaller issue I think.

@ma-hei
Copy link
Author

ma-hei commented Jul 3, 2020

I see that the discussion above regarding the failing test unittest/onnx/test_node.py::TestNode::test_import_export is now obsolete since this test got removed with commit fb73de7

@leezu
Copy link
Contributor

leezu commented Jul 5, 2020

@ma-hei there is more background at #18525 (comment)

@ma-hei
Copy link
Author

ma-hei commented Jul 9, 2020

Seems like the unit tests in the unix-cpu job are failing at this point

[2020-07-02T19:59:32.830Z] tests/python/unittest/test_profiler.py::test_gpu_memory_profiler_gluon 
[2020-07-02T19:59:32.830Z] [gw0] [ 89%] SKIPPED tests/python/unittest/test_profiler.py::test_gpu_memory_profiler_gluon 
[2020-07-02T19:59:32.830Z] tests/python/unittest/test_recordio.py::test_recordio 
[2020-07-02T22:59:39.221Z] Sending interrupt signal to process
[2020-07-02T22:59:44.185Z] 2020-07-02 22:59:39,244 - root - WARN

Trying to reproduce it locally.
Note: comparing this with the same job running on python 3.6, I see that the same test is running much later (at 99% completion compared to 89% in this job):

[2020-07-08T22:31:40.815Z] tests/python/unittest/test_profiler.py::test_gpu_memory_profiler_gluon 
[2020-07-08T22:31:40.815Z] [gw3] [ 99%] SKIPPED tests/python/unittest/test_profiler.py::test_gpu_memory_profiler_gluon 
[2020-07-08T22:31:40.815Z] tests/python/unittest/test_recordio.py::test_recordio 
[2020-07-08T22:31:40.815Z] [gw3] [ 99%] PASSED tests/python/unittest/test_recordio.py::test_recordio 

@ma-hei
Copy link
Author

ma-hei commented Jul 15, 2020

@mxnet-bot run ci [centos-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [centos-cpu]

@ma-hei ma-hei force-pushed the fixing_python3.8_issues branch 2 times, most recently from 82c187e to 1c2274e Compare July 17, 2020 23:48
@lanking520 lanking520 added the pr-work-in-progress PR is still work in progress label Sep 14, 2020
@ma-hei ma-hei force-pushed the fixing_python3.8_issues branch 3 times, most recently from 7ae5ae6 to 17e403d Compare September 14, 2020 16:43
@szha
Copy link
Member

szha commented Sep 14, 2020

There are a couple of python lint issues that block the CI:

[2020-09-14T16:51:10.099Z] + python3 -m pylint --rcfile=ci/other/pylintrc '--ignore-patterns=.*\.so1,.*\.dll1,.*\.dylib1' python/mxnet

[2020-09-14T16:52:17.785Z] ************* Module mxnet.test_utils

[2020-09-14T16:52:17.785Z] python/mxnet/test_utils.py:1143:31: E1121: Too many positional arguments for function call (too-many-function-args)

[2020-09-14T16:52:17.785Z] ************* Module mxnet.numpy.multiarray

[2020-09-14T16:52:17.785Z] python/mxnet/numpy/multiarray.py:264:0: C0103: Class name "ndarray" doesn't conform to '[A-Z_][a-zA-Z0-9]+$' pattern (invalid-name)

[2020-09-14T16:52:17.785Z] python/mxnet/numpy/multiarray.py:1871:4: W0222: Signature differs from overridden 'expand_dims' method (signature-differs)

[2020-09-14T16:52:17.785Z] ************* Module mxnet.gluon.contrib.data.vision.transforms.bbox.bbox

[2020-09-14T16:52:17.785Z] python/mxnet/gluon/contrib/data/vision/transforms/bbox/bbox.py:278:18: E1123: Unexpected keyword argument 'val' in function call (unexpected-keyword-arg)

[2020-09-14T16:52:17.785Z] python/mxnet/gluon/contrib/data/vision/transforms/bbox/bbox.py:278:18: E1120: No value for argument 'fill_value' in function call (no-value-for-parameter)

[2020-09-14T16:52:17.785Z] ************* Module mxnet.symbol.numpy._symbol

[2020-09-14T16:52:17.785Z] python/mxnet/symbol/numpy/_symbol.py:645:4: W0222: Signature differs from overridden 'expand_dims' method (signature-differs)

@ma-hei ma-hei force-pushed the fixing_python3.8_issues branch 13 times, most recently from 9e2d114 to c144a8c Compare September 18, 2020 21:58
@ma-hei
Copy link
Author

ma-hei commented Sep 29, 2020

@leezu I think I have to give up on this issue or find a different approach. What I tried was the following:

  • update python to 3.8
  • update some dependencies that needed updating in order to work with python 3.8
  • Fix some linting issues that needed fixing because of the updated linter dependencies.

Then I did the following :
(1) Kick off a build and observe the failing tests
(2) remove the failing test in order to isolate the test failures
I was expecting that the build would pass at some point. However I noticed that even after multiple iterations, I couldn't isolate the test failures. Every time I kicked off a build, I would find another set of failing tests, sometimes in a different sub-build.
Hmm.. I'm not sure how to proceed at this point. If you have some suggestions, I'd be happy try it. Otherwise I think I have to give up on this. Maybe the issue needs to be split up into multiple tasks such as "make unix-cpu succeed with python 3.8", "make unix-gpu succeed with python 3.8", etc.

@leezu
Copy link
Contributor

leezu commented Dec 4, 2020

Python 3.8 is now tested on CI as part of #19588

To get that through, onnx had to first be updated #19017

Thank you @ma-hei for taking a stab at this before!

@leezu leezu closed this Dec 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants