Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for python 3.11 #3190

Merged
merged 70 commits into from Oct 11, 2023
Merged

Add support for python 3.11 #3190

merged 70 commits into from Oct 11, 2023

Conversation

ddelange
Copy link
Contributor

@ddelange ddelange commented May 2, 2023

Issue #, if available: closes #2687

Description of changes:

  • bump dependency versions for which cp311 wheels/support is available
  • consequently add support for torch 2.0, torchmetrics 1.0, pytorch-lightning 2.0
  • run CI on 3.11
  • switch from cuda/cpu plus-notation to using pip's --force-reinstall flag

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@ddelange
Copy link
Contributor Author

ddelange commented May 2, 2023

CI fails on TypeError: __init__() got an unexpected keyword argument 'track_grad_norm'

as of pytorch-lightning v2.0.0 ref Lightning-AI/pytorch-lightning#16745

@ddelange
Copy link
Contributor Author

ddelange commented May 2, 2023

last commit due to

ValueError: You selected an invalid strategy name: `strategy=None`. It must be either a string or an instance of `pytorch_lightning.strategies.Strategy`. Example choices: auto, ddp, ddp_spawn, deepspeed, ... Find a complete list of options in our documentation at https://lightning.ai/

@ddelange
Copy link
Contributor Author

ddelange commented May 2, 2023

next fail:

tests/unittests/test_shift.py:66: KeyError: 'shift_features'

only present if detection_status is True from docs 🤔

@gradientsky gradientsky self-requested a review May 2, 2023 18:11
@Innixma
Copy link
Contributor

Innixma commented May 2, 2023

@ddelange Thanks for giving this a try, this is a great start!

I suspect there might be a non-trivial amount of changes needed since we need to upgrade to torch 2.0 and pytorch-lightning 2.0, but at least we should be able to start testing things now that CatBoost supports Python 3.11.

I'm putting a tentative timeline for v0.8 release, but it very much depends on how many changes need to happen.

@Innixma Innixma added this to the 0.8 Release milestone May 2, 2023
@Innixma Innixma added enhancement New feature or request install Related to installation dependency Related to dependency packages priority: 0 Maximum priority labels May 2, 2023
@gradientsky
Copy link
Contributor

This is still broken for Clip models:

import os
import numpy as np
import warnings
warnings.filterwarnings('ignore')
np.random.seed(123)

download_dir = './ag_automm_tutorial'
zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_kaggle.zip'
from autogluon.core.utils.loaders import load_zip
load_zip.unzip(zip_file, unzip_dir=download_dir)

import pandas as pd
dataset_path = download_dir + '/petfinder_processed'
train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)
test_data = pd.read_csv(f'{dataset_path}/dev.csv', index_col=0)
label_col = 'AdoptionSpeed'

image_col = 'Images'
train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0]) # Use the first image for a quick tutorial
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])


def path_expander(path, base_folder):
    path_l = path.split(';')
    return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])

train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))

train_data = train_data.sample(500, random_state=0)
test_data = test_data.sample(100, random_state=0)

from autogluon.multimodal import AutoMMPredictor
predictor = AutoMMPredictor(label=label_col)
predictor.fit(
    train_data=train_data,
    hyperparameters={
        "model.names": ["clip", "timm_image", "hf_text", "numerical_mlp", "fusion_mlp"],
        "model.timm_image.checkpoint_name": "swin_small_patch4_window7_224",
        "model.hf_text.checkpoint_name": "google/electra-small-discriminator",
        "env.num_gpus": 1,
    },
    time_limit=20, # seconds
)

scores = predictor.evaluate(test_data, metrics=["accuracy"])
print(scores)
│ /home/ubuntu/Projects/autogluon/multimodal/src/autogluon/multimodal/models/fusion/fusion_mlp.py: │
│ 126 in input_keys                                                                                │
│                                                                                                  │
│   123 │   def input_keys(self):                                                                  │
│   124 │   │   input_keys = []                                                                    │
│   125 │   │   for m in self.model:                                                               │
│ ❱ 126 │   │   │   assert hasattr(m, "input_keys"), f"invalid model {type(m)}, which doesn't ha   │
│   127 │   │   │   input_keys += m.input_keys                                                     │
│   128 │   │   return input_keys                                                                  │
│   129                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: invalid model <class 'autogluon.multimodal.models.clip.CLIPForImageText'>, which doesn't have a 'input_keys' attribute

@sxjscience
Copy link
Collaborator

Some failed tests point to utils/map.py. The version detection may not have taken effect:

if version.parse(torchmetrics.__version__) > version.parse("0.12.0"):
from torchmetrics.detection.mean_ap import MeanAveragePrecision

I think it's safe to always use torchmetrics>0.12.0. @FANGAreNotGnu correct me if I'm wrong.

@@ -29,8 +29,8 @@
"pandas", # version range defined in `core/_setup_utils.py`
"statsmodels>=0.13.0,<0.14",
"gluonts>=0.12.4,<0.13",
"torch>=1.9,<1.14",
"pytorch-lightning>=1.7.4,<1.10.0",
"torch>=1.9,<2.1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to security vulnerabilities of torch, we may need to force "torch>=1.13.1"

Copy link
Contributor Author

@ddelange ddelange May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, the minimum torchvision version would go to 0.14.0 and your 0.12 comment becomes obsolete 👍 sry that's torchmetrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to be supporting multiple major versions of torch? That sounds intimidating to maintain. Is it possible to drop support of torch 1.x and only support 2.x?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. torch-lightning supports 1.11-2.0, I think it should be ok to support v1 a while longer:

torch.compile is the main API for PyTorch 2.0, which wraps your model and returns a compiled model. It is a fully additive (and optional) feature and hence 2.0 is 100% backward compatible by definition.

ref https://pytorch.org/blog/pytorch-2.0-release/

personally I as user would be fine with >=2 as well

Copy link
Contributor Author

@ddelange ddelange May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of pytorch-lightning>=2.0.0,<2.1 (6a0387e), shall I consequently bump

- torch>=1.9,<2.1
+ torch>=1.11,<2.1
- torchvision>=0.10.0,<0.16
+ torchvision>=0.12.0,<0.16 

?

edit: 4c94b71

@sxjscience
Copy link
Collaborator

The lightning upgrade will require some verification.

@gradientsky
Copy link
Contributor

The lightning upgrade will require some verification.

I did smoke tests we usually ran for the containers: CLIP is broken (see the code above), but other scenarios looks ok.

@@ -16,21 +16,23 @@ function setup_build_contrib_env {
}

function setup_torch_gpu {
# Security-patched torch.
python3 -m pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu118 reinstall_torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So changes from this file actually won't be reflected in our CI because we are checking the permission of the PR submission and avoid people modifying our setup scripts.

Currently there's no easy solution to enable it to run. I'm thinking of allow the script to be used if we tag this PR as something like safe to run in the future. For now, we might need to give @ddelange write permission to our repo briefly and revoke the permission once the PR is ready. @Innixma @gradientsky Ideas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like good practice to keep that closed off 💪

fwiw, I don't mind if a maintainer opens a new PR in favour of this one

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion with the team, I'll be creating a clone of this PR once it's ready to test for changes. Once those changes are being verified, we'll merge this PR and close the clone PR.

@@ -45,7 +45,7 @@ jobs:
fail-fast: false
matrix:
os: [macos-latest, windows-latest, ubuntu-latest]
python: ["3.8", "3.9", "3.10"]
python: ["3.8", "3.9", "3.10", "3.11"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To trigger a platform test, you should be able to do it with comment on this PR by
/platform_tests ref=ddelange:cp311

However, our platform tests are currently failing because of other issues. @gradientsky @Innixma We'll want to fix the platform tests to unblock this PR

@FANGAreNotGnu
Copy link
Contributor

Some failed tests point to utils/map.py. The version detection may not have taken effect:

if version.parse(torchmetrics.__version__) > version.parse("0.12.0"):
from torchmetrics.detection.mean_ap import MeanAveragePrecision

I think it's safe to always use torchmetrics>0.12.0. @FANGAreNotGnu correct me if I'm wrong.

Just confirmed that torchmetrics>=0.11.3 works well. I can update the version requirement (also removing the backup codes) if it also works well with other problem type. @zhiqiangdon

@zhiqiangdon
Copy link
Contributor

@FANGAreNotGnu Please go ahead cleaning up the old logic for object detection. Other problem types already use the newest torchmetrics.

@FANGAreNotGnu
Copy link
Contributor

Some failed tests point to utils/map.py. The version detection may not have taken effect:

if version.parse(torchmetrics.__version__) > version.parse("0.12.0"):
from torchmetrics.detection.mean_ap import MeanAveragePrecision

I think it's safe to always use torchmetrics>0.12.0. @FANGAreNotGnu correct me if I'm wrong.

For torchmetrics >= 0.11.3 the map implementation is bug free. But there's still an issue that it's much slower than 0.6.0 (the implementation copied here). Lightning-AI/torchmetrics#1024 There might be a problem that the old implementation is no more compatible with latest pytorch/torchmetrics. We can check the pytorch version and only use the new but slow implementation if we have to. @zhiqiangdon @sxjscience

multimodal/setup.py Outdated Show resolved Hide resolved
@ddelange
Copy link
Contributor Author

ddelange commented Oct 4, 2023

^ ray 2.7 contains some critical fixes for ray serve specific to python 3.11 ref kserve/kserve#3075 (comment)

@Innixma
Copy link
Contributor

Innixma commented Oct 10, 2023

@ddelange and @yinweisu, awesome work! Everything appears to be passing now on the clone PR (#3204). We will move the changes over to this PR and merge soon. This is a huge feature, thanks so much!

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awaiting changes from #3204 to be merged into this PR by @yinweisu.

@yinweisu
Copy link
Collaborator

@Innixma Changes merged. I think the code is good to merge functionality wise; though there are some installation instruction changes. Currently, we are prompting for pytorch 1.13 installation, I wonder shall we increase it to pytorch 2.0 and its corresponding torchvision?

We can probably address this in a follow-up PR too given this PR has been around for a long time

@@ -4,7 +4,7 @@ pip install -U setuptools wheel

# CPU version of pytorch has smaller footprint - see installation instructions in
# pytorch documentation - https://pytorch.org/get-started/locally/
pip install torch==1.13.1+cpu torchvision==0.14.1+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html
pip install torchvision~=0.15.1 --force-reinstall --extra-index-url https://download.pytorch.org/whl/cpu
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched the docs to torch's new --extra-index-url notation.

torchvision etc have an exact pin on a torch version, so this command is sufficient (and will reinstall cuda version if a user previously had a cpu version installed amd vice versa)

@yinweisu
Copy link
Collaborator

@zhiqiangdon Addressed your comments

@github-actions
Copy link

Job PR-3190-bd6b173 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3190/bd6b173/index.html

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the contribution!

@yinweisu yinweisu merged commit 3345d3b into autogluon:master Oct 11, 2023
28 checks passed
@Innixma
Copy link
Contributor

Innixma commented Oct 11, 2023

Merged! Thanks everyone for working to push this through!

@Innixma Innixma mentioned this pull request Oct 11, 2023
7 tasks
@yinweisu yinweisu mentioned this pull request Oct 12, 2023
@@ -64,7 +64,7 @@
"onnx>=1.13.0,<1.14.0",
"onnxruntime>=1.15.0,<1.16.0;platform_system=='Darwin'",
"onnxruntime-gpu>=1.15.0,<1.16.0;platform_system!='Darwin'",
"tensorrt>=8.5.3.1,<8.5.4;platform_system=='Linux'",
"tensorrt>=8.5.3.1,<8.5.4;platform_system=='Linux' and python_version<'3.11'", # tensorrt > 8.5.4 cause segfault
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yinweisu could you give tensorrt 8.6.1 another shot? to get some more info about the segfault?

if i find some time, I'll try to install it in our py 3.11 staging cluster and try out optimize_for_inference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ddelange, I've been trying to work out that problem with tensorrt 8.6.1 running the tests in test_deployment_onnx.py. It still gives me a segfault, I'll update with more info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I ran some more tests in python3.11, and tensorrt==8.6.1 worked with cuda12.1 with autogluon built from source (current master) and torch==2.1.1. I was able to successfully run tests in test_deployment_onnx.py with the above versions.

All in all, we can wait for the cuda version to be upgraded inside autogluon requirements to cuda12.1 (#3663), which will then solve the segfault issue on the latest version of tensorrt and cuda11.8. Only then we can go ahead and update tensorrt version in setup.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AnirudhDagar 👋

#3982 was merged. could you finalize support for optimize_for_inference on python 3.11 to finally close #2687?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll update the issues as soon as we get rid of the py3.11 conditional which is still present there due to a few failing HPO tests (see comment). I need to run some tests locally, specifically for optimize_for_inference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AnirudhDagar I saw you added "close #2687" to the release tracker #4027

but that issue tracks optimize_for_inference support on 3.11. is your plan to squeeze it in still?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder @ddelange, yes I can give the optimize_for_inference testing a shot after merging your ray PR.

Copy link
Contributor Author

@ddelange ddelange Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw in a future version, the tensorrt python package seems deprecated (from inspecting the latest prerelease it's now an empty meta package forwarding to https://pypi.org/project/tensorrt-cu12/10.0.0b6/).

So if you hit a snatch, it might be worth to give tensorrt-cu12 a try. See also the install instructions ^ on the pypi page, to avoid an ugly pip-in-pip pattern ref NVIDIA/TensorRT#3080

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependency Related to dependency packages enhancement New feature or request install Related to installation model list checked You have updated the model list after modifying multimodal unit tests/docs priority: 0 Maximum priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Python 3.11
8 participants