New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for python 3.11 #3190
Conversation
CI fails on as of pytorch-lightning v2.0.0 ref Lightning-AI/pytorch-lightning#16745 |
last commit due to
|
@ddelange Thanks for giving this a try, this is a great start! I suspect there might be a non-trivial amount of changes needed since we need to upgrade to torch 2.0 and pytorch-lightning 2.0, but at least we should be able to start testing things now that CatBoost supports Python 3.11. I'm putting a tentative timeline for v0.8 release, but it very much depends on how many changes need to happen. |
This is still broken for Clip models: import os
import numpy as np
import warnings
warnings.filterwarnings('ignore')
np.random.seed(123)
download_dir = './ag_automm_tutorial'
zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_kaggle.zip'
from autogluon.core.utils.loaders import load_zip
load_zip.unzip(zip_file, unzip_dir=download_dir)
import pandas as pd
dataset_path = download_dir + '/petfinder_processed'
train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)
test_data = pd.read_csv(f'{dataset_path}/dev.csv', index_col=0)
label_col = 'AdoptionSpeed'
image_col = 'Images'
train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0]) # Use the first image for a quick tutorial
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])
def path_expander(path, base_folder):
path_l = path.split(';')
return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])
train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
train_data = train_data.sample(500, random_state=0)
test_data = test_data.sample(100, random_state=0)
from autogluon.multimodal import AutoMMPredictor
predictor = AutoMMPredictor(label=label_col)
predictor.fit(
train_data=train_data,
hyperparameters={
"model.names": ["clip", "timm_image", "hf_text", "numerical_mlp", "fusion_mlp"],
"model.timm_image.checkpoint_name": "swin_small_patch4_window7_224",
"model.hf_text.checkpoint_name": "google/electra-small-discriminator",
"env.num_gpus": 1,
},
time_limit=20, # seconds
)
scores = predictor.evaluate(test_data, metrics=["accuracy"])
print(scores)
|
Some failed tests point to autogluon/multimodal/src/autogluon/multimodal/utils/map.py Lines 6 to 7 in 7dfefbd
I think it's safe to always use torchmetrics>0.12.0. @FANGAreNotGnu correct me if I'm wrong. |
timeseries/setup.py
Outdated
@@ -29,8 +29,8 @@ | |||
"pandas", # version range defined in `core/_setup_utils.py` | |||
"statsmodels>=0.13.0,<0.14", | |||
"gluonts>=0.12.4,<0.13", | |||
"torch>=1.9,<1.14", | |||
"pytorch-lightning>=1.7.4,<1.10.0", | |||
"torch>=1.9,<2.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to security vulnerabilities of torch, we may need to force "torch>=1.13.1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case, the minimum torchvision version would go to 0.14.0 and your 0.12 comment becomes obsolete 👍 sry that's torchmetrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to be supporting multiple major versions of torch? That sounds intimidating to maintain. Is it possible to drop support of torch 1.x and only support 2.x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. torch-lightning supports 1.11-2.0, I think it should be ok to support v1 a while longer:
torch.compile is the main API for PyTorch 2.0, which wraps your model and returns a compiled model. It is a fully additive (and optional) feature and hence 2.0 is 100% backward compatible by definition.
ref https://pytorch.org/blog/pytorch-2.0-release/
personally I as user would be fine with >=2 as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lightning upgrade will require some verification. |
I did smoke tests we usually ran for the containers: CLIP is broken (see the code above), but other scenarios looks ok. |
@@ -16,21 +16,23 @@ function setup_build_contrib_env { | |||
} | |||
|
|||
function setup_torch_gpu { | |||
# Security-patched torch. | |||
python3 -m pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 | |||
PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu118 reinstall_torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So changes from this file actually won't be reflected in our CI because we are checking the permission of the PR submission and avoid people modifying our setup scripts.
Currently there's no easy solution to enable it to run. I'm thinking of allow the script to be used if we tag this PR as something like safe to run
in the future. For now, we might need to give @ddelange write permission to our repo briefly and revoke the permission once the PR is ready. @Innixma @gradientsky Ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like good practice to keep that closed off 💪
fwiw, I don't mind if a maintainer opens a new PR in favour of this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion with the team, I'll be creating a clone of this PR once it's ready to test for changes. Once those changes are being verified, we'll merge this PR and close the clone PR.
@@ -45,7 +45,7 @@ jobs: | |||
fail-fast: false | |||
matrix: | |||
os: [macos-latest, windows-latest, ubuntu-latest] | |||
python: ["3.8", "3.9", "3.10"] | |||
python: ["3.8", "3.9", "3.10", "3.11"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To trigger a platform test, you should be able to do it with comment on this PR by
/platform_tests ref=ddelange:cp311
However, our platform tests are currently failing because of other issues. @gradientsky @Innixma We'll want to fix the platform tests to unblock this PR
* 'master' of https://github.com/awslabs/autogluon: improve stability of flaky XShift tests (autogluon#3194)
Just confirmed that torchmetrics>=0.11.3 works well. I can update the version requirement (also removing the backup codes) if it also works well with other problem type. @zhiqiangdon |
@FANGAreNotGnu Please go ahead cleaning up the old logic for object detection. Other problem types already use the newest torchmetrics. |
For torchmetrics >= 0.11.3 the map implementation is bug free. But there's still an issue that it's much slower than 0.6.0 (the implementation copied here). Lightning-AI/torchmetrics#1024 There might be a problem that the old implementation is no more compatible with latest pytorch/torchmetrics. We can check the pytorch version and only use the new but slow implementation if we have to. @zhiqiangdon @sxjscience |
^ ray 2.7 contains some critical fixes for ray serve specific to python 3.11 ref kserve/kserve#3075 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Innixma Changes merged. I think the code is good to merge functionality wise; though there are some installation instruction changes. Currently, we are prompting for pytorch 1.13 installation, I wonder shall we increase it to pytorch 2.0 and its corresponding torchvision? We can probably address this in a follow-up PR too given this PR has been around for a long time |
@@ -4,7 +4,7 @@ pip install -U setuptools wheel | |||
|
|||
# CPU version of pytorch has smaller footprint - see installation instructions in | |||
# pytorch documentation - https://pytorch.org/get-started/locally/ | |||
pip install torch==1.13.1+cpu torchvision==0.14.1+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html | |||
pip install torchvision~=0.15.1 --force-reinstall --extra-index-url https://download.pytorch.org/whl/cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched the docs to torch's new --extra-index-url
notation.
torchvision etc have an exact pin on a torch version, so this command is sufficient (and will reinstall cuda version if a user previously had a cpu version installed amd vice versa)
multimodal/src/autogluon/multimodal/configs/optimization/adamw.yaml
Outdated
Show resolved
Hide resolved
@zhiqiangdon Addressed your comments |
Job PR-3190-bd6b173 is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the contribution!
Merged! Thanks everyone for working to push this through! |
@@ -64,7 +64,7 @@ | |||
"onnx>=1.13.0,<1.14.0", | |||
"onnxruntime>=1.15.0,<1.16.0;platform_system=='Darwin'", | |||
"onnxruntime-gpu>=1.15.0,<1.16.0;platform_system!='Darwin'", | |||
"tensorrt>=8.5.3.1,<8.5.4;platform_system=='Linux'", | |||
"tensorrt>=8.5.3.1,<8.5.4;platform_system=='Linux' and python_version<'3.11'", # tensorrt > 8.5.4 cause segfault |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yinweisu could you give tensorrt 8.6.1 another shot? to get some more info about the segfault?
if i find some time, I'll try to install it in our py 3.11 staging cluster and try out optimize_for_inference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ddelange, I've been trying to work out that problem with tensorrt 8.6.1 running the tests in test_deployment_onnx.py
. It still gives me a segfault, I'll update with more info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I ran some more tests in python3.11
, and tensorrt==8.6.1
worked with cuda12.1
with autogluon
built from source (current master) and torch==2.1.1
. I was able to successfully run tests in test_deployment_onnx.py
with the above versions.
All in all, we can wait for the cuda version to be upgraded inside autogluon requirements to cuda12.1 (#3663), which will then solve the segfault issue on the latest version of tensorrt
and cuda11.8
. Only then we can go ahead and update tensorrt
version in setup.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @AnirudhDagar 👋
#3982 was merged. could you finalize support for optimize_for_inference
on python 3.11 to finally close #2687?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll update the issues as soon as we get rid of the py3.11 conditional which is still present there due to a few failing HPO tests (see comment). I need to run some tests locally, specifically for optimize_for_inference
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AnirudhDagar I saw you added "close #2687" to the release tracker #4027
but that issue tracks optimize_for_inference
support on 3.11. is your plan to squeeze it in still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder @ddelange, yes I can give the optimize_for_inference
testing a shot after merging your ray PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw in a future version, the tensorrt
python package seems deprecated (from inspecting the latest prerelease it's now an empty meta package forwarding to https://pypi.org/project/tensorrt-cu12/10.0.0b6/).
So if you hit a snatch, it might be worth to give tensorrt-cu12
a try. See also the install instructions ^ on the pypi page, to avoid an ugly pip-in-pip pattern ref NVIDIA/TensorRT#3080
Issue #, if available: closes #2687
Description of changes:
--force-reinstall
flagBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.