Tabular distributed training artifact upload update #3110

yinweisu · 2023-04-04T23:24:28Z

Issue #, if available:
During full testing of distributed training, noticed that xgboost model cannot be saved to s3 bucket directly unless being compiled with a special flag, i.e.

Fitting model: XGBoost_BAG_L1 ...
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelDistributedFoldFittingStrategy
        Warning: Exception caused XGBoost_BAG_L1 to fail during training... Skipping this model.
                ray::_ray_fit() (pid=1426, ip=172.31.66.143)
  File "/opt/conda/lib/python3.9/site-packages/autogluon/core/models/ensemble/fold_fitting_strategy.py", line 395, in _ray_fit
    fold_model.save(path=model_save_path)
  File "/opt/conda/lib/python3.9/site-packages/autogluon/tabular/models/xgboost/xgboost_model.py", line 215, in save
    _model.save_model(path + 'xgb.ubj')
  File "/opt/conda/lib/python3.9/site-packages/xgboost/sklearn.py", line 767, in save_model
    self.get_booster().save_model(fname)
  File "/opt/conda/lib/python3.9/site-packages/xgboost/core.py", line 2389, in save_model
    _check_call(_LIB.XGBoosterSaveModel(
  File "/opt/conda/lib/python3.9/site-packages/xgboost/core.py", line 279, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [17:53:25] ../dmlc-core/src/io.cc:57: Please compile with DMLC_USE_S3=1 to use S3

Such an issue could potentially happen to other models too.

Description of changes:
Now save to the local disk and then upload the artifact to s3. Util function and tests added to upload the folder.
Example run after the fix:

No path specified. Models will be saved in: "AutogluonModels/ag-20230404_231427/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230404_231427/"
AutoGluon Version:  0.7.0b20230404
Python Version:     3.9.16
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #80~18.04.1-Ubuntu SMP Mon May 23 20:32:04 UTC 2022
Disk Space Avail:   134.76 GB / 266.40 GB (50.6%)
Train Data Rows:    500
Train Data Columns: 14
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
        2 unique label values:  [' >50K', ' <=50K']
        If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
        Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
        To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
        Available Memory:                    31814.61 MB
        Train Data (Original)  Memory Usage: 0.29 MB (0.0% of available memory)
        Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
        Stage 1 Generators:
                Fitting AsTypeFeatureGenerator...
                        Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
        Stage 2 Generators:
                Fitting FillNaFeatureGenerator...
        Stage 3 Generators:
                Fitting IdentityFeatureGenerator...
                Fitting CategoryFeatureGenerator...
                        Fitting CategoryMemoryMinimizeFeatureGenerator...
        Stage 4 Generators:
                Fitting DropUniqueFeatureGenerator...
        Types of features in original data (raw dtype, special dtypes):
                ('int', [])    : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
                ('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
        Types of features in processed data (raw dtype, special dtypes):
                ('category', [])  : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
                ('int', [])       : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
                ('int', ['bool']) : 1 | ['sex']
        0.1s = Fit runtime
        14 features in original data used to generate 14 features in processed data.
        Train Data (Processed) Memory Usage: 0.03 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.07s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
        To change this, specify the eval_metric parameter of Predictor()
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ...
        0.726    = Validation score   (accuracy)
        0.0s     = Training   runtime
        0.01s    = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ...
        0.658    = Validation score   (accuracy)
        0.0s     = Training   runtime
        0.0s     = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.834    = Validation score   (accuracy)
        0.26s    = Training   runtime
        0.01s    = Validation runtime
Fitting model: LightGBM_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.832    = Validation score   (accuracy)
        0.26s    = Training   runtime
        0.01s    = Validation runtime
Fitting model: RandomForestGini_BAG_L1 ...
        0.842    = Validation score   (accuracy)
        0.58s    = Training   runtime
        0.11s    = Validation runtime
Fitting model: RandomForestEntr_BAG_L1 ...
        0.83     = Validation score   (accuracy)
        0.54s    = Training   runtime
        0.11s    = Validation runtime
Fitting model: CatBoost_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.85     = Validation score   (accuracy)
        1.95s    = Training   runtime
        0.01s    = Validation runtime
Fitting model: ExtraTreesGini_BAG_L1 ...
        0.844    = Validation score   (accuracy)
        0.52s    = Training   runtime
        0.11s    = Validation runtime
Fitting model: ExtraTreesEntr_BAG_L1 ...
        0.844    = Validation score   (accuracy)
        0.56s    = Training   runtime
        0.11s    = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.836    = Validation score   (accuracy)
        4.37s    = Training   runtime
        0.07s    = Validation runtime
Fitting model: XGBoost_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.826    = Validation score   (accuracy)
        0.3s     = Training   runtime
        0.01s    = Validation runtime
Fitting model: NeuralNetTorch_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.822    = Validation score   (accuracy)
        2.75s    = Training   runtime
        0.03s    = Validation runtime
Fitting model: LightGBMLarge_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.808    = Validation score   (accuracy)
        0.4s     = Training   runtime
        0.01s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
        0.85     = Validation score   (accuracy)
        0.87s    = Training   runtime
        0.0s     = Validation runtime
AutoGluon training complete, total runtime = 30.75s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230404_231427/")
6118      <=50K
23204     <=50K
29590     <=50K
18116     <=50K
33964      >50K
          ...  
29128     <=50K
23950     <=50K
13700      >50K
35248     <=50K
24772     <=50K
Name: class, Length: 500, dtype: object

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Innixma

LGTM!

common/src/autogluon/common/utils/resource_utils.py

common/src/autogluon/common/utils/s3_utils.py

github-actions · 2023-04-05T01:52:14Z

Job PR-3110-316c07b is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3110/316c07b/index.html

github-actions · 2023-04-05T02:19:46Z

Job PR-3110-c814369 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3110/c814369/index.html

github-actions · 2023-04-05T06:35:31Z

Job PR-3110-6caaa13 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3110/6caaa13/index.html

upload folder

316c07b

Innixma approved these changes Apr 4, 2023

View reviewed changes

common/src/autogluon/common/utils/resource_utils.py Show resolved Hide resolved

common/src/autogluon/common/utils/resource_utils.py Show resolved Hide resolved

common/src/autogluon/common/utils/s3_utils.py Outdated Show resolved Hide resolved

address comments

c814369

fix

6caaa13

yinweisu merged commit 86cd448 into autogluon:master Apr 5, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabular distributed training artifact upload update #3110

Tabular distributed training artifact upload update #3110

yinweisu commented Apr 4, 2023

Innixma left a comment

github-actions bot commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

Tabular distributed training artifact upload update #3110

Tabular distributed training artifact upload update #3110

Conversation

yinweisu commented Apr 4, 2023

Innixma left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

github-actions bot commented Apr 5, 2023