Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabular distributed training artifact upload update #3110

Merged
merged 3 commits into from
Apr 5, 2023

Conversation

yinweisu
Copy link
Collaborator

@yinweisu yinweisu commented Apr 4, 2023

Issue #, if available:
During full testing of distributed training, noticed that xgboost model cannot be saved to s3 bucket directly unless being compiled with a special flag, i.e.

Fitting model: XGBoost_BAG_L1 ...
        Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelDistributedFoldFittingStrategy
        Warning: Exception caused XGBoost_BAG_L1 to fail during training... Skipping this model.
                ray::_ray_fit() (pid=1426, ip=172.31.66.143)
  File "/opt/conda/lib/python3.9/site-packages/autogluon/core/models/ensemble/fold_fitting_strategy.py", line 395, in _ray_fit
    fold_model.save(path=model_save_path)
  File "/opt/conda/lib/python3.9/site-packages/autogluon/tabular/models/xgboost/xgboost_model.py", line 215, in save
    _model.save_model(path + 'xgb.ubj')
  File "/opt/conda/lib/python3.9/site-packages/xgboost/sklearn.py", line 767, in save_model
    self.get_booster().save_model(fname)
  File "/opt/conda/lib/python3.9/site-packages/xgboost/core.py", line 2389, in save_model
    _check_call(_LIB.XGBoosterSaveModel(
  File "/opt/conda/lib/python3.9/site-packages/xgboost/core.py", line 279, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [17:53:25] ../dmlc-core/src/io.cc:57: Please compile with DMLC_USE_S3=1 to use S3

Such an issue could potentially happen to other models too.

Description of changes:
Now save to the local disk and then upload the artifact to s3. Util function and tests added to upload the folder.
Example run after the fix:

No path specified. Models will be saved in: "AutogluonModels/ag-20230404_231427/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230404_231427/"
AutoGluon Version:  0.7.0b20230404
Python Version:     3.9.16
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #80~18.04.1-Ubuntu SMP Mon May 23 20:32:04 UTC 2022
Disk Space Avail:   134.76 GB / 266.40 GB (50.6%)
Train Data Rows:    500
Train Data Columns: 14
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
        2 unique label values:  [' >50K', ' <=50K']
        If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
        Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
        To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
        Available Memory:                    31814.61 MB
        Train Data (Original)  Memory Usage: 0.29 MB (0.0% of available memory)
        Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
        Stage 1 Generators:
                Fitting AsTypeFeatureGenerator...
                        Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
        Stage 2 Generators:
                Fitting FillNaFeatureGenerator...
        Stage 3 Generators:
                Fitting IdentityFeatureGenerator...
                Fitting CategoryFeatureGenerator...
                        Fitting CategoryMemoryMinimizeFeatureGenerator...
        Stage 4 Generators:
                Fitting DropUniqueFeatureGenerator...
        Types of features in original data (raw dtype, special dtypes):
                ('int', [])    : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
                ('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
        Types of features in processed data (raw dtype, special dtypes):
                ('category', [])  : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
                ('int', [])       : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
                ('int', ['bool']) : 1 | ['sex']
        0.1s = Fit runtime
        14 features in original data used to generate 14 features in processed data.
        Train Data (Processed) Memory Usage: 0.03 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.07s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
        To change this, specify the eval_metric parameter of Predictor()
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ...
        0.726    = Validation score   (accuracy)
        0.0s     = Training   runtime
        0.01s    = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ...
        0.658    = Validation score   (accuracy)
        0.0s     = Training   runtime
        0.0s     = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.834    = Validation score   (accuracy)
        0.26s    = Training   runtime
        0.01s    = Validation runtime
Fitting model: LightGBM_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.832    = Validation score   (accuracy)
        0.26s    = Training   runtime
        0.01s    = Validation runtime
Fitting model: RandomForestGini_BAG_L1 ...
        0.842    = Validation score   (accuracy)
        0.58s    = Training   runtime
        0.11s    = Validation runtime
Fitting model: RandomForestEntr_BAG_L1 ...
        0.83     = Validation score   (accuracy)
        0.54s    = Training   runtime
        0.11s    = Validation runtime
Fitting model: CatBoost_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.85     = Validation score   (accuracy)
        1.95s    = Training   runtime
        0.01s    = Validation runtime
Fitting model: ExtraTreesGini_BAG_L1 ...
        0.844    = Validation score   (accuracy)
        0.52s    = Training   runtime
        0.11s    = Validation runtime
Fitting model: ExtraTreesEntr_BAG_L1 ...
        0.844    = Validation score   (accuracy)
        0.56s    = Training   runtime
        0.11s    = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.836    = Validation score   (accuracy)
        4.37s    = Training   runtime
        0.07s    = Validation runtime
Fitting model: XGBoost_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.826    = Validation score   (accuracy)
        0.3s     = Training   runtime
        0.01s    = Validation runtime
Fitting model: NeuralNetTorch_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.822    = Validation score   (accuracy)
        2.75s    = Training   runtime
        0.03s    = Validation runtime
Fitting model: LightGBMLarge_BAG_L1 ...
        Fitting 2 child models (S1F1 - S1F2) | Fitting with ParallelDistributedFoldFittingStrategy
        0.808    = Validation score   (accuracy)
        0.4s     = Training   runtime
        0.01s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
        0.85     = Validation score   (accuracy)
        0.87s    = Training   runtime
        0.0s     = Validation runtime
AutoGluon training complete, total runtime = 30.75s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230404_231427/")
6118      <=50K
23204     <=50K
29590     <=50K
18116     <=50K
33964      >50K
          ...  
29128     <=50K
23950     <=50K
13700      >50K
35248     <=50K
24772     <=50K
Name: class, Length: 500, dtype: object

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@github-actions
Copy link

github-actions bot commented Apr 5, 2023

Job PR-3110-316c07b is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3110/316c07b/index.html

@github-actions
Copy link

github-actions bot commented Apr 5, 2023

Job PR-3110-c814369 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3110/c814369/index.html

@yinweisu yinweisu merged commit 86cd448 into autogluon:master Apr 5, 2023
3 checks passed
@github-actions
Copy link

github-actions bot commented Apr 5, 2023

Job PR-3110-6caaa13 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3110/6caaa13/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants