Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest version training crashes #10137

Closed
ganlei333 opened this issue Mar 21, 2024 · 12 comments
Closed

Latest version training crashes #10137

ganlei333 opened this issue Mar 21, 2024 · 12 comments

Comments

@ganlei333
Copy link

ganlei333 commented Mar 21, 2024

My environment is Python 3.9 established through Conda. The default installation for XGBoost is 1.7.0, so there is no problem training with this version. I can train normally, but when I upgraded XGBoost to 2.0.3 and used the same code and data for training, I encountered a crash. And 100% reproduce the crash. My hardware is 2697AV4, with 1TB of DDR4-2400T memory and SSD solid-state drive. CPU mode training. The ntthread parameter is set to -1, with a dataset size of 89M and 195 features for binary classification.

My code function is as follows:

def optimize_hyperparameters(X_train, y_train):
    param_dist = {
        'n_estimators': randint(100, 600),
        'learning_rate': uniform(0.01, 0.2),
        'subsample': uniform(0.5, 0.5),
        'max_depth': randint(20, 30),
        'colsample_bytree': uniform(0.5, 0.5),
        'min_child_weight': randint(1, 10)
    }

    scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

    xgb_clf = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss',scale_pos_weight=scale_pos_weight,nthread=-1)
    rscv = RandomizedSearchCV(xgb_clf, param_distributions=param_dist, n_iter=500, scoring='roc_auc', cv=3, verbose=2, random_state=42, n_jobs=1)
    rscv.fit(X_train, y_train)

    print(f"Best parameters found: {rscv.best_params_}")
    return rscv.best_estimator_
@ganlei333
Copy link
Author

sklearn :1.3.0

@trivialfis
Copy link
Member

Hi, thank you for raising the issue. Could you please provide a reproducible example that I can run on my machine?

@ganlei333
Copy link
Author

I have provided the training function, but the data is too large to upload successfully.

@trivialfis
Copy link
Member

Could you please provide the code in a more complete form so that I can run it with synthesized data?

@trivialfis
Copy link
Member

In addition, could you please share what you mean by crash? Is it running out of memory or is it running into segfault?

@ganlei333
Copy link
Author

Uploading xgb.zip…
The crash was caused by the server automatically restarting

@trivialfis
Copy link
Member

Feel free to close the issue if this is not related to XGBoost itself.

@ganlei333
Copy link
Author

这个
I have tested that there is no problem with training XGBoost 2.0.3 in the py3.10 environment of the Windows 2019 operating system, while in the py3.11 environment of the debian12 operating system of Linux, as long as the training system is trained, it will immediately restart (the data cleaning part is normal, and once the training logic is entered, the system restart will be triggered). And the operating system environment is very clean. It's not a CPU temperature issue either.

@trivialfis
Copy link
Member

By restart, do you mean the login session restart, or the whole OS got restarted?

@ganlei333
Copy link
Author

Restart the entire operating system

@trivialfis
Copy link
Member

Then that's beyond me. The OS has bugs, likely in the kernel, you may upgrade your OS, I don't think this is related to XGBoost.

@trivialfis
Copy link
Member

Feel free to reopen it if you need further information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants