-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/automl #199
Feature/automl #199
Conversation
Increase Speed (dask-contrib#29)
Author: rajagurunath <gurunathrajagopal@gmail.com> Date: Mon May 24 02:37:40 2021 +0530
1. SHOW MODELS 2. DESCRIBE MODEL 3. EXPORT MODEL
Feature/export model
Fetch upstream changes to resolve conflicts
Fix a failing build, as ciso8601 is currently not pip-installable (dask-contrib#192)
2. Added comments and documentation. 3. added tpot and dask-ml in github workflow
I think the test cases are failing due to the dtype difference (float64 vs int64) in data frames and I suspect this may be due to the latest dask update 🤔 Any thoughts ? |
Concerning the xgboost dependency, please see my comment :-) Regarding the code! Looks quite nice, I will try to have a detailed look soon. And to the CI/CD failure: it seems we are often spotting the new "features" of new dask versions :-) Well, that is the fate of having such a large and good pydata ecosystem (there are so many moving parts). I just triggered a re-run of the corresponding test pipeline on our current main branch - I expect it will also fail. In this case we can fix it on main first and then merge back here. |
I did find the issue and have fixed it in #202. Once it is merged, feel free to also include it here. Hope this will help! Update: Fix is in the main branch, @rajagurunath! |
Hi @nils-braun, I tried using the auto-detect feature, but failed with following error:
Anything I am missing here? Please let me know ! |
Hi @rajagurunath
CREATE OR REPLACE TABLE enriched_iris AS (
SELECT
sepal_length, sepal_width, petal_length, petal_width,
CASE
WHEN species = 'setosa' THEN 0 ELSE CASE
WHEN species = 'versicolor' THEN 1
ELSE 2
END END AS "species"
FROM iris
)
CREATE OR REPLACE MODEL my_model WITH (
model_class = 'xgboost.dask.DaskXGBClassifier',
target_column = 'species'
) AS (
SELECT * FROM "enriched_iris"
) For me, that will produce a nice xgboost model, that I can also use for prediction: SELECT
*
FROM PREDICT(
MODEL my_model,
SELECT * FROM enriched_iris
) Now - why did it work for me but not for you? The error message seems to point to some network problems. I guess the software wants to get the hostname from 127.0.0.1, which should be localhost (or the other way round). Normally, it uses the entry in the hosts file (for me, that is |
1. Added `xgboost.dask.DaskXGBClassifier` in tests 2. Added tpot latest 3. Added ml experiment/export examples in notebook
Hi @nils-braun Thanks a lot for the detailed comments with a code example: you are right in both cases: Dependency conflict: Network Error: Once again thanks for your time, kindly review the changes and let me know if I missed something! |
Codecov Report
@@ Coverage Diff @@
## main #199 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 60 61 +1
Lines 2314 2404 +90
Branches 317 329 +12
=========================================
+ Hits 2314 2404 +90
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work. I have just found some small typos here and there that you can fix right from the GitHub web interface.
The only real comment I had was on the model name, but apart from that this PR is ready to be merged!
planner/src/main/java/com/dask/sql/parser/SqlCreateExperiment.java
Outdated
Show resolved
Hide resolved
class CreateExperimentPlugin(BaseRelPlugin): | ||
""" | ||
Creates /initiate Experiment for hyperparameter tuning or Automl like behaviour, | ||
i.e evaluates model with different hyperparameters and registers the best performing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e evaluates model with different hyperparameters and registers the best performing | |
i.e evaluates models with different hyperparameters and registers the best performing |
logger = logging.getLogger(__name__) | ||
|
||
|
||
class CreateExperimentPlugin(BaseRelPlugin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment (applies to this class and all the examples/documentation below): the model name is currently independent of the experiment name. Would you think it makes sense to change that, so that the best performing model has the same name as the experiment? From my view, that would make it easier for people to do the connection. Because now I fear that having multiple experiments will override the output model.
In any case, we should mention in the documentation somewhere how the resulting model is named.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nils-braun, Thanks for taking the time to review the code and clearly explaining the bugs and typos.
I also like the idea of storing the best-performing model with the experiment name. and added the same in the documentation.
Please review and let me know if any further changes are needed
1. fixed typos in docs 2. Save best performing model in the name of the experiment
3. fixed test cases
That's in! Another very nice addition to the ML functionality |
This pull request adds a Hyperparameter tuner and automl behavior as part of #130
(Added dask_ml and tpot in github action for test coverage)
some known issues so far :
dask_ml
(latest) which requiresdask_xgboost
(latest) which inturn requiresxgboost
<=0.90tpot 0.11.7
(latest) requiresxgboost
>=1.1.0 so there is a conflict that breaks some hyperparameter tuning and some model prediction as well, so a temporary solution downgraded thetpot==0.11.6.post1
which works fine as of now.Please let me know if there is any better solution to handle this situation .
As always please provide your valuable feedback 😊