Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/automl #199

Merged
merged 35 commits into from
Jul 16, 2021
Merged

Feature/automl #199

merged 35 commits into from
Jul 16, 2021

Conversation

rajagurunath
Copy link
Collaborator

@rajagurunath rajagurunath commented Jul 3, 2021

This pull request adds a Hyperparameter tuner and automl behavior as part of #130

  1. Uses dask_ml for hyperparameter tuning
  2. tpot for automl

(Added dask_ml and tpot in github action for test coverage)

some known issues so far :

  1. Dependency problem
    • Hyperparameter tuner requires dask_ml (latest) which requires dask_xgboost (latest) which inturn requires xgboost<=0.90
    • But tpot 0.11.7 (latest) requires xgboost>=1.1.0 so there is a conflict that breaks some hyperparameter tuning and some model prediction as well, so a temporary solution downgraded the tpot==0.11.6.post1 which works fine as of now.

Please let me know if there is any better solution to handle this situation .

As always please provide your valuable feedback 😊

rajagurunath and others added 30 commits September 22, 2020 22:43
Author:    rajagurunath <gurunathrajagopal@gmail.com>
Date:      Mon May 24 02:37:40 2021 +0530
1. SHOW MODELS
2. DESCRIBE MODEL
3. EXPORT MODEL
Fetch upstream changes to resolve conflicts
Fix a failing build, as ciso8601 is currently not pip-installable (dask-contrib#192)
2. Added comments and documentation.
3. added tpot and dask-ml in github workflow
@rajagurunath rajagurunath mentioned this pull request Jul 3, 2021
4 tasks
@rajagurunath
Copy link
Collaborator Author

I think the test cases are failing due to the dtype difference (float64 vs int64) in data frames and I suspect this may be due to the latest dask update 🤔 Any thoughts ?

@nils-braun
Copy link
Collaborator

Concerning the xgboost dependency, please see my comment :-)

Regarding the code! Looks quite nice, I will try to have a detailed look soon.

And to the CI/CD failure: it seems we are often spotting the new "features" of new dask versions :-) Well, that is the fate of having such a large and good pydata ecosystem (there are so many moving parts). I just triggered a re-run of the corresponding test pipeline on our current main branch - I expect it will also fail. In this case we can fix it on main first and then merge back here.

@nils-braun nils-braun mentioned this pull request Jul 4, 2021
@nils-braun
Copy link
Collaborator

nils-braun commented Jul 4, 2021

I did find the issue and have fixed it in #202. Once it is merged, feel free to also include it here. Hope this will help!

Update: Fix is in the main branch, @rajagurunath!

@rajagurunath
Copy link
Collaborator Author

Hi @nils-braun, I tried using the auto-detect feature, but failed with following error:

gaierror: [Errno 8] nodename nor servname provided, or not known

image

Anything I am missing here? Please let me know !

@nils-braun
Copy link
Collaborator

nils-braun commented Jul 9, 2021

Hi @rajagurunath
I finally found some time to try it our by myself. I am using slightly different versions of the libraries
(ask = '2021.06.2, xgboost = 1.4.0), but I do not think that matters (see below).
Some things I changed before playing around with the code:

  • We need to do one pre-processing step, as the iris table contains a string column (and xgboost does not know how to handle this):
CREATE OR REPLACE TABLE enriched_iris AS (
    SELECT 
        sepal_length, sepal_width, petal_length, petal_width,
        CASE 
            WHEN species = 'setosa' THEN 0 ELSE CASE 
            WHEN species = 'versicolor' THEN 1
            ELSE 2 
        END END AS "species"
    FROM iris 
)
  • And we should not wrap the predict call, because xgboost already knows how to handle a dask dataframe:
CREATE OR REPLACE MODEL my_model WITH (
    model_class = 'xgboost.dask.DaskXGBClassifier',
    target_column = 'species'
) AS (
    SELECT * FROM "enriched_iris"
)

For me, that will produce a nice xgboost model, that I can also use for prediction:

SELECT
    *
FROM PREDICT(
    MODEL my_model,
    SELECT * FROM enriched_iris
)

Now - why did it work for me but not for you? The error message seems to point to some network problems. I guess the software wants to get the hostname from 127.0.0.1, which should be localhost (or the other way round). Normally, it uses the entry in the hosts file (for me, that is /etc/hosts, but I do not know which OS you are working with). I am wondering why it fails to resolve localhost... Could you try to find the hosts-file for your OS and paste the lines related to localhost or 127.0.0.1 here (if any)?

1. Added `xgboost.dask.DaskXGBClassifier` in tests
2. Added tpot latest
3. Added ml experiment/export examples in notebook
@rajagurunath
Copy link
Collaborator Author

Hi @nils-braun

Thanks a lot for the detailed comments with a code example:

you are right in both cases:

Dependency conflict:
Actually, it's my mistake, dask-xgboost is an optional dependency for dask-ml, I just got confused while executing the model dask_ml.xgboost.XGBClassifier in a notebook (which raised ImportError for dask-xgboost, so I installed it and everything that happened later was history 😅 ). TLDR, there are no dependency conflicts.

Network Error:
AS you have suggested I have added one entry in my laptop (macOS) in /etc/hosts (127.0.0.1 node_name ) which solved this issue (i.e resolved localhost address). and auto-detect feature detects the existing dask client, like a charm. (but faced some hiccups when there are more than one dask-scheduler is available, the test fails with the status XFails (Worker failed to start)- exploring more on this !).

Once again thanks for your time, kindly review the changes and let me know if I missed something!

@codecov-commenter
Copy link

codecov-commenter commented Jul 10, 2021

Codecov Report

Merging #199 (0b4949e) into main (ed6749d) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #199   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           60        61    +1     
  Lines         2314      2404   +90     
  Branches       317       329   +12     
=========================================
+ Hits          2314      2404   +90     
Impacted Files Coverage Δ
dask_sql/context.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/custom/__init__.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/custom/create_experiment.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/logical/aggregate.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed6749d...0b4949e. Read the comment docs.

Copy link
Collaborator

@nils-braun nils-braun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work. I have just found some small typos here and there that you can fix right from the GitHub web interface.
The only real comment I had was on the model name, but apart from that this PR is ready to be merged!

dask_sql/physical/rel/custom/create_experiment.py Outdated Show resolved Hide resolved
docs/pages/machine_learning.rst Outdated Show resolved Hide resolved
docs/pages/machine_learning.rst Outdated Show resolved Hide resolved
docs/pages/machine_learning.rst Outdated Show resolved Hide resolved
docs/pages/machine_learning.rst Outdated Show resolved Hide resolved
dask_sql/physical/rel/custom/create_experiment.py Outdated Show resolved Hide resolved
class CreateExperimentPlugin(BaseRelPlugin):
"""
Creates /initiate Experiment for hyperparameter tuning or Automl like behaviour,
i.e evaluates model with different hyperparameters and registers the best performing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
i.e evaluates model with different hyperparameters and registers the best performing
i.e evaluates models with different hyperparameters and registers the best performing

dask_sql/physical/rel/custom/create_experiment.py Outdated Show resolved Hide resolved
logger = logging.getLogger(__name__)


class CreateExperimentPlugin(BaseRelPlugin):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment (applies to this class and all the examples/documentation below): the model name is currently independent of the experiment name. Would you think it makes sense to change that, so that the best performing model has the same name as the experiment? From my view, that would make it easier for people to do the connection. Because now I fear that having multiple experiments will override the output model.
In any case, we should mention in the documentation somewhere how the resulting model is named.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nils-braun, Thanks for taking the time to review the code and clearly explaining the bugs and typos.

I also like the idea of storing the best-performing model with the experiment name. and added the same in the documentation.

Please review and let me know if any further changes are needed

1. fixed typos in docs
2. Save best performing model in the name of the experiment
3. fixed test cases
@nils-braun nils-braun merged commit 3815a49 into dask-contrib:main Jul 16, 2021
@nils-braun
Copy link
Collaborator

That's in! Another very nice addition to the ML functionality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants