Feature/automl #199

rajagurunath · 2021-07-03T12:40:56Z

This pull request adds a Hyperparameter tuner and automl behavior as part of #130

Uses dask_ml for hyperparameter tuning
tpot for automl

(Added dask_ml and tpot in github action for test coverage)

some known issues so far :

Dependency problem
- Hyperparameter tuner requires dask_ml (latest) which requires dask_xgboost (latest) which inturn requires xgboost<=0.90
- But tpot 0.11.7 (latest) requires xgboost>=1.1.0 so there is a conflict that breaks some hyperparameter tuning and some model prediction as well, so a temporary solution downgraded the tpot==0.11.6.post1 which works fine as of now.

Please let me know if there is any better solution to handle this situation .

As always please provide your valuable feedback 😊

Increase Speed (dask-contrib#29)

Author: rajagurunath <gurunathrajagopal@gmail.com> Date: Mon May 24 02:37:40 2021 +0530

1. SHOW MODELS 2. DESCRIBE MODEL 3. EXPORT MODEL

Feature/export model

Fetch upstream changes to resolve conflicts

Fix a failing build, as ciso8601 is currently not pip-installable (dask-contrib#192)

…main

…ls-braun-main

2. Added comments and documentation. 3. added tpot and dask-ml in github workflow

rajagurunath · 2021-07-03T13:08:25Z

I think the test cases are failing due to the dtype difference (float64 vs int64) in data frames and I suspect this may be due to the latest dask update 🤔 Any thoughts ?

nils-braun · 2021-07-04T20:37:31Z

Concerning the xgboost dependency, please see my comment :-)

Regarding the code! Looks quite nice, I will try to have a detailed look soon.

And to the CI/CD failure: it seems we are often spotting the new "features" of new dask versions :-) Well, that is the fate of having such a large and good pydata ecosystem (there are so many moving parts). I just triggered a re-run of the corresponding test pipeline on our current main branch - I expect it will also fail. In this case we can fix it on main first and then merge back here.

nils-braun · 2021-07-04T20:59:11Z

I did find the issue and have fixed it in #202. Once it is merged, feel free to also include it here. Hope this will help!

Update: Fix is in the main branch, @rajagurunath!

rajagurunath · 2021-07-06T18:30:38Z

Hi @nils-braun, I tried using the auto-detect feature, but failed with following error:

gaierror: [Errno 8] nodename nor servname provided, or not known

Anything I am missing here? Please let me know !

nils-braun · 2021-07-09T05:55:56Z

Hi @rajagurunath
I finally found some time to try it our by myself. I am using slightly different versions of the libraries
(ask = '2021.06.2, xgboost = 1.4.0), but I do not think that matters (see below).
Some things I changed before playing around with the code:

We need to do one pre-processing step, as the iris table contains a string column (and xgboost does not know how to handle this):

CREATE OR REPLACE TABLE enriched_iris AS (
    SELECT 
        sepal_length, sepal_width, petal_length, petal_width,
        CASE 
            WHEN species = 'setosa' THEN 0 ELSE CASE 
            WHEN species = 'versicolor' THEN 1
            ELSE 2 
        END END AS "species"
    FROM iris 
)

And we should not wrap the predict call, because xgboost already knows how to handle a dask dataframe:

CREATE OR REPLACE MODEL my_model WITH (
    model_class = 'xgboost.dask.DaskXGBClassifier',
    target_column = 'species'
) AS (
    SELECT * FROM "enriched_iris"
)

For me, that will produce a nice xgboost model, that I can also use for prediction:

SELECT
    *
FROM PREDICT(
    MODEL my_model,
    SELECT * FROM enriched_iris
)

Now - why did it work for me but not for you? The error message seems to point to some network problems. I guess the software wants to get the hostname from 127.0.0.1, which should be localhost (or the other way round). Normally, it uses the entry in the hosts file (for me, that is /etc/hosts, but I do not know which OS you are working with). I am wondering why it fails to resolve localhost... Could you try to find the hosts-file for your OS and paste the lines related to localhost or 127.0.0.1 here (if any)?

1. Added `xgboost.dask.DaskXGBClassifier` in tests 2. Added tpot latest 3. Added ml experiment/export examples in notebook

rajagurunath · 2021-07-10T16:58:06Z

Hi @nils-braun

Thanks a lot for the detailed comments with a code example:

you are right in both cases:

Dependency conflict:
Actually, it's my mistake, dask-xgboost is an optional dependency for dask-ml, I just got confused while executing the model dask_ml.xgboost.XGBClassifier in a notebook (which raised ImportError for dask-xgboost, so I installed it and everything that happened later was history 😅 ). TLDR, there are no dependency conflicts.

Network Error:
AS you have suggested I have added one entry in my laptop (macOS) in /etc/hosts (127.0.0.1 node_name ) which solved this issue (i.e resolved localhost address). and auto-detect feature detects the existing dask client, like a charm. (but faced some hiccups when there are more than one dask-scheduler is available, the test fails with the status XFails (Worker failed to start)- exploring more on this !).

Once again thanks for your time, kindly review the changes and let me know if I missed something!

codecov-commenter · 2021-07-10T17:07:15Z

Codecov Report

Merging #199 (0b4949e) into main (ed6749d) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #199   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           60        61    +1     
  Lines         2314      2404   +90     
  Branches       317       329   +12     
=========================================
+ Hits          2314      2404   +90

Impacted Files	Coverage Δ
dask_sql/context.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rel/custom/__init__.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rel/custom/create_experiment.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rel/logical/aggregate.py	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed6749d...0b4949e. Read the comment docs.

nils-braun

Very good work. I have just found some small typos here and there that you can fix right from the GitHub web interface.
The only real comment I had was on the model name, but apart from that this PR is ready to be merged!