Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bc 295 lightgbm #1082

Merged
merged 39 commits into from Aug 26, 2020
Merged

Bc 295 lightgbm #1082

merged 39 commits into from Aug 26, 2020

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Aug 20, 2020

fix #295

Pull Request Description

Adding functionality for LightGBM Classifier estimator

@codecov
Copy link

codecov bot commented Aug 20, 2020

Codecov Report

Merging #1082 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main    #1082    +/-   ##
========================================
  Coverage   99.91%   99.91%            
========================================
  Files         192      194     +2     
  Lines       10751    10957   +206     
========================================
+ Hits        10742    10948   +206     
  Misses          9        9            
Impacted Files Coverage Δ
evalml/pipelines/__init__.py 100.00% <ø> (ø)
evalml/pipelines/components/__init__.py 100.00% <ø> (ø)
evalml/pipelines/components/estimators/__init__.py 100.00% <ø> (ø)
evalml/model_family/model_family.py 100.00% <100.00%> (ø)
...ines/components/estimators/classifiers/__init__.py 100.00% <100.00%> (ø)
...ents/estimators/classifiers/lightgbm_classifier.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 100.00% <100.00%> (ø)
...alml/tests/component_tests/test_lgbm_classifier.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_utils.py 100.00% <100.00%> (ø)
...alml/tests/model_family_tests/test_model_family.py 100.00% <100.00%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 037a6c9...225a5e5. Read the comment docs.

cb_model.json Outdated Show resolved Hide resolved
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 Looks good! I have a question about the parameters that should be included in the init. I think this would be good to merge after that is resolved!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 great job on this! There's a lot of details involved, especially since this is an optional estimator, and you hunted them all down. This is pretty close to being mergeable!

I left comments with requested changes and also with some discussion for context. Here's a few additional requests:

  • Please delete the junit file at the bottom and the model JSON at the top.
  • Please add "lightgbm" to this list. Its used by our core dependency unit tests. (There's no way you'd have found that haha its totally buried)
  • Please run all the tests in test_lgbm_classifier.py on your machine and post the total runtime. If its more than a few seconds, we should talk briefly about ways to save on unit test runtime before merging.
  • Total runtime is 0.46s for 12 tests

cb_model.json Outdated Show resolved Hide resolved
evalml/model_family/model_family.py Show resolved Hide resolved
evalml/tests/component_tests/test_lgbm_classifier.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_lgbm_classifier.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_lgbm_classifier.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_lgbm_classifier.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 this is really close!!

I left questions / suggestions on the implementation. And I think you need to split up your unit test cases to make each of them shorter and simpler. Once that's done, we can get this merged in!

@bchen1116 bchen1116 requested a review from dsherry August 25, 2020 13:31
@dsherry
Copy link
Contributor

dsherry commented Aug 25, 2020

@bchen1116 to wrap up our recent discussion in slack:

I see what you were saying. Lightgbm, like catboost, requires the target to be encoded numerically, and doesn't work with string or categorical typed targets. My apologies for not catching this when you had the label encoder in place--I forgot we had to do this for catboost.

I verified this in a notebook:

import evalml
import lightgbm
X, y = evalml.demos.load_breast_cancer()
train_data = lightgbm.Dataset(X, label=y)
parameters = {"boosting_type": 'gbdt',
              "learning_rate": 0.1,
              "n_estimators": 100,
              "max_depth": 0,
              "num_leaves": 31,
              "min_child_samples": 20,
              "n_jobs": -1}
model = lightgbm.train(parameters, train_data)

produces

ValueError: Series.dtypes must be int, float or bool

And then as a test, passing in y.astype('categorical').cat.codes instead has train completing without an error.

We have a few options here for how to proceed:

  1. Add your usage of the label encoder back into this PR. Add unit test coverage.
  2. Merge this PR without target encoding, just to get it out of the way, then put up another PR to update lightgbm to use the label encoder as is done in catboost. Multiple small PRs can be preferable to one larger one, to keep things moving along.
  3. If you'd rather move on to other issues, file this as a separate issue and we'll get to it later, before we add lightgbm to automl

What would you like to do? I think option 2 is a good compromise but its your call.

@bchen1116
Copy link
Contributor Author

@bchen1116 to wrap up our recent discussion in slack:

I see what you were saying. Lightgbm, like catboost, requires the target to be encoded numerically, and doesn't work with string or categorical typed targets. My apologies for not catching this when you had the label encoder in place--I forgot we had to do this for catboost.

I verified this in a notebook:

import evalml
import lightgbm
X, y = evalml.demos.load_breast_cancer()
train_data = lightgbm.Dataset(X, label=y)
parameters = {"boosting_type": 'gbdt',
              "learning_rate": 0.1,
              "n_estimators": 100,
              "max_depth": 0,
              "num_leaves": 31,
              "min_child_samples": 20,
              "n_jobs": -1}
model = lightgbm.train(parameters, train_data)

produces

ValueError: Series.dtypes must be int, float or bool

And then as a test, passing in y.astype('categorical').cat.codes instead has train completing without an error.

We have a few options here for how to proceed:

  1. Add your usage of the label encoder back into this PR. Add unit test coverage.
  2. Merge this PR without target encoding, just to get it out of the way, then put up another PR to update lightgbm to use the label encoder as is done in catboost. Multiple small PRs can be preferable to one larger one, to keep things moving along.
  3. If you'd rather move on to other issues, file this as a separate issue and we'll get to it later, before we add lightgbm to automl

What would you like to do? I think option 2 is a good compromise but its your call.

@dsherry Yeah, I can merge PR and create a new PR to update LightGBM with LabelEncoder.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 Looks good! My only comment is that I think we should create a new OrdinalEncoder every time fit is called because right now a user can't use the same estimator on different datasets.

This can happen in a subsequent PR though.

# Encode the X input to be floats for all categorical components
X2 = pd.DataFrame(X).copy() if not isinstance(X, pd.DataFrame) else X.copy()
cat_cols = X2.select_dtypes(categorical_dtypes).columns
if not self._ordinal_encoder:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should set a new ordinal encoder within fit because right now a user can't call fit on different datasets with the same LightGBMClassifier

from evalml.pipelines.components import LightGBMClassifier
import pandas as pd


data = pd.DataFrame({"feat1": ["a", "a", "b", "b"], "feat2": [0, 0, 1, 1]})
y = pd.Series([1, 1, 0, 0])

gbm = LightGBMClassifier()

gbm.fit(data, y)

data2 = pd.DataFrame({"feature_1": [3, 3, 1, 1], "feature_2": ["c", "c", "d", "d"]})
gbm.fit(data2, y)

This would yield a ValueError: Found unknown categories ['c', 'd'] in column 0 during transform .

I guess you could argue a user should create a new estimator for each dataset but that's not clear from our docs or how sklearn works.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a great point! I think this would be great to fix in the next PR for this, as it there are some other changes that it needs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton thanks for that! @bchen1116 and I discussed and he's gonna add a unit test for this, and update the impl to support it.

@dsherry
Copy link
Contributor

dsherry commented Aug 26, 2020

To recap discussion with @bchen1116 this morning, here's what's left:

  • Add a test for calling fit multiple times on different data
  • Update the implementation to support that case
  • Delete the test_categorical_data_shuffle because its redundant with test_categorical_data_subset
  • Update/replace test_categorical_data_subset to define training and test splits and expected values, rather than taking a slice of one dataset
    • My suggestion: make it so the test split has categories which appear in a different order, and then check that the mocked calls to fit/predict/predict_proba received input with the correct integer encoding
  • Merge!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 great work! 🎊

Nothing is blocking merge at this point :) I left a suggestion for test_multiple_fit which we should add in a separate PR if you prefer that. I also left a suggestion on how to simplify the implementation organization a bit, which we should add if you agree on it.

return X2

def fit(self, X, y=None):
X2 = self._make_encodings(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 yep this works great. Thanks.

For after merge, can do in a separate PR: I'm now seeing that your previous idea of having the if/else with fit_transform vs transform leads to a cleaner organization! I have a suggestion based on that.

  • Delete _convert_to_dataframe
  • Delete _make_encodings
  • Define _encode_categories like so
    def _encode_categories(self, X, fit=False):
        X2 = pd.DataFrame(copy.copy(X))
        cat_cols = X2.select_dtypes(categorical_dtypes).columns
        # necessary to wipe out column names in case any names contain symbols ([, ], <) which LightGBM cannot properly handle
        X2.columns = np.arange(len(X2.columns))
        # encode each categorical feature as an integer
        if fit:
            encoder_output = self._ordinal_encoder.fit_transform(X2[cat_cols])
        else:
            encoder_output = self._ordinal_encoder.transform(X2[cat_cols])
        X2[cat_cols] = pd.DataFrame(encoder_output)
        X2[cat_cols] = X2[cat_cols].astype('category')
        return X2

    def fit(self, X, y=None):
        X2 = self._encode_categories(X, fit=True)
        return super().fit(X2, y)

    def predict(self, X):
        X2 = self._encode_categories(X)
        return super().predict(X2)
...

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this would be a good fix. We can incorporate this when we add in the LabelEncoder

@dsherry
Copy link
Contributor

dsherry commented Aug 26, 2020

@bchen1116 let's do the following here:

  • Please file an issue to track adding support for string-type targets to the lightgbm component (i.e. usage of LabelEncoder like we did for catboost)
  • Let's resolve our discussion of the unit testing. I'll respond to your comment momentarily.
  • Merge this PR.
  • Please get another PR up which addresses the comments I left about updating the implementation and the unit test.
  • We can then figure out who will work on the string-type targets issue and when it'll get done.

Sound good?

@bchen1116 bchen1116 merged commit 89adab1 into main Aug 26, 2020
@bchen1116 bchen1116 deleted the bc_295_lightgbm branch September 10, 2020 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add LightGBM classification component
5 participants