Skip to content

Address MultiIndex problem with LightGBM#1770

Merged
angela97lin merged 20 commits intomainfrom
1710_multiindex_lightgbm
Feb 4, 2021
Merged

Address MultiIndex problem with LightGBM#1770
angela97lin merged 20 commits intomainfrom
1710_multiindex_lightgbm

Conversation

@angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Jan 31, 2021

Closes #1710

Added context on the issue here: #1710 (comment). Not sure if this is the best way to solve this issue so would love any thoughts and suggestions :')

@angela97lin angela97lin self-assigned this Jan 31, 2021
@codecov
Copy link

codecov bot commented Jan 31, 2021

Codecov Report

Merging #1770 (e1a3d89) into main (42e01f3) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1770     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         247      247             
  Lines       19618    19679     +61     
=========================================
+ Hits        19609    19670     +61     
  Misses          9        9             
Impacted Files Coverage Δ
...ents/estimators/classifiers/lightgbm_classifier.py 100.0% <100.0%> (ø)
...nents/estimators/classifiers/xgboost_classifier.py 100.0% <100.0%> (ø)
...onents/estimators/regressors/lightgbm_regressor.py 100.0% <100.0%> (ø)
...ponents/estimators/regressors/xgboost_regressor.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_lgbm_classifier.py 100.0% <100.0%> (ø)
...valml/tests/component_tests/test_lgbm_regressor.py 100.0% <100.0%> (ø)
...l/tests/component_tests/test_xgboost_classifier.py 100.0% <100.0%> (ø)
...ml/tests/component_tests/test_xgboost_regressor.py 100.0% <100.0%> (ø)
evalml/utils/gen_utils.py 99.6% <100.0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42e01f3...e1a3d89. Read the comment docs.

X_encoded = _rename_column_names_to_numeric(X_encoded)
cat_cols = list(X_encoded.select('category').columns)
X_encoded = _convert_woodwork_types_wrapper(X_encoded.to_dataframe())
X = _convert_to_woodwork_structure(X)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blargh. I'd like to combine lightgbm classifier and regressor, since they share a lot of the same core logic. Probs better to file than to address here tho 😬


X_encoded = _rename_column_names_to_numeric(X)
rename_cols_dict = dict(zip(X.columns, X_encoded.columns))
cat_cols = [rename_cols_dict[col] for col in cat_cols]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we moved this to after conversion to dataframe (need dataframe to set column names easily for tuples), need to find mapping from old to new names and update.

else:
X_t = X.copy()

if len(X_t.columns) > 0 and isinstance(X_t.columns[0], tuple):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling tuples. Working with WW directly doesn't allow renames unless passed another tuple, and this helps convert tuples to strings to 'flatten' the tuple. Ex: ('a', 1) --> '('a', 1)' to convert tuple multi-index cols to string cols

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation here looks good! I would vote for isinstance(X_t.columns, pd.MultiIndex) to make it clearer what's we're guarding against.

Looks like specifying tuples for column names automatically creates a MultiIndex:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yupperino, agreed that it'll make it more clear. Will update 😁

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Looks good to me!

I think the fix you have here makes sense. The other thing that comes to mind is converting X to numpy XGBoost and LightGBM fit rather than having to make all these conversions to the dataframe.

That would require changing more code but may be better long term if there isn't a reason that we need to pass a dataframe to xgboost or lightgbm.



@pytest.mark.parametrize("data_type", ['pd', 'ww'])
def test_lightgbm_multiindex(data_type, X_y_regression, make_data_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also test the xgboost estimators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes! I thought that since xgboost can handle tuples that the tests wouldn't be necessary, but since it uses the _rename_column_names_to_numeric method it was impacted and I had to tweak some stuff. Thank you!!

else:
X_t = X.copy()

if len(X_t.columns) > 0 and isinstance(X_t.columns[0], tuple):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation here looks good! I would vote for isinstance(X_t.columns, pd.MultiIndex) to make it clearer what's we're guarding against.

Looks like specifying tuples for column names automatically creates a MultiIndex:

image

@angela97lin
Copy link
Contributor Author

@freddyaboulton Ooo, that's a really interesting idea! I wonder what that would mean for the LightGBM estimators specifically. Right now, we take advantage of the auto feature to determine categorical columns, which requires a pandas DF (and us making sure that the categorical cols as determined by WW are set to category dtype). (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier) If we converted to numpy, we'd have to manually determine the mapping from col name to integer index for the numpy array and pass that along (assuming also dtypes don't get funky). Could be an interesting alternative to explore though if we see more problems like this crop up / more edge cases hehe.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I agree with leaving it as a pandas input so that we can preserve categorical columns for lightGBM.

@angela97lin angela97lin merged commit 07f5d73 into main Feb 4, 2021
@angela97lin angela97lin deleted the 1710_multiindex_lightgbm branch February 4, 2021 18:10
@ParthivNaresh ParthivNaresh mentioned this pull request Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multiindex Lightgbm problem

3 participants