Address MultiIndex problem with LightGBM#1770
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1770 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 247 247
Lines 19618 19679 +61
=========================================
+ Hits 19609 19670 +61
Misses 9 9
Continue to review full report at Codecov.
|
…into 1710_multiindex_lightgbm
| X_encoded = _rename_column_names_to_numeric(X_encoded) | ||
| cat_cols = list(X_encoded.select('category').columns) | ||
| X_encoded = _convert_woodwork_types_wrapper(X_encoded.to_dataframe()) | ||
| X = _convert_to_woodwork_structure(X) |
There was a problem hiding this comment.
Blargh. I'd like to combine lightgbm classifier and regressor, since they share a lot of the same core logic. Probs better to file than to address here tho 😬
|
|
||
| X_encoded = _rename_column_names_to_numeric(X) | ||
| rename_cols_dict = dict(zip(X.columns, X_encoded.columns)) | ||
| cat_cols = [rename_cols_dict[col] for col in cat_cols] |
There was a problem hiding this comment.
Since we moved this to after conversion to dataframe (need dataframe to set column names easily for tuples), need to find mapping from old to new names and update.
evalml/utils/gen_utils.py
Outdated
| else: | ||
| X_t = X.copy() | ||
|
|
||
| if len(X_t.columns) > 0 and isinstance(X_t.columns[0], tuple): |
There was a problem hiding this comment.
Handling tuples. Working with WW directly doesn't allow renames unless passed another tuple, and this helps convert tuples to strings to 'flatten' the tuple. Ex: ('a', 1) --> '('a', 1)' to convert tuple multi-index cols to string cols
There was a problem hiding this comment.
Yupperino, agreed that it'll make it more clear. Will update 😁
freddyaboulton
left a comment
There was a problem hiding this comment.
@angela97lin Looks good to me!
I think the fix you have here makes sense. The other thing that comes to mind is converting X to numpy XGBoost and LightGBM fit rather than having to make all these conversions to the dataframe.
That would require changing more code but may be better long term if there isn't a reason that we need to pass a dataframe to xgboost or lightgbm.
|
|
||
|
|
||
| @pytest.mark.parametrize("data_type", ['pd', 'ww']) | ||
| def test_lightgbm_multiindex(data_type, X_y_regression, make_data_type): |
There was a problem hiding this comment.
Should we also test the xgboost estimators?
There was a problem hiding this comment.
Ah yes! I thought that since xgboost can handle tuples that the tests wouldn't be necessary, but since it uses the _rename_column_names_to_numeric method it was impacted and I had to tweak some stuff. Thank you!!
evalml/utils/gen_utils.py
Outdated
| else: | ||
| X_t = X.copy() | ||
|
|
||
| if len(X_t.columns) > 0 and isinstance(X_t.columns[0], tuple): |
|
@freddyaboulton Ooo, that's a really interesting idea! I wonder what that would mean for the LightGBM estimators specifically. Right now, we take advantage of the |
bchen1116
left a comment
There was a problem hiding this comment.
LGTM! I agree with leaving it as a pandas input so that we can preserve categorical columns for lightGBM.

Closes #1710
Added context on the issue here: #1710 (comment). Not sure if this is the best way to solve this issue so would love any thoughts and suggestions :')