-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LSA Component #1022
Add LSA Component #1022
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1022 +/- ##
========================================
Coverage 99.90% 99.91%
========================================
Files 181 183 +2
Lines 9998 10143 +145
========================================
+ Hits 9989 10134 +145
Misses 9 9
Continue to review full report at Codecov.
|
if len(text_columns) == 0: | ||
warnings.warn("No text columns were given to TextFeaturizer, component will have no effect", RuntimeWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this warning from __init__
to fit
to temporarily resolve #1017
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Left a few non-blocking nit-picky comments :)
'LSA(col_2)[1]']) | ||
X_t = lsa.transform(X) | ||
assert set(X_t.columns) == expected_col_names | ||
assert len(X_t.columns) == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit-pick: I feel like this line is covered by set(X_t.columns) == expected_col_names
so maybe not necessary? (same with other tests!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so as well at first, but this line actually helped me catch a bug yesterday! Since we take the set of X_t.columns
, any columns with duplicate names will not cause that line to fail -- checking the number of columns explicitly prevents that from slipping through the cracks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo huh, I didn't even know duplicate names were allowed but makes sense! 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, @angela97lin , you can do fancy stuff in pandas.
df = pd.DataFrame(data=np.array([[1, 1], [2, 2], [3, 3]]), columns=['a', 'a'])
produces a df with two columns which happen to have the same name, although they occupy different positions in the column index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these tests, let's do a direct comparison of the column names:
expected_col_names = np.array(...) # expected str values
np.testing.assert_equal(X_t.columns, expected_col_names)
This has the added benefit of covering the column name order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, the column order as outputted by featuretools changes, and as far as I can tell there's no option to fix it. @dsherry would you rather I enforce a column order by sorting, say, alphabetically, or leave this test as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's good to know. Your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to leave this as is, since enforcing an order makes the test bulkier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay looks pretty close! I left a bunch of questions and some suggestions.
A few of my questions and comments have to do with the conversion of the feature names to / from str
, and the way we're indexing into the input DFs. Unless there's a detail I'm missing so far, I don't think we need any conversion. Its good that we're validating that the provided feature names exist in the input DFs, but I think we can assume whatever format the DF column index is in, the feature names will be provided in that format, str, int or whatever else. LMK if you want to talk this through rather than responding via text.
I also left a comment about the warnings which we should resolve, ideally before merging.
More unit test to add:
- Input DF has two features with the same name
- Input DF has non-str column names, i.e.
df = pd.DataFrame(data=np.zeros((1, 4)), columns=[0, 1, 42, -1000])
X_t = X_t.drop(labels=int(col), axis=1) | ||
|
||
X_t['LSA({})[0]'.format(col)] = pd.Series(transformed[:, 0]) | ||
X_t['LSA({})[1]'.format(col)] = pd.Series(transformed[:, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking: what do you think of doing this for the naming: LSA(my_feature, 0)
and LSA(my_feature, 1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! I only kept this formatting to mirror what the primitives' generated column names look like, but I can change this if you'd prefer.
'LSA(col_2)[1]']) | ||
X_t = lsa.transform(X) | ||
assert set(X_t.columns) == expected_col_names | ||
assert len(X_t.columns) == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, @angela97lin , you can do fancy stuff in pandas.
df = pd.DataFrame(data=np.array([[1, 1], [2, 2], [3, 3]]), columns=['a', 'a'])
produces a df with two columns which happen to have the same name, although they occupy different positions in the column index.
'LSA(col_2)[1]']) | ||
X_t = lsa.transform(X) | ||
assert set(X_t.columns) == expected_col_names | ||
assert len(X_t.columns) == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these tests, let's do a direct comparison of the column names:
expected_col_names = np.array(...) # expected str values
np.testing.assert_equal(X_t.columns, expected_col_names)
This has the added benefit of covering the column name order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I left one comment about addressing the str-to-int column name conversion and the transform
try/except in a separate PR. I also didn't see coverage for the two cases I mentioned previously:
- Input DF has two features with the same name
- Input DF has non-str column names, i.e. df = pd.DataFrame(data=np.zeros((1, 4)), columns=[0, 1, 42, -1000])
Should fix #940 and close #980 by moving the implementation of LSA into evalml, instead of making the changes within nlp_primitives. The new LSA component can function as an independent component but is also called within the TextFeaturizer component to maintain its previous behavior.