Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LSA Component #1022

Merged
merged 13 commits into from Aug 11, 2020
Merged

Add LSA Component #1022

merged 13 commits into from Aug 11, 2020

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Aug 5, 2020

Should fix #940 and close #980 by moving the implementation of LSA into evalml, instead of making the changes within nlp_primitives. The new LSA component can function as an independent component but is also called within the TextFeaturizer component to maintain its previous behavior.

@codecov
Copy link

codecov bot commented Aug 5, 2020

Codecov Report

Merging #1022 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main    #1022    +/-   ##
========================================
  Coverage   99.90%   99.91%            
========================================
  Files         181      183     +2     
  Lines        9998    10143   +145     
========================================
+ Hits         9989    10134   +145     
  Misses          9        9            
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.00% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.00% <100.00%> (ø)
.../components/transformers/preprocessing/__init__.py 100.00% <100.00%> (ø)
...lines/components/transformers/preprocessing/lsa.py 100.00% <100.00%> (ø)
...ents/transformers/preprocessing/text_featurizer.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_lsa.py 100.00% <100.00%> (ø)
...alml/tests/component_tests/test_text_featurizer.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 49556bb...14c50ac. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review August 5, 2020 15:35
if len(text_columns) == 0:
warnings.warn("No text columns were given to TextFeaturizer, component will have no effect", RuntimeWarning)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this warning from __init__ to fit to temporarily resolve #1017

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Left a few non-blocking nit-picky comments :)

'LSA(col_2)[1]'])
X_t = lsa.transform(X)
assert set(X_t.columns) == expected_col_names
assert len(X_t.columns) == 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-pick: I feel like this line is covered by set(X_t.columns) == expected_col_names so maybe not necessary? (same with other tests!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so as well at first, but this line actually helped me catch a bug yesterday! Since we take the set of X_t.columns, any columns with duplicate names will not cause that line to fail -- checking the number of columns explicitly prevents that from slipping through the cracks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo huh, I didn't even know duplicate names were allowed but makes sense! 😊

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, @angela97lin , you can do fancy stuff in pandas.

df = pd.DataFrame(data=np.array([[1, 1], [2, 2], [3, 3]]), columns=['a', 'a'])

produces a df with two columns which happen to have the same name, although they occupy different positions in the column index.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these tests, let's do a direct comparison of the column names:

expected_col_names = np.array(...) # expected str values
np.testing.assert_equal(X_t.columns, expected_col_names)

This has the added benefit of covering the column name order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the column order as outputted by featuretools changes, and as far as I can tell there's no option to fix it. @dsherry would you rather I enforce a column order by sorting, say, alphabetically, or leave this test as is?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's good to know. Your call.

Copy link
Contributor Author

@eccabay eccabay Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this as is, since enforcing an order makes the test bulkier.

Copy link
Collaborator

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay looks pretty close! I left a bunch of questions and some suggestions.

A few of my questions and comments have to do with the conversion of the feature names to / from str, and the way we're indexing into the input DFs. Unless there's a detail I'm missing so far, I don't think we need any conversion. Its good that we're validating that the provided feature names exist in the input DFs, but I think we can assume whatever format the DF column index is in, the feature names will be provided in that format, str, int or whatever else. LMK if you want to talk this through rather than responding via text.

I also left a comment about the warnings which we should resolve, ideally before merging.

More unit test to add:

  • Input DF has two features with the same name
  • Input DF has non-str column names, i.e. df = pd.DataFrame(data=np.zeros((1, 4)), columns=[0, 1, 42, -1000])

docs/source/release_notes.rst Show resolved Hide resolved
evalml/tests/component_tests/test_text_featurizer.py Outdated Show resolved Hide resolved
X_t = X_t.drop(labels=int(col), axis=1)

X_t['LSA({})[0]'.format(col)] = pd.Series(transformed[:, 0])
X_t['LSA({})[1]'.format(col)] = pd.Series(transformed[:, 1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking: what do you think of doing this for the naming: LSA(my_feature, 0) and LSA(my_feature, 1) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! I only kept this formatting to mirror what the primitives' generated column names look like, but I can change this if you'd prefer.

'LSA(col_2)[1]'])
X_t = lsa.transform(X)
assert set(X_t.columns) == expected_col_names
assert len(X_t.columns) == 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, @angela97lin , you can do fancy stuff in pandas.

df = pd.DataFrame(data=np.array([[1, 1], [2, 2], [3, 3]]), columns=['a', 'a'])

produces a df with two columns which happen to have the same name, although they occupy different positions in the column index.

'LSA(col_2)[1]'])
X_t = lsa.transform(X)
assert set(X_t.columns) == expected_col_names
assert len(X_t.columns) == 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these tests, let's do a direct comparison of the column names:

expected_col_names = np.array(...) # expected str values
np.testing.assert_equal(X_t.columns, expected_col_names)

This has the added benefit of covering the column name order.

evalml/tests/component_tests/test_lsa.py Show resolved Hide resolved
@eccabay eccabay requested a review from dsherry August 10, 2020 16:23
Copy link
Collaborator

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I left one comment about addressing the str-to-int column name conversion and the transform try/except in a separate PR. I also didn't see coverage for the two cases I mentioned previously:

  • Input DF has two features with the same name
  • Input DF has non-str column names, i.e. df = pd.DataFrame(data=np.zeros((1, 4)), columns=[0, 1, 42, -1000])

@eccabay eccabay merged commit dd784a2 into main Aug 11, 2020
@dsherry dsherry mentioned this pull request Aug 25, 2020
@eccabay eccabay deleted the 980_940_lsa_component branch November 2, 2020 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't run text featurizer without network connection Flaky test in text featurizer
4 participants