Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove text_columns parameter from LSA and TextFeaturizer components #1652

Merged
merged 30 commits into from
Feb 12, 2021

Conversation

angela97lin
Copy link
Contributor

Closes #1614.

@angela97lin angela97lin self-assigned this Jan 5, 2021
@codecov
Copy link

codecov bot commented Jan 6, 2021

Codecov Report

Merging #1652 (2f99bcb) into main (2804bd0) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main   #1652     +/-   ##
========================================
- Coverage   100.0%   99.9%   -0.0%     
========================================
  Files         255     255             
  Lines       20621   20571     -50     
========================================
- Hits        20613   20549     -64     
- Misses          8      22     +14     
Impacted Files Coverage Δ
...lml/automl/automl_algorithm/iterative_algorithm.py 100.0% <ø> (ø)
...ponents/transformers/preprocessing/featuretools.py 100.0% <ø> (ø)
...lml/tests/automl_tests/test_iterative_algorithm.py 100.0% <ø> (ø)
evalml/automl/automl_search.py 99.7% <100.0%> (-<0.1%) ⬇️
.../transformers/preprocessing/datetime_featurizer.py 100.0% <100.0%> (ø)
...lines/components/transformers/preprocessing/lsa.py 97.6% <100.0%> (-2.4%) ⬇️
...ents/transformers/preprocessing/text_featurizer.py 82.2% <100.0%> (-17.8%) ⬇️
...nts/transformers/preprocessing/text_transformer.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 100.0% <100.0%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2804bd0...2f99bcb. Read the comment docs.

@@ -273,7 +273,7 @@ def test_make_pipeline_no_column_names(input_type, problem_type):
def test_make_pipeline_text_columns(input_type, problem_type):
X = pd.DataFrame({"numerical": [1, 2, 3, 1, 2],
"categorical": ["a", "b", "a", "c", "c"],
"text": ["string one", "another", "text for a column", "text string", "hello world"]})
"text": ["string one", "another", "text for a column, this should be a text column!!", "text string", "hello world"]})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this longer so that WW automatically detects its a text column. Could also just use infer_feature_types heh

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I think this looks good!

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a few questions!

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_lsa.py Show resolved Hide resolved
lsa.fit(X)
expected_col_names = set(['LSA(4.75)[0]',
'LSA(4.75)[1]',
'LSA(-1)[0]',
'LSA(-1)[1]'])
'LSA(-1.0)[0]',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this column label become a float versus the original int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I believe it should have been a float to begin with:
image

However, because we previously asked users to pass in text col names and then use those col names and we previously passed in -1, we got -1 as an int instead.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@angela97lin
Copy link
Contributor Author

Blocking on #1662 so putting this back as In Progress 😭

@dsherry
Copy link
Contributor

dsherry commented Jan 8, 2021

@angela97lin BTW since we decided yesterday that this is not ready to merge, I'm converting this back to a draft.

@angela97lin angela97lin marked this pull request as ready for review February 11, 2021 23:52
@angela97lin angela97lin merged commit 2747fbc into main Feb 12, 2021
@angela97lin angela97lin deleted the 1614_remove_text_columns branch February 12, 2021 06:08
@chukarsten chukarsten mentioned this pull request Feb 23, 2021
@dsherry dsherry mentioned this pull request Mar 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove text_columns parameter from LSA and TextFeaturizer
5 participants