Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text processing component #913

Merged
merged 24 commits into from
Jul 10, 2020
Merged

Add text processing component #913

merged 24 commits into from
Jul 10, 2020

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Jul 7, 2020

Closes #907 by adding a brand new text featurization component that wraps the featuretools nlp-primitives code. Tests at the moment are fairly minimal, so any suggestions on what to add are much appreciated!

@codecov
Copy link

codecov bot commented Jul 7, 2020

Codecov Report

Merging #913 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main     #913    +/-   ##
========================================
  Coverage   99.83%   99.83%            
========================================
  Files         168      170     +2     
  Lines        8366     8578   +212     
========================================
+ Hits         8352     8564   +212     
  Misses         14       14            
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.00% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.00% <100.00%> (ø)
.../components/transformers/preprocessing/__init__.py 100.00% <100.00%> (ø)
...s/transformers/preprocessing/text_featurization.py 100.00% <100.00%> (ø)
...l/tests/component_tests/test_text_featurization.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_utils.py 100.00% <100.00%> (ø)
evalml/tests/pipeline_tests/test_graph_utils.py 100.00% <100.00%> (ø)
evalml/tests/utils_tests/test_cli_utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3077671...0298829. Read the comment docs.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay Left a comment about making sure the column names passed in __init__ are actually in the dataframe and some questions. Overall, looks good to me!

def text_df():
df = pd.DataFrame(
{'col_1': ['I\'m singing in the rain!$%^ do do do do do da do', 'just singing in the rain.................. \n', '\t\n\n\n\nWhat a glorious feelinggggggggggg, I\'m happy again!!! lalalalalalalalalalala'],
'col_2': ['do you hear the people sing?////////////////////////////////////', 'singing the songs of angry men\n', '\tIt is the music of a people who will NOT be slaves again!!!!!!!!!!!']})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🇫🇷

@eccabay eccabay requested a review from freddyaboulton July 9, 2020 12:27
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!! Looking good. Left a few comments.

The nlp-primitives repo has unit tests which check the output from each primitive is correct. We don't have that level of coverage yet here but I think we should. My suggestion would be to port the relevant nlp-primitives tests to our repo. Happy to talk more about that, there's definitely a few options.

I see we'll doing the addition to automl in a separate PR in #908. Sounds good to me!

After we merge this PR, featuretools will be a core dependency on evalml. I think before the July milestone goes out, we should make it optional and move it to requirements.txt!

@@ -3,6 +3,8 @@ pandas>=0.25.0
scipy>=1.2.1
scikit-learn>=0.21.3,!=0.22,<0.23.0
scikit-optimize
featuretools[nlp_primitives]
nlp-primitives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I see the instructions for nlp-primitives on pypi say to install featuretools[nlp_primitives]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was what I tried at first, but CircleCI couldn't import the nlp-primitives without the second line - take a look at some of my earlier commits and you can see that none of the tests were able to run.

Copy link
Contributor

@kmax12 kmax12 Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be a bug in featuretools cc @rwedge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a bug when installing the nlp-primitives 0.3.0 via pip, may have been fixed by 0.3.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rwedge

Ah, I see 0.3.1 went out 1 hour ago 😆

@eccabay since it's late on Friday, I suggest you merge this as-is, and we can deal with it later. But, hopefully with the update its now fine to delete this line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with nlp 0.3.1

pip install featuretools[nlp_primitives]
doesn't install properly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip install nlp_primitives works though

@@ -171,6 +173,7 @@ def test_normalize_confusion_matrix(data_type):

def test_normalize_confusion_matrix_error():
conf_mat = np.array([[0, 0, 0], [0, 0, 0], [0, 0, 0]])
warnings.simplefilter('default', category=RuntimeWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice. So to recap the discussion we had offline with @gsheni , this is necessary because featuretools runs warnings.simplefilter("ignore", category=RuntimeWarning) on import. But normalize_confusion_matrix listens for a particular warning and turns it into a ValueError. And this was recently fixed in featuretools and will be in the next release.

random_state=random_state)

@property
def features(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why expose this? I think we should delete this getter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the features in the tests to verify the component fit/transformed properly, I added this so the tests didn't have to access the private attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. In that case, let's delete this getter and instead check that transform creates the columns we expect it to, with the column names we expect.

@dsherry
Copy link
Contributor

dsherry commented Jul 10, 2020

@eccabay before I forget, could you please update this PR to point at main?

@eccabay eccabay changed the base branch from master to main July 10, 2020 14:14
@eccabay eccabay requested a review from dsherry July 10, 2020 14:24
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay looks good to me! Nice going :) it's exciting we're getting so close to having our first active integration with featuretools!

After our discussion of LSA on the epic, I was poking around in the nlp-primitives code. Two points from that:

  1. As you pointed out in the epic, we're currently training the TF-IDF on a predetermined corpus from nltk. We should take a look at training TF-IDF on the training data instead. This may require implementing this in the evalml code instead of using a primitive.
  2. If we stick with the predetermined corpus: I see we currently call nltk.download to grab the corpus. We should change that before we include this in automl. We should be able to run all components with or without network access.

X_t = tf.transform(X)
cols = [col for col in X_t.columns if 'LSA' in col]
features = X_t[cols]
np.testing.assert_almost_equal(features, expected_features, decimal=3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for adding these!

expected_col_names.add(f'PART_OF_SPEECH_COUNT(col_1)[{i}]')
expected_col_names.add(f'PART_OF_SPEECH_COUNT(col_2)[{i}]')
X_t = tf.transform(X)
assert set(X_t.columns) == expected_col_names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Minor but since the feature order is (or should be?) deterministic, would be nice not to use set and test the ordering too

@@ -1,5 +1,6 @@
catboost==0.23.2
cloudpickle==1.5.0
distributed==2.20.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit to add this. I think 0.3.1 of nlp-primitives which rolled out an hour ago added this package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

featuretools uses distributed

@rwedge
Copy link
Contributor

rwedge commented Jul 10, 2020

  1. As you pointed out in the epic, we're currently training the TF-IDF on a predetermined corpus from nltk. We should take a look at training TF-IDF on the training data instead. This may require implementing this in the evalml code instead of using a primitive.

Perhaps one approach would be to update the primitive to accept the training data as a corpus to use instead of the predetermined one.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay Looks good to me!

@eccabay eccabay merged commit 923eb25 into main Jul 10, 2020
X_text['index'] = range(len(X_text))

es = ft.EntitySet()
es = es.entity_from_dataframe(entity_id='X', dataframe=X_text, index='index')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just seeing this now, but i think we should be using the variable_types argument to specify which columns are are text columns when calling this function. you can see any example here: https://docs.featuretools.com/en/stable/loading_data/using_entitysets.html#adding-entities

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a Text Featurization Component
6 participants