-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add text processing component #913
Conversation
Codecov Report
@@ Coverage Diff @@
## main #913 +/- ##
========================================
Coverage 99.83% 99.83%
========================================
Files 168 170 +2
Lines 8366 8578 +212
========================================
+ Hits 8352 8564 +212
Misses 14 14
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay Left a comment about making sure the column names passed in __init__
are actually in the dataframe and some questions. Overall, looks good to me!
def text_df(): | ||
df = pd.DataFrame( | ||
{'col_1': ['I\'m singing in the rain!$%^ do do do do do da do', 'just singing in the rain.................. \n', '\t\n\n\n\nWhat a glorious feelinggggggggggg, I\'m happy again!!! lalalalalalalalalalala'], | ||
'col_2': ['do you hear the people sing?////////////////////////////////////', 'singing the songs of angry men\n', '\tIt is the music of a people who will NOT be slaves again!!!!!!!!!!!']}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🇫🇷
evalml/pipelines/components/transformers/preprocessing/text_featurization.py
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/text_featurization.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/text_featurization.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/text_featurization.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/text_featurization.py
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/text_featurization.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/text_featurization.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!! Looking good. Left a few comments.
The nlp-primitives repo has unit tests which check the output from each primitive is correct. We don't have that level of coverage yet here but I think we should. My suggestion would be to port the relevant nlp-primitives tests to our repo. Happy to talk more about that, there's definitely a few options.
I see we'll doing the addition to automl in a separate PR in #908. Sounds good to me!
After we merge this PR, featuretools will be a core dependency on evalml. I think before the July milestone goes out, we should make it optional and move it to requirements.txt
!
@@ -3,6 +3,8 @@ pandas>=0.25.0 | |||
scipy>=1.2.1 | |||
scikit-learn>=0.21.3,!=0.22,<0.23.0 | |||
scikit-optimize | |||
featuretools[nlp_primitives] | |||
nlp-primitives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? I see the instructions for nlp-primitives
on pypi say to install featuretools[nlp_primitives]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was what I tried at first, but CircleCI couldn't import the nlp-primitives without the second line - take a look at some of my earlier commits and you can see that none of the tests were able to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be a bug in featuretools cc @rwedge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a bug when installing the nlp-primitives 0.3.0 via pip, may have been fixed by 0.3.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with nlp 0.3.1
pip install featuretools[nlp_primitives]
doesn't install properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pip install nlp_primitives
works though
evalml/pipelines/components/transformers/preprocessing/text_featurization.py
Outdated
Show resolved
Hide resolved
@@ -171,6 +173,7 @@ def test_normalize_confusion_matrix(data_type): | |||
|
|||
def test_normalize_confusion_matrix_error(): | |||
conf_mat = np.array([[0, 0, 0], [0, 0, 0], [0, 0, 0]]) | |||
warnings.simplefilter('default', category=RuntimeWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 nice. So to recap the discussion we had offline with @gsheni , this is necessary because featuretools runs warnings.simplefilter("ignore", category=RuntimeWarning)
on import. But normalize_confusion_matrix
listens for a particular warning and turns it into a ValueError
. And this was recently fixed in featuretools and will be in the next release.
random_state=random_state) | ||
|
||
@property | ||
def features(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why expose this? I think we should delete this getter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use the features in the tests to verify the component fit/transformed properly, I added this so the tests didn't have to access the private attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. In that case, let's delete this getter and instead check that transform
creates the columns we expect it to, with the column names we expect.
evalml/pipelines/components/transformers/preprocessing/text_featurization.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/text_featurization.py
Show resolved
Hide resolved
@eccabay before I forget, could you please update this PR to point at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay looks good to me! Nice going :) it's exciting we're getting so close to having our first active integration with featuretools!
After our discussion of LSA on the epic, I was poking around in the nlp-primitives
code. Two points from that:
- As you pointed out in the epic, we're currently training the TF-IDF on a predetermined corpus from nltk. We should take a look at training TF-IDF on the training data instead. This may require implementing this in the evalml code instead of using a primitive.
- If we stick with the predetermined corpus: I see we currently call
nltk.download
to grab the corpus. We should change that before we include this in automl. We should be able to run all components with or without network access.
X_t = tf.transform(X) | ||
cols = [col for col in X_t.columns if 'LSA' in col] | ||
features = X_t[cols] | ||
np.testing.assert_almost_equal(features, expected_features, decimal=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for adding these!
expected_col_names.add(f'PART_OF_SPEECH_COUNT(col_1)[{i}]') | ||
expected_col_names.add(f'PART_OF_SPEECH_COUNT(col_2)[{i}]') | ||
X_t = tf.transform(X) | ||
assert set(X_t.columns) == expected_col_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Minor but since the feature order is (or should be?) deterministic, would be nice not to use set
and test the ordering too
@@ -1,5 +1,6 @@ | |||
catboost==0.23.2 | |||
cloudpickle==1.5.0 | |||
distributed==2.20.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit to add this. I think 0.3.1
of nlp-primitives
which rolled out an hour ago added this package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
featuretools
uses distributed
Perhaps one approach would be to update the primitive to accept the training data as a corpus to use instead of the predetermined one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay Looks good to me!
X_text['index'] = range(len(X_text)) | ||
|
||
es = ft.EntitySet() | ||
es = es.entity_from_dataframe(entity_id='X', dataframe=X_text, index='index') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just seeing this now, but i think we should be using the variable_types
argument to specify which columns are are text columns when calling this function. you can see any example here: https://docs.featuretools.com/en/stable/loading_data/using_entitysets.html#adding-entities
Closes #907 by adding a brand new text featurization component that wraps the
featuretools nlp-primitives
code. Tests at the moment are fairly minimal, so any suggestions on what to add are much appreciated!