Refactor `clean_tokens` function #29

eccabay · 2020-09-01T18:34:47Z

Fixes a few universal_sentence_encoder tests that were failing on master
Refactors the try/except/finally logic in clean_tokens so that a download failing will skip that part and complete successfully rather than raising a Variable referenced before assignment error
Removes the nltk.word_tokenizer in clean_tokens altogether in preference for an equivalent set of operations that takes half the time to run
Removes the nltk detokenizer from PolarityScore, since the same operation can be performed by str.join(' ') in a much shorter amount of time

Running cProfile:
Original clean_tokens performance

After removing word_tokenizer

ctduffy

Overall, this looks clean and it's exciting how much it speeds up the primitives! Just left a few questions/comments, but looks good to me after those adjustments (mainly just the try/except duplication in the stopword removal)

nlp_primitives/polarity_score.py

nlp_primitives/tests/test_universal_sentence_encoder.py

ctduffy · 2020-09-02T14:35:40Z

nlp_primitives/utilities.py

@@ -7,25 +6,25 @@


 def clean_tokens(textstr):
+    textstr = textstr.translate(str.maketrans('', '', string.punctuation))


is there a reason to use this instead of just replacing all punctuation with an empty string with regex maybe? This could be a better way/not the only reason for this line (it seems like this line is meant to remove all punctuation, but feel free to correct me), but just a thought/wondering the rationale for this construction in particular. I see that you have removed the import re line so a reason could be that regex was really slow?

Everything I've found online indicates that this maketrans is just about the most efficient way to do this operation, apparently it just directly wraps C code, which is nice. See my comment below about regex and why I moved away from that.

oooh cool!! the more ya know!

ctduffy · 2020-09-02T14:37:36Z

nlp_primitives/utilities.py

    except LookupError:
        nltk.download('stopwords')
        swords = set(nltk.corpus.stopwords.words('english'))
+        to_remove = set(string.punctuation).union(swords)


might it be worth not repeating this and the following line twice as they are the same as on 14 and 15, and could instead be taken out of the try/except and just put after the except expression and not be indented? / could be put in a finally expression

There's a weird bug going on in the evalml code right now that actually stems from this. If the internet connection is bad or non-existent when running this code, the nltk download will fail and any attempts to use that variable will throw an error. By keeping all references to nltk's stopwords within the try/except block, the code will run and successfully return something even if the download(s) fail.

Hm that is strange, I guess there isn't a lot of python handling around downloading and waiting for stuff to download/handling if it fails in the evalml library/here. in that case, this seems very reasonable!!

ctduffy · 2020-09-02T14:49:03Z

nlp_primitives/utilities.py

-                processed = ['0' if re.search('[0-9]+', ch) else ch for ch in processed]
-                processed = [wn.lemmatize(w) for w in processed]
-                return processed
+        textstr = ['0' if any(map(str.isdigit, ch)) else ch for ch in textstr]


similar comment to above (in removing punctuation), just wondering what the reason is for moving away from regex, but I do think this looks very clean!

I removed the regex here to add just a little more optimization - the re.search runtime is the third largest box in the second cProfile screenshot I posted, it took about 50 seconds. Removing the regex sped things up noticeably, the same operation was cut down to 20 seconds, so clean_tokens is down to running in 340 seconds total.

Apparently, the way regex searches through text includes a lot of backtracking and jumping forward, and it can be really inefficient on some types of datasets, depending on what you're searching for (https://towardsdatascience.com/regex-performance-in-python-873bd948a0ea), so I figured it would be safer to pull them out of here at least for now.

amazing, makes a lot of sense, cool that you looked into this!

dsherry · 2020-09-14T15:39:01Z

nlp_primitives/tests/test_universal_sentence_encoder.py

-    assert a.equals(b)
+    a = a.mean().round(7).to_numpy()
+    b = np.array([-0.0007475, 0.0032088, 0.0018552, 0.0008256, 0.0028342])
+    np.testing.assert_array_almost_equal(a, b)


Looks good.

dsherry

👍 LGTM! I know we've got other changes planned which will touch the download-related code, so I'll hold a few comments until we get to that work.

dsherry · 2020-09-17T18:07:39Z

@eccabay any reason not to merge this?

eccabay · 2020-09-17T18:16:56Z

@dsherry nope, not really! Was holding on in case anyone had anything else to add, but I'll just go ahead and merge now

eccabay added 3 commits August 31, 2020 15:01

Fix failing universal sentence encoder tests

ba65500

Rework clean_text to run when download(s) fail

651ec15

Remove word tokenizer

9fba499

eccabay requested review from dsherry, rwedge, gsheni and ctduffy September 1, 2020 18:34

eccabay added 4 commits September 1, 2020 14:41

lint fix

6b3f8e9

Remove detokenizer from polarity score

cc6e8a4

Replace re.search with map in clean_tokens

947daa8

lint fix

aa9f54d

ctduffy suggested changes Sep 2, 2020

View reviewed changes

eccabay requested a review from ctduffy September 2, 2020 17:31

ctduffy approved these changes Sep 2, 2020

View reviewed changes

dsherry reviewed Sep 14, 2020

View reviewed changes

dsherry approved these changes Sep 14, 2020

View reviewed changes

eccabay merged commit 6a75504 into master Sep 17, 2020

dsherry mentioned this pull request Sep 21, 2020

Flaky text unit tests alteryx/evalml#1083

Closed

rwedge mentioned this pull request Oct 26, 2020

v1.1.0 #43

Merged

rwedge deleted the improve_primitive_performance branch March 4, 2021 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `clean_tokens` function #29

Refactor `clean_tokens` function #29

eccabay commented Sep 1, 2020 •

edited

Loading

ctduffy left a comment

ctduffy Sep 2, 2020

eccabay Sep 2, 2020

ctduffy Sep 2, 2020

ctduffy Sep 2, 2020

eccabay Sep 2, 2020

ctduffy Sep 2, 2020

ctduffy Sep 2, 2020

eccabay Sep 2, 2020

ctduffy Sep 2, 2020

dsherry Sep 14, 2020

dsherry left a comment

dsherry commented Sep 17, 2020

eccabay commented Sep 17, 2020

		@@ -7,25 +6,25 @@


		def clean_tokens(textstr):
		textstr = textstr.translate(str.maketrans('', '', string.punctuation))

Refactor clean_tokens function #29

Refactor clean_tokens function #29

Conversation

eccabay commented Sep 1, 2020 • edited Loading

ctduffy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

dsherry commented Sep 17, 2020

eccabay commented Sep 17, 2020

Refactor `clean_tokens` function #29

Refactor `clean_tokens` function #29

eccabay commented Sep 1, 2020 •

edited

Loading