Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor clean_tokens function #29

Merged
merged 7 commits into from
Sep 17, 2020
Merged

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Sep 1, 2020

  • Fixes a few universal_sentence_encoder tests that were failing on master
  • Refactors the try/except/finally logic in clean_tokens so that a download failing will skip that part and complete successfully rather than raising a Variable referenced before assignment error
  • Removes the nltk.word_tokenizer in clean_tokens altogether in preference for an equivalent set of operations that takes half the time to run
  • Removes the nltk detokenizer from PolarityScore, since the same operation can be performed by str.join(' ') in a much shorter amount of time

Running cProfile:
Original clean_tokens performance
clean_tokens_original
After removing word_tokenizer
clean_tokens_union_out_of_listcomp

Copy link
Contributor

@ctduffy ctduffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks clean and it's exciting how much it speeds up the primitives! Just left a few questions/comments, but looks good to me after those adjustments (mainly just the try/except duplication in the stopword removal)

nlp_primitives/polarity_score.py Show resolved Hide resolved
@@ -7,25 +6,25 @@


def clean_tokens(textstr):
textstr = textstr.translate(str.maketrans('', '', string.punctuation))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to use this instead of just replacing all punctuation with an empty string with regex maybe? This could be a better way/not the only reason for this line (it seems like this line is meant to remove all punctuation, but feel free to correct me), but just a thought/wondering the rationale for this construction in particular. I see that you have removed the import re line so a reason could be that regex was really slow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything I've found online indicates that this maketrans is just about the most efficient way to do this operation, apparently it just directly wraps C code, which is nice. See my comment below about regex and why I moved away from that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh cool!! the more ya know!

except LookupError:
nltk.download('stopwords')
swords = set(nltk.corpus.stopwords.words('english'))
to_remove = set(string.punctuation).union(swords)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might it be worth not repeating this and the following line twice as they are the same as on 14 and 15, and could instead be taken out of the try/except and just put after the except expression and not be indented? / could be put in a finally expression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a weird bug going on in the evalml code right now that actually stems from this. If the internet connection is bad or non-existent when running this code, the nltk download will fail and any attempts to use that variable will throw an error. By keeping all references to nltk's stopwords within the try/except block, the code will run and successfully return something even if the download(s) fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm that is strange, I guess there isn't a lot of python handling around downloading and waiting for stuff to download/handling if it fails in the evalml library/here. in that case, this seems very reasonable!!

processed = ['0' if re.search('[0-9]+', ch) else ch for ch in processed]
processed = [wn.lemmatize(w) for w in processed]
return processed
textstr = ['0' if any(map(str.isdigit, ch)) else ch for ch in textstr]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment to above (in removing punctuation), just wondering what the reason is for moving away from regex, but I do think this looks very clean!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the regex here to add just a little more optimization - the re.search runtime is the third largest box in the second cProfile screenshot I posted, it took about 50 seconds. Removing the regex sped things up noticeably, the same operation was cut down to 20 seconds, so clean_tokens is down to running in 340 seconds total.

Apparently, the way regex searches through text includes a lot of backtracking and jumping forward, and it can be really inefficient on some types of datasets, depending on what you're searching for (https://towardsdatascience.com/regex-performance-in-python-873bd948a0ea), so I figured it would be safer to pull them out of here at least for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing, makes a lot of sense, cool that you looked into this!

@eccabay eccabay requested a review from ctduffy September 2, 2020 17:31
assert a.equals(b)
a = a.mean().round(7).to_numpy()
b = np.array([-0.0007475, 0.0032088, 0.0018552, 0.0008256, 0.0028342])
np.testing.assert_array_almost_equal(a, b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM! I know we've got other changes planned which will touch the download-related code, so I'll hold a few comments until we get to that work.

@dsherry
Copy link
Contributor

dsherry commented Sep 17, 2020

@eccabay any reason not to merge this?

@eccabay
Copy link
Contributor Author

eccabay commented Sep 17, 2020

@dsherry nope, not really! Was holding on in case anyone had anything else to add, but I'll just go ahead and merge now

@eccabay eccabay merged commit 6a75504 into master Sep 17, 2020
@rwedge rwedge mentioned this pull request Oct 26, 2020
@rwedge rwedge deleted the improve_primitive_performance branch March 4, 2021 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants