-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor clean_tokens
function
#29
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks clean and it's exciting how much it speeds up the primitives! Just left a few questions/comments, but looks good to me after those adjustments (mainly just the try/except duplication in the stopword removal)
@@ -7,25 +6,25 @@ | |||
|
|||
|
|||
def clean_tokens(textstr): | |||
textstr = textstr.translate(str.maketrans('', '', string.punctuation)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason to use this instead of just replacing all punctuation with an empty string with regex maybe? This could be a better way/not the only reason for this line (it seems like this line is meant to remove all punctuation, but feel free to correct me), but just a thought/wondering the rationale for this construction in particular. I see that you have removed the import re
line so a reason could be that regex was really slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything I've found online indicates that this maketrans
is just about the most efficient way to do this operation, apparently it just directly wraps C code, which is nice. See my comment below about regex and why I moved away from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooh cool!! the more ya know!
except LookupError: | ||
nltk.download('stopwords') | ||
swords = set(nltk.corpus.stopwords.words('english')) | ||
to_remove = set(string.punctuation).union(swords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might it be worth not repeating this and the following line twice as they are the same as on 14 and 15, and could instead be taken out of the try
/except
and just put after the except
expression and not be indented? / could be put in a finally
expression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a weird bug going on in the evalml
code right now that actually stems from this. If the internet connection is bad or non-existent when running this code, the nltk download will fail and any attempts to use that variable will throw an error. By keeping all references to nltk's stopwords within the try/except block, the code will run and successfully return something even if the download(s) fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm that is strange, I guess there isn't a lot of python handling around downloading and waiting for stuff to download/handling if it fails in the evalml library/here. in that case, this seems very reasonable!!
processed = ['0' if re.search('[0-9]+', ch) else ch for ch in processed] | ||
processed = [wn.lemmatize(w) for w in processed] | ||
return processed | ||
textstr = ['0' if any(map(str.isdigit, ch)) else ch for ch in textstr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment to above (in removing punctuation), just wondering what the reason is for moving away from regex, but I do think this looks very clean!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the regex here to add just a little more optimization - the re.search
runtime is the third largest box in the second cProfile screenshot I posted, it took about 50 seconds. Removing the regex sped things up noticeably, the same operation was cut down to 20 seconds, so clean_tokens
is down to running in 340 seconds total.
Apparently, the way regex searches through text includes a lot of backtracking and jumping forward, and it can be really inefficient on some types of datasets, depending on what you're searching for (https://towardsdatascience.com/regex-performance-in-python-873bd948a0ea), so I figured it would be safer to pull them out of here at least for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing, makes a lot of sense, cool that you looked into this!
assert a.equals(b) | ||
a = a.mean().round(7).to_numpy() | ||
b = np.array([-0.0007475, 0.0032088, 0.0018552, 0.0008256, 0.0028342]) | ||
np.testing.assert_array_almost_equal(a, b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM! I know we've got other changes planned which will touch the download-related code, so I'll hold a few comments until we get to that work.
@eccabay any reason not to merge this? |
@dsherry nope, not really! Was holding on in case anyone had anything else to add, but I'll just go ahead and merge now |
universal_sentence_encoder
tests that were failing on masterclean_tokens
so that a download failing will skip that part and complete successfully rather than raising aVariable referenced before assignment
errornltk.word_tokenizer
inclean_tokens
altogether in preference for an equivalent set of operations that takes half the time to runnltk
detokenizer fromPolarityScore
, since the same operation can be performed bystr.join(' ')
in a much shorter amount of timeRunning cProfile:
![clean_tokens_original](https://user-images.githubusercontent.com/60074782/91891536-3399ff00-ec5f-11ea-8237-bc79d17be1eb.png)
![clean_tokens_union_out_of_listcomp](https://user-images.githubusercontent.com/60074782/91891586-4ad8ec80-ec5f-11ea-92ab-58796b1892fe.png)
Original
clean_tokens
performanceAfter removing
word_tokenizer