-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SL-578: Libraries too long #50766
SL-578: Libraries too long #50766
Conversation
lib/cdo/web_purify.rb
Outdated
|
||
CONNECTION_OPTIONS = { | ||
read_timeout: DCDO.get('webpurify_http_read_timeout', 10), | ||
open_timeout: DCDO.get('webpurify_tcp_connect_timeout', 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: I need to test that these CONNECTION_OPTIONS are having an effect when used in Net::HTTP.start
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried stubbing the POST
request and using WebMock's .to_timeout
function to simulate a timeout and test the read_timeout
and open_timeout
here, but for some reason, the stub wasn't working. I have a feeling it might have to do with the fact that I was mixing VCR and WebMock in the tests. I'd still love to write that test if anyone has expertise here and would like to pair.
def test_find_potential_profanities_at_character_limit | ||
max_length_string = 'f' * WEBPURIFY_CHARACTER_LIMIT | ||
assert_nil WebPurify.find_potential_profanities(max_length_string) | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I've switched to a POST
, I cannot seem to trigger the large payload error. What's up with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per conversation with the WebPurify team, it's not unexpected that we can no longer trigger the large payload error. However, they still recommend batching requests, so each request has a payload of 30,000 characters or less. They make no promises on the accuracy of their profanity detection on payloads larger than 30,000 characters.
@@ -26,8 +29,13 @@ def test_find_potential_profanities | |||
end | |||
|
|||
def test_find_potential_profanities_with_language | |||
assert_nil WebPurify.find_potential_profanities('scheiße', ['en']) | |||
assert_nil WebPurify.find_potential_profanities('scheiße', ['es']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the fixtures were updated, this test had to change because WebPurify now considers scheiße
profanity in English and German.
method: get | ||
uri: http://api1.webpurify.com/services/rest/?api_key=mocksecret&format=json&lang=en&method=webpurify.live.return&text=not%20a%20swear | ||
method: post | ||
uri: http://api1.webpurify.com/services/rest | ||
body: | ||
encoding: US-ASCII | ||
string: "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the POST is updated to use multipart form data, those bodies don't get captured anywhere in the VCR fixtures. It doesn't cause test failures, but that might be because the tests are executed in the same order every time. Does anyone have context here?
lib/cdo/web_purify.rb
Outdated
|
||
response = http.request(request) | ||
|
||
next unless response.is_a?(Net::HTTPSuccess) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we want to handle errors here? It doesn't seem like there was explicit error handling before. (There was some client-side logic to render a useful message specifically for the large payload error.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question here is whether we want to allow publishing if webpurify is down or something, yeah? I think that's maybe worth clarifying with the group while we're here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general our error handling leaves something to be desired. I think it's worth reporting back a useful retry message if there is an issue communicating with WebPurify.
Is there a word missing in this sentence?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I don't have a lot of context on the broader questions you have in your PR description, but I think clarifying what we want to do if we get errors making requests to webpurify is a good idea.
@@ -16,18 +16,52 @@ def setup | |||
c.filter_sensitive_data('<API_KEY>') {CDO.webpurify_key} | |||
end | |||
|
|||
def test_chunk_text_for_webpurify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these, very helpful :)
['method', 'webpurify.live.return'], | ||
['text', chunk] | ||
] | ||
request.set_form form_data, 'multipart/form-data' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, why do you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is if you want to submit form data through Net::HTTP
, you have to create the request (Post.new
) and attach the form data (set_form
) in separate steps. The Net::HTTP
library has some other shortcut methods like post_form
(see documentation here), but it wasn't getting me what I wanted from WebPurify, I think because the default encoding is application/x-www-form-urlencoded
.
lib/cdo/web_purify.rb
Outdated
|
||
response = http.request(request) | ||
|
||
next unless response.is_a?(Net::HTTPSuccess) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question here is whether we want to allow publishing if webpurify is down or something, yeah? I think that's maybe worth clarifying with the group while we're here.
@ebeastlake we do have a general max file size of 5 mb, to one of your questions |
@@ -21,7 +21,7 @@ export default class LibraryIdCopier extends React.Component { | |||
type="text" | |||
ref={channelId => (this.channelId = channelId)} | |||
onClick={event => event.target.select()} | |||
readOnly="true" | |||
readOnly={true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have no big context about the whole PR, just a smal comment re this line.
readOnly={true}
can also be simplefied to just readOnly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
# Returns the all profanities in text (if any) or nil (if none). | ||
# @param [String] text The text to search for profanity within. | ||
# @param [Array[String]] language_codes The set of languages to search for profanity in. | ||
# @return [Array<String>, nil] The profanities (if any) or nil (if none). | ||
def self.find_potential_profanities(text, language_codes = ['en']) | ||
return nil unless CDO.webpurify_key && Gatekeeper.allows('webpurify', default: true) | ||
return nil if text.nil? | ||
|
||
# This is an artificial limit to prevent us from profanity-checking a file up to 5MB (the project size limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this comment read "prevent us from profanity-checking a file over 5MB"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, but thinking out loud: Files over 5MB don't exist, but if we removed the limit entirely, we could profanity check a 5 MB file (or 4 MB file, or 3 MB file, or 2 MB file, all of which would be a lot and probably more than we want in the immediate future). Previously, we maxed out in the ~7,000-30,000 character range (<0.03 MB).
I agree that the comment might be unclear, though. How about "this is an artificial limit to prevent us from increasing our profanity-checking limit by too many orders of magnitude (from 0.007-0.03 MB to 0.12-5 MB) at once"? Or something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I was confused by the "up to 5MB" line, does this limit have anything to do with 5 mb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it doesn't, except that that's the de facto limit without the URI limit. I could remove the reference to 5 MB and state that it's an arbitrary size limit. Do you think that would be clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok I see. Yeah I think removing the 5 mb reference is clearer
dad73a7
to
f60da93
Compare
* add WebPurify unit test for throwing HTTP request too large error * fix console error by using boolean instead of string * POST to web_purify instead of GET * add unit test for hitting and exceeding character limit * post to web_purify instead of get * add test for requests at character limit * update fixtures to reflect post instead of get * update test to reflect new profanity standards * chunk text into requests smaller than 30_000 characters * use next unless inside loop to satisfy linter * add tests to cover new implementation * update fixtures * update front-end error handling * in progress * update error handling and add new edge case * convert readOnly={true} to readOnly * add tests for new edge cases
* add WebPurify unit test for throwing HTTP request too large error * fix console error by using boolean instead of string * POST to web_purify instead of GET * add unit test for hitting and exceeding character limit * post to web_purify instead of get * add test for requests at character limit * update fixtures to reflect post instead of get * update test to reflect new profanity standards * chunk text into requests smaller than 30_000 characters * use next unless inside loop to satisfy linter * add tests to cover new implementation * update fixtures * update front-end error handling * in progress * update error handling and add new edge case * convert readOnly={true} to readOnly * add tests for new edge cases
Recently, we've received many reports about students being unable to publish libraries because the project is too large: https://codedotorg.atlassian.net/browse/SL-578. Because libraries could theoretically use any part of a student's code, the entire project needs to be profanity checked to export any subset of it as a library.
I began my investigation by noticing that the
414
errors we received from WebPurify were getting hit well before WebPurify's stated character limit of 30,000 (see https://www.webpurify.com/faq/). Switching the request from a GET request to a POST request using form data instead of query parameters solved this problem. Once I started using a form POST, I could not hit any error or size limit from WebPurify, and profanity checking succeeded up to payloads of 1M characters. I reached out to the WebPurify team about the behavior, and they said that "[API] responses can become unpredictable above [30,000 characters]" and recommend keeping that as their limit.So, I modified the function to split text on spaces and batch requests to the WebPurify API above 30,000 characters. It occurs to me that we could make a lower-risk change by enforcing a 30,000-character limit ourselves and making a single request but seeing how much of an actually going up to 30,000 characters (as opposed to whenever we were actually hitting the limit before -- I was able to consistently trigger it around 20,000 in my testing) reduces our Zendesk ticket volume.
I could use guidance on a few things:
Links
Jira ticket: https://codedotorg.atlassian.net/browse/SL-578
Testing story
So far, I've added tests (minus the ones I couldn't seem to get to work, see notes above) and confirmed that locally, I can publish remixed versions of the projects mentioned in the ticket as a library.
Deployment strategy
Follow-up work
Privacy
Security
Caching
PR Checklist: