Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SL-578: Libraries too long #50766

Merged
merged 17 commits into from
May 18, 2023
Merged

SL-578: Libraries too long #50766

merged 17 commits into from
May 18, 2023

Conversation

ebeastlake
Copy link
Contributor

@ebeastlake ebeastlake commented Mar 15, 2023

Recently, we've received many reports about students being unable to publish libraries because the project is too large: https://codedotorg.atlassian.net/browse/SL-578. Because libraries could theoretically use any part of a student's code, the entire project needs to be profanity checked to export any subset of it as a library.

I began my investigation by noticing that the 414 errors we received from WebPurify were getting hit well before WebPurify's stated character limit of 30,000 (see https://www.webpurify.com/faq/). Switching the request from a GET request to a POST request using form data instead of query parameters solved this problem. Once I started using a form POST, I could not hit any error or size limit from WebPurify, and profanity checking succeeded up to payloads of 1M characters. I reached out to the WebPurify team about the behavior, and they said that "[API] responses can become unpredictable above [30,000 characters]" and recommend keeping that as their limit.

So, I modified the function to split text on spaces and batch requests to the WebPurify API above 30,000 characters. It occurs to me that we could make a lower-risk change by enforcing a 30,000-character limit ourselves and making a single request but seeing how much of an actually going up to 30,000 characters (as opposed to whenever we were actually hitting the limit before -- I was able to consistently trigger it around 20,000 in my testing) reduces our Zendesk ticket volume.

I could use guidance on a few things:

  • Was this the only place we enforced project size limits? Are we at risk of exposing other errors/places where a large project could cause problems if the profanity checking is limitless?
  • Error handling -- it doesn't seem like we had much before, and I've been unable to trigger any consistently in our testing to see whether we're handling them the same way
  • Using WebMock to stub a request timeout (see comment here)
  • Why don't the form bodies appear in the VCR fixtures? (See here).

Links

Jira ticket: https://codedotorg.atlassian.net/browse/SL-578

Testing story

So far, I've added tests (minus the ones I couldn't seem to get to work, see notes above) and confirmed that locally, I can publish remixed versions of the projects mentioned in the ticket as a library.

Deployment strategy

Follow-up work

Privacy

Security

Caching

PR Checklist:

  • Tests provide adequate coverage
  • Privacy and Security impacts have been assessed
  • Code is well-commented
  • New features are translatable or updates will not break translations
  • Relevant documentation has been added or updated
  • User impact is well-understood and desirable
  • Pull Request is labeled appropriately
  • Follow-up work items (including potential tech debt) are tracked and linked


CONNECTION_OPTIONS = {
read_timeout: DCDO.get('webpurify_http_read_timeout', 10),
open_timeout: DCDO.get('webpurify_tcp_connect_timeout', 5)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: I need to test that these CONNECTION_OPTIONS are having an effect when used in Net::HTTP.start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried stubbing the POST request and using WebMock's .to_timeout function to simulate a timeout and test the read_timeout and open_timeout here, but for some reason, the stub wasn't working. I have a feeling it might have to do with the fact that I was mixing VCR and WebMock in the tests. I'd still love to write that test if anyone has expertise here and would like to pair.

def test_find_potential_profanities_at_character_limit
max_length_string = 'f' * WEBPURIFY_CHARACTER_LIMIT
assert_nil WebPurify.find_potential_profanities(max_length_string)
end
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I've switched to a POST, I cannot seem to trigger the large payload error. What's up with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per conversation with the WebPurify team, it's not unexpected that we can no longer trigger the large payload error. However, they still recommend batching requests, so each request has a payload of 30,000 characters or less. They make no promises on the accuracy of their profanity detection on payloads larger than 30,000 characters.

@@ -26,8 +29,13 @@ def test_find_potential_profanities
end

def test_find_potential_profanities_with_language
assert_nil WebPurify.find_potential_profanities('scheiße', ['en'])
assert_nil WebPurify.find_potential_profanities('scheiße', ['es'])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the fixtures were updated, this test had to change because WebPurify now considers scheiße profanity in English and German.

method: get
uri: http://api1.webpurify.com/services/rest/?api_key=mocksecret&format=json&lang=en&method=webpurify.live.return&text=not%20a%20swear
method: post
uri: http://api1.webpurify.com/services/rest
body:
encoding: US-ASCII
string: ""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the POST is updated to use multipart form data, those bodies don't get captured anywhere in the VCR fixtures. It doesn't cause test failures, but that might be because the tests are executed in the same order every time. Does anyone have context here?


response = http.request(request)

next unless response.is_a?(Net::HTTPSuccess)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we want to handle errors here? It doesn't seem like there was explicit error handling before. (There was some client-side logic to render a useful message specifically for the large payload error.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question here is whether we want to allow publishing if webpurify is down or something, yeah? I think that's maybe worth clarifying with the group while we're here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general our error handling leaves something to be desired. I think it's worth reporting back a useful retry message if there is an issue communicating with WebPurify.

@ebeastlake ebeastlake requested review from maddiedierker, Hamms, sureshc and a team March 22, 2023 22:06
@bencodeorg
Copy link
Contributor

Is there a word missing in this sentence?

It occurs to me that we could make a lower-risk change by enforcing a 30,000-character limit ourselves and making a single request but seeing how much of an actually going up to 30,000 characters (as opposed to whenever we were actually hitting the limit before...

Copy link
Contributor

@bencodeorg bencodeorg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I don't have a lot of context on the broader questions you have in your PR description, but I think clarifying what we want to do if we get errors making requests to webpurify is a good idea.

@@ -16,18 +16,52 @@ def setup
c.filter_sensitive_data('<API_KEY>') {CDO.webpurify_key}
end

def test_chunk_text_for_webpurify
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these, very helpful :)

['method', 'webpurify.live.return'],
['text', chunk]
]
request.set_form form_data, 'multipart/form-data'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, why do you need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is if you want to submit form data through Net::HTTP, you have to create the request (Post.new) and attach the form data (set_form) in separate steps. The Net::HTTP library has some other shortcut methods like post_form (see documentation here), but it wasn't getting me what I wanted from WebPurify, I think because the default encoding is application/x-www-form-urlencoded.


response = http.request(request)

next unless response.is_a?(Net::HTTPSuccess)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question here is whether we want to allow publishing if webpurify is down or something, yeah? I think that's maybe worth clarifying with the group while we're here.

@molly-moen
Copy link
Contributor

@ebeastlake we do have a general max file size of 5 mb, to one of your questions

@@ -21,7 +21,7 @@ export default class LibraryIdCopier extends React.Component {
type="text"
ref={channelId => (this.channelId = channelId)}
onClick={event => event.target.select()}
readOnly="true"
readOnly={true}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have no big context about the whole PR, just a smal comment re this line.
readOnly={true} can also be simplefied to just readOnly

Copy link
Contributor

@molly-moen molly-moen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

# Returns the all profanities in text (if any) or nil (if none).
# @param [String] text The text to search for profanity within.
# @param [Array[String]] language_codes The set of languages to search for profanity in.
# @return [Array<String>, nil] The profanities (if any) or nil (if none).
def self.find_potential_profanities(text, language_codes = ['en'])
return nil unless CDO.webpurify_key && Gatekeeper.allows('webpurify', default: true)
return nil if text.nil?

# This is an artificial limit to prevent us from profanity-checking a file up to 5MB (the project size limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this comment read "prevent us from profanity-checking a file over 5MB"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, but thinking out loud: Files over 5MB don't exist, but if we removed the limit entirely, we could profanity check a 5 MB file (or 4 MB file, or 3 MB file, or 2 MB file, all of which would be a lot and probably more than we want in the immediate future). Previously, we maxed out in the ~7,000-30,000 character range (<0.03 MB).

I agree that the comment might be unclear, though. How about "this is an artificial limit to prevent us from increasing our profanity-checking limit by too many orders of magnitude (from 0.007-0.03 MB to 0.12-5 MB) at once"? Or something...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I was confused by the "up to 5MB" line, does this limit have anything to do with 5 mb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't, except that that's the de facto limit without the URI limit. I could remove the reference to 5 MB and state that it's an arbitrary size limit. Do you think that would be clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok I see. Yeah I think removing the 5 mb reference is clearer

@ebeastlake ebeastlake force-pushed the task/sl-578/libraries-too-long branch from dad73a7 to f60da93 Compare May 18, 2023 19:45
@ebeastlake ebeastlake merged commit 66c3d51 into staging May 18, 2023
2 checks passed
@ebeastlake ebeastlake deleted the task/sl-578/libraries-too-long branch May 18, 2023 23:44
pablo-code-org pushed a commit that referenced this pull request May 25, 2023
* add WebPurify unit test for throwing HTTP request too large error

* fix console error by using boolean instead of string

* POST to web_purify instead of GET

* add unit test for hitting and exceeding character limit

* post to web_purify instead of get

* add test for requests at character limit

* update fixtures to reflect post instead of get

* update test to reflect new profanity standards

* chunk text into requests smaller than 30_000 characters

* use next unless inside loop to satisfy linter

* add tests to cover new implementation

* update fixtures

* update front-end error handling

* in progress

* update error handling and add new edge case

* convert readOnly={true} to readOnly

* add tests for new edge cases
snickell pushed a commit that referenced this pull request Feb 3, 2024
* add WebPurify unit test for throwing HTTP request too large error

* fix console error by using boolean instead of string

* POST to web_purify instead of GET

* add unit test for hitting and exceeding character limit

* post to web_purify instead of get

* add test for requests at character limit

* update fixtures to reflect post instead of get

* update test to reflect new profanity standards

* chunk text into requests smaller than 30_000 characters

* use next unless inside loop to satisfy linter

* add tests to cover new implementation

* update fixtures

* update front-end error handling

* in progress

* update error handling and add new edge case

* convert readOnly={true} to readOnly

* add tests for new edge cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants