SL-578: Libraries too long #50766

ebeastlake · 2023-03-15T00:17:40Z

Recently, we've received many reports about students being unable to publish libraries because the project is too large: https://codedotorg.atlassian.net/browse/SL-578. Because libraries could theoretically use any part of a student's code, the entire project needs to be profanity checked to export any subset of it as a library.

I began my investigation by noticing that the 414 errors we received from WebPurify were getting hit well before WebPurify's stated character limit of 30,000 (see https://www.webpurify.com/faq/). Switching the request from a GET request to a POST request using form data instead of query parameters solved this problem. Once I started using a form POST, I could not hit any error or size limit from WebPurify, and profanity checking succeeded up to payloads of 1M characters. I reached out to the WebPurify team about the behavior, and they said that "[API] responses can become unpredictable above [30,000 characters]" and recommend keeping that as their limit.

So, I modified the function to split text on spaces and batch requests to the WebPurify API above 30,000 characters. It occurs to me that we could make a lower-risk change by enforcing a 30,000-character limit ourselves and making a single request but seeing how much of an actually going up to 30,000 characters (as opposed to whenever we were actually hitting the limit before -- I was able to consistently trigger it around 20,000 in my testing) reduces our Zendesk ticket volume.

I could use guidance on a few things:

Was this the only place we enforced project size limits? Are we at risk of exposing other errors/places where a large project could cause problems if the profanity checking is limitless?
Error handling -- it doesn't seem like we had much before, and I've been unable to trigger any consistently in our testing to see whether we're handling them the same way
Using WebMock to stub a request timeout (see comment here)
Why don't the form bodies appear in the VCR fixtures? (See here).

Links

Jira ticket: https://codedotorg.atlassian.net/browse/SL-578

Testing story

So far, I've added tests (minus the ones I couldn't seem to get to work, see notes above) and confirmed that locally, I can publish remixed versions of the projects mentioned in the ticket as a library.

Deployment strategy

Follow-up work

Privacy

Security

Caching

PR Checklist:

Tests provide adequate coverage
Privacy and Security impacts have been assessed
Code is well-commented
New features are translatable or updates will not break translations
Relevant documentation has been added or updated
User impact is well-understood and desirable
Pull Request is labeled appropriately
Follow-up work items (including potential tech debt) are tracked and linked

ebeastlake · 2023-03-16T16:26:55Z

lib/cdo/web_purify.rb

+
+  CONNECTION_OPTIONS = {
+    read_timeout: DCDO.get('webpurify_http_read_timeout', 10),
+    open_timeout: DCDO.get('webpurify_tcp_connect_timeout', 5)


TODO: I need to test that these CONNECTION_OPTIONS are having an effect when used in Net::HTTP.start.

I tried stubbing the POST request and using WebMock's .to_timeout function to simulate a timeout and test the read_timeout and open_timeout here, but for some reason, the stub wasn't working. I have a feeling it might have to do with the fact that I was mixing VCR and WebMock in the tests. I'd still love to write that test if anyone has expertise here and would like to pair.

ebeastlake · 2023-03-16T16:28:14Z

lib/test/cdo/test_web_purify.rb

+  def test_find_potential_profanities_at_character_limit
+    max_length_string = 'f' * WEBPURIFY_CHARACTER_LIMIT
+    assert_nil WebPurify.find_potential_profanities(max_length_string)
+  end
 end


Now that I've switched to a POST, I cannot seem to trigger the large payload error. What's up with that?

Per conversation with the WebPurify team, it's not unexpected that we can no longer trigger the large payload error. However, they still recommend batching requests, so each request has a payload of 30,000 characters or less. They make no promises on the accuracy of their profanity detection on payloads larger than 30,000 characters.

ebeastlake · 2023-03-16T16:31:15Z

lib/test/cdo/test_web_purify.rb

@@ -26,8 +29,13 @@ def test_find_potential_profanities
  end

  def test_find_potential_profanities_with_language
-    assert_nil WebPurify.find_potential_profanities('scheiße', ['en'])
+    assert_nil WebPurify.find_potential_profanities('scheiße', ['es'])


Since the fixtures were updated, this test had to change because WebPurify now considers scheiße profanity in English and German.

ebeastlake · 2023-03-16T16:37:08Z

shared/test/fixtures/vcr/webpurify/find_potential_profanities.yml

-      method: get
-      uri: http://api1.webpurify.com/services/rest/?api_key=mocksecret&format=json&lang=en&method=webpurify.live.return&text=not%20a%20swear
+      method: post
+      uri: http://api1.webpurify.com/services/rest
      body:
        encoding: US-ASCII
        string: ""


When the POST is updated to use multipart form data, those bodies don't get captured anywhere in the VCR fixtures. It doesn't cause test failures, but that might be because the tests are executed in the same order every time. Does anyone have context here?

ebeastlake · 2023-03-22T21:58:41Z

lib/cdo/web_purify.rb

+
+        response = http.request(request)
+
+        next unless response.is_a?(Net::HTTPSuccess)


How do we want to handle errors here? It doesn't seem like there was explicit error handling before. (There was some client-side logic to render a useful message specifically for the large payload error.)

The question here is whether we want to allow publishing if webpurify is down or something, yeah? I think that's maybe worth clarifying with the group while we're here.

In general our error handling leaves something to be desired. I think it's worth reporting back a useful retry message if there is an issue communicating with WebPurify.

bencodeorg · 2023-03-24T15:19:35Z

Is there a word missing in this sentence?

It occurs to me that we could make a lower-risk change by enforcing a 30,000-character limit ourselves and making a single request but seeing how much of an actually going up to 30,000 characters (as opposed to whenever we were actually hitting the limit before...

bencodeorg

This looks great! I don't have a lot of context on the broader questions you have in your PR description, but I think clarifying what we want to do if we get errors making requests to webpurify is a good idea.

bencodeorg · 2023-03-24T15:34:45Z

lib/test/cdo/test_web_purify.rb

@@ -16,18 +16,52 @@ def setup
    c.filter_sensitive_data('<API_KEY>') {CDO.webpurify_key}
  end

+  def test_chunk_text_for_webpurify


Thanks for adding these, very helpful :)

bencodeorg · 2023-03-24T15:52:44Z

lib/cdo/web_purify.rb

+          ['method', 'webpurify.live.return'],
+          ['text', chunk]
+        ]
+        request.set_form form_data, 'multipart/form-data'


For my understanding, why do you need this?

My understanding is if you want to submit form data through Net::HTTP, you have to create the request (Post.new) and attach the form data (set_form) in separate steps. The Net::HTTP library has some other shortcut methods like post_form (see documentation here), but it wasn't getting me what I wanted from WebPurify, I think because the default encoding is application/x-www-form-urlencoded.

bencodeorg · 2023-03-24T16:00:51Z

lib/cdo/web_purify.rb

+
+        response = http.request(request)
+
+        next unless response.is_a?(Net::HTTPSuccess)


The question here is whether we want to allow publishing if webpurify is down or something, yeah? I think that's maybe worth clarifying with the group while we're here.

molly-moen · 2023-03-24T17:18:50Z

@ebeastlake we do have a general max file size of 5 mb, to one of your questions

levadadenys · 2023-03-28T11:55:15Z

apps/src/code-studio/components/libraries/LibraryIdCopier.jsx

@@ -21,7 +21,7 @@ export default class LibraryIdCopier extends React.Component {
          type="text"
          ref={channelId => (this.channelId = channelId)}
          onClick={event => event.target.select()}
-          readOnly="true"
+          readOnly={true}


Have no big context about the whole PR, just a smal comment re this line.
readOnly={true} can also be simplefied to just readOnly

molly-moen

LGTM!

molly-moen · 2023-04-11T20:27:34Z

lib/cdo/web_purify.rb

  # Returns the all profanities in text (if any) or nil (if none).
  # @param [String] text The text to search for profanity within.
  # @param [Array[String]] language_codes The set of languages to search for profanity in.
  # @return [Array<String>, nil] The profanities (if any) or nil (if none).
  def self.find_potential_profanities(text, language_codes = ['en'])
    return nil unless CDO.webpurify_key && Gatekeeper.allows('webpurify', default: true)
    return nil if text.nil?
+
+    # This is an artificial limit to prevent us from profanity-checking a file up to 5MB (the project size limit)


nit: should this comment read "prevent us from profanity-checking a file over 5MB"?

I don't think so, but thinking out loud: Files over 5MB don't exist, but if we removed the limit entirely, we could profanity check a 5 MB file (or 4 MB file, or 3 MB file, or 2 MB file, all of which would be a lot and probably more than we want in the immediate future). Previously, we maxed out in the ~7,000-30,000 character range (<0.03 MB).

I agree that the comment might be unclear, though. How about "this is an artificial limit to prevent us from increasing our profanity-checking limit by too many orders of magnitude (from 0.007-0.03 MB to 0.12-5 MB) at once"? Or something...

Yeah I think I was confused by the "up to 5MB" line, does this limit have anything to do with 5 mb?

No, it doesn't, except that that's the de facto limit without the URI limit. I could remove the reference to 5 MB and state that it's an arbitrary size limit. Do you think that would be clearer?

ah ok I see. Yeah I think removing the 5 mb reference is clearer

* add WebPurify unit test for throwing HTTP request too large error * fix console error by using boolean instead of string * POST to web_purify instead of GET * add unit test for hitting and exceeding character limit * post to web_purify instead of get * add test for requests at character limit * update fixtures to reflect post instead of get * update test to reflect new profanity standards * chunk text into requests smaller than 30_000 characters * use next unless inside loop to satisfy linter * add tests to cover new implementation * update fixtures * update front-end error handling * in progress * update error handling and add new edge case * convert readOnly={true} to readOnly * add tests for new edge cases

ebeastlake commented Mar 16, 2023

View reviewed changes

ebeastlake commented Mar 22, 2023

View reviewed changes

ebeastlake requested review from maddiedierker, Hamms, sureshc and a team March 22, 2023 22:06

bencodeorg approved these changes Mar 24, 2023

View reviewed changes

levadadenys reviewed Mar 28, 2023

View reviewed changes

ebeastlake requested review from molly-moen, levadadenys and bencodeorg April 11, 2023 19:41

molly-moen approved these changes Apr 11, 2023

View reviewed changes

ebeastlake added 13 commits May 18, 2023 12:43

add WebPurify unit test for throwing HTTP request too large error

395061a

fix console error by using boolean instead of string

da6b632

POST to web_purify instead of GET

d15aa5e

add unit test for hitting and exceeding character limit

4fc6ee1

post to web_purify instead of get

b5e0602

add test for requests at character limit

61262ef

update fixtures to reflect post instead of get

63918cb

update test to reflect new profanity standards

babdd8c

chunk text into requests smaller than 30_000 characters

e4a4e51

use next unless inside loop to satisfy linter

b155ab0

add tests to cover new implementation

be83713

update fixtures

0e9e6a2

update front-end error handling

077e655

ebeastlake added 4 commits May 18, 2023 12:44

in progress

7494e4d

update error handling and add new edge case

b1b939e

convert readOnly={true} to readOnly

0626b99

add tests for new edge cases

f60da93

ebeastlake force-pushed the task/sl-578/libraries-too-long branch from dad73a7 to f60da93 Compare May 18, 2023 19:45

ebeastlake merged commit 66c3d51 into staging May 18, 2023
2 checks passed

ebeastlake deleted the task/sl-578/libraries-too-long branch May 18, 2023 23:44

ebeastlake mentioned this pull request Jun 21, 2023

Define TextTooLong error class and update code paths to handle it #52137

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SL-578: Libraries too long #50766

SL-578: Libraries too long #50766

ebeastlake commented Mar 15, 2023 •

edited

ebeastlake Mar 16, 2023

ebeastlake Mar 22, 2023

ebeastlake Mar 16, 2023

ebeastlake Mar 22, 2023

ebeastlake Mar 16, 2023 •

edited

ebeastlake Mar 16, 2023 •

edited

ebeastlake Mar 22, 2023

bencodeorg Mar 24, 2023

molly-moen Mar 24, 2023

bencodeorg commented Mar 24, 2023

bencodeorg left a comment

bencodeorg Mar 24, 2023

bencodeorg Mar 24, 2023

ebeastlake Mar 27, 2023

bencodeorg Mar 24, 2023

molly-moen commented Mar 24, 2023

levadadenys Mar 28, 2023 •

edited

molly-moen left a comment

molly-moen Apr 11, 2023

ebeastlake Apr 12, 2023 •

edited

molly-moen Apr 12, 2023

ebeastlake Apr 12, 2023 •

edited

molly-moen Apr 12, 2023


		response = http.request(request)

		next unless response.is_a?(Net::HTTPSuccess)

SL-578: Libraries too long #50766

SL-578: Libraries too long #50766

Conversation

ebeastlake commented Mar 15, 2023 • edited

Links

Testing story

Deployment strategy

Follow-up work

Privacy

Security

Caching

PR Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebeastlake Mar 16, 2023 • edited

Choose a reason for hiding this comment

ebeastlake Mar 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bencodeorg commented Mar 24, 2023

bencodeorg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

molly-moen commented Mar 24, 2023

levadadenys Mar 28, 2023 • edited

Choose a reason for hiding this comment

molly-moen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebeastlake Apr 12, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebeastlake Apr 12, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebeastlake commented Mar 15, 2023 •

edited

ebeastlake Mar 16, 2023 •

edited

ebeastlake Mar 16, 2023 •

edited

levadadenys Mar 28, 2023 •

edited

ebeastlake Apr 12, 2023 •

edited

ebeastlake Apr 12, 2023 •

edited