New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fully strip block XML before TTSifying #16718
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending comments.
@@ -100,6 +100,18 @@ def self.tts_upload_to_s3(text, filename) | |||
end | |||
end | |||
|
|||
def self.sanitize(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps sanitize_and_render
? Otherwise, I would expect sanitize
to accept and return a string, nothing more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be confused; accepting and returning a string is precisely what it does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I didn't realize what TTSSafeRenderer.render
did.
test 'sanitize html' do | ||
assert_equal @level_with_raw_html.tts_markdown_instructions_text, "This should have no excess formatting \n" | ||
test 'sanitize html and xml' do | ||
assert_equal @level_with_raw_html.tts_markdown_instructions_text, "This should have no excess formatting\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected and actual values should be flipped (for better error output on failure).
(also elsewhere in the file, feel welcome to do as a separate PR, or leave for me to do)
test 'sanitize html and xml' do | ||
assert_equal @level_with_raw_html.tts_markdown_instructions_text, "This should have no excess formatting\n" | ||
assert_equal @level_with_block_html.tts_markdown_instructions_text, "This block should get stripped:\n" | ||
assert_equal @level_with_xml.tts_markdown_instructions_text, "This block should get stripped:\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I be surprised that there is only one \n
rather than two in the expected output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on how familiar you are with Redcarpet 😄
Two newlines in the input are necessary to separate paragraphs and to treat the xml as its own "block", as is standard for markdown input.
Newlines within the content itself are irrelevant, as is standard for markdown output, but for some reason one is always added to the end.
# | ||
# to avoid this, we simply and aggressively strip out any xml before passing | ||
# the rest to redcarpet, despite the risk of invoking the wrath of Zalgo. | ||
TTSSafeRenderer.render(text.gsub(/<xml>.*<\/xml>/m, '')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any concern of there being multiple </xml>
strings in text
? If so, does the /m
invoke the desired behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch. There is indeed and I should make this regex less greedy.
The /m
doesn't influence that side effect, though; it's just there to make sure the regex with deal with newlines
a9d088a
to
6d016f5
Compare
Added a Loofah implementation rather than regex because why on earth was I trying to parse XML using regex. PTAL |
LGTM (Ignore the PTAL [now deleted]). |
d541846
to
8a00df2
Compare
No description provided.