Sync from Google (#30149)

* - Fixes tokenizer line/column accounting. In a few cases it was calculating negative column values. - The Strings::CodePointNumBytes was never used and it was incorrectly implemented. Fixed this and added unit test. PiperOrigin-RevId: 329348931 * Automated g4 rollback of changelist 329348931. *** Reason for rollback *** Discovered some issues. Will address those, add a few more tests and explain the issue in detail in the fix CL. *** Original change description *** - Fixes tokenizer line/column accounting. In a few cases it was calculating negative column values. - The Strings::CodePointNumBytes was never used and it was incorrectly implemented. Fixed this and added unit test. PiperOrigin-RevId: 329380897 * Automated g4 rollback of changelist 329380897. *** Reason for rollback *** Fixed the corner case of invalid byte at the end of the document, leading to some out of range calculations. *** Original change description *** Automated g4 rollback of changelist 329348931. *** Reason for rollback *** Discovered some issues. Will address those, add a few more tests and explain the issue in detail in the fix CL. *** Original change description *** - Fixes tokenizer line/column accounting. In a few cases it was calculating negative column values. - The Strings::CodePointNumBytes was never used and it was incorrectly implemented. Fixed this and added unit test. PiperOrigin-RevId: 329398986 * Checks for low surrogates should not result in crash. The fuzz library has many low surrogates test cases. Now such code_points are ignored by replacement chars. https://unicodemap.org/range/77/Low_Surrogates/ PiperOrigin-RevId: 329406810 * Skips invalid byte and in sequence in one pass. PiperOrigin-RevId: 329589616 * Automated g4 rollback of changelist 329589616. *** Reason for rollback *** Rolling back recent utf8 changes to fix negative columns. Didn't test properly. *** Original change description *** Skips invalid byte and in sequence in one pass. PiperOrigin-RevId: 329942833 * Automated g4 rollback of changelist 329398986. PiperOrigin-RevId: 329943822 Co-authored-by: Amaltas Bohra <amaltas@google.com>
ampproject · Sep 10, 2020 · 90fcffd · 90fcffd
1 parent 273097c
commit 90fcffd
Showing 1 changed file with 5 additions and 10 deletions.
diff --git a/validator/cpp/htmlparser/strings.cc b/validator/cpp/htmlparser/strings.cc
@@ -85,9 +85,6 @@ void CaseTransformInternal(bool to_upper, std::string* s);
 // parameter. Returns false if next byte in the sequence is not a valid byte.
 bool ReadContinuationByte(uint8_t byte, uint8_t* out);
 
-// Checks the codepoints are in range of allowed utf-8 ranges.
-void CheckScalarValue(char32_t code_point);
-
 // Checks if the character is ASCII that is in range 1-127.
 inline bool IsOneByteASCIIChar(uint8_t c);
 
@@ -223,7 +220,11 @@ std::optional<char32_t> Strings::DecodeUtf8Symbol(std::string_view* s) {
     if (code_point < 0x0800) {
       return std::nullopt;
     }
-    CheckScalarValue(code_point);
+    // Check if this is codepoint is low surrgates.
+    if (code_point >= 0xd800 && code_point <= 0xdfff) {
+      return std::nullopt;
+    }
+
     return code_point;
   }
 
@@ -832,12 +833,6 @@ bool ReadContinuationByte(uint8_t byte, uint8_t* out) {
   return false;
 }
 
-void CheckScalarValue(char32_t code_point) {
-  CHECK((!(code_point >= 0xd800 && code_point <= 0xdfff)))
-        << "Lone surrogaate U+" + Strings::ToHexString(code_point) +
-           " is not a valid scalar value.";
-}
-
 inline bool IsOneByteASCIIChar(uint8_t c) {
   return (c & 0x80) == 0;
 }