Skip to content

Commit

Permalink
Sync from Google (#30149)
Browse files Browse the repository at this point in the history
* - Fixes tokenizer line/column accounting. In a few cases it was calculating
  negative column values.
- The Strings::CodePointNumBytes was never used and it was incorrectly
  implemented. Fixed this and added unit test.

PiperOrigin-RevId: 329348931

* Automated g4 rollback of changelist 329348931.

*** Reason for rollback ***

Discovered some issues. Will address those, add a few more tests and explain the issue in detail in the fix CL.

*** Original change description ***

- Fixes tokenizer line/column accounting. In a few cases it was calculating
  negative column values.
- The Strings::CodePointNumBytes was never used and it was incorrectly
  implemented. Fixed this and added unit test.

PiperOrigin-RevId: 329380897

* Automated g4 rollback of changelist 329380897.

*** Reason for rollback ***

Fixed the corner case of invalid byte at the end of the document, leading to some out of range calculations.

*** Original change description ***

Automated g4 rollback of changelist 329348931.

*** Reason for rollback ***

Discovered some issues. Will address those, add a few more tests and explain the issue in detail in the fix CL.

*** Original change description ***

- Fixes tokenizer line/column accounting. In a few cases it was calculating
  negative column values.
- The Strings::CodePointNumBytes was never used and it was incorrectly
  implemented. Fixed this and added unit test.

PiperOrigin-RevId: 329398986

* Checks for low surrogates should not result in crash. The fuzz library has many
low surrogates test cases. Now such code_points are ignored by replacement
chars.

https://unicodemap.org/range/77/Low_Surrogates/

PiperOrigin-RevId: 329406810

* Skips invalid byte and in sequence in one pass.

PiperOrigin-RevId: 329589616

* Automated g4 rollback of changelist 329589616.

*** Reason for rollback ***

Rolling back recent utf8 changes to fix negative columns. Didn't test properly.

*** Original change description ***

Skips invalid byte and in sequence in one pass.

PiperOrigin-RevId: 329942833

* Automated g4 rollback of changelist 329398986.

PiperOrigin-RevId: 329943822

Co-authored-by: Amaltas Bohra <amaltas@google.com>
  • Loading branch information
caoboxiao and amaltas committed Sep 10, 2020
1 parent 273097c commit 90fcffd
Showing 1 changed file with 5 additions and 10 deletions.
15 changes: 5 additions & 10 deletions validator/cpp/htmlparser/strings.cc
Expand Up @@ -85,9 +85,6 @@ void CaseTransformInternal(bool to_upper, std::string* s);
// parameter. Returns false if next byte in the sequence is not a valid byte.
bool ReadContinuationByte(uint8_t byte, uint8_t* out);

// Checks the codepoints are in range of allowed utf-8 ranges.
void CheckScalarValue(char32_t code_point);

// Checks if the character is ASCII that is in range 1-127.
inline bool IsOneByteASCIIChar(uint8_t c);

Expand Down Expand Up @@ -223,7 +220,11 @@ std::optional<char32_t> Strings::DecodeUtf8Symbol(std::string_view* s) {
if (code_point < 0x0800) {
return std::nullopt;
}
CheckScalarValue(code_point);
// Check if this is codepoint is low surrgates.
if (code_point >= 0xd800 && code_point <= 0xdfff) {
return std::nullopt;
}

return code_point;
}

Expand Down Expand Up @@ -832,12 +833,6 @@ bool ReadContinuationByte(uint8_t byte, uint8_t* out) {
return false;
}

void CheckScalarValue(char32_t code_point) {
CHECK((!(code_point >= 0xd800 && code_point <= 0xdfff)))
<< "Lone surrogaate U+" + Strings::ToHexString(code_point) +
" is not a valid scalar value.";
}

inline bool IsOneByteASCIIChar(uint8_t c) {
return (c & 0x80) == 0;
}
Expand Down

0 comments on commit 90fcffd

Please sign in to comment.