Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Lexer memory access beyond the given buffer when malformed input is passed. #7050

Merged
merged 7 commits into from Aug 8, 2017

Conversation

Projects
None yet
4 participants
@JohanEngelen
Copy link
Contributor

JohanEngelen commented Jul 31, 2017

Let me know what token value you want to return upon errors.

This change also enforces that the buffer is really null-terminated (or EOF), and that the terminating null character is included in the buffer interval that the Lexer should lex.
When a string is passed, the implicit null-terminator is necessary and the code must be modified:

    string text = "int"; // the test relies on this string begin null-terminated
-    scope Lexer lex1 = new Lexer(null, text.ptr, 0, text.length, 0, 0);
+    scope Lexer lex1 = new Lexer(null, text.ptr, 0, text.length+1, 0, 0);

Without including the null terminator in the length (the +1), nextToken would return TOKerror instead of TOKeof.

@dlang-bot

This comment has been minimized.

Copy link
Contributor

dlang-bot commented Jul 31, 2017

Thanks for your pull request, @JohanEngelen! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.

Some tips to help speed things up:

  • smaller, focused PRs are easier to review than big ones

  • try not to mix up refactoring or style changes with bug fixes or feature enhancements

  • provide helpful commit messages explaining the rationale behind each change

Bear in mind that large or tricky changes may require multiple rounds of review and revision.

Please see CONTRIBUTING.md for more information.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Jul 31, 2017

Actually, I think the required null-terminator is a misfeature. Why specify buffer length at all in that case? Without the null-terminator requirement, you can nicely lex just a part of a longer code string.
So instead of returning TOKerror upon buffer end, perhaps we should return TOKeof ?

@JohanEngelen JohanEngelen force-pushed the JohanEngelen:lexerfuzz1 branch from 09d8c8b to 4757f61 Jul 31, 2017

@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Jul 31, 2017

I see that TOKeof is returned in other locations where an error occurs, but also end of buffer is similar as EOF.
I've removed the null-terminated requirement, as that is taken care of by the endoffset.

@@ -253,6 +270,12 @@ class Lexer
t.lineComment = null;
while (1)
{
// Return error token when end of buffer is reached unexpectedly.
if (p >= end) {

This comment has been minimized.

Copy link
@WalterBright

WalterBright Jul 31, 2017

Member

Lexing performance is critical to compiler speed, and this adds a compare/branch for every character processed. This is why a comparison to the buffer end is not done, and a sentinal (0 termination) is done. The sentinel does not require any extra code in the main path.

This comment has been minimized.

Copy link
@JohanEngelen

JohanEngelen Jul 31, 2017

Author Contributor

Some extra checking is needed, perhaps elsewhere, but currently the code reads beyond the buffer whenever one of the sub-functions (e.g. hexStringConstant) stumbles upon the null terminator. For example with input "\'\0" and with "{{q{{\0". Null-termination is not preventing the buffer overflow read with the current implementation.

(the performance hit of this comparison+branch is probably very low, because it's almost always false and thus easily predicted correctly)

This comment has been minimized.

Copy link
@UplinkCoder

UplinkCoder Jul 31, 2017

Member

not every processor has nice branch prediction.
add a sentinel if you must.

This comment has been minimized.

Copy link
@WalterBright

WalterBright Jul 31, 2017

Member

Null-termination is not preventing the buffer overflow read with the current implementation.

A test case demonstrating buffer overflow would be apropos. In any case, perhaps the error check should go into hexStringConstant instead of the main loop.

probably very low

My testing showed that every instruction per character matters. Lexing speed is crucial, and every drop counts.

This comment has been minimized.

Copy link
@WalterBright

WalterBright Jul 31, 2017

Member

BTW, dmd's compile speed has slowed down substantially in the recent past, due to an accumulation of barnacles. It's now slower than dmc++.

This comment has been minimized.

Copy link
@JohanEngelen

JohanEngelen Aug 1, 2017

Author Contributor

Of course the PR adds a testcase.... from commit 1.....

@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Aug 1, 2017

The problem is caused by the user not knowing when the file/buffer ended, when it ends on an incomplete string literal/char literal/comment/... The last token returned by nextToken would be for example TOKstring (for incomplete string literal), and then when the user calls nextToken again, a buffer overread occurs.
I have now moved the condition out of the loop, so it is performed once for each call to scan.
Fuzzing has thusfar not revealed new bugs.

The requirement of the endoffset parameter to be one beyond the last valid index is up for debate. That's what the original intent of endoffset was, judging from the code, but it does mean that the example of how to use Lexer needs to be modified with a +1 to include the implicit string literal null-terminator (couldn't find it in the spec that the null terminator is mandatory...)

@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Aug 1, 2017

An other more fragile solution is to move the pointer p back by one upon encountering e.g. an unterminated string constant (need to fix all locations where that happens).

@WalterBright

This comment has been minimized.

Copy link
Member

WalterBright commented Aug 1, 2017

a buffer overread occurs

Please post an example. I'm not seeing it.

@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Aug 1, 2017

For example with input "'\0" and with "{{q{{\0"

@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Aug 1, 2017

Note: mars.d line 1351 sets a buffer without including the null terminator in the length of that buffer.

@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Aug 1, 2017

Bad stuff happens here for example:

dmd/src/ddmd/lexer.d

Lines 1239 to 1255 in 28daf53

dchar c = *p++;
switch (c)
{
case '\n':
endOfLine();
break;
case '\r':
if (*p == '\n')
continue; // ignore
c = '\n'; // treat EndOfLine as \n character
endOfLine();
break;
case 0:
case 0x1A:
error("unterminated string constant starting at %s", start.toChars());
t.setString();
return TOKstring;

A character is read from *p and p is immediately increased. Then, if the char read is the null terminator, we return but p is pointing at an invalid memory location. The next time we call scan, a buffer overread occurs. The user cannot know about this, because the returned token is TOKstring, while the user is looping until he sees TOKeof.

@WalterBright

This comment has been minimized.

Copy link
Member

WalterBright commented Aug 2, 2017

Note: mars.d line 1351 sets a buffer without including the null terminator in the length of that buffer.

The Lexer relies on the 0 being past the end of the length, i.e. that buf[length] is 0. But there was another bug there, corrected here:

#7057

@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Aug 2, 2017

The Lexer relies on the 0 being past the end of the length

The current code is ambiguous to what the length means. In one spot, it code says that the endoffset (i.e. the length passed in #7057) is the last valid index in the buffer, but at the same time it is assigned to the end field that points one past the buffer.
I'll clarify and update the PR according to that the terminator is supposed to be at base[endoffset], and thus that that's a valid memory access.
In the meanwhile, I've found another buffer over-read related to UTF8 chars.
I've cooked up a different fix, focussing on pre/post conditions that would remove the need for the p > end boundscheck. Work in progress.

@JohanEngelen JohanEngelen changed the title Fix Lexer memory access beyond the given buffer when malformed input is passed. [WIP] Fix Lexer memory access beyond the given buffer when malformed input is passed. Aug 2, 2017

@dlang-bot dlang-bot added the WIP label Aug 2, 2017

@JohanEngelen JohanEngelen changed the title [WIP] Fix Lexer memory access beyond the given buffer when malformed input is passed. Fix Lexer memory access beyond the given buffer when malformed input is passed. Aug 2, 2017

@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Aug 2, 2017

OK, I completely reworked the patch. The boundscheck is gone, and instead I fixed the code paths that advanced p beyond the terminator / end of the buffer.
This fixes the other access violation not caught by the earlier patches.

@UplinkCoder

This comment has been minimized.

Copy link
Member

UplinkCoder commented Aug 2, 2017

that looks better.

@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Aug 2, 2017

that looks better.

I disagree, much more prone to bugs in the future. But you can have it your way.

@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Aug 7, 2017

ping

@WalterBright

This comment has been minimized.

Copy link
Member

WalterBright commented Aug 8, 2017

This fixes the other access violation not caught by the earlier patches.

Please include a test case for both access violations.

JohanEngelen added some commits Jul 31, 2017

Fix Lexer memory access beyond the given buffer when malformed input …
…is passed.

This change also enforces that the buffer is really null-terminated (or EOF), and that the terminating null character is included in the buffer interval that the Lexer should lex.
When a string is passed, the implicit null-terminator is necessary and the code must be modified:
+    string text = "int"; // the test relies on this string begin null-terminated
-    scope Lexer lex1 = new Lexer(null, text.ptr, 0, text.length, 0, 0);
+    scope Lexer lex1 = new Lexer(null, text.ptr, 0, text.length+1, 0, 0);
Lexer: remove null-terminated requirement and return TOKeof upon end …
…of stream. The same effect is fullfilled by the endoffset.
Complete rework of patch.
Instead of verifying `p <= end`, make sure we never advance beyond the terminating character.

@JohanEngelen JohanEngelen force-pushed the JohanEngelen:lexerfuzz1 branch from 890a63c to 1276b36 Aug 8, 2017

@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Aug 8, 2017

@WalterBright I've added a few more testcases.

@WalterBright WalterBright added auto-merge and removed WIP labels Aug 8, 2017

@dlang-bot dlang-bot merged commit ec0f8af into dlang:master Aug 8, 2017

4 of 5 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
CyberShadow/DAutoTest Documentation OK (8 additions, 5 deletions)
Details
auto-tester Pass: 10
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/jenkins/pr-merge This commit looks good
Details
@JohanEngelen

This comment has been minimized.

Copy link
Contributor Author

JohanEngelen commented Aug 8, 2017

Thanks. Further fuzzing has not turned up anything (yet? ;-).

@JohanEngelen JohanEngelen deleted the JohanEngelen:lexerfuzz1 branch Aug 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.