Fix ASCII dependency in strcpy_url and strlen_url #2535

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
2 participants
@smuehlst
Contributor

smuehlst commented Apr 26, 2018

Commit 3c630f9 partially reverted
the changes from commit dd7521b because of
the problem that strcpy_url() was modified unilaterally without also
modifying strlen_url(). As a consequence strcpy_url() was again depending
on ASCII encoding.

This change fixes strlen_url() and strcpy_url() in parallel to use a
common host-encoding independent criterion for deciding whether an URL
character must be %-escaped.

Fix ASCII dependency in strcpy_url and strlen_url
Commit 3c630f9 partially reverted
the changes from commit dd7521b because of
the problem that strcpy_url() was modified unilaterally without also
modifying strlen_url(). As a consequence strcpy_url() was again depending
on ASCII encoding.

This change fixes strlen_url() and strcpy_url() in parallel to use a
common host-encoding independent criterion for deciding whether an URL
character must be %-escaped.
@bagder

This comment has been minimized.

Show comment Hide comment
@bagder

bagder Apr 26, 2018

Member

This changes behavior for redirects to URLs with "control codes" in the string, which I think is probably fine, but we need a test case or two that verify this so that we can trust the code a bit more.

Member

bagder commented Apr 26, 2018

This changes behavior for redirects to URLs with "control codes" in the string, which I think is probably fine, but we need a test case or two that verify this so that we can trust the code a bit more.

@smuehlst

This comment has been minimized.

Show comment Hide comment
@smuehlst

smuehlst Apr 27, 2018

Contributor

This changes behavior for redirects to URLs with "control codes" in the string, which I think is probably fine, but we need a test case or two that verify this so that we can trust the code a bit more.

The current pull request does not change the behavior regarding control codes. Please check the condition in function urlchar_needs_escaping, which excludes control characters from being escaped:

static bool urlchar_needs_escaping(int c)
{
    return !(ISCNTRL(c) || ISSPACE(c) || ISGRAPH(c));
}

To actually include control codes in the set of characters to be escaped I can simplify that condition back to !ISPRINT(c), and I can drop the new implementation of ISCNTRL(c) again as it is needed nowhere else.

For writing new test cases I need some guidance please. The test case test1138 apparently is the one that tests URLs with characters beyond 0x80. Can that be extended, or is a new test case in a new file required?

The documentation in the FILEFORMAT file says:

All data for a single test case resides in a single ASCII file.

But the file test1138 directly contains UTF-8. Do I understand it correctly that I can directly enter the control characters in the file then?

Contributor

smuehlst commented Apr 27, 2018

This changes behavior for redirects to URLs with "control codes" in the string, which I think is probably fine, but we need a test case or two that verify this so that we can trust the code a bit more.

The current pull request does not change the behavior regarding control codes. Please check the condition in function urlchar_needs_escaping, which excludes control characters from being escaped:

static bool urlchar_needs_escaping(int c)
{
    return !(ISCNTRL(c) || ISSPACE(c) || ISGRAPH(c));
}

To actually include control codes in the set of characters to be escaped I can simplify that condition back to !ISPRINT(c), and I can drop the new implementation of ISCNTRL(c) again as it is needed nowhere else.

For writing new test cases I need some guidance please. The test case test1138 apparently is the one that tests URLs with characters beyond 0x80. Can that be extended, or is a new test case in a new file required?

The documentation in the FILEFORMAT file says:

All data for a single test case resides in a single ASCII file.

But the file test1138 directly contains UTF-8. Do I understand it correctly that I can directly enter the control characters in the file then?

@bagder

This comment has been minimized.

Show comment Hide comment
@bagder

bagder May 3, 2018

Member

The current pull request does not change the behavior regarding control codes

You're right and I was wrong. This should maintain the same functionality on ASCII systems and yet enable non-ascii systems to do the right thing.

The test file format is actually totally encoding agnostic and can contain whatever is suitable so test 1138 has some UTF-8 sequences in there to verify exactly this sort of %-encoding. This test continues to work after your patch, as the CI builds and tests here already show.

I will merge this asap.

Member

bagder commented May 3, 2018

The current pull request does not change the behavior regarding control codes

You're right and I was wrong. This should maintain the same functionality on ASCII systems and yet enable non-ascii systems to do the right thing.

The test file format is actually totally encoding agnostic and can contain whatever is suitable so test 1138 has some UTF-8 sequences in there to verify exactly this sort of %-encoding. This test continues to work after your patch, as the CI builds and tests here already show.

I will merge this asap.

@bagder bagder closed this in 7f41432 May 3, 2018

@bagder

This comment has been minimized.

Show comment Hide comment
@bagder

bagder May 3, 2018

Member

Thanks a lot!

Member

bagder commented May 3, 2018

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment