[stdlib] Rewrite UTF8._isValidUTF8() #1477

Merged
merged 7 commits into from Feb 29, 2016

Conversation

Projects
None yet
3 participants
@PatrickPijnappel
Collaborator

PatrickPijnappel commented Feb 28, 2016

What's in this pull request?

A rewrite of UTF8._isValidUTF8(), which further improves performance (mainly by removing branches) and reduces code size.

  • Non-ASCII: 35-45% speed-up
  • Invalid sequences: ~15% speed-up
  • ASCII: identical performance

Tested against original implementation for all input values (0...0xffffffff), results are identical.


Before merging this pull request to apple/swift repository:

  • Test pull request on Swift continuous integration.

Triggering Swift CI

The swift-ci is triggered by writing a comment on this PR addressed to the github user @swift-ci. Different tests will run depending on the specific comment that you use. The currently available comments are:

Smoke Testing

Platform Comment
All supported platforms @swift-ci Please smoke test
OS X platform @swift-ci Please smoke test OS X platform
Linux platform @swift-ci Please smoke test Linux platform

Validation Testing

Platform Comment
All supported platforms @swift-ci Please test
OS X platform @swift-ci Please test OS X platform
Linux platform @swift-ci Please test Linux platform

Note: Only members of the Apple organization can trigger swift-ci.

@gribozavr

This comment has been minimized.

Show comment
Hide comment
@gribozavr

gribozavr Feb 28, 2016

Collaborator

@swift-ci Please test

Collaborator

gribozavr commented Feb 28, 2016

@swift-ci Please test

@gribozavr

This comment has been minimized.

Show comment
Hide comment
@gribozavr

gribozavr Feb 28, 2016

Collaborator

@PatrickPijnappel There's a build failure, would you mind taking a look?

Collaborator

gribozavr commented Feb 28, 2016

@PatrickPijnappel There's a build failure, would you mind taking a look?

@PatrickPijnappel

This comment has been minimized.

Show comment
Hide comment
@PatrickPijnappel

PatrickPijnappel Feb 28, 2016

Collaborator

@gribozavr My bad! Will take a look and resolve.

Collaborator

PatrickPijnappel commented Feb 28, 2016

@gribozavr My bad! Will take a look and resolve.

+ // Require 10xx xxxx 110x xxxx.
+ if buffer & 0xc0e0 != 0x80c0 { return false }
+ // Disallow xxxx xxxx xxx0 000x (<= 7 bits case).
+ if buffer & 0x001e == 0x0000 { return false }

This comment has been minimized.

@gribozavr

gribozavr Feb 28, 2016

Collaborator

Sorry, I don't understand this case. I think you meant to test against 0x1f00 instead of 0x001e.

@gribozavr

gribozavr Feb 28, 2016

Collaborator

Sorry, I don't understand this case. I think you meant to test against 0x1f00 instead of 0x001e.

This comment has been minimized.

@gribozavr

gribozavr Feb 28, 2016

Collaborator

The bytes come in reverse order. Never mind.

@gribozavr

gribozavr Feb 28, 2016

Collaborator

The bytes come in reverse order. Never mind.

@gribozavr

This comment has been minimized.

Show comment
Hide comment
@gribozavr

gribozavr Feb 28, 2016

Collaborator

@PatrickPijnappel This is brilliant! Please fix the build issue, and I'll run the benchmarks.

Collaborator

gribozavr commented Feb 28, 2016

@PatrickPijnappel This is brilliant! Please fix the build issue, and I'll run the benchmarks.

@practicalswift

This comment has been minimized.

Show comment
Hide comment
@practicalswift

practicalswift Feb 28, 2016

Collaborator

@PatrickPijnappel Great stuff! 👍

Collaborator

practicalswift commented Feb 28, 2016

@PatrickPijnappel Great stuff! 👍

PatrickPijnappel added some commits Feb 28, 2016

[stdlib] Refactor UTF8._isValidUTF() to be useable throughout stdlib
This is as a replacement for usages of UTF8._numTrailingBytes().
Note that the sanityCheck was redundant at both call sites.
[stdlib] Replace usage of UTF8._numTrailingBytes()
The checks are technically different (previous check only rejected malformed initial code units, not all malformed sequences). Which is more correct is debatable, but since _buffer is only filled by transcoding from UTF-16 it should always be well-formed anyway and the difference is not very relevant.
[stdlib] Add tests for _isValidUTF8()
Replaces the tests for the removed _numTrailingBytes()
@PatrickPijnappel

This comment has been minimized.

Show comment
Hide comment
@PatrickPijnappel

PatrickPijnappel Feb 28, 2016

Collaborator

@gribozavr OK fixed the issues and added some validation tests as well!

Collaborator

PatrickPijnappel commented Feb 28, 2016

@gribozavr OK fixed the issues and added some validation tests as well!

@gribozavr

This comment has been minimized.

Show comment
Hide comment
@gribozavr

gribozavr Feb 28, 2016

Collaborator

@swift-ci Please test

Collaborator

gribozavr commented Feb 28, 2016

@swift-ci Please test

stdlib/public/core/Unicode.swift
- }
- }
- return true
+ public static func _isValidUTF8(buffer: UInt32) -> Bool {

This comment has been minimized.

@gribozavr

gribozavr Feb 28, 2016

Collaborator

Please use public // @testable (like we do in other places), for documentation purposes, and to make it easy to fix up the code when @testable works for the standard library.

@gribozavr

gribozavr Feb 28, 2016

Collaborator

Please use public // @testable (like we do in other places), for documentation purposes, and to make it easy to fix up the code when @testable works for the standard library.

@PatrickPijnappel

This comment has been minimized.

Show comment
Hide comment
@PatrickPijnappel

PatrickPijnappel Feb 29, 2016

Collaborator

Added @testable and fixed the failing test, it now passes on my machine.

Collaborator

PatrickPijnappel commented Feb 29, 2016

Added @testable and fixed the failing test, it now passes on my machine.

@gribozavr

This comment has been minimized.

Show comment
Hide comment
@gribozavr

gribozavr Feb 29, 2016

Collaborator

@swift-ci Please test

Collaborator

gribozavr commented Feb 29, 2016

@swift-ci Please test

@gribozavr

This comment has been minimized.

Show comment
Hide comment
@gribozavr

gribozavr Feb 29, 2016

Collaborator

Running benchmarks.

Collaborator

gribozavr commented Feb 29, 2016

Running benchmarks.

@gribozavr

This comment has been minimized.

Show comment
Hide comment
@gribozavr

gribozavr Feb 29, 2016

Collaborator

I'm seeing >10% improvements for ErrorHandling, NSError, NSStringConversion, and SevenBoom.

@PatrickPijnappel If you have a targeted microbenchmark, feel free to contribute it to a new file under benchmarks.

Collaborator

gribozavr commented Feb 29, 2016

I'm seeing >10% improvements for ErrorHandling, NSError, NSStringConversion, and SevenBoom.

@PatrickPijnappel If you have a targeted microbenchmark, feel free to contribute it to a new file under benchmarks.

gribozavr added a commit that referenced this pull request Feb 29, 2016

Merge pull request #1477 from PatrickPijnappel/patch-3
[stdlib] Rewrite UTF8._isValidUTF8()

@gribozavr gribozavr merged commit 56785e8 into apple:master Feb 29, 2016

2 checks passed

Swift Test Linux Platform Build finished. 7972 tests run, 0 skipped, 0 failed.
Details
Swift Test OS X Platform Build finished. 32052 tests run, 0 skipped, 0 failed.
Details

@PatrickPijnappel PatrickPijnappel deleted the PatrickPijnappel:patch-3 branch Mar 1, 2016

@PatrickPijnappel PatrickPijnappel restored the PatrickPijnappel:patch-3 branch Mar 1, 2016

@PatrickPijnappel PatrickPijnappel deleted the PatrickPijnappel:patch-3 branch Mar 1, 2016

@PatrickPijnappel

This comment has been minimized.

Show comment
Hide comment
@PatrickPijnappel

PatrickPijnappel Mar 1, 2016

Collaborator

@gribozavr Added a UTF-8 benchmark (#1493). I'm in the process of simplifying/optimizing the other parts of UTF-8 decoding so it'll be useful to have a benchmark!

Collaborator

PatrickPijnappel commented Mar 1, 2016

@gribozavr Added a UTF-8 benchmark (#1493). I'm in the process of simplifying/optimizing the other parts of UTF-8 decoding so it'll be useful to have a benchmark!

@PatrickPijnappel PatrickPijnappel referenced this pull request Mar 3, 2016

Merged

[stdlib] Rewrite UTF8 decoding #1525

0 of 1 task complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment