Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for splitting on ReadOnlySpan<char> #295

Merged
merged 24 commits into from
Jun 10, 2020

Conversation

bbartels
Copy link
Contributor

@bbartels bbartels commented Nov 26, 2019

This implementation complies with the specifications discussed in: #934

I tried to add the tests I wrote (adapted from original StringSplitTests), but I was not sure which directory pertains to CoreLib tests.
My best guess is src/libraries/System.Runtime/tests/System, but would like someone to confirm.

This implementation uses multiple types to achieve splitting by a single- and multiple chars. This results in almost identical code in the two types, but could be mitigated by merging the two implementations. I am not sure which implementation is preferred in this scenario.

I need help with the following:

  • Review Implementation
  • Add Tests to correct directory
  • Decide on a merged or separate Enumerator types to handle splitting on an element/sequence

@danmoseley
Copy link
Member

@ahsonkhan

@Wraith2
Copy link
Contributor

Wraith2 commented Dec 17, 2019

Is the input span to split required to be a single item of n length or n items of 1 length? string split has both single item and array of items overloads which are commonly used. So for example if i call this new method with new ROS(@"\r\n") i think it's going to just look for that literal value as a single item of 2 length, and in that case what input format would be used for splitting on multiple selected whitespace chars equivalent to new []{'\t','\n','\r',\ '}?

@bbartels
Copy link
Contributor Author

@Wraith2 You are correct, it will just treat it as one single value to split on. The reason for that is because when the API was reviewed, the team decided to not include a way to split on multiple chars. So I decided not to include it here.

See 'SplitAny()' in #934 (comment)

@stephentoub
Copy link
Member

@bbartels, you have this marked WIP. Are you still working on it? It hasn't seen progress in over a month. Thanks.

@bbartels
Copy link
Contributor Author

There is not much I can currently do before someone reviews and/or addresses the points in the Todo above. Maybe I should have made that a little clearer (and possibly omitted WIP as it may not be the appropriate label for this PR).

@stephentoub
Copy link
Member

stephentoub commented Jan 15, 2020

I see, ok, thanks. Yes, the way the PR reads right now, it appears you're saying there's nothing for anyone to do until you do more work on it (at least that's how I read it).

@stephentoub stephentoub changed the title [WIP] Added support for splitting on ReadOnlySpan<char> Added support for splitting on ReadOnlySpan<char> Jan 15, 2020
@bbartels
Copy link
Contributor Author

My mistake, could have sworn I made it clear enough, but upon rereading I am unsure myself what conclusion to make!

@danmoseley
Copy link
Member

Adding area owners. Apologies for the delay.

Copy link
Contributor

@layomia layomia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@layomia layomia added this to the 5.0 milestone May 21, 2020
Copy link
Contributor Author

@bbartels bbartels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@layomia Finished merging the implementations, as well as added tests to the specified directory. There are a couple points I need feedback on below.

private readonly ReadOnlySpan<T> _separators;
private readonly T _separator;
private readonly bool _isSequence;
private readonly int _separatorLength;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be replaced with a cleaner:

private int SeparatorLength => _isSequence ? _separators.Length : 1;

However, according to my benchmarks it will incur a 3% perf penalty on the tested scenario.
What would be preferred in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current way is preferred to avoid the extra calculation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share your benchmarks source/results for this PR?

@bbartels
Copy link
Contributor Author

bbartels commented Jun 4, 2020

For some reason I cannot comment on 'outdated' review comments, not sure if this is by design.
I've changed up the implementation to pass all the tests and make the the calculation easier to follow. I hope this resolves the concerns.
There are two review comments I need clarification on before I can proceed (see above).

@danmoseley
Copy link
Member

Trailing whitespace 😢

@bbartels
Copy link
Contributor Author

bbartels commented Jun 4, 2020

@danmosemsft Oh my, no clue how they managed to sneak in. Should be fixed now!

@layomia
Copy link
Contributor

layomia commented Jun 10, 2020

Overall, LGTM.

There are two review comments I need clarification on before I can proceed

@bbartels, can you highlight the comments? All my feedback seems to be addressed.

@bbartels
Copy link
Contributor Author

bbartels commented Jun 10, 2020

@layomia I think one of them was #295 (comment), but honestly completely forgot what the other one was. Probably should have noted it down 😅, don't think it was too important thought!

EDIT: So it turns out I never pressed Submit Review and my comments weren't published...
Still a bit unfamiliar with the github review system.

@layomia
Copy link
Contributor

layomia commented Jun 10, 2020

I think one of them was #295 (comment),

Ah. That's resolved - #295 (comment).

@bbartels
Copy link
Contributor Author

bbartels commented Jun 10, 2020

Okay, then there are no further objections from me :) Thank you for the guidance throughout the review, was a good learning experience!

Copy link
Contributor

@layomia layomia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @bbartels!

It would be great if you could add your benchmarks to the dotnet/performance repo here. This doc goes over how to write/run benchmarks for new API. Here's an example PR for new API in System.Text.Json - dotnet/performance#1130.

@layomia layomia merged commit 78ed8e8 into dotnet:master Jun 10, 2020
@danmoseley
Copy link
Member

I wonder how we could find places we could potentially use this ourselves in this repo. Perhaps someone out there's interested in looking.

@davidfowl
Copy link
Member

Finally!!!!

@bbartels
Copy link
Contributor Author

bbartels commented Jun 10, 2020

@layomia Will do, should have some time tomorrow!
The doc link just links to the same place as your first link. Is it just this?

@layomia
Copy link
Contributor

layomia commented Jun 10, 2020

@bbartels

Will do, should have some time tomorrow!

Nice!

If you have some more time, an additional way to widen the impact of this new API is this task - #295 (comment). I imagine this would involve searching this repo for usages of String.Split(), figuring out if the new API can be used instead, and opening a PR with any changes.

The doc link just links to the same place as your first link. Is it just this?

I meant this - https://github.com/dotnet/performance/blob/2c15891e126a397fed9b3c34120228ae21b1ea1f/docs/benchmarking-workflow-dotnet-runtime.md#benchmarking-new-api 😅

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Jun 11, 2020

Out of curiosity, why is the parameterless Split routing splitting only on the space character (' ')? This differs from string.Split(), which splits on all whitespace, including tabs and line separators.

This also speaks to @danmosemsft's earlier comment. We need to be careful to account for any behavioral differences between this and string.Split before we go changing all of the internal call sites to use this. Otherwise we risk introducing a regression.

It's still a worthwhile exercise though! :)

@bbartels
Copy link
Contributor Author

@GrabYourPitchforks I faintly remember watching the API review on YouTube and there being arguments against introducing anything that has multiple seperators (which would be needed to match string°Split semantics). Though I also remember that there was discussion about sending this back to API Review once the implementation is done. Maybe that's not such a bad idea, to decide on Split() semantics, as it would require a constructor taking a ReadOnlySpan argument for multiple separators,which was previously postponed for UTF-8 reasons.

@layomia
Copy link
Contributor

layomia commented Jun 11, 2020

why is the parameterless Split routing splitting only on the space character (' ')?

What I understand from watching the API review discussion is that, for simplicity, the Split methods should be used to split on single separators, while SplitAny could be added in the future for multiple separators. So, the parameterless overload for Split would split on a single space character, while the parameterless SplitAny would be for any whitespace token. What are your thoughts on this? @GrabYourPitchforks

Another area where the current implementation differs from String.Split is the handling of zero-length target strings. I've opened this PR to reconcile them - #37742.

@stephentoub
Copy link
Member

The parameterless Split should split on anything for which char.IsWhitespace returns true. There's nothing fundamentally wrong with that behavior for string, and any such subtle differences like that from string are going to lead to bugs.

@bbartels
Copy link
Contributor Author

bbartels commented Jun 11, 2020

I feel like there is no way around another API Review, unless a constructor that does not allow for splitting on multiple separators is introduced, e.g. SpanSplitEnumerator(bool splitOnWhiteSpace).

The discussion about semantics of a SplitAny(ReadOnlySpan<char> input, ReadOnlySpan<char> separators) have purposefully been delayed as per this.

@stephentoub
Copy link
Member

stephentoub commented Jun 11, 2020

I feel like there is no way around another API Review, unless a constructor that does not allow for splitting on multiple separators is introduced

Why? If you want to split on just ' ', you can do so just as with string by passing that to split. If you want to split on all whitespace, you can do that just as with string by not passing any parameters.

@bbartels
Copy link
Contributor Author

bbartels commented Jun 11, 2020

Why? If you want to split on just ' ', you can do so just as with string by passing that to split. If you want to split on all whitespace, you can do that just as with string by not passing any parameters.

Oh, my bad, I forgot the SpanSplitEnumerator constructors were never made public. I was working with that assumption. Otherwise a consumer could have instantiated a SpanSplitEnumerator with multiple separators, the behaviour of which has not been decided yet.

@bbartels
Copy link
Contributor Author

Moving discussion regarding ROS.Split() here: #37746

layomia added a commit to layomia/dotnet_runtime that referenced this pull request Jun 11, 2020
layomia added a commit that referenced this pull request Jun 11, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants