Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data analysts should be able to reverse strings using Text.reverse #3377

Merged
merged 4 commits into from Apr 5, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Expand Up @@ -96,6 +96,7 @@
- [Added warning handling to `Table.aggregate`][3349]
- [Improved performance of `Table.aggregate` and full warnings implementation]
[3364]
- [Implemented `Text.reverse`][3377]

[debug-shortcuts]:
https://github.com/enso-org/enso/blob/develop/app/gui/docs/product/shortcuts.md#debug
Expand Down Expand Up @@ -144,6 +145,7 @@
[3349]: https://github.com/enso-org/enso/pull/3349
[3361]: https://github.com/enso-org/enso/pull/3361
[3364]: https://github.com/enso-org/enso/pull/3364
[3377]: https://github.com/enso-org/enso/pull/3377
[3366]: https://github.com/enso-org/enso/pull/3366

#### Enso Compiler
Expand Down
Expand Up @@ -25,6 +25,7 @@ export Standard.Base.Data.Text.Line_Ending_Style

polyglot java import com.ibm.icu.lang.UCharacter
polyglot java import com.ibm.icu.text.BreakIterator
polyglot java import java.lang.StringBuilder
polyglot java import org.enso.base.Text_Utils

## UNSTABLE
Expand Down Expand Up @@ -58,6 +59,28 @@ Text.length =
@Tail_Call count (accum + 1) iterator.next
count 0 iterator.next

## Returns a new `Text` object with the characters in the reverse order of the input.

! What is a Character?
A character is defined as an Extended Grapheme Cluster, see Unicode
Standard Annex 29. This is the smallest unit that still has semantic
meaning in most text-processing applications.

> Example
Reverse the text "Hello, world!".

"Hello, world!".reverse
Text.reverse : Text
Text.reverse =
reverseStringBuilder = StringBuilder.new this.length
iterator = BreakIterator.getCharacterInstance
iterator.setText this

iterate prev next = if next == -1 then reverseStringBuilder.toString else
reverseStringBuilder.append (Text_Utils.substring this next prev)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a StringBuilder works, but I'm wondering if using our + wouldn't be more idiomatic - for example our Vector.join does in fact use +.

Performance-wise it should be quite comparable, because our + uses ropes under the hood to achieve O(1) amortized append time. Still, probably StringBuilder has a bit less of an overhead, so likely still a bit faster.

So I'm not really saying to change it - because this definitely works and is fast, but wondering what best practices we should adopt - because I think it would be good to have consistent approach to this between this function, Vector.join and possibly other ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Text ropes cache collapsed strings - see ToJavaStringNode::inplaceFlatten.

As for Vector.join - I'm talking about the function which concatenates a vector of texts into a single text, did you mean Vector.+ instead?

Copy link
Contributor Author

@indiv0 indiv0 Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idiomatically, + does seem like the right choice.

When I was implementing this, I actually did compare both approaches and the + was ~2-3x slower for a 10000 length string. Though that may be due to me making some other error in the implementation. I can re-draft that benchmark and include it in the PR if you'd like to compare for yourself. I didn't include the benchmark in the commit because it didn't seem like this is something likely to change, performance-wise, over time.

Not sure if I follow the Vector.join and Vector.+ comments.

StringBuilder impl:

Text.reverse 10000 characters/iteration:0: 595.53ms                                                                                                                                     
Text.reverse 10000 characters/iteration:1: 405.00ms                                                                                                                                     
Text.reverse 10000 characters/iteration:2: 402.30ms                                                                                                                                     
Text.reverse 10000 characters/iteration:3: 402.01ms                                                                                                                                     
Text.reverse 10000 characters/iteration:4: 400.68ms                                                                                                                                     
Text.reverse 10000 characters/iteration:5: 401.56ms                                                                                                                                     
Text.reverse 10000 characters/iteration:6: 409.98ms                                                                                                                                     
Text.reverse 10000 characters/iteration:7: 417.73ms                                                                                                                                     
Text.reverse 10000 characters/iteration:8: 408.80ms                                                                                                                                     
Text.reverse 10000 characters/iteration:9: 400.80ms 

+ impl:

Text.reverse 10000 characters/iteration:0: 1083.13ms                                        
Text.reverse 10000 characters/iteration:1: 1274.55ms                                        
Text.reverse 10000 characters/iteration:2: 1267.65ms                                        
Text.reverse 10000 characters/iteration:3: 1195.36ms                                        
Text.reverse 10000 characters/iteration:4: 1140.85ms                                        
Text.reverse 10000 characters/iteration:5: 1167.93ms                                        
Text.reverse 10000 characters/iteration:6: 1162.50ms                                        
Text.reverse 10000 characters/iteration:7: 1199.97ms                                        
Text.reverse 10000 characters/iteration:8: 1203.35ms                                        
Text.reverse 10000 characters/iteration:9: 1242.65ms

Naive (StringBuilder.new this).reverse (not really a fair comparison because not extended-grapheme-cluster-aware):

Text.reverse 10000 characters/iteration:0: 6.57ms                                           
Text.reverse 10000 characters/iteration:1: 3.47ms                                           
Text.reverse 10000 characters/iteration:2: 3.96ms                                           
Text.reverse 10000 characters/iteration:3: 3.45ms                                           
Text.reverse 10000 characters/iteration:4: 4.05ms                                           
Text.reverse 10000 characters/iteration:5: 4.81ms                                           
Text.reverse 10000 characters/iteration:6: 6.69ms                                           
Text.reverse 10000 characters/iteration:7: 6.56ms                                           
Text.reverse 10000 characters/iteration:8: 7.33ms                                           
Text.reverse 10000 characters/iteration:9: 8.02ms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@radeusgd ping ^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, then I think you are right that it may be better to keep the stringbuilder. I guess the difference is because the ropes still have some more allocation overhead.

As for Vector.join and Vector.+ - no worry it was a reply to a later deleted comment, please ignore 😅

@Tail_Call iterate next iterator.previous
iterate iterator.last iterator.previous

## Applies the provided `function` to each character in `this`.

Arguments:
Expand Down
11 changes: 11 additions & 0 deletions test/Tests/src/Data/Text_Spec.enso
Expand Up @@ -496,6 +496,17 @@ spec =
str.at -6 . should_fail_with Index_Out_Of_Bounds_Error
str.at 5 . should_fail_with Index_Out_Of_Bounds_Error

Test.specify "should be able to reverse characters" <|
"Hello World!".reverse . should_equal "!dlroW olleH"

"".reverse . should_equal ""
'e\u{301}'.reverse . should_equal 'e\u{301}'
'e\u{301}\u00E9'.reverse . should_equal '\u00E9e\u{301}'
'e\u{321}\u{360}'.reverse . should_equal 'e\u{321}\u{360}'
'Iñtërnâtiônàlizætiøn☃💩'.reverse . should_equal '💩☃nøitæzilànôitânrëtñI'
'ほげほげ'.reverse . should_equal 'げほげほ'
'\u{10000}'.reverse . should_equal '\u{10000}'

Test.specify "should allow to iterate over characters" <|
str = kshi + accent_1 + accent_2 + 'abc'
builder = Vector.new_builder
Expand Down