Unexpected behavior of `n_max` argument in `stri_split_fixed` #100

Closed
drammock opened this Issue Sep 22, 2014 · 8 comments

Comments

Projects
None yet
3 participants
@drammock

Consider the following code:

foo <- c("ab_c", "d_ef")
R> stri_split_fixed(foo, "_", n_max=1)
[[1]]
[1] "ab_c"

[[2]]
[1] "d_ef"

I (wrongly) thought that the n_max argument meant "please perform the splitting, and return to me only the first n pieces". So from the above code, I expected [[1]] [1] "ab" [[2]] [1] "d". This is clearly not what is happening.

The documentation for n_max says "integer vector, the maximal number of pieces to return". At the very least, I think this documentation is ambiguous as to the argument's effect.

If the current behavior is the intended behavior, may I ask how I can achieve the result that I expected?

@bartektartanus

This comment has been minimized.

Show comment
Hide comment
@bartektartanus

bartektartanus Sep 22, 2014

Contributor

If you want only the first n pieces you can set n_max=n+1 and then omit the last vector element on each list element. Like this:

lapply(stri_split_fixed(foo, "_", n_max=1+1), function(x) x[-length(x)])
[[1]]
[1] "ab"

[[2]]
[1] "d"

But we will take your suggestion into consideration :)

Thank you!

Contributor

bartektartanus commented Sep 22, 2014

If you want only the first n pieces you can set n_max=n+1 and then omit the last vector element on each list element. Like this:

lapply(stri_split_fixed(foo, "_", n_max=1+1), function(x) x[-length(x)])
[[1]]
[1] "ab"

[[2]]
[1] "d"

But we will take your suggestion into consideration :)

Thank you!

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Sep 22, 2014

Owner

Yes, the documentation is really confusing, @bartektartanus , could you file an update?

@bartektartanus - maybe we'll introduce an omit_last argument, with default=FALSE

if omit_last=TRUE then the behavior we'll be as @drammock suggested

What do you guys think?

Owner

gagolews commented Sep 22, 2014

Yes, the documentation is really confusing, @bartektartanus , could you file an update?

@bartektartanus - maybe we'll introduce an omit_last argument, with default=FALSE

if omit_last=TRUE then the behavior we'll be as @drammock suggested

What do you guys think?

@drammock

This comment has been minimized.

Show comment
Hide comment
@drammock

drammock Sep 22, 2014

an omit_last argument actually wouldn't work for my particular use case. What I have is more like this:

foo <- c("abc_def", "abc_ghi", "bbb", "qrs", "tuv_wxy", "tuv_mno")
# desired output from stri_split_fixed(foo, "_"):
# list("abc", "abc", "bbb", "qrs", "tuv", "tuv")

If you omit_last then presumably you would lose entries for bbb and qrs because they don't contain the split marker at all.

an omit_last argument actually wouldn't work for my particular use case. What I have is more like this:

foo <- c("abc_def", "abc_ghi", "bbb", "qrs", "tuv_wxy", "tuv_mno")
# desired output from stri_split_fixed(foo, "_"):
# list("abc", "abc", "bbb", "qrs", "tuv", "tuv")

If you omit_last then presumably you would lose entries for bbb and qrs because they don't contain the split marker at all.

@drammock

This comment has been minimized.

Show comment
Hide comment
@drammock

drammock Sep 22, 2014

By the way, what I ended up doing was based on the lapply-based solution suggested by @bartektartanus:

foo <- sapply(stri_split_fixed(foo, "_"), function(x) x[1])

By the way, what I ended up doing was based on the lapply-based solution suggested by @bartektartanus:

foo <- sapply(stri_split_fixed(foo, "_"), function(x) x[1])
@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Sep 22, 2014

Owner

OK, got it, you need sth like sapply(stri_split_fixed(foo, "_"), function(x) x[1:k]) for given k

so maybe a token_max param should denote that?

btw, we try to implement everything in C++ for performance reasons.

Owner

gagolews commented Sep 22, 2014

OK, got it, you need sth like sapply(stri_split_fixed(foo, "_"), function(x) x[1:k]) for given k

so maybe a token_max param should denote that?

btw, we try to implement everything in C++ for performance reasons.

@drammock

This comment has been minimized.

Show comment
Hide comment
@drammock

drammock Sep 22, 2014

Yes, something like token_max as you've described it would give the
result I wanted. If you're willing to entertain an API change, I suggest
changing the name of n_max to max_splits and calling the new argument
max_tokens.

I did notice that you were implementing in C++, that is why I haven't tried
to fork and submit patches. I can't code in C.
On Sep 22, 2014 8:14 PM, "Marek Gagolewski" notifications@github.com
wrote:

OK, got it, you need sth like sapply(stri_split_fixed(foo, "_"),
function(x) x[1:k]) for given k

so maybe a token_max param should denote that?

btw, we try to implement everything in C++ for performance reasons.


Reply to this email directly or view it on GitHub
#100 (comment).

Yes, something like token_max as you've described it would give the
result I wanted. If you're willing to entertain an API change, I suggest
changing the name of n_max to max_splits and calling the new argument
max_tokens.

I did notice that you were implementing in C++, that is why I haven't tried
to fork and submit patches. I can't code in C.
On Sep 22, 2014 8:14 PM, "Marek Gagolewski" notifications@github.com
wrote:

OK, got it, you need sth like sapply(stri_split_fixed(foo, "_"),
function(x) x[1:k]) for given k

so maybe a token_max param should denote that?

btw, we try to implement everything in C++ for performance reasons.


Reply to this email directly or view it on GitHub
#100 (comment).

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Sep 22, 2014

Owner

no problem, thanks for the suggestion!

@bartektartanus - you've been assigned :)

Owner

gagolews commented Sep 22, 2014

no problem, thanks for the suggestion!

@bartektartanus - you've been assigned :)

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Oct 19, 2014

Owner

work ongoing: tokens_only param added, it defaults to FALSE for backward compatibility (+ compatibility with stringr)

> stri_split_fixed("a_b_c_d", "_")
[[1]]
[1] "a" "b" "c" "d"

> stri_split_fixed("a_b_c__d", "_")
[[1]]
[1] "a" "b" "c" ""  "d"

> stri_split_fixed("a_b_c__d", "_", omit_empty=TRUE)
[[1]]
[1] "a" "b" "c" "d"

> stri_split_fixed("a_b_c__d", "_", n_max=2, tokens_only=FALSE) # "a" & remainder
[[1]]
[1] "a"      "b_c__d"

> stri_split_fixed("a_b_c__d", "_", n_max=2, tokens_only=TRUE) # "a" & "b" only
[[1]]
[1] "a" "b"

> stri_split_fixed("a_b_c__d", "_", n_max=4, omit_empty=TRUE, tokens_only=TRUE)
[[1]]
[1] "a" "b" "c" "d"

> stri_split_fixed("a_b_c__d", "_", n_max=4, omit_empty=FALSE, tokens_only=TRUE)
[[1]]
[1] "a" "b" "c" "" 

> stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=1, tokens_only=TRUE, omit_empty=TRUE)
[[1]]
[1] "ab"

[[2]]
[1] "d"

[[3]]
[1] "h"

[[4]]
character(0)

> stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=2, tokens_only=TRUE, omit_empty=TRUE)
[[1]]
[1] "ab" "c" 

[[2]]
[1] "d"  "ef"

[[3]]
[1] "h"

[[4]]
character(0)

> stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE)
[[1]]
[1] "ab" "c" 

[[2]]
[1] "d"  "ef" "g" 

[[3]]
[1] "h"

[[4]]
character(0)
Owner

gagolews commented Oct 19, 2014

work ongoing: tokens_only param added, it defaults to FALSE for backward compatibility (+ compatibility with stringr)

> stri_split_fixed("a_b_c_d", "_")
[[1]]
[1] "a" "b" "c" "d"

> stri_split_fixed("a_b_c__d", "_")
[[1]]
[1] "a" "b" "c" ""  "d"

> stri_split_fixed("a_b_c__d", "_", omit_empty=TRUE)
[[1]]
[1] "a" "b" "c" "d"

> stri_split_fixed("a_b_c__d", "_", n_max=2, tokens_only=FALSE) # "a" & remainder
[[1]]
[1] "a"      "b_c__d"

> stri_split_fixed("a_b_c__d", "_", n_max=2, tokens_only=TRUE) # "a" & "b" only
[[1]]
[1] "a" "b"

> stri_split_fixed("a_b_c__d", "_", n_max=4, omit_empty=TRUE, tokens_only=TRUE)
[[1]]
[1] "a" "b" "c" "d"

> stri_split_fixed("a_b_c__d", "_", n_max=4, omit_empty=FALSE, tokens_only=TRUE)
[[1]]
[1] "a" "b" "c" "" 

> stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=1, tokens_only=TRUE, omit_empty=TRUE)
[[1]]
[1] "ab"

[[2]]
[1] "d"

[[3]]
[1] "h"

[[4]]
character(0)

> stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=2, tokens_only=TRUE, omit_empty=TRUE)
[[1]]
[1] "ab" "c" 

[[2]]
[1] "d"  "ef"

[[3]]
[1] "h"

[[4]]
character(0)

> stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE)
[[1]]
[1] "ab" "c" 

[[2]]
[1] "d"  "ef" "g" 

[[3]]
[1] "h"

[[4]]
character(0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment