Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C-only version of substr_esc #13

Closed
brodieG opened this issue Jan 16, 2018 · 2 comments
Closed

C-only version of substr_esc #13

brodieG opened this issue Jan 16, 2018 · 2 comments
Milestone

Comments

@brodieG
Copy link
Owner

brodieG commented Jan 16, 2018

currently ansi_substr2 does a lot of the work in R.

@brodieG brodieG changed the title C-only version of substr_csi C-only version of substr_esc Feb 10, 2018
@brodieG brodieG modified the milestone: 0.20 Feb 10, 2018
@brodieG
Copy link
Owner Author

brodieG commented Feb 10, 2018

There are some complications here b/c we'd like to minimize how many times we scan through each unique string, but we also don't necessarily want to allocate and store states at every selected cut point. This is probably what we should do when we get to handling the issue:

  1. Actually confirm that we can get a substantial performance improvement of substr by testing taking substrings of a long string repeatedly; if we can't then there really isn't much of a benefit as with substrings of any substantial size the bulk of the current computation in is substr anyway.
  2. Compute how many ESC sequences there are between the lowest cut-point and the highest cut-point
    • We will store these ESC sequences with their states
    • What do we do about sequential ESC sequences? Ideally we only store one entry per sequence of ESC sequences, but that means we have to parse each sequence on the first pass.
    • This is all under the presumption that there are fewer ESC sequences than distinct cut points
    • Additionally, one big issue with this is that it only works well for byte encoded strings, as we still have to find the byte that corresponds to a particular width or character between two ESC sequences. THIS COULD BE A DEAL BREAKER.
  3. Once we have the recorded points, we can binary search for them since we'll presumably store them sorted.
    • We can even keep the most recent search path for beginning and end cut points as those should only be log(N) long and we can possibly re-use them with the subsequent cut points to reduce how many hops we have to do

Alternate:

  1. Compute state at every cut point
    • TBD whether we need different states for starting and ending cut points

@brodieG
Copy link
Owner Author

brodieG commented Jul 4, 2021

There does not seem to be enough juice in this to warrant the effort.

@brodieG brodieG closed this as completed Jul 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant