Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8_string::get_num_codepoints returns number of bytes under certain circumstances with ::lut_active #16

Closed
vadim-berman opened this issue Aug 26, 2018 · 12 comments
Labels

Comments

@vadim-berman
Copy link

vadim-berman commented Aug 26, 2018

Hi Jakob,

Looks like the part of get_num_codepoints that uses lut_active has issues under some circumstances. I only could fix it by disabling the entire part altogether.

Here is the example:

utf8_string bad_find = u8"S’pore Starbucks selling. gorgeous";
utf8_string substr = u8".";
std::cout << bad_find.find(substr, 0);
std::cout.flush();

The . character is at position 24. The return value is 26 due to ’ being a "compound" codepoint.

Best regards,
Vadim

@DuffsDevice
Copy link
Owner

Hi Vadim,

I currently have little time to have a look, but I will, as soon as I do.
Hopefully this will be in a week or so.Thank you for providing a minimal example, that helps a lot!

All the Best
Jakob

@vadim-berman
Copy link
Author

Thanks, Jakob!

@DuffsDevice
Copy link
Owner

Does it work, if you replace line 736 in tinyutf8.cpp

// Iterate over relevant multibyte indices
while( lut_iter > lut_begin ){

with

while( lut_iter >= lut_begin ){

?
I'm pretty sure, that's the bug, but don't have time to try it out right now.

I look forward to hearing from you!
Jakob

@vadim-berman
Copy link
Author

Hi Jakob,

Now I'm unavailable (relocating) :) - sorry, will take me a day or two to get to that. I will let you know though.

Best regards, Vadim

@vadim-berman
Copy link
Author

Yes, it works with my example! Thanks a lot for fixing it.

Best regards, Vadim

DuffsDevice added a commit that referenced this issue Sep 14, 2018
DuffsDevice added a commit that referenced this issue Sep 14, 2018
Fixed bug in get_num_bytes, if lut is active (#16)
@vadim-berman
Copy link
Author

vadim-berman commented Jun 27, 2019

Hi Jakob,

Long time no talk :) !

Sadly, the same thing strikes again, this time, with a longer string but again the multibytes are the culprits. Here is an example:

    utf8_string anotherFindBug = "who were engaged in “recent narcotics trafficking activity.” Those two individuals were of Newport Beach.";
    utf8_string dot = ".";
    std::cout << anotherFindBug.find(dot, 60);

It is supposed to return 104. It returns 102. If you replace the smart quotes by ", then it's all correct.

The workaround for me is to comment out the block starting at the line 2701:

if( utf8_string::is_lut_active( lut_iter ) )
...

@DuffsDevice
Copy link
Owner

Hi Vadim!

Indeed, I hope your doing well!
As I am currently not at home, I haven’t tried out the issue yet. Just to make sure, are you aware, that find returns the number of the Codepoint, not the Byte index? I ask, because In case of question, the lower is more likely to be correct (as the number of codepoints is <= the number of bytes).

Cheers,
Jakob

@vadim-berman
Copy link
Author

Hi Jakob,

I'm good, thanks, hope you're also doing well.

Yes indeed, the issue is now different, and it's definitely not the number of bytes, but the number of codepoints is absolutely incorrect. I naturally assumed the issue was in my code before zeroing in on the issue. I don't know where the number comes from, but there's definitely an issue in this part. Run the example, and you'll see.

Best regards, Vadim

@DuffsDevice DuffsDevice reopened this Jul 5, 2019
@DuffsDevice
Copy link
Owner

I had a look to the code and it was a small ">" to ">=" issue 😃
Thanks a lot for pointing me to it, I hope the issue was not blocking you. I'll commit in a second...

@DuffsDevice
Copy link
Owner

Does it work for you now as well?

@vadim-berman
Copy link
Author

Sorry about the belated reply, Jakob! (Crazy weeks, then I forgot about it.)

I checked and indeed it works now. Thank you very much for the quick resolution.

@DuffsDevice
Copy link
Owner

No probs! You're very welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants