New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

$tf->distance() corrupts source data in 2d-array with utf8 strings #17

Closed
internationils opened this Issue Jul 9, 2017 · 11 comments

Comments

Projects
None yet
2 participants
@internationils

internationils commented Jul 9, 2017

Hello, I have a really weird problem where $tf->distance() corrupts source data. This is caused by a really strange combination of factors, and tough to reproduce. Factors:

  • only for some UTF8 strings
  • no bug if Text::Fuzzy::PP is used (clean, without utf8 error messages)
  • no output bug if dclone is used and $tf->distance() is applied to the 2d array copy (utf8 error messages will still show up)
  • only in a 2d-array (1d array worked with the same input..., probably something with references to the 2nd dimension, hence the dclone helps)

conso-BUG.txt

@internationils

This comment has been minimized.

Show comment
Hide comment
@internationils

internationils Jul 9, 2017

conso-BUG.pl.zip
...in case the .txt is unreadable

internationils commented Jul 9, 2017

conso-BUG.pl.zip
...in case the .txt is unreadable

@benkasminbullock

This comment has been minimized.

Show comment
Hide comment
@benkasminbullock

benkasminbullock Jul 9, 2017

Owner
Owner

benkasminbullock commented Jul 9, 2017

@benkasminbullock

This comment has been minimized.

Show comment
Hide comment
@benkasminbullock

benkasminbullock Jul 9, 2017

Owner
Owner

benkasminbullock commented Jul 9, 2017

@benkasminbullock

This comment has been minimized.

Show comment
Hide comment
@benkasminbullock

benkasminbullock Jul 9, 2017

Owner
Owner

benkasminbullock commented Jul 9, 2017

@internationils

This comment has been minimized.

Show comment
Hide comment
@internationils

internationils Jul 9, 2017

$ uname -a
Linux blackbox 4.10.0-22-generic #24-Ubuntu SMP Mon May 22 17:43:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ perl --version
This is perl 5, version 24, subversion 1 (v5.24.1) built for x86_64-linux-gnu-thread-multi
(with 67 registered patches, see perl -V for more detail)

Text::Fuzzy is up to date. (0.25)
...I'll wait until it shows up, then test again.

internationils commented Jul 9, 2017

$ uname -a
Linux blackbox 4.10.0-22-generic #24-Ubuntu SMP Mon May 22 17:43:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ perl --version
This is perl 5, version 24, subversion 1 (v5.24.1) built for x86_64-linux-gnu-thread-multi
(with 67 registered patches, see perl -V for more detail)

Text::Fuzzy is up to date. (0.25)
...I'll wait until it shows up, then test again.

@benkasminbullock

This comment has been minimized.

Show comment
Hide comment
@benkasminbullock

benkasminbullock Jul 9, 2017

Owner
Owner

benkasminbullock commented Jul 9, 2017

@internationils

This comment has been minimized.

Show comment
Hide comment
@internationils

internationils Jul 9, 2017

I can confirm the fix:

  • diff shows the ::PP and plain Text::Fuzzy versions giving the same result
  • file shows output.csv: UTF-8 Unicode text, with very long lines... before it showed "data"
  • no more "wide character" etc. errors
    Thanks a lot for the quick response! That bug was such an extreme edge case with so many factors needed to make it happen, it cost me a few hours to reproduce with small code and track down.
    Will you release this as 0.26 now? Might save someone a few hours...

internationils commented Jul 9, 2017

I can confirm the fix:

  • diff shows the ::PP and plain Text::Fuzzy versions giving the same result
  • file shows output.csv: UTF-8 Unicode text, with very long lines... before it showed "data"
  • no more "wide character" etc. errors
    Thanks a lot for the quick response! That bug was such an extreme edge case with so many factors needed to make it happen, it cost me a few hours to reproduce with small code and track down.
    Will you release this as 0.26 now? Might save someone a few hours...
@benkasminbullock

This comment has been minimized.

Show comment
Hide comment
@benkasminbullock

benkasminbullock Jul 9, 2017

Owner

The bug is quite easy to reproduce with a lot smaller code than yours once you know where it is:

#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use lib '/home/ben/projects/text-fuzzy/blib/lib';
use lib '/home/ben/projects/text-fuzzy/blib/arch';
use Text::Fuzzy;
use Test::More;
my $tf = Text::Fuzzy->new ("fu");
my $guff = "fü";
my $guffcopy = $guff;
$tf->distance ($guffcopy);
ok ($guff eq $guffcopy);
done_testing ();
Owner

benkasminbullock commented Jul 9, 2017

The bug is quite easy to reproduce with a lot smaller code than yours once you know where it is:

#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use lib '/home/ben/projects/text-fuzzy/blib/lib';
use lib '/home/ben/projects/text-fuzzy/blib/arch';
use Text::Fuzzy;
use Test::More;
my $tf = Text::Fuzzy->new ("fu");
my $guff = "fü";
my $guffcopy = $guff;
$tf->distance ($guffcopy);
ok ($guff eq $guffcopy);
done_testing ();
@benkasminbullock

This comment has been minimized.

Show comment
Hide comment
@benkasminbullock

benkasminbullock Jul 10, 2017

Owner

The bug probably won't occur again, but just for safety's sake I've added the above as a test in the new release, 5c87200. This should be going on to CPAN in about eight hours, assuming that 0.25_02 gets through CPAN testers without too many issues.

Owner

benkasminbullock commented Jul 10, 2017

The bug probably won't occur again, but just for safety's sake I've added the above as a test in the new release, 5c87200. This should be going on to CPAN in about eight hours, assuming that 0.25_02 gets through CPAN testers without too many issues.

@internationils

This comment has been minimized.

Show comment
Hide comment
@internationils

internationils Jul 10, 2017

Great, glad you got to see it as well. Weird thing is that in most cases (I have a german input file with 19k lines) it works, and just seemed to trigger in some edge cases. Even modifying the string in my testcase by a character made it work again...

internationils commented Jul 10, 2017

Great, glad you got to see it as well. Weird thing is that in most cases (I have a german input file with 19k lines) it works, and just seemed to trigger in some edge cases. Even modifying the string in my testcase by a character made it work again...

@benkasminbullock

This comment has been minimized.

Show comment
Hide comment
@benkasminbullock

benkasminbullock Jul 10, 2017

Owner

It is due to a shortcut made which only occurs when the string given to Text::Fuzzy->new ($ascii) is ascii and the matched string in $tf->distance ($unicode) is Unicode-marked. The shortcut eliminates all the wide characters and replaces them with unmatchable characters. That was not meant to be done to the actual string though. A simple use of "const" in the C code would have prevented this error.

http://blogs.perl.org/users/ben_bullock/2017/07/always-use-const-char-to-refer-to-the-return-value-from-svpv.html

Owner

benkasminbullock commented Jul 10, 2017

It is due to a shortcut made which only occurs when the string given to Text::Fuzzy->new ($ascii) is ascii and the matched string in $tf->distance ($unicode) is Unicode-marked. The shortcut eliminates all the wide characters and replaces them with unmatchable characters. That was not meant to be done to the actual string though. A simple use of "const" in the C code would have prevented this error.

http://blogs.perl.org/users/ben_bullock/2017/07/always-use-const-char-to-refer-to-the-return-value-from-svpv.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment