Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

$tf->distance() corrupts source data in 2d-array with utf8 strings #17

Closed
internationils opened this issue Jul 9, 2017 · 11 comments
Closed

Comments

@internationils
Copy link

Hello, I have a really weird problem where $tf->distance() corrupts source data. This is caused by a really strange combination of factors, and tough to reproduce. Factors:

  • only for some UTF8 strings
  • no bug if Text::Fuzzy::PP is used (clean, without utf8 error messages)
  • no output bug if dclone is used and $tf->distance() is applied to the 2d array copy (utf8 error messages will still show up)
  • only in a 2d-array (1d array worked with the same input..., probably something with references to the 2nd dimension, hence the dclone helps)

conso-BUG.txt

@internationils
Copy link
Author

conso-BUG.pl.zip
...in case the .txt is unreadable

@benkasminbullock
Copy link
Owner

benkasminbullock commented Jul 9, 2017 via email

@benkasminbullock
Copy link
Owner

benkasminbullock commented Jul 9, 2017 via email

@benkasminbullock
Copy link
Owner

benkasminbullock commented Jul 9, 2017 via email

@internationils
Copy link
Author

$ uname -a
Linux blackbox 4.10.0-22-generic #24-Ubuntu SMP Mon May 22 17:43:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ perl --version
This is perl 5, version 24, subversion 1 (v5.24.1) built for x86_64-linux-gnu-thread-multi
(with 67 registered patches, see perl -V for more detail)

Text::Fuzzy is up to date. (0.25)
...I'll wait until it shows up, then test again.

@benkasminbullock
Copy link
Owner

benkasminbullock commented Jul 9, 2017 via email

@internationils
Copy link
Author

internationils commented Jul 9, 2017

I can confirm the fix:

  • diff shows the ::PP and plain Text::Fuzzy versions giving the same result
  • file shows output.csv: UTF-8 Unicode text, with very long lines... before it showed "data"
  • no more "wide character" etc. errors
    Thanks a lot for the quick response! That bug was such an extreme edge case with so many factors needed to make it happen, it cost me a few hours to reproduce with small code and track down.
    Will you release this as 0.26 now? Might save someone a few hours...

@benkasminbullock
Copy link
Owner

The bug is quite easy to reproduce with a lot smaller code than yours once you know where it is:

#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use lib '/home/ben/projects/text-fuzzy/blib/lib';
use lib '/home/ben/projects/text-fuzzy/blib/arch';
use Text::Fuzzy;
use Test::More;
my $tf = Text::Fuzzy->new ("fu");
my $guff = "fü";
my $guffcopy = $guff;
$tf->distance ($guffcopy);
ok ($guff eq $guffcopy);
done_testing ();

@benkasminbullock
Copy link
Owner

The bug probably won't occur again, but just for safety's sake I've added the above as a test in the new release, 5c87200. This should be going on to CPAN in about eight hours, assuming that 0.25_02 gets through CPAN testers without too many issues.

@internationils
Copy link
Author

Great, glad you got to see it as well. Weird thing is that in most cases (I have a german input file with 19k lines) it works, and just seemed to trigger in some edge cases. Even modifying the string in my testcase by a character made it work again...

@benkasminbullock
Copy link
Owner

It is due to a shortcut made which only occurs when the string given to Text::Fuzzy->new ($ascii) is ascii and the matched string in $tf->distance ($unicode) is Unicode-marked. The shortcut eliminates all the wide characters and replaces them with unmatchable characters. That was not meant to be done to the actual string though. A simple use of "const" in the C code would have prevented this error.

http://blogs.perl.org/users/ben_bullock/2017/07/always-use-const-char-to-refer-to-the-return-value-from-svpv.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants