New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
$tf->distance() corrupts source data in 2d-array with utf8 strings #17
Comments
conso-BUG.pl.zip |
Thank you for your bug report.
I was unable to reproduce any bug with the supplied script. The output I
get is this:
$ hexdump -C output.csv
00000000 22 69 6e 64 65 78 22 2c 22 73 74 72 65 65 74 30
|"index","street0|
00000010 22 0a 22 30 22 2c 22 72 30 22 0a 22 31 22 2c 22
|"."0","r0"."1","|
00000020 66 c3 bc 20 6b 65 20 42 c3 bc 22 0a 22 32 22 2c |f.. ke
B.."."2",|
00000030 22 72 32 22 0a 22 33 22 2c 22 72 33 22 0a |"r2"."3","r3".|
0000003e
As far as I can see there is nowhere within "distance" where it writes into
the string.
Can you please confirm you are using the latest version of Text::Fuzzy and
tell me what version of Perl.
Thanks
…On 9 July 2017 at 19:11, Nils Boeffel ***@***.***> wrote:
conso-BUG.pl.zip
<https://github.com/benkasminbullock/Text-Fuzzy/files/1133663/conso-BUG.pl.zip>
...in case the .txt is unreadable
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#17 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAGOdUOibmsAnVqNsCSdmNYktaAlag45ks5sMKc0gaJpZM4OSBEM>
.
|
Further to this, I went through all the code adding "const" and found that
in the case that the search term is not utf8 and the searched-for word is
utf8, Text::Fuzzy is currently writing into the PV, which it definitely
should not be doing. I'm working on a fix for this and will send another
message when it's ready.
…On 9 July 2017 at 19:53, Ben Bullock ***@***.***> wrote:
Thank you for your bug report.
I was unable to reproduce any bug with the supplied script. The output I
get is this:
$ hexdump -C output.csv
00000000 22 69 6e 64 65 78 22 2c 22 73 74 72 65 65 74 30
|"index","street0|
00000010 22 0a 22 30 22 2c 22 72 30 22 0a 22 31 22 2c 22
|"."0","r0"."1","|
00000020 66 c3 bc 20 6b 65 20 42 c3 bc 22 0a 22 32 22 2c |f.. ke
B.."."2",|
00000030 22 72 32 22 0a 22 33 22 2c 22 72 33 22 0a
|"r2"."3","r3".|
0000003e
As far as I can see there is nowhere within "distance" where it writes
into the string.
Can you please confirm you are using the latest version of Text::Fuzzy and
tell me what version of Perl.
Thanks
On 9 July 2017 at 19:11, Nils Boeffel ***@***.***> wrote:
> conso-BUG.pl.zip
> <https://github.com/benkasminbullock/Text-Fuzzy/files/1133663/conso-BUG.pl.zip>
> ...in case the .txt is unreadable
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#17 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAGOdUOibmsAnVqNsCSdmNYktaAlag45ks5sMKc0gaJpZM4OSBEM>
> .
>
|
I've uploaded the new version to CPAN with the version number 0.25_01. Can
you please download that version from the following link and check if it
solves the problem:
https://cpan.metacpan.org/authors/id/B/BK/BKB/Text-Fuzzy-0.25_01.tar.gz
Thanks.
Thanks.
…On 9 July 2017 at 20:47, Ben Bullock ***@***.***> wrote:
Further to this, I went through all the code adding "const" and found that
in the case that the search term is not utf8 and the searched-for word is
utf8, Text::Fuzzy is currently writing into the PV, which it definitely
should not be doing. I'm working on a fix for this and will send another
message when it's ready.
On 9 July 2017 at 19:53, Ben Bullock ***@***.***> wrote:
> Thank you for your bug report.
>
> I was unable to reproduce any bug with the supplied script. The output I
> get is this:
>
> $ hexdump -C output.csv
> 00000000 22 69 6e 64 65 78 22 2c 22 73 74 72 65 65 74 30
> |"index","street0|
> 00000010 22 0a 22 30 22 2c 22 72 30 22 0a 22 31 22 2c 22
> |"."0","r0"."1","|
> 00000020 66 c3 bc 20 6b 65 20 42 c3 bc 22 0a 22 32 22 2c |f.. ke
> B.."."2",|
> 00000030 22 72 32 22 0a 22 33 22 2c 22 72 33 22 0a
> |"r2"."3","r3".|
> 0000003e
>
> As far as I can see there is nowhere within "distance" where it writes
> into the string.
>
> Can you please confirm you are using the latest version of Text::Fuzzy
> and tell me what version of Perl.
>
> Thanks
>
>
>
> On 9 July 2017 at 19:11, Nils Boeffel ***@***.***> wrote:
>
>> conso-BUG.pl.zip
>> <https://github.com/benkasminbullock/Text-Fuzzy/files/1133663/conso-BUG.pl.zip>
>> ...in case the .txt is unreadable
>>
>> —
>> You are receiving this because you are subscribed to this thread.
>> Reply to this email directly, view it on GitHub
>> <#17 (comment)>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-auth/AAGOdUOibmsAnVqNsCSdmNYktaAlag45ks5sMKc0gaJpZM4OSBEM>
>> .
>>
>
>
|
Text::Fuzzy is up to date. (0.25) |
The link should work now, can you please check with that? It will not show
up in the module list.
Download and use
cpanm Text-Fuzzy-0.25_01.tar.gz
to install.
…On 9 July 2017 at 21:44, Nils Boeffel ***@***.***> wrote:
$ uname -a
Linux blackbox 4.10.0-22-generic #24-Ubuntu SMP Mon May 22 17:43:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ perl --version
This is perl 5, version 24, subversion 1 (v5.24.1) built for x86_64-linux-gnu-thread-multi
(with 67 registered patches, see perl -V for more detail)
Text::Fuzzy is up to date. (0.25)
...I'll wait until it shows up, then test again.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#17 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAGOdaMAmg71TPpcVgv-eefHB_bzoSvlks5sMMscgaJpZM4OSBEM>
.
|
I can confirm the fix:
|
The bug is quite easy to reproduce with a lot smaller code than yours once you know where it is:
|
The bug probably won't occur again, but just for safety's sake I've added the above as a test in the new release, 5c87200. This should be going on to CPAN in about eight hours, assuming that 0.25_02 gets through CPAN testers without too many issues. |
Great, glad you got to see it as well. Weird thing is that in most cases (I have a german input file with 19k lines) it works, and just seemed to trigger in some edge cases. Even modifying the string in my testcase by a character made it work again... |
It is due to a shortcut made which only occurs when the string given to Text::Fuzzy->new ($ascii) is ascii and the matched string in $tf->distance ($unicode) is Unicode-marked. The shortcut eliminates all the wide characters and replaces them with unmatchable characters. That was not meant to be done to the actual string though. A simple use of "const" in the C code would have prevented this error. |
Hello, I have a really weird problem where $tf->distance() corrupts source data. This is caused by a really strange combination of factors, and tough to reproduce. Factors:
conso-BUG.txt
The text was updated successfully, but these errors were encountered: