Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Perl
tag: v0.001

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
lib/Text
t
.gitignore
Changes
MANIFEST.SKIP
README.pod
dist.ini
weaver.ini

README.pod

NAME

Text::Fingerprint - perform simple text clustering by key collision

VERSION

version 0.001

SYNOPSIS

    use feature qw(say);
    use utf8;

    use Text::Fingerprint qw(:all);

    my $str = q(
        À noite, vovô Kowalsky vê o ímã cair no pé do pingüim
        queixoso e vovó põe açúcar no chá de tâmaras do jabuti feliz.
    );

    say fingerprint($str);
    # a acucar cair cha de do e feliz ima jabuti kowalsky
    # no noite o pe pinguim poe queixoso tamaras ve vovo

    say fingerprint_ngram($str);
    # abacadaialamanarasbucachcudedoeaedeieleoetevfeg
    # uhaifiminiritixizjakokylilsmamqngnoocoeoiojokop
    # osovowpepipoqurarnsdsksotatetiucueuiutvevowaxoyv

DESCRIPTION

Text clustering functions borrowed from the Google Refine. Can be useful for finding groups of different values that might be alternative representations of the same thing. For example, the two strings "New York" and "new york" are very likely to refer to the same concept and just have capitalization differences. Likewise, "Gödel" and "Godel" probably refer to the same person.

FUNCTIONS

fingerprint($string)

The process that generates the key from a $string value is the following (note that the order of these operations is significant):

  • remove leading and trailing whitespace
  • normalize extended western characters to their ASCII representation (for example "gödel" → "godel")
  • change all characters to their lowercase representation
  • split the string into punctuation, whitespace and control characters-separated tokens
  • sort the tokens and remove duplicates
  • join the tokens back together

fingerprint_ngram($string, $n)

The n-gram fingerprint method is similar to the fingerprint method described above but instead of using whitespace separated tokens, it uses n-grams, where the $n (or the size in chars of the token) can be specified by the user (default: 2).

  • change all characters to their lowercase representation
  • remove all punctuation, whitespace, and control characters
  • normalize extended western characters to their ASCII representation
  • obtain all the string n-grams
  • sort the n-grams and remove duplicates
  • join the sorted n-grams back together

SEE ALSO

AUTHOR

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

Something went wrong with that request. Please try again.