explain scoring #5

abubelinha · 2024-03-08T13:02:01Z

Looking to the basic example:
The string list is ['foo', 'foo quz BAR', 'baaarfoo', 'quz']

There you say:

The higher the score, the closer the value is to the pattern

What is the highest expected score, so I can decide the value is "close enough" to the pattern?
It turns out that longer patterns get higher scores:

pattern = 'baaarfoo' outputs:

shape: (1, 2)
┌─────────────┬───────┐
│ strs        ┆ score │
│ ---         ┆ ---   │
│ str         ┆ u32   │
╞═════════════╪═══════╡
│ baaarfoo ┆ 218   │
└─────────────┴───────┘

pattern = 'foo quz BAR' outputs:

shape: (1, 2)
┌─────────────┬───────┐
│ strs        ┆ score │
│ ---         ┆ ---   │
│ str         ┆ u32   │
╞═════════════╪═══════╡
│ foo quz BAR ┆ 264   │
└─────────────┴───────┘

pattern = 'quz' outputs:

shape: (2, 2)
┌─────────────┬───────┐
│ strs        ┆ score │
│ ---         ┆ ---   │
│ str         ┆ u32   │
╞═════════════╪═══════╡
│ foo quz BAR ┆ 88    │
│ quz         ┆ 88    │
└─────────────┴───────┘

I would expect identical strings should have the same top score (i.e. 100, for example), no matter how long they are.
Also, in the last example, I would expect 'quz' the same top score (i.e. 100) and 'foo quz BAR' to score less than the top (i.e., containing the string is not the same fully matching it).

Are they any config parameters so I can make the library behave like that?
Thanks

@abubelinha

The text was updated successfully, but these errors were encountered:

bnm3k · 2024-03-08T17:59:04Z

The score depends on both the pattern and the strings being matched such that the longer the pattern is, the higher the score will be for strings that match. For example, with an pattern that's got 2500 ascii chars, strings that match against it will have a score of roughly 65,010:

import polars as pl
from polars_fuzzy_match import fuzzy_match_score

long_str = "".join(["f"] * 5000)  # len 5000
df = pl.DataFrame(
    {
        "strs": [long_str, "alice", "bob"],
    }
)
long_pattern = "".join(["f"] * 2500)  # len 2500
out = (
    df.with_columns(
        length=pl.col("strs").str.len_bytes(),
        score=fuzzy_match_score(
            pl.col("strs"),
            long_pattern,
        ),
    )
    .filter(pl.col("score").is_not_null())
    .sort(by="score", descending=True)
)
print(out)

This outputs:

┌───────────────────────────────────┬────────┬───────┐
│ strs                              ┆ length ┆ score │
│ ---                               ┆ ---    ┆ ---   │
│ str                               ┆ u32    ┆ u32   │
╞═══════════════════════════════════╪════════╪═══════╡
│ ffffffffffffffffffffffffffffffff… ┆ 5000   ┆ 65010 │
└───────────────────────────────────┴────────┴───────┘

However, the score will never exceed 65,355 (max of an unsigned 16 bit integer). The best way to determine closeness in a manner that's independent of the length/complexity of the pattern is to filter out the null scores (as those don't match at all) and to order by score. There are some additional configurations that I haven't documented yet such as how to configure handling of uppercase characters and so on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explain scoring #5

explain scoring #5

abubelinha commented Mar 8, 2024 •

edited

Loading

bnm3k commented Mar 8, 2024

explain scoring #5

explain scoring #5

Comments

abubelinha commented Mar 8, 2024 • edited Loading

bnm3k commented Mar 8, 2024

abubelinha commented Mar 8, 2024 •

edited

Loading