Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explain scoring #5

Open
abubelinha opened this issue Mar 8, 2024 · 1 comment
Open

explain scoring #5

abubelinha opened this issue Mar 8, 2024 · 1 comment

Comments

@abubelinha
Copy link

abubelinha commented Mar 8, 2024

Looking to the basic example:
The string list is ['foo', 'foo quz BAR', 'baaarfoo', 'quz']

There you say:

The higher the score, the closer the value is to the pattern

What is the highest expected score, so I can decide the value is "close enough" to the pattern?
It turns out that longer patterns get higher scores:

  • pattern = 'baaarfoo' outputs:
shape: (1, 2)
┌─────────────┬───────┐
│ strs        ┆ score │
│ ---         ┆ ---   │
│ str         ┆ u32   │
╞═════════════╪═══════╡
│ baaarfoo ┆ 218   │
└─────────────┴───────┘
  • pattern = 'foo quz BAR' outputs:
shape: (1, 2)
┌─────────────┬───────┐
│ strs        ┆ score │
│ ---         ┆ ---   │
│ str         ┆ u32   │
╞═════════════╪═══════╡
│ foo quz BAR ┆ 264   │
└─────────────┴───────┘
  • pattern = 'quz' outputs:
shape: (2, 2)
┌─────────────┬───────┐
│ strs        ┆ score │
│ ---         ┆ ---   │
│ str         ┆ u32   │
╞═════════════╪═══════╡
│ foo quz BAR ┆ 88    │
│ quz         ┆ 88    │
└─────────────┴───────┘

I would expect identical strings should have the same top score (i.e. 100, for example), no matter how long they are.
Also, in the last example, I would expect 'quz' the same top score (i.e. 100) and 'foo quz BAR' to score less than the top (i.e., containing the string is not the same fully matching it).

Are they any config parameters so I can make the library behave like that?
Thanks

@abubelinha

@bnm3k
Copy link
Owner

bnm3k commented Mar 8, 2024

The score depends on both the pattern and the strings being matched such that the longer the pattern is, the higher the score will be for strings that match. For example, with an pattern that's got 2500 ascii chars, strings that match against it will have a score of roughly 65,010:

import polars as pl
from polars_fuzzy_match import fuzzy_match_score

long_str = "".join(["f"] * 5000)  # len 5000
df = pl.DataFrame(
    {
        "strs": [long_str, "alice", "bob"],
    }
)
long_pattern = "".join(["f"] * 2500)  # len 2500
out = (
    df.with_columns(
        length=pl.col("strs").str.len_bytes(),
        score=fuzzy_match_score(
            pl.col("strs"),
            long_pattern,
        ),
    )
    .filter(pl.col("score").is_not_null())
    .sort(by="score", descending=True)
)
print(out)

This outputs:

┌───────────────────────────────────┬────────┬───────┐
│ strs                              ┆ length ┆ score │
│ ---                               ┆ ---    ┆ ---   │
│ str                               ┆ u32    ┆ u32   │
╞═══════════════════════════════════╪════════╪═══════╡
│ ffffffffffffffffffffffffffffffff… ┆ 5000   ┆ 65010 │
└───────────────────────────────────┴────────┴───────┘

However, the score will never exceed 65,355 (max of an unsigned 16 bit integer). The best way to determine closeness in a manner that's independent of the length/complexity of the pattern is to filter out the null scores (as those don't match at all) and to order by score. There are some additional configurations that I haven't documented yet such as how to configure handling of uppercase characters and so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants