Fuzzy matching continued #2186

mikiher · 2023-10-05T18:57:44Z

This is a continuation of Fuzzy Matching V1.
This includes some cleanups and refactoring, a few improvements, and one major enhancement.

Cleanups, refactoring, and small fixes:

Refactor title candidates logic (addition, variants, sorting) into a separate class (TitleCandidates) (46b0b3a)
(minor) Rewrite the logic that makes sure we don't run the original Title/Author search twice, to make it more readable (10f5bc8)
(minor) Refactor OpenLib-specific sorting into getOpenLibResult (1d3ad38)
(minor) Add back lower-casing in the cleanTitle/Author methods, since they are called from other places in the code (5d7c197)

Enhancements & Improvements:

(major) Move Author candidates logic into its own class (AuthorCandidates), and introduce author extraction and validation (from both author and title parts) using parallel requests to Audnexus. This helps in cases where: the author field includes additional data, or when the author hides in one of the title parts. Fuzzy search logic now has an external loop that goes over author candidates (including empty author in the end), and an internal loop that goes over title candidates (9eff471)
Handle initials in author normalization (separate initials, and remove middle initials, as they sometimes mismatch with providers) (f3555a1)
Added/fixed a couple of title transformer regular expressions. (b2acdad)
Improved title candidate sorting (preferring transformed title parts over original ones, and title parts in their order of appearance) (8979586)
Add just one title variant after all all transformers have been applied, and not after each transformer (752bfff)
Treat underscores as title part separators (improves some corner cases) (bf9f389)
Reduce spurious audnexus author matching (by reducing the max Levenshtein distance, and looking only at the top 10 results) (b0b7a0a)
If no authors have been validated, use an aggressively cleaned version of the author field (in many cases, it is better than nothing) (f44b7ed)

The code is now more robust, and handles various hard corner cases it didn't handle before.

It fixes one case in the previous eval set, and keeps a 98% found rate and 96% found@1 in a new 50 title/author pairs eval set.
In the new eval set, I also measured the average number of fuzzy searches - 1.18 (note that the eval sets are picked from an unstructured torrents folder, where the initial search with the original title and field almost always fails. 1.18 means that in most cases, only one fuzzy search request is needed).

The additional parallel author validation requests (usually between 2-4) to audnexus seem to run very quickly, and most of the network time seems to be spent in the search provider requests.

server/finders/BookFinder.js

mikiher · 2023-10-07T22:17:49Z

By the way, I can share my eval sets and hard cases privately, if you'd like to see them.

advplyr · 2023-10-08T15:39:04Z

@mikiher I tested this PR with my sample libraries that have random meta tags and filenames and it worked really well. It would be nice to have a better test set to test with in the future.

advplyr · 2023-10-08T15:40:15Z

I think the AuthorCandidates class can be useful for the AuthorFinder as well

mikiher · 2023-10-08T17:17:09Z

@advplyr I just sent you a link to the evals spreadsheet to your personal email I found here. Feel free to send me any additional hard cases you find.

advplyr · 2023-10-30T21:37:26Z

server/finders/BookFinder.js

+    add(title, position = 0) {
+      // if title contains the author, remove it
+      if (this.cleanAuthor) {
+        const authorRe = new RegExp(`(^| | by |)${this.cleanAuthor}(?= |$)`, "g")


For future reference if you are working on this, an edge case came up here with invalid regex. #2265

Fixed by adding a util function to escape the string.

mikiher · 2023-10-31T00:18:30Z

Nice lesson in defensive coding, thanks! I got sidetracked by this external docker-on-windows abs watcher project (which I hope to release sometime this week), but I'm definitely planning to go back to the matching code. This code requires some serious unit testing, but I didn't get a clear response to my question on discord regarding which unit testing framework was your favorite (jest, mocha, cypress, something else?). I think you just need to make a decision and we'll all stick with it, but I believe you should drive this.

…

On Mon, Oct 30, 2023 at 11:37 PM advplyr ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In server/finders/BookFinder.js <#2186 (comment)> : > - if (candidate) - candidates.add(candidate) + static TitleCandidates = class { + + constructor(bookFinder, cleanAuthor) { + this.bookFinder = bookFinder + this.candidates = new Set() + this.cleanAuthor = cleanAuthor + this.priorities = {} + this.positions = {} + } + + add(title, position = 0) { + // if title contains the author, remove it + if (this.cleanAuthor) { + const authorRe = new RegExp(`(^| | by |)${this.cleanAuthor}(?= |$)`, "g") For future reference if you are working on this, an edge case came up here with invalid regex. #2265 <#2265> Fixed by adding a util function to escape the string. — Reply to this email directly, view it on GitHub <#2186 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFMDFVQAHRSZER3ZCHNCKALYCAM2DAVCNFSM6AAAAAA5UW3XEGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOMBVGEYTMMBRGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

advplyr · 2023-10-31T19:48:41Z

I don't have a preferred framework. I've used mocha and jest a little bit but not enough to have a preference. Do you have a preference?

mikiher · 2023-11-02T12:37:18Z

I'm quite new to Node.js and Javascript (my background is C++ and Python).
Let me do some digging and testing of my own and I'll come back with a recommendation.

mikiher added 12 commits September 30, 2023 18:08

[cleanup] refactor OpenLib sort into getOpenLibResult

1d3ad38

[cleanup] Refactor candidates logic to separate class

46b0b3a

[fix] Add back toLowerCase to cleanAuthor/Title (required by other uses)

5d7c197

[cleanup] Make original title/author check with more readable

10f5bc8

[enhamcement] Only add title candidate before and after all transforms

752bfff

[enhancement] Improve candidate sorting

8979586

[enhancement] AuthorCandidates, author validation

9eff471

[enhancement] Added a couple title transformers

b2acdad

[enhancement] Handle initials in author normalization

f3555a1

[enhancement] Treat underscores as title part separators

bf9f389

[enhancement] Reduce spurious matches in validateAuthor

b0b7a0a

[enhancement] If no valid authors, use clean author field

f44b7ed

mikiher marked this pull request as ready for review October 5, 2023 19:43

Merge branch 'master' into Fuzzy-Matching-Continued

786df45

advplyr reviewed Oct 7, 2023

View reviewed changes

server/finders/BookFinder.js Outdated Show resolved Hide resolved

Remove some unused code in AuthorCandidates.add

f8f555b

advplyr merged commit 5ad9f50 into advplyr:master Oct 8, 2023
1 check passed

advplyr reviewed Oct 30, 2023

View reviewed changes

advplyr mentioned this pull request Nov 25, 2023

Search: search over all metadata combined #2340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuzzy matching continued #2186

Fuzzy matching continued #2186

mikiher commented Oct 5, 2023 •

edited

mikiher commented Oct 7, 2023

advplyr commented Oct 8, 2023 •

edited

advplyr commented Oct 8, 2023

mikiher commented Oct 8, 2023

advplyr Oct 30, 2023

mikiher commented Oct 31, 2023 via email

advplyr commented Oct 31, 2023

mikiher commented Nov 2, 2023

Fuzzy matching continued #2186

Fuzzy matching continued #2186

Conversation

mikiher commented Oct 5, 2023 • edited

mikiher commented Oct 7, 2023

advplyr commented Oct 8, 2023 • edited

advplyr commented Oct 8, 2023

mikiher commented Oct 8, 2023

advplyr Oct 30, 2023

Choose a reason for hiding this comment

mikiher commented Oct 31, 2023 via email

advplyr commented Oct 31, 2023

mikiher commented Nov 2, 2023

mikiher commented Oct 5, 2023 •

edited

advplyr commented Oct 8, 2023 •

edited