Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy matching continued #2186

Merged
merged 14 commits into from
Oct 8, 2023
Merged

Conversation

mikiher
Copy link
Contributor

@mikiher mikiher commented Oct 5, 2023

This is a continuation of Fuzzy Matching V1.
This includes some cleanups and refactoring, a few improvements, and one major enhancement.

Cleanups, refactoring, and small fixes:

  • Refactor title candidates logic (addition, variants, sorting) into a separate class (TitleCandidates) (46b0b3a)
  • (minor) Rewrite the logic that makes sure we don't run the original Title/Author search twice, to make it more readable (10f5bc8)
  • (minor) Refactor OpenLib-specific sorting into getOpenLibResult (1d3ad38)
  • (minor) Add back lower-casing in the cleanTitle/Author methods, since they are called from other places in the code (5d7c197)

Enhancements & Improvements:

  • (major) Move Author candidates logic into its own class (AuthorCandidates), and introduce author extraction and validation (from both author and title parts) using parallel requests to Audnexus. This helps in cases where: the author field includes additional data, or when the author hides in one of the title parts. Fuzzy search logic now has an external loop that goes over author candidates (including empty author in the end), and an internal loop that goes over title candidates (9eff471)
  • Handle initials in author normalization (separate initials, and remove middle initials, as they sometimes mismatch with providers) (f3555a1)
  • Added/fixed a couple of title transformer regular expressions. (b2acdad)
  • Improved title candidate sorting (preferring transformed title parts over original ones, and title parts in their order of appearance) (8979586)
  • Add just one title variant after all all transformers have been applied, and not after each transformer (752bfff)
  • Treat underscores as title part separators (improves some corner cases) (bf9f389)
  • Reduce spurious audnexus author matching (by reducing the max Levenshtein distance, and looking only at the top 10 results) (b0b7a0a)
  • If no authors have been validated, use an aggressively cleaned version of the author field (in many cases, it is better than nothing) (f44b7ed)

The code is now more robust, and handles various hard corner cases it didn't handle before.

It fixes one case in the previous eval set, and keeps a 98% found rate and 96% found@1 in a new 50 title/author pairs eval set.
In the new eval set, I also measured the average number of fuzzy searches - 1.18 (note that the eval sets are picked from an unstructured torrents folder, where the initial search with the original title and field almost always fails. 1.18 means that in most cases, only one fuzzy search request is needed).

The additional parallel author validation requests (usually between 2-4) to audnexus seem to run very quickly, and most of the network time seems to be spent in the search provider requests.

@mikiher mikiher marked this pull request as ready for review October 5, 2023 19:43
@mikiher
Copy link
Contributor Author

mikiher commented Oct 7, 2023

By the way, I can share my eval sets and hard cases privately, if you'd like to see them.

@advplyr
Copy link
Owner

advplyr commented Oct 8, 2023

@mikiher I tested this PR with my sample libraries that have random meta tags and filenames and it worked really well. It would be nice to have a better test set to test with in the future.

@advplyr advplyr merged commit 5ad9f50 into advplyr:master Oct 8, 2023
1 check passed
@advplyr
Copy link
Owner

advplyr commented Oct 8, 2023

I think the AuthorCandidates class can be useful for the AuthorFinder as well

@mikiher
Copy link
Contributor Author

mikiher commented Oct 8, 2023

@advplyr I just sent you a link to the evals spreadsheet to your personal email I found here. Feel free to send me any additional hard cases you find.

add(title, position = 0) {
// if title contains the author, remove it
if (this.cleanAuthor) {
const authorRe = new RegExp(`(^| | by |)${this.cleanAuthor}(?= |$)`, "g")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference if you are working on this, an edge case came up here with invalid regex. #2265

Fixed by adding a util function to escape the string.

@mikiher
Copy link
Contributor Author

mikiher commented Oct 31, 2023 via email

@advplyr
Copy link
Owner

advplyr commented Oct 31, 2023

I don't have a preferred framework. I've used mocha and jest a little bit but not enough to have a preference. Do you have a preference?

@mikiher
Copy link
Contributor Author

mikiher commented Nov 2, 2023

I'm quite new to Node.js and Javascript (my background is C++ and Python).
Let me do some digging and testing of my own and I'll come back with a recommendation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants