Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement on duplicate checking #4

Open
Xunius opened this issue Apr 3, 2019 · 0 comments
Open

Improvement on duplicate checking #4

Xunius opened this issue Apr 3, 2019 · 0 comments

Comments

@Xunius
Copy link
Owner

Xunius commented Apr 3, 2019

Currently the duplicate matching is done in the following manner (see lib/tools.fuzzyMatch()):

  1. Get doc1, doc2.
  2. Compute the token_sort_ratio (see fuzzywuzzy on the author lists of doc1 and doc2, giving ratio_authors.
  3. Compute the ratio on titles of doc1 and doc2, giving ratio_title.
  4. Compute the ratio on the journal-name-year combined strings (e.g. 'Nature2019' v.s. 'Science2018') of doc1 and doc2, giving ratio_other.
  5. Compute a weighted average of the 3 scores, using average (mean of doc1 and doc2) length as weights, giving score.

Finally, a match is labelled if score>= a given threshold.

Please let me know if you spot anything wrong or have better way of doing this.

Xunius added a commit that referenced this issue Apr 7, 2019
trying to address [issue #4](#4)

tools.py:

    add fuzzyMatchPrepare(), get strings of authors, title and
    journal+year from a doc and return them in a tuple.

    in fuzzyMatch(), inputs are taken from outputs of
    fuzzyMatchPrepare(). Add min_score input arg, add 2 shortcuts
    that could possibly skip a couple of fuzzymatchings, given max
    possible score (100) a to-be-computed scores.

    in fuzzyMatch(), use simple ratio for authors too.

duplicate_frame.py:

    in prepareJoblist(), add a cache_dict to store outputs from
    fuzzyMatchPrepare(), to avoid re-computes. Compare string lengths
    of authors and titles, if they diff by > 50%, skip fuzzy matching
    and treat them as non-match.

    It seems that there is very little space for speed up in the
    fuzzy matching calls. Even if I shortcut fuzzyMatch() completely the
    total time won't drop much. So it is the shear search space that is
    adding up the time. Unless one can vectorize the Levenshtein
    computations, cutting down the search space is the way to go.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant