You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
trying to address [issue #4](#4)
tools.py:
add fuzzyMatchPrepare(), get strings of authors, title and
journal+year from a doc and return them in a tuple.
in fuzzyMatch(), inputs are taken from outputs of
fuzzyMatchPrepare(). Add min_score input arg, add 2 shortcuts
that could possibly skip a couple of fuzzymatchings, given max
possible score (100) a to-be-computed scores.
in fuzzyMatch(), use simple ratio for authors too.
duplicate_frame.py:
in prepareJoblist(), add a cache_dict to store outputs from
fuzzyMatchPrepare(), to avoid re-computes. Compare string lengths
of authors and titles, if they diff by > 50%, skip fuzzy matching
and treat them as non-match.
It seems that there is very little space for speed up in the
fuzzy matching calls. Even if I shortcut fuzzyMatch() completely the
total time won't drop much. So it is the shear search space that is
adding up the time. Unless one can vectorize the Levenshtein
computations, cutting down the search space is the way to go.
Currently the duplicate matching is done in the following manner (see
lib/tools.fuzzyMatch()
):token_sort_ratio
(see fuzzywuzzy on the author lists of doc1 and doc2, givingratio_authors
.ratio
on titles of doc1 and doc2, givingratio_title
.ratio
on the journal-name-year combined strings (e.g. 'Nature2019' v.s. 'Science2018') of doc1 and doc2, givingratio_other
.score
.Finally, a match is labelled if
score
>= a given threshold.Please let me know if you spot anything wrong or have better way of doing this.
The text was updated successfully, but these errors were encountered: