Improvement on duplicate checking #4

Xunius · 2019-04-03T00:57:21Z

Currently the duplicate matching is done in the following manner (see lib/tools.fuzzyMatch()):

Get doc1, doc2.
Compute the token_sort_ratio (see fuzzywuzzy on the author lists of doc1 and doc2, giving ratio_authors.
Compute the ratio on titles of doc1 and doc2, giving ratio_title.
Compute the ratio on the journal-name-year combined strings (e.g. 'Nature2019' v.s. 'Science2018') of doc1 and doc2, giving ratio_other.
Compute a weighted average of the 3 scores, using average (mean of doc1 and doc2) length as weights, giving score.

Finally, a match is labelled if score>= a given threshold.

Please let me know if you spot anything wrong or have better way of doing this.

The text was updated successfully, but these errors were encountered:

trying to address [issue #4](#4) tools.py: add fuzzyMatchPrepare(), get strings of authors, title and journal+year from a doc and return them in a tuple. in fuzzyMatch(), inputs are taken from outputs of fuzzyMatchPrepare(). Add min_score input arg, add 2 shortcuts that could possibly skip a couple of fuzzymatchings, given max possible score (100) a to-be-computed scores. in fuzzyMatch(), use simple ratio for authors too. duplicate_frame.py: in prepareJoblist(), add a cache_dict to store outputs from fuzzyMatchPrepare(), to avoid re-computes. Compare string lengths of authors and titles, if they diff by > 50%, skip fuzzy matching and treat them as non-match. It seems that there is very little space for speed up in the fuzzy matching calls. Even if I shortcut fuzzyMatch() completely the total time won't drop much. So it is the shear search space that is adding up the time. Unless one can vectorize the Levenshtein computations, cutting down the search space is the way to go.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement on duplicate checking #4

Improvement on duplicate checking #4

Xunius commented Apr 3, 2019

Improvement on duplicate checking #4

Improvement on duplicate checking #4

Comments

Xunius commented Apr 3, 2019