Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start to use sklearn for ml algorithms #992

Merged
merged 32 commits into from
Jun 2, 2022
Merged

start to use sklearn for ml algorithms #992

merged 32 commits into from
Jun 2, 2022

Conversation

fgregg
Copy link
Contributor

@fgregg fgregg commented Apr 26, 2022

relates to #991 and #990

Todo

From trying to replace the cosine distance, it's pretty clearly not worth doing that unless we can batch the calls.

Any other likely places where we can use sklearn or scipy code instead of an additional library or replace dedupe code, @fjsj , @NickCrews ?

@fjsj
Copy link
Contributor

fjsj commented Apr 26, 2022

@fgregg
Another thing you could replace is the TF-IDF index, although it's not trivial to build scikit's TF-IDF iteratively and query it like an index, if I recall correctly.

I also see some nested fors in canonical.py, but I'm not sure if something in scikit-learn or scipy can replace that.

There are also graph operations at clustering.py that could in principle use sparse graphs from scipy, but it would be a full rewrite and I'm not sure if performance / memory usage would improve / reduce.

dedupe/api.py Outdated Show resolved Hide resolved
@NickCrews
Copy link
Contributor

I don't think I'm familiar enough with the code to be that confident, but from a cursory scan it seems like you got the obvious low-hanging fruit.

@NickCrews
Copy link
Contributor

If we implemented #967 then perhaps we could do a whole lot less guessing about how performance would be changed

@fgregg
Copy link
Contributor Author

fgregg commented May 5, 2022

nice, so @NickCrews's benchmarking set up let's us see that sklearn's Random Forest classifier leads us to use about 90M of peak memory as opposed to 50M of peak memory.

@coveralls
Copy link

coveralls commented May 5, 2022

Coverage Status

Coverage remained the same at 64.947% when pulling ca861b9 on sklearn_depend into a503691 on main.

cclauss and others added 3 commits May 6, 2022 03:42
This reverts commit 3c34d99.

Using sklearn to calculate cosine is signficantly slower than
the simplecosine package because the sklearn methods were not
desined to be called field-pair by field-pair
@fgregg
Copy link
Contributor Author

fgregg commented May 6, 2022

@NickCrews,there seems to be something going on with the benchmarks. The precision and recall scores are exactly the same between this PR and the comparison branch. That does not seem possible.

@fgregg
Copy link
Contributor Author

fgregg commented May 6, 2022

I adjusted the benchmarker, and we are getting a much clearer picture.

  • Recall is up. Almost always at 1.
  • For record linkage modes, precision stays high, sometimes small improvement
  • For deduplication mode, recall is up to 1, but the precision has dropped from low 90s to high 70s.

@NickCrews
Copy link
Contributor

I think the issue might be

  • benchmark runs, settings file created
  • switch to new commit, settings file is in .gitignore so it stays unchanged
  • on new commit, the benchmark still uses the old settings file

@fgregg
Copy link
Contributor Author

fgregg commented May 6, 2022

  • Recall is up. Almost always at 1.

    • For record linkage modes, precision stays high, sometimes small improvement

    • For deduplication mode, recall is up to 1, but the precision has dropped from low 90s to high 70s.

If we increase the threshold a bit for canonical, we can get better recall and precision than with logistic regression. I think I'll keep this change.

The only thing that we have left to do is make sure that existing settings file that use rlr can still be loaded, and add a deprecation notice if we find such a settings file.

@dedupeio dedupeio deleted a comment from github-actions bot May 7, 2022
@dedupeio dedupeio deleted a comment from github-actions bot May 7, 2022
@dedupeio dedupeio deleted a comment from NickCrews May 7, 2022
@dedupeio dedupeio deleted a comment from NickCrews May 7, 2022
@dedupeio dedupeio deleted a comment from github-actions bot May 7, 2022
@dedupeio dedupeio deleted a comment from github-actions bot May 8, 2022
@dedupeio dedupeio deleted a comment from github-actions bot May 8, 2022
@dedupeio dedupeio deleted a comment from github-actions bot May 8, 2022
@dedupeio dedupeio deleted a comment from github-actions bot May 8, 2022
@dedupeio dedupeio deleted a comment from github-actions bot May 27, 2022
@dedupeio dedupeio deleted a comment from github-actions bot May 27, 2022
@dedupeio dedupeio deleted a comment from github-actions bot May 27, 2022
@dedupeio dedupeio deleted a comment from github-actions bot May 27, 2022
@dedupeio dedupeio deleted a comment from github-actions bot May 27, 2022
@fgregg
Copy link
Contributor Author

fgregg commented May 27, 2022

@benchmark

@github-actions
Copy link

All benchmarks (diff):

before after ratio benchmark
494M 532M 1.08 canonical.Canonical.peakmem_run
15.4±0.3s 15.0±0.03s 0.97 canonical.Canonical.time_run
- 0.884 0.767 0.87 canonical.Canonical.track_precision
0.973 1.0 1.03 canonical.Canonical.track_recall
+ 195M 232M 1.19 canonical_gazetteer.Gazetteer.peakmem_run(None)
13.8±0.08s 14.3±0.02s 1.03 canonical_gazetteer.Gazetteer.time_run(None)
0.982 0.991 1.01 canonical_gazetteer.Gazetteer.track_precision(None)
0.982 0.991 1.01 canonical_gazetteer.Gazetteer.track_recall(None)
+ 196M 232M 1.18 canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
+ 195M 232M 1.19 canonical_matching.Matching.peakmem_run({'threshold': 0.5})
12.1±0.01s 12.4±0.01s 1.03 canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
12.2±0.1s 12.6±0.03s 1.04 canonical_matching.Matching.time_run({'threshold': 0.5})
0.99 0.982 0.99 canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
0.99 0.982 0.99 canonical_matching.Matching.track_precision({'threshold': 0.5})
0.911 1.0 1.10 canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
0.911 0.973 1.07 canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

@dedupeio dedupeio deleted a comment from github-actions bot May 27, 2022
@fgregg
Copy link
Contributor Author

fgregg commented May 27, 2022

@benchmark

@github-actions
Copy link

All benchmarks (diff):

before after ratio benchmark
494M 532M 1.08 canonical.Canonical.peakmem_run
15.0±0.2s 22.9±0s ~1.53 canonical.Canonical.time_run
- 0.919 0.8 0.87 canonical.Canonical.track_precision
0.911 1.0 1.10 canonical.Canonical.track_recall
+ 194M 232M 1.19 canonical_gazetteer.Gazetteer.peakmem_run(None)
+ 13.5±0.07s 23.5±0.01s 1.75 canonical_gazetteer.Gazetteer.time_run(None)
0.982 0.991 1.01 canonical_gazetteer.Gazetteer.track_precision(None)
0.982 0.982 1.00 canonical_gazetteer.Gazetteer.track_recall(None)
+ 194M 232M 1.19 canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
+ 194M 232M 1.19 canonical_matching.Matching.peakmem_run({'threshold': 0.5})
+ 11.8±0.03s 21.7±0.01s 1.84 canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
11.7±0.1s 21.9±0.02s ~1.87 canonical_matching.Matching.time_run({'threshold': 0.5})
0.981 0.991 1.01 canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
0.99 1.0 1.01 canonical_matching.Matching.track_precision({'threshold': 0.5})
0.911 0.991 1.09 canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
0.92 1.0 1.09 canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

@fgregg
Copy link
Contributor Author

fgregg commented Jun 2, 2022

@benchmark

@github-actions
Copy link

github-actions bot commented Jun 2, 2022

All benchmarks (diff):

before after ratio benchmark
495M 527M 1.07 canonical.Canonical.peakmem_run
15.8±0.06s 14.2±0.3s ~0.90 canonical.Canonical.time_run
0.904 0.962 1.06 canonical.Canonical.track_precision
0.964 0.929 0.96 canonical.Canonical.track_recall
+ 195M 228M 1.17 canonical_gazetteer.Gazetteer.peakmem_run(None)
13.5±0.05s 13.2±0.03s 0.98 canonical_gazetteer.Gazetteer.time_run(None)
0.982 0.982 1.00 canonical_gazetteer.Gazetteer.track_precision(None)
0.982 0.982 1.00 canonical_gazetteer.Gazetteer.track_recall(None)
+ 195M 228M 1.17 canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
+ 195M 228M 1.17 canonical_matching.Matching.peakmem_run({'threshold': 0.5})
11.8±0.01s 11.7±0.06s 0.99 canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
12.3±0.2s 11.7±0.02s 0.96 canonical_matching.Matching.time_run({'threshold': 0.5})
0.99 0.99 1.00 canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
0.99 0.99 1.00 canonical_matching.Matching.track_precision({'threshold': 0.5})
0.911 0.911 1.00 canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
0.929 0.911 0.98 canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

@fgregg fgregg merged commit 7e24af2 into main Jun 2, 2022
@fgregg fgregg linked an issue Jun 2, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

shoud we integrate with scikit learn
5 participants