-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add in-memory option #913
add in-memory option #913
Conversation
@@ -227,7 +233,10 @@ def pairs(self, data): | |||
# Blocking and pair generation are typically the first memory | |||
# bottlenecks, so we'll use sqlite3 to avoid doing them in memory | |||
with tempfile.TemporaryDirectory() as temp_dir: | |||
con = sqlite3.connect(temp_dir + '/blocks.db') | |||
if in_memory: | |||
con = sqlite3.connect(':memory:') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like the worst thing, but it is a bit awkward that the temp_dir still gets created in this case where it is unused. It looks like there are some nice patterns for avoiding this, but some of them depend on python version >= 3.3 or 3.7. I'm not sure what dedupe is trying to support currently. See some of the answers on this stackoverflow: https://stackoverflow.com/questions/27803059/conditional-with-statement-in-python.
May not be worth worrying about the temp_dir at all.
i'm a bit surprised that you had a dataset that was large enough that the job ran for 90 minutes but where the blocking map wasn't enormous. Or maybe it was very big, but you have lots of memory on your machine? In general, I'm open to this option. I think it should be a config option when initializing the Dedupe, RecordLink, or Gazetteer class, not an argument to the various methods. |
you may need to merge from master to fix the mypy complaints |
Made changes that move the in_memory option to a class attribute determined on initialization. I have it sitting in the parent I am usually running on a machine with very large RAM for my primary use case. That said, I'm profiling on my laptop and not seeing all that much memory usage. Like you mention, I've been a little surprised by the combination of runtime, size of input data, memory usage, etc. There are some things about the performance characteristics of dedupe I'm having a hard time nailing down, but I'm doing some digging to see what I can find, hence this and my other PR. |
can you add documentation for these arguments here https://github.com/dedupeio/dedupe/blob/master/docs/API-documentation.rst |
Looks like that rst file is mostly using sphinx to pick things up automatically from docstrings. Would you rather I add directly to the rst or just put it in the relevant docstring in the |
i think if you update the docstring in the ActiveMatching and StaticMatching classes, it will do the right thing, but i don't completely recall. |
very nice, thanks a lot |
After upgrading to dedupe 2.x we were encountering gunicorn worker timeouts on staging. This change attempts to return to speed things up by using the SQLite in memory option added in dedupeio/dedupe#913
Using sqlite for join operations is good when memory is an issue, but has a performance hit when it is not. Depending on the input data, parameterization of the model, and machine running the code it may be preferable to run in memory or use sqlite to push things to disk. I'm working with a large dataset, but on a machine with plenty of RAM.
Conveniently sqlite3 gives the option to run in memory simply by changing the connection string to
:memory:
. See here. In my particular case changing the connection from a temporary file to working in memory reduces compute time of partition from 91 minutes to 24 minutes.I propose adding an
in_memory
option that defaults to False (maintaining existing behavior) but gives the option to run things in memory. Thanks to the sqlite3 option, it comes with very little additional complexity.If you are interested in this change maybe it makes sense to add test cases, but I'm not sure where they belong exactly. I've tried running tests/canonical_matching.py and tests/canonical.py with in_memory hard coded to True and False to verify that much works.