Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add in-memory option #913

Merged
merged 7 commits into from
Sep 3, 2021
Merged

add in-memory option #913

merged 7 commits into from
Sep 3, 2021

Conversation

jhendr
Copy link
Contributor

@jhendr jhendr commented Aug 20, 2021

Using sqlite for join operations is good when memory is an issue, but has a performance hit when it is not. Depending on the input data, parameterization of the model, and machine running the code it may be preferable to run in memory or use sqlite to push things to disk. I'm working with a large dataset, but on a machine with plenty of RAM.

Conveniently sqlite3 gives the option to run in memory simply by changing the connection string to :memory:. See here. In my particular case changing the connection from a temporary file to working in memory reduces compute time of partition from 91 minutes to 24 minutes.

I propose adding an in_memory option that defaults to False (maintaining existing behavior) but gives the option to run things in memory. Thanks to the sqlite3 option, it comes with very little additional complexity.

If you are interested in this change maybe it makes sense to add test cases, but I'm not sure where they belong exactly. I've tried running tests/canonical_matching.py and tests/canonical.py with in_memory hard coded to True and False to verify that much works.

@@ -227,7 +233,10 @@ def pairs(self, data):
# Blocking and pair generation are typically the first memory
# bottlenecks, so we'll use sqlite3 to avoid doing them in memory
with tempfile.TemporaryDirectory() as temp_dir:
con = sqlite3.connect(temp_dir + '/blocks.db')
if in_memory:
con = sqlite3.connect(':memory:')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like the worst thing, but it is a bit awkward that the temp_dir still gets created in this case where it is unused. It looks like there are some nice patterns for avoiding this, but some of them depend on python version >= 3.3 or 3.7. I'm not sure what dedupe is trying to support currently. See some of the answers on this stackoverflow: https://stackoverflow.com/questions/27803059/conditional-with-statement-in-python.

May not be worth worrying about the temp_dir at all.

@fgregg
Copy link
Contributor

fgregg commented Aug 21, 2021

i'm a bit surprised that you had a dataset that was large enough that the job ran for 90 minutes but where the blocking map wasn't enormous. Or maybe it was very big, but you have lots of memory on your machine?

In general, I'm open to this option. I think it should be a config option when initializing the Dedupe, RecordLink, or Gazetteer class, not an argument to the various methods.

@fgregg
Copy link
Contributor

fgregg commented Aug 21, 2021

you may need to merge from master to fix the mypy complaints

@jhendr
Copy link
Contributor Author

jhendr commented Aug 23, 2021

Made changes that move the in_memory option to a class attribute determined on initialization. I have it sitting in the parent Matching class similar to num_cores. Let me know what you think.

I am usually running on a machine with very large RAM for my primary use case. That said, I'm profiling on my laptop and not seeing all that much memory usage. Like you mention, I've been a little surprised by the combination of runtime, size of input data, memory usage, etc. There are some things about the performance characteristics of dedupe I'm having a hard time nailing down, but I'm doing some digging to see what I can find, hence this and my other PR.

@coveralls
Copy link

coveralls commented Aug 23, 2021

Coverage Status

Coverage decreased (-0.2%) to 66.516% when pulling f69529b on jhendr:sqlite-in-memory into d70b0aa on dedupeio:master.

@fgregg
Copy link
Contributor

fgregg commented Aug 23, 2021

can you add documentation for these arguments here https://github.com/dedupeio/dedupe/blob/master/docs/API-documentation.rst

@jhendr
Copy link
Contributor Author

jhendr commented Aug 23, 2021

can you add documentation for these arguments here https://github.com/dedupeio/dedupe/blob/master/docs/API-documentation.rst

Looks like that rst file is mostly using sphinx to pick things up automatically from docstrings. Would you rather I add directly to the rst or just put it in the relevant docstring in the Matching class?

@fgregg
Copy link
Contributor

fgregg commented Aug 24, 2021

i think if you update the docstring in the ActiveMatching and StaticMatching classes, it will do the right thing, but i don't completely recall.

@jhendr
Copy link
Contributor Author

jhendr commented Aug 24, 2021

image

Looks like you are correct. That seems to work. Since the documentation lives in ActiveMatching and StaticMatching, I added in_memory as an explicit kwarg in those classes rather than just having it pass implicitly through **kwargs.

@fgregg
Copy link
Contributor

fgregg commented Sep 3, 2021

very nice, thanks a lot

@fgregg fgregg merged commit e789b9a into dedupeio:master Sep 3, 2021
jwalgran added a commit to opensupplyhub/open-apparel-registry that referenced this pull request Jul 15, 2022
After upgrading to dedupe 2.x we were encountering gunicorn worker timeouts on
staging. This change attempts to return to speed things up by using the SQLite
in memory option added in

dedupeio/dedupe#913
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants