add in-memory option #913

jhendr · 2021-08-20T03:16:27Z

Using sqlite for join operations is good when memory is an issue, but has a performance hit when it is not. Depending on the input data, parameterization of the model, and machine running the code it may be preferable to run in memory or use sqlite to push things to disk. I'm working with a large dataset, but on a machine with plenty of RAM.

Conveniently sqlite3 gives the option to run in memory simply by changing the connection string to :memory:. See here. In my particular case changing the connection from a temporary file to working in memory reduces compute time of partition from 91 minutes to 24 minutes.

I propose adding an in_memory option that defaults to False (maintaining existing behavior) but gives the option to run things in memory. Thanks to the sqlite3 option, it comes with very little additional complexity.

If you are interested in this change maybe it makes sense to add test cases, but I'm not sure where they belong exactly. I've tried running tests/canonical_matching.py and tests/canonical.py with in_memory hard coded to True and False to verify that much works.

jhendr · 2021-08-20T03:20:00Z

dedupe/api.py

@@ -227,7 +233,10 @@ def pairs(self, data):
        # Blocking and pair generation are typically the first memory
        # bottlenecks, so we'll use sqlite3 to avoid doing them in memory
        with tempfile.TemporaryDirectory() as temp_dir:
-            con = sqlite3.connect(temp_dir + '/blocks.db')
+            if in_memory:
+                con = sqlite3.connect(':memory:')


It doesn't seem like the worst thing, but it is a bit awkward that the temp_dir still gets created in this case where it is unused. It looks like there are some nice patterns for avoiding this, but some of them depend on python version >= 3.3 or 3.7. I'm not sure what dedupe is trying to support currently. See some of the answers on this stackoverflow: https://stackoverflow.com/questions/27803059/conditional-with-statement-in-python.

May not be worth worrying about the temp_dir at all.

fgregg · 2021-08-21T01:45:51Z

i'm a bit surprised that you had a dataset that was large enough that the job ran for 90 minutes but where the blocking map wasn't enormous. Or maybe it was very big, but you have lots of memory on your machine?

In general, I'm open to this option. I think it should be a config option when initializing the Dedupe, RecordLink, or Gazetteer class, not an argument to the various methods.

fgregg · 2021-08-21T02:06:05Z

you may need to merge from master to fix the mypy complaints

jhendr · 2021-08-23T19:14:06Z

Made changes that move the in_memory option to a class attribute determined on initialization. I have it sitting in the parent Matching class similar to num_cores. Let me know what you think.

I am usually running on a machine with very large RAM for my primary use case. That said, I'm profiling on my laptop and not seeing all that much memory usage. Like you mention, I've been a little surprised by the combination of runtime, size of input data, memory usage, etc. There are some things about the performance characteristics of dedupe I'm having a hard time nailing down, but I'm doing some digging to see what I can find, hence this and my other PR.

coveralls · 2021-08-23T20:27:47Z

Coverage decreased (-0.2%) to 66.516% when pulling f69529b on jhendr:sqlite-in-memory into d70b0aa on dedupeio:master.

fgregg · 2021-08-23T22:40:35Z

can you add documentation for these arguments here https://github.com/dedupeio/dedupe/blob/master/docs/API-documentation.rst

jhendr · 2021-08-23T23:09:17Z

can you add documentation for these arguments here https://github.com/dedupeio/dedupe/blob/master/docs/API-documentation.rst

Looks like that rst file is mostly using sphinx to pick things up automatically from docstrings. Would you rather I add directly to the rst or just put it in the relevant docstring in the Matching class?

fgregg · 2021-08-24T00:50:50Z

i think if you update the docstring in the ActiveMatching and StaticMatching classes, it will do the right thing, but i don't completely recall.

jhendr · 2021-08-24T01:42:48Z

Looks like you are correct. That seems to work. Since the documentation lives in ActiveMatching and StaticMatching, I added in_memory as an explicit kwarg in those classes rather than just having it pass implicitly through **kwargs.

fgregg · 2021-09-03T17:54:44Z

very nice, thanks a lot

After upgrading to dedupe 2.x we were encountering gunicorn worker timeouts on staging. This change attempts to return to speed things up by using the SQLite in memory option added in dedupeio/dedupe#913

add in-memory option

6c01ef8

jhendr commented Aug 20, 2021

View reviewed changes

Jeff Hendricks added 3 commits August 23, 2021 11:03

Move in_memory to Matching class attribute

c2af2b1

Fix formatting for flake8

6c21e74

Merge branch 'master' into sqlite-in-memory

fb079e5

Add docstrings for in_memory

153ea90

Jeff Hendricks and others added 2 commits August 23, 2021 19:47

Fix missing in_memory in Gazetteer

a639918

Merge branch 'dedupeio:master' into sqlite-in-memory

f69529b

fgregg approved these changes Sep 3, 2021

View reviewed changes

fgregg merged commit e789b9a into dedupeio:master Sep 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add in-memory option #913

add in-memory option #913

jhendr commented Aug 20, 2021

jhendr Aug 20, 2021

fgregg commented Aug 21, 2021

fgregg commented Aug 21, 2021

jhendr commented Aug 23, 2021

coveralls commented Aug 23, 2021 •

edited

Loading

fgregg commented Aug 23, 2021

jhendr commented Aug 23, 2021 •

edited

Loading

fgregg commented Aug 24, 2021

jhendr commented Aug 24, 2021

fgregg commented Sep 3, 2021

add in-memory option #913

add in-memory option #913

Conversation

jhendr commented Aug 20, 2021

jhendr Aug 20, 2021

Choose a reason for hiding this comment

fgregg commented Aug 21, 2021

fgregg commented Aug 21, 2021

jhendr commented Aug 23, 2021

coveralls commented Aug 23, 2021 • edited Loading

fgregg commented Aug 23, 2021

jhendr commented Aug 23, 2021 • edited Loading

fgregg commented Aug 24, 2021

jhendr commented Aug 24, 2021

fgregg commented Sep 3, 2021

coveralls commented Aug 23, 2021 •

edited

Loading

jhendr commented Aug 23, 2021 •

edited

Loading