Skip to content

Commit

Permalink
Merge pull request #959 from NickCrews/user-guide
Browse files Browse the repository at this point in the history
Add User Guide
  • Loading branch information
fgregg committed Feb 21, 2022
2 parents 2405f30 + c9287d0 commit fa576f9
Show file tree
Hide file tree
Showing 10 changed files with 116 additions and 27 deletions.
13 changes: 7 additions & 6 deletions dedupe/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -942,13 +942,14 @@ def __init__(self,
settings_file: A file object containing settings
info produced from the
:func:`~dedupe.api.ActiveMatching.write_settings` method.
num_cores: the number of cpus to use for parallel
num_cores: The number of cpus to use for parallel
processing, defaults to the number of cpus
available on the machine. If set to 0, then
multiprocessing will be disabled.
in_memory: Boolean that if True will compute pairs using
sqlite in RAM with the sqlite3 ':memory:' option
in_memory: If True, :meth:`dedupe.Dedupe.pairs` will generate
pairs in RAM with the sqlite3 ':memory:' option
rather than writing to disk. May be faster if
sufficient memory is available.
Expand Down Expand Up @@ -1000,13 +1001,13 @@ def __init__(self,
the variables will be used for
training a model. See :ref:`variable_definitions`
num_cores: the number of cpus to use for parallel
num_cores: The number of cpus to use for parallel
processing. If set to `None`, uses all cpus
available on the machine. If set to 0, then
multiprocessing will be disabled.
in_memory: Boolean that if True will compute pairs using
sqlite in RAM with the sqlite3 ':memory:' option
in_memory: If True, :meth:`dedupe.Dedupe.pairs` will generate
pairs in RAM with the sqlite3 ':memory:' option
rather than writing to disk. May be faster if
sufficient memory is available.
Expand Down
19 changes: 19 additions & 0 deletions docs/Examples.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
========
Examples
========

Dedupe is a library and not a stand-alone command line tool. To
demonstrate its usage, we have come up with a few example recipes for
different sized datasets for you to try out.

You can view and download the source code for these examples in the
`examples repo <https://github.com/dedupeio/dedupe-examples>`__.

Or, you can view annotated, "walkthrough" versions online:

* `Small data deduplication <http://dedupeio.github.io/dedupe-examples/docs/csv_example.html>`__
* `Record Linkage <https://dedupeio.github.io/dedupe-examples/docs/record_linkage_example.html>`__
* `Gazetter example <https://dedupeio.github.io/dedupe-examples/docs/gazetteer_example.html>`__
* `MySQL example <https://dedupeio.github.io/dedupe-examples/docs/mysql_example.html>`__
* `Postgres big dedupe example <https://dedupeio.github.io/dedupe-examples/docs/pgsql_big_dedupe_example.html>`__
* `Patent Author Disambiguation <https://dedupeio.github.io/dedupe-examples/docs/patent_example.html>`__
84 changes: 84 additions & 0 deletions docs/Troubleshooting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
***************
Troubleshooting
***************

So you've tried to apply dedupe to your dataset, but you're having some problems.
Once you understand :ref:`how dedupe works <how-it-works-label>`, and you've taken
a look at some of the :doc:`examples<Examples>`, then this troubleshoooting
guide is your next step.

Memory Considerations
=====================

The top two likely memory bottlenecks, in order of likelihood, are:

1. Building the index predicates for blocking. If this is a problem,
you can try turning off index blocking rules (and just use predicate
blocking rules) by setting ``index_predicates=False`` in
:meth:`dedupe.Dedupe.train`.

2. During `cluster()`. After scoring, we have to compare all the pairwise scores
and build the clusters. dedupe runs a connected-components algorithm to
determine where to begin the clustering, and this is currently done in
memory using python dicts, so it can take substantial memory.
There isn't currently a way to avoid this except to just use less records.

Time Considerations
===================

The slowest part of dedupe is probably during blocking. A big part of this is building
the index predicates, so the easiest fix for this is to set `index_predicates=False`
in :meth:`dedupe.Dedupe.train`.

Blocking could also be slow if dedupe has to do too many or too complex of
blocking rules. You can fix this by reducing the number of blocking rules dedupe has
to learn to cover all the true positives. Either you reduce the `recall` parameter
in :meth:`dedupe.Dedupe.train`, or, similarly, just use less positive examples
during training.

Note that you are making a choice here between speed and recall. The less blocking
you do, the faster you go, but the more likely you are to not block true positives
together.

This part of dedupe is still single-threaded, and could probably benefit
from parallelization or other code strategies,
although current attempts haven't really proved promising yet.


Improving Accuracy
==================

- Inspect your results and see if you can find any patterns: Does dedupe
not seem to be paying enough attention to some detail?

- Inspect the pairs given to you during :func:`dedupe.console_label`. These
are pairs that dedupe is most confused about. Are these actually confusing
pairs? If so, then great, dedupe is doing about as well as you could expect.
If the pair is obviously a duplicate or obviously not a duplicate, then this
means there is some signal that you should help dedupe to find.

- Read up on the theory behind each of the variable types. Some of them
are going to work better depending on the situation, so try to understand
them as well as you can.

- Add other variables. For instance try treating a field as both a `String`
and as a `Text` variable. If this doesn't cut it, add your own custom
variable that emphasizes the feature that you're really looking for.
For instance, if you have a list of last names, you might want "Smith"
to score well with "Smith-Johnson" (someone got married?). None of the
builtin variables will handle this well, so write your own comparator.

- Add `Interaction` variables. For instance, if both the "last name" and
"street address" fields score very well, then this is almost a guarantee
that these two records refer to the same person. An `Interaction` variable
can emphasize this to the learner.

Extending Dedupe
================

If the built in variables don't cut it, you can write your own variables.

Take a look at the separately maintained `optional variables
<https://github.com/search?q=org%3Adedupeio+dedupe-variable>`__
for examples of how to write your own custom variable types with
your custom comparators and predicates.
File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions docs/How-it-works.rst → docs/how-it-works/How-it-works.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _how-it-works-label:

############
How it works
############
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,6 @@ Other field distances
~~~~~~~~~~~~~~~~~~~~~

We have implemented a number of field distance measures. See :doc:`the
details about variables <Variable-definition>`.
details about variables </Variable-definition>`.


File renamed without changes.
23 changes: 3 additions & 20 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,9 @@ Contents

API-documentation
Variable-definition
How-it-works
Examples
how-it-works/How-it-works
Troubleshooting
Bibliography


Expand All @@ -66,23 +68,6 @@ Installation
pip install dedupe
Using dedupe
============

Dedupe is a library and not a stand-alone command line tool. To
demonstrate its usage, we have come up with a `few example recipes for
different sized datasets for you
<https://github.com/dedupeio/dedupe-examples/archive/0.5.zip>`__
(`repo <https://github.com/dedupeio/dedupe-examples>`__, as well as
annotated source code:

* `Small data deduplication <http://dedupeio.github.io/dedupe-examples/docs/csv_example.html>`__
* `Record Linkage <https://dedupeio.github.io/dedupe-examples/docs/record_linkage_example.html>`__
* `Gazetter example <https://dedupeio.github.io/dedupe-examples/docs/gazetteer_example.html>`__
* `MySQL example <https://dedupeio.github.io/dedupe-examples/docs/mysql_example.html>`__
* `Postgres big dedupe example <https://dedupeio.github.io/dedupe-examples/docs/pgsql_big_dedupe_example.html>`__
* `Patent Author Disambiguation <https://dedupeio.github.io/dedupe-examples/docs/patent_example.html>`__

Errors / Bugs
=============

Expand Down Expand Up @@ -111,6 +96,4 @@ Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

0 comments on commit fa576f9

Please sign in to comment.