Merge pull request #959 from NickCrews/user-guide

Add User Guide
dedupeio · Feb 21, 2022 · fa576f9 · fa576f9
2 parents 2405f30 + c9287d0
commit fa576f9
Show file tree

Hide file tree

Showing 10 changed files with 116 additions and 27 deletions.
diff --git a/dedupe/api.py b/dedupe/api.py
@@ -942,13 +942,14 @@ def __init__(self,
             settings_file: A file object containing settings
                            info produced from the
                            :func:`~dedupe.api.ActiveMatching.write_settings` method.
-            num_cores: the number of cpus to use for parallel
+
+            num_cores: The number of cpus to use for parallel
                        processing, defaults to the number of cpus
                        available on the machine. If set to 0, then
                        multiprocessing will be disabled.
 
-            in_memory: Boolean that if True will compute pairs using
-                       sqlite in RAM with the sqlite3 ':memory:' option
+            in_memory: If True, :meth:`dedupe.Dedupe.pairs` will generate
+                       pairs in RAM with the sqlite3 ':memory:' option
                        rather than writing to disk. May be faster if
                        sufficient memory is available.
 
@@ -1000,13 +1001,13 @@ def __init__(self,
                                  the variables will be used for
                                  training a model. See :ref:`variable_definitions`
 
-            num_cores: the number of cpus to use for parallel
+            num_cores: The number of cpus to use for parallel
                        processing. If set to `None`, uses all cpus
                        available on the machine. If set to 0, then
                        multiprocessing will be disabled.
 
-            in_memory: Boolean that if True will compute pairs using
-                       sqlite in RAM with the sqlite3 ':memory:' option
+            in_memory: If True, :meth:`dedupe.Dedupe.pairs` will generate
+                       pairs in RAM with the sqlite3 ':memory:' option
                        rather than writing to disk. May be faster if
                        sufficient memory is available.
 

diff --git a/docs/Examples.rst b/docs/Examples.rst
@@ -0,0 +1,19 @@
+========
+Examples
+========
+
+Dedupe is a library and not a stand-alone command line tool. To
+demonstrate its usage, we have come up with a few example recipes for
+different sized datasets for you to try out.
+
+You can view and download the source code for these examples in the
+`examples repo <https://github.com/dedupeio/dedupe-examples>`__.
+
+Or, you can view annotated, "walkthrough" versions online:
+
+* `Small data deduplication <http://dedupeio.github.io/dedupe-examples/docs/csv_example.html>`__
+* `Record Linkage <https://dedupeio.github.io/dedupe-examples/docs/record_linkage_example.html>`__
+* `Gazetter example <https://dedupeio.github.io/dedupe-examples/docs/gazetteer_example.html>`__
+* `MySQL example <https://dedupeio.github.io/dedupe-examples/docs/mysql_example.html>`__
+* `Postgres big dedupe example <https://dedupeio.github.io/dedupe-examples/docs/pgsql_big_dedupe_example.html>`__
+* `Patent Author Disambiguation <https://dedupeio.github.io/dedupe-examples/docs/patent_example.html>`__
diff --git a/docs/Troubleshooting.rst b/docs/Troubleshooting.rst
@@ -0,0 +1,84 @@
+***************
+Troubleshooting
+***************
+
+So you've tried to apply dedupe to your dataset, but you're having some problems.
+Once you understand :ref:`how dedupe works <how-it-works-label>`, and you've taken
+a look at some of the :doc:`examples<Examples>`, then this troubleshoooting
+guide is your next step.
+
+Memory Considerations
+=====================
+
+The top two likely memory bottlenecks, in order of likelihood, are:
+
+1. Building the index predicates for blocking. If this is a problem,
+   you can try turning off index blocking rules (and just use predicate
+   blocking rules) by setting ``index_predicates=False`` in
+   :meth:`dedupe.Dedupe.train`.
+
+2. During `cluster()`. After scoring, we have to compare all the pairwise scores
+   and build the clusters. dedupe runs a connected-components algorithm to
+   determine where to begin the clustering, and this is currently done in
+   memory using python dicts, so it can take substantial memory.
+   There isn't currently a way to avoid this except to just use less records.
+
+Time Considerations
+===================
+
+The slowest part of dedupe is probably during blocking. A big part of this is building
+the index predicates, so the easiest fix for this is to set `index_predicates=False`
+in :meth:`dedupe.Dedupe.train`.
+
+Blocking could also be slow if dedupe has to do too many or too complex of
+blocking rules. You can fix this by reducing the number of blocking rules dedupe has
+to learn to cover all the true positives. Either you reduce the `recall` parameter
+in :meth:`dedupe.Dedupe.train`, or, similarly, just use less positive examples
+during training.
+
+Note that you are making a choice here between speed and recall. The less blocking
+you do, the faster you go, but the more likely you are to not block true positives
+together.
+
+This part of dedupe is still single-threaded, and could probably benefit
+from parallelization or other code strategies,
+although current attempts haven't really proved promising yet.
+
+
+Improving Accuracy
+==================
+
+- Inspect your results and see if you can find any patterns: Does dedupe
+  not seem to be paying enough attention to some detail?
+
+- Inspect the pairs given to you during :func:`dedupe.console_label`. These
+  are pairs that dedupe is most confused about. Are these actually confusing
+  pairs? If so, then great, dedupe is doing about as well as you could expect.
+  If the pair is obviously a duplicate or obviously not a duplicate, then this
+  means there is some signal that you should help dedupe to find.
+
+- Read up on the theory behind each of the variable types. Some of them
+  are going to work better depending on the situation, so try to understand
+  them as well as you can.
+
+- Add other variables. For instance try treating a field as both a `String`
+  and as a `Text` variable. If this doesn't cut it, add your own custom
+  variable that emphasizes the feature that you're really looking for.
+  For instance, if you have a list of last names, you might want "Smith"
+  to score well with "Smith-Johnson" (someone got married?). None of the
+  builtin variables will handle this well, so write your own comparator.
+
+- Add `Interaction` variables. For instance, if both the "last name" and 
+  "street address" fields score very well, then this is almost a guarantee
+  that these two records refer to the same person. An `Interaction` variable
+  can emphasize this to the learner.
+
+Extending Dedupe
+================
+
+If the built in variables don't cut it, you can write your own variables.
+
+Take a look at the separately maintained `optional variables
+<https://github.com/search?q=org%3Adedupeio+dedupe-variable>`__
+for examples of how to write your own custom variable types with
+your custom comparators and predicates.
diff --git a/docs/Choosing-a-good-threshold.rst → ...ow-it-works/Choosing-a-good-threshold.rst b/docs/Choosing-a-good-threshold.rst → ...ow-it-works/Choosing-a-good-threshold.rst
diff --git a/docs/Grouping-duplicates.rst → docs/how-it-works/Grouping-duplicates.rst b/docs/Grouping-duplicates.rst → docs/how-it-works/Grouping-duplicates.rst
diff --git a/docs/How-it-works.rst → docs/how-it-works/How-it-works.rst b/docs/How-it-works.rst → docs/how-it-works/How-it-works.rst
@@ -1,3 +1,5 @@
+.. _how-it-works-label:
+
 ############
 How it works
 ############

diff --git a/docs/Making-smart-comparisons.rst → ...how-it-works/Making-smart-comparisons.rst b/docs/Making-smart-comparisons.rst → ...how-it-works/Making-smart-comparisons.rst
diff --git a/docs/Matching-records.rst → docs/how-it-works/Matching-records.rst b/docs/Matching-records.rst → docs/how-it-works/Matching-records.rst
@@ -130,6 +130,6 @@ Other field distances
 ~~~~~~~~~~~~~~~~~~~~~
 
 We have implemented a number of field distance measures. See :doc:`the
-details about variables <Variable-definition>`.
+details about variables </Variable-definition>`.
 
 
diff --git a/docs/Special-Cases.rst → docs/how-it-works/Special-Cases.rst b/docs/Special-Cases.rst → docs/how-it-works/Special-Cases.rst
diff --git a/docs/index.rst b/docs/index.rst
@@ -46,7 +46,9 @@ Contents
 
    API-documentation
    Variable-definition
-   How-it-works
+   Examples
+   how-it-works/How-it-works
+   Troubleshooting
    Bibliography
 
 
@@ -66,23 +68,6 @@ Installation
 
    pip install dedupe
 
-Using dedupe
-============
-
-Dedupe is a library and not a stand-alone command line tool. To
-demonstrate its usage, we have come up with a `few example recipes for
-different sized datasets for you
-<https://github.com/dedupeio/dedupe-examples/archive/0.5.zip>`__
-(`repo <https://github.com/dedupeio/dedupe-examples>`__, as well as
-annotated source code:
-
-* `Small data deduplication <http://dedupeio.github.io/dedupe-examples/docs/csv_example.html>`__
-* `Record Linkage <https://dedupeio.github.io/dedupe-examples/docs/record_linkage_example.html>`__
-* `Gazetter example <https://dedupeio.github.io/dedupe-examples/docs/gazetteer_example.html>`__
-* `MySQL example <https://dedupeio.github.io/dedupe-examples/docs/mysql_example.html>`__
-* `Postgres big dedupe example <https://dedupeio.github.io/dedupe-examples/docs/pgsql_big_dedupe_example.html>`__
-* `Patent Author Disambiguation <https://dedupeio.github.io/dedupe-examples/docs/patent_example.html>`__
-
 Errors / Bugs
 =============
 
@@ -111,6 +96,4 @@ Indices and tables
 ==================
 
 * :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`