Skip to content

Commit

Permalink
refer to matchBlocks for in thresholdBlocks
Browse files Browse the repository at this point in the history
  • Loading branch information
fgregg committed Feb 26, 2015
1 parent 18305d8 commit 9a7eed1
Showing 1 changed file with 7 additions and 28 deletions.
35 changes: 7 additions & 28 deletions docs/common_methods.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,39 +5,18 @@

For larger datasets, you will need to use the ``thresholdBlocks``
and ``matchBlocks``. This methods require you to create blocks of
records. For Dedupe, each blocks should be a dictionary of
records. Each block consists of all the records that share a
particular predicate, as output by the blocker method of Dedupe.

Within a block, the dictionary should consist of records from the data,
with the keys being record ids and the values being the record.

.. code:: python
> data = {'A1' : {'name' : 'howard'},
'B1' : {'name' : 'howie'}}
...
> blocks = defaultdict(dict)
>
> for block_key, record_id in linker.blocker(data_d.items()) :
> blocks[block_key].update({record_id : data_d[record_id]})
>
> blocked_data = blocks.values()
> print blocked_data
[{'A1' : {'name' : 'howard'},
'B1' : {'name' : 'howie'}}]
records. See the documentation for the ``matchBlocks`` method
for how to construct blocks.
.. code:: python
threshold = deduper.thresholdBlocks(blocked_data, recall_weight=2)
Keyword arguments

``blocks`` Sequence of tuples of records, where each tuple is a set of
records covered by a blocking predicate.
:param list blocks: See ```matchBlocks```

``recall_weight`` Sets the tradeoff between precision and recall. I.e.
if you care twice as much about recall as you do precision, set
recall\_weight to 2.
:param float recall_weight: Sets the tradeoff between precision and
recall. I.e. if you care twice as much
about recall as you do precision, set
recall\_weight to 2.

0 comments on commit 9a7eed1

Please sign in to comment.