Skip to content

Commit

Permalink
Add and improve examples in variable definitions
Browse files Browse the repository at this point in the history
As a newcomer I didn't really understand the differences
between all the different variable types,
and how to choose which to use. Now that I did the research
myself I thought I'd try to set others on the right track from
the beginning.
  • Loading branch information
NickCrews committed Jan 25, 2022
1 parent 92b776a commit 0c46395
Showing 1 changed file with 16 additions and 5 deletions.
21 changes: 16 additions & 5 deletions docs/Variable-definition.rst
Expand Up @@ -15,7 +15,7 @@ field specification. For example:-
[
{'field': 'Site name', 'type': 'String'},
{'field': 'Address', 'type': 'String'},
{'field': 'Zip', 'type': 'String', 'has missing': True},
{'field': 'Zip', 'type': 'ShortString', 'has missing': True},
{'field': 'Phone', 'type': 'String', 'has missing': True}
]
Expand All @@ -27,8 +27,10 @@ A ``String`` type field must declare the name of the record field to compare
a ``String`` type declaration. The ``String`` type expects fields to be of
class string.

``String`` types are compared using `affine gap string
distance <http://en.wikipedia.org/wiki/Gap_penalty#Affine>`__.
``String`` types are compared using string edit distance, specifically
`affine gap string distance <http://en.wikipedia.org/wiki/Gap_penalty#Affine>`__.
This is a good metric for measuring fields that might have typos in them,
such as "John" vs "Jon".

For example:-

Expand Down Expand Up @@ -57,7 +59,7 @@ For example:-
Text Types
^^^^^^^^^^

If you want to compare fields containing long blocks of text e.g. product
If you want to compare fields containing blocks of text e.g. product
descriptions or article abstracts, you should use this type. ``Text`` type
fields are compared using the `cosine similarity metric
<http://en.wikipedia.org/wiki/Vector_space_model>`__.
Expand All @@ -66,6 +68,14 @@ This is a measurement of the amount of words that two documents have in
common. This measure can be made more useful as the overlap of rare words
counts more than the overlap of common words.

Compare this to ``String`` and ``ShortString`` types: For strings containing
occupations, "yoga teacher" might be fairly similar to "yoga instructor" when
using the ``Text`` measurement, because they both contain the relatively
rare word of "yoga". However, if you compared these two strings using the
``String`` or ``ShortString`` measurements, they might be considered fairly
dis-similar, because the actual string edit distance between them is large.


If provided a sequence of example fields (i.e. a corpus) then dedupe will
learn these weights for you. For example:-

Expand Down Expand Up @@ -213,7 +223,8 @@ Categorical
different types of things. For example, you may have data on businesses and
you find that taxi cab businesses tend to have very similar names but law
firms don't. ``Categorical`` variables would let you indicate whether two records
are both taxi companies, both law firms, or one of each.
are both taxi companies, both law firms, or one of each. This is also a good choice
for fields that are booleans, e.g. "True" or "False".

Dedupe would represent these three possibilities using two dummy variables:

Expand Down

0 comments on commit 0c46395

Please sign in to comment.