# DedupliPy

## Advanced deduplication

Load your data. In this example we take a sample dataset that comes with DedupliPy:

In [None]:
from deduplipy.datasets import load_data

In [None]:
df = load_data(kind='childcare', return_pairs=False)

This dataset has two columns; `name` and `address`:

In [None]:
df.head(2)

Create a `Deduplicator` instance and provide advanced settings

- The similarity metrics per field are entered in a dict. Similarity metric can be any function that takes two strings and output a number.

In [None]:
from deduplipy.deduplicator import Deduplicator
from fuzzywuzzy.fuzz import ratio, partial_ratio, token_set_ratio, token_sort_ratio

In [None]:
field_info = {'name':[ratio, partial_ratio], 'address':[token_set_ratio, token_sort_ratio]}

- We choose our own set of rules for blocking which we define ourselves.

In [None]:
def first_two_characters(x):
    return x[:2]

- `interaction=True` makes the classifier include interaction features, e.g. `ratio('name') * token_set_ratio('address')`. When interaction features are included, the logistic regression classifier applies a L1 regularisation to prevent overfitting.
- We set `verbose=1` to get information on the progress and a distribution of scores

In [None]:
myDedupliPy = Deduplicator(field_info=field_info, interaction=True, rules = [first_two_characters], verbose=1)

Fit the `Deduplicator` by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering 'f'.

In [None]:
myDedupliPy.fit(df)

Based on the histogram of scores, we decide to ignore all pairs with a similarity probability lower than 0.1 when predicting:

Apply the trained `Deduplicator` on (new) data. The column `deduplication_id` is the identifier for a cluster. Rows with the same `deduplication_id` are found to be the same real world entity.

In [None]:
res = myDedupliPy.predict(df, score_threshold=0.1)
res.sort_values('deduplication_id').head(10)