# Tutorial

## Simple deduplication

In [1]:
import pandas as pd
from deduplipy.datasets import load_data

Load data

In [2]:
df_train = load_data(kind='childcare', return_pairs=False)

This dataset has two columns; `name` and `address`:

In [11]:
df_train.head(2)

Unnamed: 0,name,address,row_number
0,Chicago Commons Association St Catherine's - S...,27 Washington Oak Park IL 60302,0
1,Precious Little One's Learning Center Inc,221 E 51st St,1


Import `Deduplicator` class

In [3]:
from deduplipy.deduplicator import Deduplicator

Instantiate `Deduplicator` class with the column names

In [6]:
myDedupliPy = Deduplicator(['name', 'address'])

Perform the fitting using active learning

In [None]:
myDedupliPy.fit(df_train)

Predict on new data

In [9]:
res = myDedupliPy.predict(df_train)
res.sort_values('deduplication_id').head(10)

Unnamed: 0,name,address,deduplication_id
1248,YMCA South Chicago,3039 E 91st St,1
1244,YMCA of Metropolitan Chicago - South Chicago YMCA,3039 E 91st St,1
1245,YMCA South Chicago,3039 E 91st Street,1
1247,YMCA of Metropolitan Chicago South Chicago,3039 E 91st St,1
1249,YMCA South Chicago School,3039 E 91st Street,1
1201,Dorsey Developmental Institute III,2938 E 91st Street,2
1202,Dorsey Developmental Institute III,2938 E 91st St,2
505,Woodlawn Organization - Early Childhood Develo...,950 E 61st Street,3
504,Woodlawn EC Development Ctr I/T,950 E 61st St,3
503,Woodlawn EC Development Ctr,950 E 61st St,3


The `Deduplicator` instance can be saved as a pickle file and be applied on new data after training:

In [9]:
import pickle

In [10]:
with open('myDeduplipy.pkl', 'wb') as f:
    pickle.dump(myDedupliPy, f)

In [11]:
del(myDedupliPy)

In [14]:
with open('myDeduplipy.pkl', 'rb') as f:
    loaded_obj = pickle.load(f)

In [17]:
res = loaded_obj.predict(df)
res.sort_values('deduplication_id').head(10)

Unnamed: 0,name,address,deduplication_id
1161,Hammond,2819 W 21st Place,1
1162,Hammond,2819 W 21st Pl,1
561,Holden,1104 W 31st Street,2
560,Holden,1104 W 31St St,2
394,Aldridge,630 E 131St St,3
395,CHICAGO PUBLIC SCHOOLS ALDRIDGE IRA F,630 E 131ST ST,3
1202,Dorsey Developmental Institute III,2938 E 91st St,4
1201,Dorsey Developmental Institute III,2938 E 91st Street,4
1249,YMCA South Chicago School,3039 E 91st Street,5
1248,YMCA South Chicago,3039 E 91st St,5


## Advanced deduplication

Load your data. In this example we take a sample dataset that comes with DedupliPy:

In [12]:
from deduplipy.datasets import load_data

In [13]:
df = load_data(kind='childcare', return_pairs=False)

Create a `Deduplicator` instance and provide advanced settings

- The similarity metrics per field are entered in a dict. Similarity metric can be any function that takes two strings and output a number.

In [19]:
from deduplipy.deduplicator import Deduplicator
from fuzzywuzzy.fuzz import ratio, partial_ratio, token_set_ratio, token_sort_ratio

In [20]:
field_info = {'name':[ratio, partial_ratio], 'address':[token_set_ratio, token_sort_ratio]}

- We choose our own set of rules for blocking which we define ourselves.

In [21]:
def first_two_characters(x):
    return x[:2]

- `interaction=True` makes the classifier include interaction features, e.g. `ratio('name') * token_set_ratio('address')`. When interaction features are included, the logistic regression classifier applies a L1 regularisation to prevent overfitting.
- We set `verbose=1` to get information on the progress and a distribution of scores

In [22]:
myDedupliPy = Deduplicator(field_info=field_info, interaction=True, rules = [first_two_characters], verbose=1)

Fit the `Deduplicator` by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering 'f'.

In [None]:
myDedupliPy.fit(df)

Based on the histogram of scores, we decide to ignore all pairs with a similarity probability lower than 0.1 when predicting:

Apply the trained `Deduplicator` on (new) data. The column `deduplication_id` is the identifier for a cluster. Rows with the same `deduplication_id` are found to be the same real world entity.

In [24]:
res = myDedupliPy.predict(df, score_threshold=0.1)
res.sort_values('deduplication_id').head(10)

blocking started
blocking finished
Nr of pairs: 41311
scoring started
scoring finished
Nr of filtered pairs: 3044
Clustering started
Clustering finished


Unnamed: 0,name,address,deduplication_id
1125,Carole Robertson Center For Learning - Aurora ...,2701 W 18th St,1
1126,CAROLE ROBERTSON CENTER FOR LEARNING FCCH-AURO...,2701 W 18TH ST,1
1136,YMCA Bowen School,2710 E 89th Street,2
1135,YMCA of Metropolitan Chicago - YMCA - Bowen Hi...,2710 E 89th St,2
49,Easter Seals Society of Metropolitan Chicago -...,2718 W 59th St,3
50,Keeper?s Institute Infant/Child Care,2718 W 59th St,3
1130,GADS HILL CENTER HUNT'S EARLY CHILDHOOD EDUCAT...,2701 W 79TH ST,4
1121,YMCA of Metropolitan Chicago - Rauner,2700 S Western Ave,8
1122,YMCA Rauner Family,2700 S Western Ave,8
1123,YMCA OF METROPOLITAN CHICAGO RAUNER,2700 S WESTERN AVE,8
