# DedupliPy

## Simple deduplication

Load your data. In this example we take a sample dataset that comes with DedupliPy:

In [1]:
from deduplipy.datasets import load_data

In [2]:
df = load_data(kind='childcare', return_pairs=False)

This dataset has two columns; `name` and `address`:

In [3]:
df.head(2)

Unnamed: 0,name,address
0,Chicago Commons Association St Catherine's - S...,27 Washington Oak Park IL 60302
1,Precious Little One's Learning Center Inc,221 E 51st St


Create a `Deduplicator` instance and provide the column names to be used for deduplication:

In [4]:
from deduplipy.deduplicator import Deduplicator

In [5]:
myDedupliPy = Deduplicator(['name', 'address'])

Fit the `Deduplicator` by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering 'f'.

In [6]:
myDedupliPy.fit(df)


Nr. 1 (0+/0-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1       Precious Little One's Learning Center Inc
address_1    221 E 51st St                            
-> name_2       Precious Little One's Learning Center Inc
address_2    221 E 51st St                            


 y



Nr. 2 (1+/0-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1       Precious Little One's Learning Center Inc
address_1    221 E 51st St                            
-> name_2       Bethel New Life - Lake and Pulaski
address_2    316 N Pulaski Rd                  


 n



Nr. 3 (1+/1-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1       Catholic Charities - Our Lady of Lourdes
address_1    1449 S Keeler                           
-> name_2       CATHOLIC CHARITIES OF THE ARCHDIOCESE OF CHICAGO OUR LADY OF LOURDES
address_2    1449 S KEELER                                                       


 y



Nr. 4 (2+/1-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1       Onward Neighborhood House - After School
address_1    2158 W Ohio Street                      
-> name_2       Erie Neighborhood House - Charter School - Cortez
address_2    2510 W Cortez St                                 


 n



Nr. 5 (2+/2-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1       Henry Booth House Young Achievers Academy
address_1    520 E 79TH St                            
-> name_2       HENRY BOOTH HOUSE WEE CARE NURSERY
address_2    1845 E 79TH STREET                


 n



Nr. 6 (2+/3-)
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1       Karen Cruz Children's Center - Karen Cruz Children's Center
address_1    1507 W Sunnyside Ave                                       
-> name_2       Greenview Children's Center
address_2    1507 W Sunnyside Avenue    


 f


recall threshold reached, recall = 1.0


<deduplipy.deduplicator.deduplicator.Deduplicator at 0x7f8576428430>

Apply the trained `Deduplicator` on (new) data. The column `deduplication_id` is the identifier for a cluster. Rows with the same `deduplication_id` are found to be the same real world entity.

In [8]:
res = myDedupliPy.predict(df)
res.sort_values('deduplication_id').head(10)

Unnamed: 0,name,address,deduplication_id
1161,Hammond,2819 W 21st Place,1
1162,Hammond,2819 W 21st Pl,1
561,Holden,1104 W 31st Street,2
560,Holden,1104 W 31St St,2
394,Aldridge,630 E 131St St,3
395,CHICAGO PUBLIC SCHOOLS ALDRIDGE IRA F,630 E 131ST ST,3
1202,Dorsey Developmental Institute III,2938 E 91st St,4
1201,Dorsey Developmental Institute III,2938 E 91st Street,4
1249,YMCA South Chicago School,3039 E 91st Street,5
1248,YMCA South Chicago,3039 E 91st St,5


The `Deduplicator` instance can be saved as a pickle file and be applied on new data after training:

In [9]:
import pickle

In [10]:
with open('mypickle.pkl', 'wb') as f:
    pickle.dump(myDedupliPy, f)

In [11]:
del(myDedupliPy)

In [14]:
with open('mypickle.pkl', 'rb') as f:
    loaded_obj = pickle.load(f)

In [17]:
res = loaded_obj.predict(df)
res.sort_values('deduplication_id').head(10)

Unnamed: 0,name,address,deduplication_id
1161,Hammond,2819 W 21st Place,1
1162,Hammond,2819 W 21st Pl,1
561,Holden,1104 W 31st Street,2
560,Holden,1104 W 31St St,2
394,Aldridge,630 E 131St St,3
395,CHICAGO PUBLIC SCHOOLS ALDRIDGE IRA F,630 E 131ST ST,3
1202,Dorsey Developmental Institute III,2938 E 91st St,4
1201,Dorsey Developmental Institute III,2938 E 91st Street,4
1249,YMCA South Chicago School,3039 E 91st Street,5
1248,YMCA South Chicago,3039 E 91st St,5
