# DedupliPy

## Advanced deduplication

Load your data. In this example we take a sample dataset that comes with DedupliPy:

In [1]:
from deduplipy.datasets import load_data

In [2]:
df = load_data(kind='voters')

Column names: 'name', 'suburb', 'postcode'


In [3]:
df.head(2)

Unnamed: 0,name,suburb,postcode
0,khimerc thomas,charlotte,2826g
1,lucille richardst,kannapolis,28o81


Create a `Deduplicator` instance and provide advanced settings

- The similarity metrics per field are entered in a dict. Similarity metric can be any function that takes two strings and output a number.

In [4]:
from deduplipy.deduplicator import Deduplicator
from fuzzywuzzy.fuzz import ratio, partial_ratio, token_set_ratio, token_sort_ratio

In [5]:
field_info = {'name':[ratio, partial_ratio], 'suburb':[token_set_ratio, token_sort_ratio], 'postcode':[ratio]}

- We choose our own set of rules for blocking which we define ourselves.

In [6]:
def first_two_characters(x):
    return x[:2]

- `interaction=True` makes the classifier include interaction features, e.g. `ratio('name') * token_set_ratio('suburb')`. When interaction features are included, the logistic regression classifier applies a L1 regularisation to prevent overfitting.
- We set `verbose=1` to get information on the progress and a distribution of scores

In [7]:
myDedupliPy = Deduplicator(field_info=field_info, interaction=True, rules = [first_two_characters], verbose=1)

Fit the `Deduplicator` by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering 'f'.

In [8]:
myDedupliPy.fit(df)


Nr. 1 (0+/0-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        reb3cca bauerboand
suburb_1      raleigh           
postcode_1    27615             
-> name_2        reb3cca bauerboand
suburb_2      raleigh           
postcode_2    27615             


 y



Nr. 2 (1+/0-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        reb3cca bauerboand
suburb_1      raleigh           
postcode_1    27615             
-> name_2        maureen desilets
suburb_2      monro           
postcode_2    281l2           


 n



Nr. 3 (1+/1-) 
LR parameters: [0.00195159 0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        reb3cca bauerboand
suburb_1      raleigh           
postcode_1    27615             
-> name_2        louise logan 
suburb_2      ruthernordton
postcode_2    28189        


 n


Largest step in LR coefficients: 0.3237594602428085

Nr. 4 (1+/2-) 
LR parameters: [-0.25974397  0.          0.03306706  0.          0.18900025  0.18900025
  0.32375946  0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        demetrio hinkle
suburb_1      gastonia       
postcode_1    28054          
-> name_2        kimberly biggs
suburb_2      jacksonville  
postcode_2    28546         


 n


Largest step in LR coefficients: 1.5066797250631585

Nr. 5 (1+/3-) 
LR parameters: [-1.7664237   0.          0.          0.          0.          0.
  0.          1.07831028  0.          0.          0.          0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        reb3cca bauerboand
suburb_1      raleigh           
postcode_1    27615             
-> name_2        nitw mcdowell
suburb_2      raleigh      
postcode_2    2761q        


 n


Largest step in LR coefficients: 0.5727083794219332

Nr. 6 (1+/4-) 
LR parameters: [-2.33913208  0.          0.          0.          0.          0.
  0.          1.1594471   0.          0.          0.          0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        louise logan 
suburb_1      ruthernordton
postcode_1    28189        
-> name_2        loui5e lofan 
suburb_2      rutherfordton
postcode_2    28139        


 y


Largest step in LR coefficients: 1.8151438885774747

Nr. 7 (2+/4-) 
LR parameters: [-0.52398819  0.          1.54840218  0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        clara robinson
suburb_1      fayetteville  
postcode_1    28306         
-> name_2        marie johnson
suburb_2      kernersville 
postcode_2    27284        


 n


Largest step in LR coefficients: 0.20769047490158243

Nr. 8 (2+/5-) 
LR parameters: [-0.73167866  0.          1.5587708   0.          0.          0.
  0.08879556  0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        margaret williams
suburb_1      roseboro         
postcode_1    28382            
-> name_2        eulonta williams
suburb_2      louisburg       
postcode_2    27049           


 n


Largest step in LR coefficients: 0.5221736277644791

Nr. 9 (2+/6-) 
LR parameters: [-0.87791237  0.          1.03659717  0.          0.08372057  0.08372057
  0.3074637   0.          0.          0.          0.15333043  0.
  0.          0.24027221  0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        kimberly biggs
suburb_1      jacksonville  
postcode_1    28546         
-> name_2        ckimberly tavis
suburb_2      dunn           
postcode_2    28334          


 n


Largest step in LR coefficients: 0.45535888147434955

Nr. 10 (2+/7-) 
LR parameters: [-1.11285169  0.          0.58123829  0.          0.38718085  0.38718085
  0.12467266  0.          0.27120474  0.27120474  0.12544345  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        abraam th9mas
suburb_1      raleigh      
postcode_1    27613        
-> name_2        desiree thomas
suburb_2      raleigh       
postcode_2    27610         


 n


Largest step in LR coefficients: 0.536507210457234

Nr. 11 (2+/8-) 
LR parameters: [-1.24894989  0.          1.1177455   0.          0.12194578  0.12194578
  0.20146248  0.          0.28477555  0.28477555  0.12809502  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        margaret williams
suburb_1      roseboro         
postcode_1    28382            
-> name_2        janes williams
suburb_2      tarboro       
postcode_2    27896         


 n


Largest step in LR coefficients: 0.7548538204546067

Nr. 12 (2+/9-) 
LR parameters: [-1.25188836  0.          0.76836642  0.          0.20223018  0.20223018
  0.22143266  0.          0.0086061   0.0086061   0.88294884  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        james hendricks
suburb_1      battleboro     
postcode_1    27809          
-> name_2        antrea hendricks
suburb_2      rocky m0unt     
postcode_2    27801           


 n


Largest step in LR coefficients: 0.8829488433597157

Nr. 13 (2+/10-) 
LR parameters: [-1.46066163  0.          0.80838505  0.          0.27727059  0.27727059
  0.27293314  0.          0.46944452  0.46944452  0.          0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        maryann ca5h
suburb_1      waxha       
postcode_1    28173       
-> name_2        ivev oden 
suburb_2      golclsboro
postcode_2    27530     


 n


Largest step in LR coefficients: 0.4802498948241479

Nr. 14 (2+/11-) 
LR parameters: [-1.6561363   0.          0.74768866  0.          0.36141759  0.36141759
  0.04908655  0.          0.3333332   0.3333332   0.          0.
  0.          0.48024989  0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        donalr barney
suburb_1      fayetteville 
postcode_1    z8303        
-> name_2        crqig williams
suburb_2      winston-salem 
postcode_2    27l27         


 n


Largest step in LR coefficients: 0.4802498948241479

Nr. 15 (2+/12-) 
LR parameters: [-2.03355147  0.          0.90017051  0.          0.33023477  0.33023477
  0.02857334  0.          0.53942814  0.53942814  0.08585938  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        margaret stewart
suburb_1      hamlet          
postcode_1    28345           
-> name_2        alinia stewart
suburb_2      charlotte     
postcode_2    28273         


 n


Largest step in LR coefficients: 0.13325491061485994

Nr. 16 (2+/13-) 
LR parameters: [-2.16680638  0.          0.81914442  0.          0.40918552  0.40918552
  0.          0.          0.56449506  0.56449506  0.          0.
  0.          0.08866182  0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        maureen desilets
suburb_1      monro           
postcode_1    281l2           
-> name_2        darren aldndge
suburb_2      monro         
postcode_2    28110         


 n


Largest step in LR coefficients: 0.3423618714917259

Nr. 17 (2+/14-) 
LR parameters: [-2.16321172  0.          1.16150629  0.          0.23743639  0.23743639
  0.1184764   0.          0.53437482  0.53437482  0.00999316  0.
  0.          0.18429657  0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        michael burke
suburb_1      kinston      
postcode_1    28504        
-> name_2        kristin woods
suburb_2      durham       
postcode_2    27713        


 n


Largest step in LR coefficients: 0.26143286682568156

Nr. 18 (2+/15-) 
LR parameters: [-2.42464459  0.          1.16381245  0.          0.21300255  0.21300255
  0.13701936  0.          0.5953676   0.5953676   0.1691336   0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        kimberly biggs
suburb_1      jacksonville  
postcode_1    28546         
-> name_2        kimberly hollander
suburb_2      asheville         
postcode_2    28803             


 n


Largest step in LR coefficients: 0.0994945479368741

Nr. 19 (2+/16-) 
LR parameters: [-2.50640327  0.          1.16889374  0.          0.19954114  0.19954114
  0.2027532   0.          0.56046018  0.56046018  0.26862814  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        omar skinner
suburb_1      greenville  
postcode_1    27834       
-> name_2        yoshida stokes
suburb_2      greenville    
postcode_2    27834         


 n


Largest step in LR coefficients: 0.3920115502001511

Nr. 20 (2+/17-) 
LR parameters: [-2.59932654  0.          1.56090529  0.          0.12910711  0.12910711
  0.0343254   0.          0.63421082  0.63421082  0.13070365  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        joanne grubb
suburb_1      selma       
postcode_1    27576       
-> name_2        ckimberly tavis
suburb_2      dunn           
postcode_2    28334          


 n


Largest step in LR coefficients: 0.2311214403560089

Nr. 21 (2+/18-) 
LR parameters: [-2.83044798  0.          1.5649872   0.          0.08928387  0.08928387
  0.03972629  0.          0.69156524  0.69156524  0.12956098  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        carolyn howell
suburb_1      wilson        
postcode_1    27896         
-> name_2        carol howe
suburb_2      horse shoe
postcode_2    28742     


 n


Largest step in LR coefficients: 0.1877081194864707

Nr. 22 (2+/19-) 
LR parameters: [-2.75778425  0.          1.37727908  0.          0.20176265  0.20176265
  0.10425588  0.          0.70251564  0.70251564  0.14823128  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        paul smith
suburb_1      cary      
postcode_1    27519     
-> name_2        john lloyd
suburb_2      pineville 
postcode_2    28134     


 n


Largest step in LR coefficients: 0.20566719795224708

Nr. 23 (2+/20-) 
LR parameters: [-2.96345145  0.          1.3470289   0.          0.18100691  0.18100691
  0.07419844  0.          0.71955718  0.71955718  0.25190613  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        jean mise
suburb_1      roxboro  
postcode_1    27574    
-> name_2        janes williams
suburb_2      tarboro       
postcode_2    27896         


 n


Largest step in LR coefficients: 0.09453111838098638

Nr. 24 (2+/21-) 
LR parameters: [-3.01856349  0.          1.38385881  0.          0.15991746  0.15991746
  0.11564447  0.          0.67880781  0.67880781  0.34643724  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        ckay leedy
suburb_1      lijcolnton
postcode_1    28092     
-> name_2        viviem bryan
suburb_2      durzam      
postcode_2    27713       


 n


Largest step in LR coefficients: 0.2036926653221527

Nr. 25 (2+/22-) 
LR parameters: [-3.22225615  0.          1.3542564   0.          0.13055729  0.13055729
  0.09554833  0.          0.72626303  0.72626303  0.36277481  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        john lloyd
suburb_1      pineville 
postcode_1    28134     
-> name_2        amv bates
suburb_2      cary     
postcode_2    27511    


 n


Largest step in LR coefficients: 0.1933402923009453

Nr. 26 (2+/23-) 
LR parameters: [-3.41559644  0.          1.3138406   0.          0.10982502  0.10982502
  0.04412712  0.          0.74854745  0.74854745  0.45122402  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        john maher
suburb_1      cary      
postcode_1    27513     
-> name_2        keily sgea
suburb_2      boone     
postcode_2    28608     


 n


Largest step in LR coefficients: 0.19849155889500203

Nr. 27 (2+/24-) 
LR parameters: [-3.614088    0.          1.27185364  0.          0.0800384   0.0800384
  0.00600408  0.          0.79489196  0.79489196  0.47323774  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        amv bates
suburb_1      cary     
postcode_1    27511    
-> name_2        keily sgea
suburb_2      boone     
postcode_2    28608     


 n


Largest step in LR coefficients: 0.19072594145825095

Nr. 28 (2+/25-) 
LR parameters: [-3.80481395  0.          1.21514906  0.          0.04047533  0.04047533
  0.          0.          0.84086912  0.84086912  0.48990016  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        jean mise
suburb_1      roxboro  
postcode_1    27574    
-> name_2        donalr barney
suburb_2      fayetteville 
postcode_2    z8303        


 n


Largest step in LR coefficients: 0.18321082264836042

Nr. 29 (2+/26-) 
LR parameters: [-3.98802477  0.          1.14885001  0.          0.          0.
  0.          0.          0.90895763  0.90895763  0.45797252  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        robert ambriz
suburb_1      four oaks    
postcode_1    27524        
-> name_2        david sprinkle
suburb_2      hiddenite     
postcode_2    28636         


 n


Largest step in LR coefficients: 0.16576584950246565

Nr. 30 (2+/27-) 
LR parameters: [-4.15379062  0.          1.00881455  0.          0.          0.
  0.          0.          0.93901354  0.93901354  0.49103484  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        joseph taylor
suburb_1      eden         
postcode_1    27288        
-> name_2        maryann ca5h
suburb_2      waxha       
postcode_2    28173       


 n


Largest step in LR coefficients: 0.15145464190200375

Nr. 31 (2+/28-) 
LR parameters: [-4.30524526  0.          0.86401098  0.          0.          0.
  0.          0.          0.96292995  0.96292995  0.53617875  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        john maher
suburb_1      cary      
postcode_1    27513     
-> name_2        david sprinkle
suburb_2      hiddenite     
postcode_2    28636         


 n


Largest step in LR coefficients: 0.1447206243394169

Nr. 32 (2+/29-) 
LR parameters: [-4.42788297  0.          0.71929035  0.          0.          0.
  0.          0.          0.98830539  0.98830539  0.5562519   0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        john maher
suburb_1      cary      
postcode_1    27513     
-> name_2        davis sprinkle
suburb_2      hiddenite     
postcode_2    28656         


 n


Largest step in LR coefficients: 0.14191158573937201

Nr. 33 (2+/30-) 
LR parameters: [-4.56979456  0.          0.58910123  0.          0.          0.
  0.          0.          1.01274414  1.01274414  0.57606904  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        john lloyd
suburb_1      pineville 
postcode_1    28134     
-> name_2        robert ambriz
suburb_2      four oaks    
postcode_2    27524        


 n


Largest step in LR coefficients: 0.1576844415694323

Nr. 34 (2+/31-) 
LR parameters: [-4.727479    0.          0.43291389  0.          0.          0.
  0.          0.00497045  1.04159191  1.04159191  0.60276888  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        leah wilson
suburb_1      greensboro 
postcode_1    27407      
-> name_2        maryann ca5h
suburb_2      waxha       
postcode_2    28173       


 n


Largest step in LR coefficients: 0.20960440512999945

Nr. 35 (2+/32-) 
LR parameters: [-4.9370834   0.          0.23548526  0.          0.          0.
  0.          0.09028131  1.04990695  1.04990695  0.63017819  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        kristin woods
suburb_1      durham       
postcode_1    27713        
-> name_2        donalr barney
suburb_2      fayetteville 
postcode_2    z8303        


 n


Largest step in LR coefficients: 0.2251786517278882

Nr. 36 (2+/33-) 
LR parameters: [-5.16226206  0.          0.02250502  0.          0.          0.
  0.          0.18349133  1.06352517  1.06352517  0.64583387  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        jean mise
suburb_1      roxboro  
postcode_1    27574    
-> name_2        phpillip cookd
suburb_2      winston salem 
postcode_2    27101         


 n


Largest step in LR coefficients: 0.03707858686962329
Classifier converged, enter 'f' to stop training

Nr. 37 (2+/34-) 
LR parameters: [-5.12518347  0.          0.          0.          0.          0.
  0.          0.18167089  1.0389568   1.0389568   0.61943853  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        carol howe
suburb_1      horse shoe
postcode_1    28742     
-> name_2        viviem bryan
suburb_2      durzam      
postcode_2    27713       


 n


Largest step in LR coefficients: 0.046933781414025155
Classifier converged, enter 'f' to stop training

Nr. 38 (2+/35-) 
LR parameters: [-5.17211725  0.          0.          0.          0.          0.
  0.          0.18940717  1.0182421   1.0182421   0.60125894  0.
  0.          0.          0.          0.          0.        ]
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        kenyatta thornton
suburb_1      lexington        
postcode_1    27292            
-> name_2        betty honbaier
suburb_2      lexington     
postcode_2    27292         


 f


 score  count
  0.05   4912
  0.10      0
  0.15      0
  0.20      0
  0.25      0
  0.30      0
  0.35      0
  0.40      0
  0.45      0
  0.50      0
  0.55      0
  0.60      0
  0.65      0
  0.70      0
  0.75      0
  0.80      0
  0.85      0
  0.90      0
  0.95      0
  1.00    101
active learning finished
recall threshold reached, recall = 1.0
blocking rules found
[[<function first_two_characters at 0x7f9a3790aa60>, 'first_two_characters', 'postcode']]


<deduplipy.deduplicator.deduplicator.Deduplicator at 0x7f9a3992ca60>

Based on the histogram of scores, we decide to ignore all pairs with a similarity probability lower than 0.1 when predicting:

Apply the trained `Deduplicator` on (new) data. The column `deduplication_id` is the identifier for a cluster. Rows with the same `deduplication_id` are found to be the same real world entity.

In [9]:
res = myDedupliPy.predict(df, score_threshold=0.1)
res.sort_values('deduplication_id').head(10)

blocking started
blocking finished
Nr of pairs: 748767
scoring started
scoring finished
Nr of filtered pairs: 1152
Clustering started
Clustering finished


Unnamed: 0,name,suburb,postcode,deduplication_id
1380,kiea matthews,charlotte,28218,1
252,kiera matthews,charlotte,28216,1
0,khimerc thomas,charlotte,2826g,2
1190,chimerc thomas,charlotte,28269,2
1302,chimerc thmas,chaflotte,28269,2
1,lucille richardst,kannapolis,28o81,6
1194,lucille richards,kannapolis,28081,6
5,darr6l perry,fayetteville,28321,8
966,daryl perry,fayetteville,2830l,8
1449,darryl perry,fayetteville,28301,8
