# DedupliPy

## Advanced deduplication

Load your data. In this example we take a sample dataset that comes with DedupliPy:

In [1]:
# %load_ext autoreload
# %autoreload 2


In [2]:
from deduplipy.datasets import load_data

In [3]:
df = load_data(kind='voters')

Column names: 'name', 'suburb', 'postcode'


In [4]:
df.head(2)

Unnamed: 0,name,suburb,postcode
0,khimerc thomas,charlotte,2826g
1,lucille richardst,kannapolis,28o81


Create a `Deduplicator` instance and provide advanced settings

- The similarity metrics per field are entered in a dict. Similarity metric can be any function that takes two strings and output a number.

In [5]:
from deduplipy.deduplicator import Deduplicator
from fuzzywuzzy.fuzz import ratio, partial_ratio, token_set_ratio, token_sort_ratio

In [6]:
field_info = {'name':[ratio, partial_ratio], 'suburb':[token_set_ratio, token_sort_ratio], 'postcode':[ratio]}

- We choose our own set of rules for blocking which we define ourselves. We only apply this rule to the 'name' column

In [7]:
def first_two_characters(x):
    return x[:2]

- `interaction=True` makes the classifier include interaction features, e.g. `ratio('name') * token_set_ratio('suburb')`. When interaction features are included, the logistic regression classifier applies a L1 regularisation to prevent overfitting.
- We set `verbose=1` to get information on the progress and a distribution of scores

In [8]:
myDedupliPy = Deduplicator(field_info=field_info, interaction=True, rules={'name': [first_two_characters]}, verbose=1)

Fit the `Deduplicator` by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering 'f'.

In [9]:
myDedupliPy.fit(df)


Nr. 1 (0+/0-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        kimmerly walden
suburb_1      high pint      
postcode_1    27760          
-> name_2        kimmerly walden
suburb_2      high pint      
postcode_2    27760          


 y



Nr. 2 (1+/0-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        kimmerly walden
suburb_1      high pint      
postcode_1    27760          
-> name_2        terry dillom
suburb_2      kernersville
postcode_2    17284       


 n



Nr. 3 (1+/1-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        yoshida stokes
suburb_1      greenville    
postcode_1    27834         
-> name_2        juah ramoa  
suburb_2      fayetteville
postcode_2    28314       


 n



Nr. 4 (1+/2-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        michael jones
suburb_1      fayetteville 
postcode_1    28306        
-> name_2        bradlev james
suburb_2      fayettville  
postcode_2    28312        


 n



Nr. 5 (1+/3-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        leigh younce
suburb_1      greenville  
postcode_1    27858       
-> name_2        terwsa jones
suburb_2      spring hope 
postcode_2    2788z       


 n



Nr. 6 (1+/4-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        chad williams
suburb_1      charlotte    
postcode_1    28277        
-> name_2        craig williams
suburb_2      winston-salem 
postcode_2    27127         


 n



Nr. 7 (1+/5-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        emily davidson
suburb_1      concord       
postcode_1    28025         
-> name_2        emioy davidson
suburb_2      concotd       
postcode_2    28025         


 y



Nr. 8 (2+/5-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        terry dillom
suburb_1      kernersville
postcode_1    17284       
-> name_2        terry dillom
suburb_2      kernersville
postcode_2    17284       


 y



Nr. 9 (3+/5-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        cbad mozby 
suburb_1      chapel hill
postcode_1    27514      
-> name_2        cbad mozby 
suburb_2      chapel hill
postcode_2    27514      


 y



Nr. 10 (4+/5-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        john sakuon
suburb_1      raleigh    
postcode_1    27668      
-> name_2        john sakuon
suburb_2      raleigh    
postcode_2    27668      


 y



Nr. 11 (5+/5-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        jennifer hannen
suburb_1      greensboro     
postcode_1    27405          
-> name_2        jennifer bentz
suburb_2      greensboro    
postcode_2    27407         


 n



Nr. 12 (5+/6-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        jason shown
suburb_1      greensboro 
postcode_1    27407      
-> name_2        dustin snowden
suburb_2      greensboro    
postcode_2    27406         


 n



Nr. 13 (5+/7-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        susan tobin  
suburb_1      winston salem
postcode_1    27106        
-> name_2        quentin davis
suburb_2      winston salem
postcode_2    27106        


 n



Nr. 14 (5+/8-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        elas5er handy
suburb_1      lumberton    
postcode_1    28398        
-> name_2        elas5er handy
suburb_2      lumberton    
postcode_2    28398        


 y



Nr. 15 (6+/8-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        darren aldndge
suburb_1      monro         
postcode_1    28110         
-> name_2        karen nering
suburb_2      wiliiiington
postcode_2    28411       


 n



Nr. 16 (6+/9-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        kimmerly walden
suburb_1      high pint      
postcode_1    27760          
-> name_2        cbad mozby 
suburb_2      chapel hill
postcode_2    27514      


 n



Nr. 17 (6+/10-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        chad williams
suburb_1      charlotte    
postcode_1    28277        
-> name_2        wilkiam glass
suburb_2      charlote     
postcode_2    28278        


 n



Nr. 18 (6+/11-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        clyde alberg
suburb_1      clemmgns    
postcode_1    270|2       
-> name_2        clyde alberg
suburb_2      clemmgns    
postcode_2    270|2       


 y



Nr. 19 (7+/11-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        kimmerly walden
suburb_1      high pint      
postcode_1    27760          
-> name_2        john sakuon
suburb_2      raleigh    
postcode_2    27668      


 n



Nr. 20 (7+/12-) 
Is this a match? (y)es, (n)o, (p)revious, (s)kip, (f)inish
-> name_1        peggy huffman
suburb_1      hickory      
postcode_1    28601        
-> name_2        peggy huffman
suburb_2      hickory      
postcode_2    28601        


 


Wrong input!


 f


 score  count
  0.05      0
  0.10      0
  0.15   4937
  0.20      0
  0.25      0
  0.30      0
  0.35      0
  0.40      0
  0.45      0
  0.50      0
  0.55      0
  0.60      0
  0.65      0
  0.70      0
  0.75      0
  0.80      0
  0.85      0
  0.90     94
  0.95      0
  1.00      0
active learning finished
recall threshold reached, recall = 1.0
blocking rules found
['name first_two_characters']


Deduplicator
  - col_names = ['name', 'suburb', 'postcode']
  - field_info = {'name': ['ratio', 'partial_ratio'], 'suburb': ['token_set_ratio', 'token_sort_ratio'], 'postcode': ['ratio']}
  - interaction = True
  - rules_info = {'name': ['first_two_characters']}
  - recall = 1.0

Based on the histogram of scores, we decide to ignore all pairs with a similarity probability lower than 0.1 when predicting:

Apply the trained `Deduplicator` on (new) data. The column `deduplication_id` is the identifier for a cluster. Rows with the same `deduplication_id` are found to be the same real world entity.

In [10]:
res = myDedupliPy.predict(df, score_threshold=0.1)
res.sort_values('deduplication_id').head(10)

blocking started
blocking finished
Nr of pairs: 27350
scoring started
scoring finished
Nr of filtered pairs: 27350
Clustering started
Clustering finished


Unnamed: 0,name,suburb,postcode,deduplication_id
995,lutta baldwin,whitevill,28475,1
604,lutta baldwin,whiteville,28472,1
1,lucille richardst,kannapolis,28o81,2
1194,lucille richards,kannapolis,28081,2
1595,rehee emmons,salishury,28146,5
891,renee emmons,salisbury,28146,5
2,reb3cca bauerboand,raleigh,27615,6
1134,rebecca bauerband,raleigh,27615,6
675,rebeccah shelton,whittier,28789,7
1535,rebecvah shelton,whittier,2878g,7


In [11]:
myDedupliPy.myActiveLearner.learner.estimator.classifier.steps[1][1].model.get_weights()

[array([[-0.00050127, -0.25572953,  0.00050729, -0.00049916],
        [ 0.00049993, -0.36192012,  0.00051298,  0.00049658],
        [ 0.0005007 , -0.01182869, -0.00049452, -0.00050586],
        [ 0.00049866, -0.01184364, -0.00049713, -0.00050379],
        [-0.00049841, -0.24231392, -0.00048792, -0.00050321]],
       dtype=float32),
 array([-0.05491118,  0.9702366 , -0.0034916 , -0.00187172], dtype=float32),
 array([[ 0.00049796, -0.00050284],
        [-0.48095465, -0.37345102],
        [ 0.00048311, -0.00050611],
        [ 0.00048927, -0.00049002]], dtype=float32),
 array([0.589673  , 0.45859084], dtype=float32),
 array([[3.6666062],
        [3.2822022]], dtype=float32),
 array([-1.8524615], dtype=float32)]