# DedupliPy

## Simple deduplication

Load your data. In this example we take a sample dataset that comes with DedupliPy:

In [1]:
from deduplipy.datasets import load_data

In [2]:
df = load_data(kind='voters')

Column names: 'name', 'suburb', 'postcode'


In [3]:
df.head(2)

Unnamed: 0,name,suburb,postcode
0,khimerc thomas,charlotte,2826g
1,lucille richardst,kannapolis,28o81


Create a `Deduplicator` instance and provide the column names to be used for deduplication:

In [4]:
from deduplipy.deduplicator import Deduplicator

In [5]:
myDedupliPy = Deduplicator(['name', 'suburb', 'postcode'])

Fit the `Deduplicator` by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering 'f'.

In [None]:
myDedupliPy.fit(df)

Apply the trained `Deduplicator` on (new) data. The column `deduplication_id` is the identifier for a cluster. Rows with the same `deduplication_id` are found to be the same real world entity.

In [7]:
res = myDedupliPy.predict(df)
res.sort_values('deduplication_id').head(10)

Unnamed: 0,name,suburb,postcode,deduplication_id
1194,lucille richards,kannapolis,28081,1
1,lucille richardst,kannapolis,28o81,1
2,reb3cca bauerboand,raleigh,27615,3
1134,rebecca bauerband,raleigh,27615,3
675,rebeccah shelton,whittier,28789,5
1535,rebecvah shelton,whittier,2878g,5
1024,rebecca harrell,witnon,27926,7
1456,rebecca harrell,winton,27986,7
3,maleda mccloud,goldsboro,2753o,9
1238,maleta mccloud,goldsboro,27530,9


The `Deduplicator` instance can be saved as a pickle file and be applied on new data after training:

In [8]:
import pickle

In [9]:
with open('mypickle.pkl', 'wb') as f:
    pickle.dump(myDedupliPy, f)

In [10]:
with open('mypickle.pkl', 'rb') as f:
    loaded_obj = pickle.load(f)

In [11]:
res = loaded_obj.predict(df)
res.sort_values('deduplication_id').head(10)

Unnamed: 0,name,suburb,postcode,deduplication_id
1194,lucille richards,kannapolis,28081,1
1,lucille richardst,kannapolis,28o81,1
2,reb3cca bauerboand,raleigh,27615,3
1134,rebecca bauerband,raleigh,27615,3
675,rebeccah shelton,whittier,28789,5
1535,rebecvah shelton,whittier,2878g,5
1024,rebecca harrell,witnon,27926,7
1456,rebecca harrell,winton,27986,7
3,maleda mccloud,goldsboro,2753o,9
1238,maleta mccloud,goldsboro,27530,9


To obtain the canonical representation for each cluster, set `return_canonical=True`:

In [12]:
res = loaded_obj.predict(df, return_canonical=True)
res.sort_values('deduplication_id').head(10)

Unnamed: 0,name,suburb,postcode,deduplication_id,name_canonical,suburb_canonical,postcode_canonical
1194,lucille richards,kannapolis,28081,1,lucille richardst,kannapolis,28o81
1,lucille richardst,kannapolis,28o81,1,lucille richardst,kannapolis,28o81
2,reb3cca bauerboand,raleigh,27615,3,reb3cca bauerboand,raleigh,27615
1134,rebecca bauerband,raleigh,27615,3,reb3cca bauerboand,raleigh,27615
675,rebeccah shelton,whittier,28789,5,rebeccah shelton,whittier,28789
1535,rebecvah shelton,whittier,2878g,5,rebeccah shelton,whittier,28789
1024,rebecca harrell,witnon,27926,7,rebecca harrell,witnon,27926
1456,rebecca harrell,winton,27986,7,rebecca harrell,witnon,27926
3,maleda mccloud,goldsboro,2753o,9,maleda mccloud,goldsboro,2753o
1238,maleta mccloud,goldsboro,27530,9,maleda mccloud,goldsboro,2753o
