# Deduplicate data

## 1. Load sample data

In [1]:
import pandas as pd

In [2]:
customers = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv",
    encoding="utf-8",
)

## 2. Deduplication with pandas

### 2.1 Overview

In [3]:
customers

Unnamed: 0,name,job,company,street_address,city,state,email,user_name
0,Patricia Schaefer,"Programmer, systems",Estrada-Best,398 Paul Drive,Christianview,Delaware,lambdavid@gmail.com,ndavidson
1,Olivie Dubois,Ingénieur recherche et développement en agroal...,Moreno,rue Lucas Benard,Saint Anastasie-les-Bains,AR,berthelotjacqueline@mahe.fr,manonallain
2,Mary Davies-Kirk,Public affairs consultant,Baker Ltd,Flat 3\nPugh mews,Stanleyfurt,ZA,middletonconor@hotmail.com,colemanmichael
3,Miroslawa Eckbauer,Dispensing optician,Ladeck GmbH,Mijo-Lübs-Straße 12,Neubrandenburg,Berlin,sophia01@yahoo.de,romanjunitz
4,Richard Bauer,"Accountant, chartered certified",Hoffman-Rocha,6541 Rodriguez Wall,Carlosmouth,Texas,tross@jensen-ware.org,adam78
...,...,...,...,...,...,...,...,...
2075,Maurice Stey,Systems developer,Linke Margraf GmbH & Co. OHG,Laila-Scheibe-Allee 2/0,Luckenwalde,Hamburg,gutknechtevelyn@niemeier.com,dkreusel
2076,Linda Alexander,Commrcil horiculuri,"Webb, Ballald and Vasquel",5594 Persn Ciff,Mooneybury,Maryland,ahleythoa@ail.co,kennethrchn
2077,Diane Bailly,Pharmacien,Voisin,"527, rue Dijoux",Duval-les-Bains,CH,aruiz@reynaud.fr,dorothee41
2078,Jorge Riba Cerdán,Hotel manager,Amador-Diego,Rambla de Adriana Barceló 854 Puerta 3,Huesca,Asturias,manuelamosquera@yahoo.com,eugenia17


### 2.2 Display data types

For this we use [pandas.DataFrame.dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html):

In [4]:
customers.dtypes

name              object
job               object
company           object
street_address    object
city              object
state             object
email             object
user_name         object
dtype: object

### 2.3 Determining missing values

[pandas.isnull](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html) shows whether values are missing for an array-like object:

* `NaN` in numeric arrays
* `None` or `NaN` in object arrays
* `NaT` in [datetimelike](https://pandas.pydata.org/docs/reference/general_functions.html#top-level-dealing-with-datetimelike-data)

> **See also:**
> 
> * [notna](https://pandas.pydata.org/docs/reference/api/pandas.notna.html) for the Boolean inverse of [pandas.isna](https://pandas.pydata.org/docs/reference/api/pandas.isna.html)
> * [Series.isna](https://pandas.pydata.org/docs/reference/api/pandas.Series.isna.html) for the missing values in a series
> * [DataFrame.isna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) for the missing values in a DataFrame
> * [Index.isna](https://pandas.pydata.org/docs/reference/api/pandas.Index.isna.html) for the missing values in an index

In [5]:
for col in customers.columns:
    print(col, customers[col].isnull().sum())

name 0
job 0
company 0
street_address 0
city 0
state 0
email 0
user_name 0


### 2.4 Determine duplicated data records

In [6]:
customers.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
2075    False
2076    False
2077    False
2078    False
2079    False
Length: 2080, dtype: bool

`customers.duplicated()` does not yet give us the desired indication of whether there are duplicate data records. In the following, we display all data records for which `True` is returned:

In [7]:
customers[customers.duplicated()]

Unnamed: 0,name,job,company,street_address,city,state,email,user_name


Obviously there are no identical data records.

### 2.5 Deleting duplicated data

Deleting duplicate data records with `drop_duplicates` should therefore not change anything and leave the number of data records at 2080:

In [8]:
customers.drop_duplicates()

Unnamed: 0,name,job,company,street_address,city,state,email,user_name
0,Patricia Schaefer,"Programmer, systems",Estrada-Best,398 Paul Drive,Christianview,Delaware,lambdavid@gmail.com,ndavidson
1,Olivie Dubois,Ingénieur recherche et développement en agroal...,Moreno,rue Lucas Benard,Saint Anastasie-les-Bains,AR,berthelotjacqueline@mahe.fr,manonallain
2,Mary Davies-Kirk,Public affairs consultant,Baker Ltd,Flat 3\nPugh mews,Stanleyfurt,ZA,middletonconor@hotmail.com,colemanmichael
3,Miroslawa Eckbauer,Dispensing optician,Ladeck GmbH,Mijo-Lübs-Straße 12,Neubrandenburg,Berlin,sophia01@yahoo.de,romanjunitz
4,Richard Bauer,"Accountant, chartered certified",Hoffman-Rocha,6541 Rodriguez Wall,Carlosmouth,Texas,tross@jensen-ware.org,adam78
...,...,...,...,...,...,...,...,...
2075,Maurice Stey,Systems developer,Linke Margraf GmbH & Co. OHG,Laila-Scheibe-Allee 2/0,Luckenwalde,Hamburg,gutknechtevelyn@niemeier.com,dkreusel
2076,Linda Alexander,Commrcil horiculuri,"Webb, Ballald and Vasquel",5594 Persn Ciff,Mooneybury,Maryland,ahleythoa@ail.co,kennethrchn
2077,Diane Bailly,Pharmacien,Voisin,"527, rue Dijoux",Duval-les-Bains,CH,aruiz@reynaud.fr,dorothee41
2078,Jorge Riba Cerdán,Hotel manager,Amador-Diego,Rambla de Adriana Barceló 854 Puerta 3,Huesca,Asturias,manuelamosquera@yahoo.com,eugenia17


Now we want to display those data records for which `user_name` is identical:

In [9]:
customers[customers.duplicated(["user_name"])]

Unnamed: 0,name,job,company,street_address,city,state,email,user_name
337,Aysel Binner,Reccig officer,Kuhl Kalleww Swifwunw & Co. KGaA,Batix-Kanz-Staß 5/4,Fulda,Berli,frncoise@wgnerco,christinefinke
377,Jolanta Rogge,Accommodation managr,Scholl e.V.,Lrchplz 4/6,Mettmnn,Thüringen,inrharff@yah.d,walentinabeier
506,Mrs. Frances Peters,Fuiue desie,"Rsgers, Lawrence and Richards",Studio \nCarpntr kys,Wes Simn,BO,halenewilliams@wilson-sandes.og,amy17
545,Gerhart Krebs MBA.,Surgeon,Roskoth,Kühnertweg 863,Stade,Bayern,olav44@bolander.de,bettyhahn
592,Folkert Gnatz,Meteorologist,Bolnbach,Heinfried-Austermühle-Ring 05,Eilenburg,Thüringen,jaentschbirgitt@boerner.org,francesco44
633,Manon Jacquot,Ingénieur en aéronautique,Jacob,"8, chemin Éléonore Evrard",Marechal-les-Bains,AR,ilemaitre@voila.fr,astrid58
658,Austin Waller,Insurance risk surveyor,Sexton Group,11097 Hansen Field,Davidmouth,Texas,christina74@doyle-baker.biz,olynn
723,Wanda Moran,"Solicitor, Scotland",Estes PLC,08011 Hernandez Streets Apt. 149,Natalieshire,Oregon,howardreginald@gmail.com,dana91
762,Charles Russell,"Scientist, research (physical sciences)",Preston-Wilson,6709 Ashley Circle Apt. 309,Danielberg,South Dakota,nancyescobar@brown.net,ruben71
772,Waltrud Wohlgemut,"Designer, fashion/clothing",Nerger AG,Elmar-Ullmann-Allee 6,Schlüchtern,Rheinland-Pfalz,auch-schlauchindietlind@gmx.de,zitakuhl


Now we can display the associated data records, for example with:

In [10]:
customers[customers["user_name"] == "christinefinke"]

Unnamed: 0,name,job,company,street_address,city,state,email,user_name
236,Aysel Binner,Recycling officer,Kuhl Kallert Stiftung & Co. KGaA,Beatrix-Kranz-Straße 5/4,Fulda,Berlin,francoise22@wagner.com,christinefinke
337,Aysel Binner,Reccig officer,Kuhl Kalleww Swifwunw & Co. KGaA,Batix-Kanz-Staß 5/4,Fulda,Berli,frncoise@wgnerco,christinefinke


Finally, we can delete those data records whose `user_name` is identical:

In [11]:
customers.drop_duplicates(["user_name"])

Unnamed: 0,name,job,company,street_address,city,state,email,user_name
0,Patricia Schaefer,"Programmer, systems",Estrada-Best,398 Paul Drive,Christianview,Delaware,lambdavid@gmail.com,ndavidson
1,Olivie Dubois,Ingénieur recherche et développement en agroal...,Moreno,rue Lucas Benard,Saint Anastasie-les-Bains,AR,berthelotjacqueline@mahe.fr,manonallain
2,Mary Davies-Kirk,Public affairs consultant,Baker Ltd,Flat 3\nPugh mews,Stanleyfurt,ZA,middletonconor@hotmail.com,colemanmichael
3,Miroslawa Eckbauer,Dispensing optician,Ladeck GmbH,Mijo-Lübs-Straße 12,Neubrandenburg,Berlin,sophia01@yahoo.de,romanjunitz
4,Richard Bauer,"Accountant, chartered certified",Hoffman-Rocha,6541 Rodriguez Wall,Carlosmouth,Texas,tross@jensen-ware.org,adam78
...,...,...,...,...,...,...,...,...
2074,Rhonda James,Recruitment consultant,"Turner, Bradley and Scott",28382 Stokes Expressway,Port Gabrielaport,New Hampshire,zroberts@hotmail.com,heathscott
2076,Linda Alexander,Commrcil horiculuri,"Webb, Ballald and Vasquel",5594 Persn Ciff,Mooneybury,Maryland,ahleythoa@ail.co,kennethrchn
2077,Diane Bailly,Pharmacien,Voisin,"527, rue Dijoux",Duval-les-Bains,CH,aruiz@reynaud.fr,dorothee41
2078,Jorge Riba Cerdán,Hotel manager,Amador-Diego,Rambla de Adriana Barceló 854 Puerta 3,Huesca,Asturias,manuelamosquera@yahoo.com,eugenia17


This deleted 51 data records.

## 3. Dedupe 

Alternatively, we can recognise the duplicated data with the [Dedupe](https://docs.dedupe.io/en/latest/) library, which uses a shallow neural network to learn from a small training.

<div class="alert alert-block alert-info">

**See also:**

[csvdedupe](https://github.com/dedupeio/csvdedupe) offers a command line tool for dedupe.
</div>

In addition, the same developers have created [parserator](https://github.com/datamade/parserator), which you can use to extract text functions and train your own text extraction.

### 3.1 Configuring Dedupe

Now we define the fields to be taken into account during deduplication and create a new `deduper` object:

In [9]:
import os

import dedupe


customers = pd.read_csv(
    "https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv",
    encoding="utf-8",
)

In [10]:
variables = [
    dedupe.variables.String("name"),
    dedupe.variables.String("job"),
    dedupe.variables.String("company"),
    dedupe.variables.String("street_address"),
    dedupe.variables.String("city"),
    dedupe.variables.String("state"),
    dedupe.variables.String("email"),
    dedupe.variables.String("user_name")
]

deduper = dedupe.Dedupe(variables)

If the value of a field is missing, this missing value should be displayed as a `None` object. However, `'has_missing': True` creates a new, additional field that indicates whether the data was present or not, and the missing data is assigned zero.

<div class="alert alert-block alert-info">

**See also:**

* [Missing Data](https://docs.dedupe.io/en/latest/Variable-definition.html#missing-data)
</div>

In [11]:
deduper

<dedupe.api.Dedupe at 0x7f03b6d92610>

In [12]:
customers.shape

(2080, 8)

## 4. Create training data

In [13]:
deduper.prepare_training(customers.T.to_dict())

[prepare_training](https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.prepare_training) initialises active learning with our data and, optionally, with existing training data.

`T` mirrors the DataFrame via its diagonal by writing rows as columns and vice versa. For this, [pandas.DataFrame.transpose](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html) is used.

## 5. Active learning

You can train your dedupe instance with [dedupe.console_label](https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.console_label). If Dedupe finds a pair of data sets, you will be asked to label it as a duplicate. You can use the `y`, `n` and `u` keys to label duplicates. Press `f` when you are finished.

In [14]:
dedupe.console_label(deduper)

name : Kenneth Moore
job : Magazine journalist
company : Cross, Bell and Diaz
street_address : 75443 Lindsey Pine
city : Thompsonshire
state : Colorado
email : ashley28@rice.com
user_name : todd72

name : Kenneth Moore
job : Magazine journalist
company : Cross, Bfll anf Diaz
street_address : 753 Lindsey Pine
city : Thompsonshe
state : Colorao
email : ashey28@rice.co
user_name : todd72

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


name : Frédérique Lejeune-Daniel
job : Technicien chimiste
company : Schmitt
street_address : chemin Denise Ferrand
city : Saint CharlotteVille
state : IE
email : jchretien@costa.com
user_name : joseph60

name : Frédérique Lejeune-Daniel
job : Tecce cse
company : Sctmitt
street_address : chemin Denise Ferrand
city : Saint ChalotteVille
state : IE
email : jchretien@costacom
user_name : joseph60

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


name : Herr Johann Eigenwillig
job : Immigrtion officer
company : Süßebuer Hänel GmbH
street_address : Lanernplatz 0
city : Stadtsteinach
state : Thürinen
email : hemieluie@nock.com
user_name : istoll

name : Herr Johann Eigenwillig
job : Immigration officer
company : Süßebier Hänel GmbH
street_address : Langernplatz 0
city : Stadtsteinach
state : Thüringen
email : haasemarieluise@noack.com
user_name : istoll

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


name : Dr. Catherine Sutton
job : Engineer, maintenance
company : Ross LLC
street_address : 13689 Morales Centers
city : North Sarah
state : New Mexico
email : lewisnicole@yahoo.com
user_name : clittle

name : Dr. Catherine Sutton
job : Enginee maintenance
company : Ross LLC
street_address : 13689 Morales Centers
city : North Sarah
state : New Mexico
email : ewinicoe@yaoo.com
user_name : little

3/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


name : Andrés Franco Bravo
job : Photographer
company : Pareja-Fábregas
street_address : Cuesta Margarita Robledo 251 Piso 1 
city : Granada
state : Alicante
email : fátimazamora@batlle.com
user_name : losasebastian

name : Andrés Franco Bravo
job : Photographer
company : Pare8a8Fábre8as
street_address : Cuesta Magaita Robledo 251 Piso 1 
city : Granada
state : Alicante
email : fáimazamra@balle.cm
user_name : lsasebastian

4/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling


The last two training datasets compared make it clear that we did not delete this duplicate with our `drop_duplicates` example above – `clittle` and `little` were recognised as different.

With [Dedupe.train](https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.train), the data record pairs you have marked are added to the training data and the matching model is updated.

With `index_predicates=True`, deduplication also takes into account predicates based on the indexing of the data.

When you are finished, save your training data with [Dedupe.write_settings](https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.write_settings).

In [15]:
settings_file = "csv_example_learned_settings"

if os.path.exists(settings_file):
    print("reading from", settings_file)
    with open(settings_file, "rb") as f:
        deduper = dedupe.StaticDedupe(f)
else:
    deduper.train(index_predicates=True)
    with open(settings_file, "wb") as sf:
        deduper.write_settings(sf)

reading from csv_example_learned_settings


With [dedupe.Dedupe.partition](https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.partition), data sets that all refer to the same entity are identified and returned as tuples that are a sequence of data set IDs and confidence values. Further details on the confidence value can be found at [dedupe.Dedupe.cluster](https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.cluster).

In [16]:
dupes = deduper.partition(customers.T.to_dict())

In [17]:
dupes

[((136, 1360), (1.0, 1.0)),
 ((298, 1026), (1.0, 1.0)),
 ((354, 858), (1.0, 1.0)),
 ((478, 1119), (1.0, 1.0)),
 ((938, 1890), (1.0, 1.0)),
 ((1785, 1939), (1.0, 1.0)),
 ((0,), (1.0,)),
 ((1,), (1.0,)),
 ((2,), (1.0,)),
 ((3,), (1.0,)),
 ((4,), (1.0,)),
 ...]

We can also display only individual entries:

In [18]:
dupes[0]

((136, 1360), (1.0, 1.0))

We can then display these with [pandas.DataFrame.iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html):

In [19]:
customers.iloc[[136,1360]]

Unnamed: 0,name,job,company,street_address,city,state,email,user_name
136,Frédérique Lejeune-Daniel,Technicien chimiste,Schmitt,chemin Denise Ferrand,Saint CharlotteVille,IE,jchretien@costa.com,joseph60
1360,Frédérique Lejeune-Daniel,Tecce cse,Sctmitt,chemin Denise Ferrand,Saint ChalotteVille,IE,jchretien@costacom,joseph60
