# Entity Resolution

**Entity** refers to any unique object (person, place, organization, etc.) with real and independent existence. **Entity Resolution** is a technique to identify data records that refer to the same real-world entity when unique identifiers are not available. It derives meaningful insights from data across the enterprise that reflects real-world entities and the relationships between them. Each indivudial record/document in a(or more) dataset(s) points to a single real-world entity.

Entity resolution itself is known by multiple names: Data Deduplication, Record Linkage, Fuzzy Matching, Entity Matching and many more.

### Steps Involved in Entity Resolution

- **Data Preprocessing:** This involves all the steps involved in making data ready for the task.
    - **Canonicalization:** Converting data with more than one possible representations into a standart form.
    - **Data Cleaning:** Removing unnecessary contents in the data, that do not provide any relevant information.
- **Blocking:** Group similar records together and separate them from other groups to reduce the number of comparisons. Blocks are created based on a number of blocking rules.
- **Matching:** Once blocks are created, comparison is made between the records of the same block based on a set of matching rules defined.
    - **Data Labelling:** Labelling random pairs of data if they represent a match or not. In our case we use *active learning* to label data. This facilitates supervised learning. In case of Unsupervised Entity Resolution this step is absent.
    - **Featurization:** This steps involves in calculating a similarity score for each pair of records between the same block. It can be done in two ways:
        - **Record-by-Record:** Concatenate all the fields together and calculate similarity score as a whole.
        - **Field-by-Field:** Compare each attribute independently and add the scores together to get the final one. The advantage of using field-by-field over record-by-record is that we can assign different weights to each fields.
    - **Classification:** Classifies similarity score obtained in the previous step to represent either distinct or duplicate records. Can also learn attribute weights and blocking rules.
- **Clustering:** Grouping all the similar records together.

### Data Ingestion

Data ingestion is the process of obtaining and importing data for immediate use or storage. The data used in this project can be obtained from [here](https://github.com/dedupeio/dedupe-examples/blob/master/csv_example/csv_example_messy_input.csv).

In [1]:
import os
import re
import csv
import dedupe
import pandas as pd
from unidecode import unidecode

Locate datasets and results files and folders according to your current working directory.

In [2]:
input_file = './../data/er.csv'
output_dir = './../results'
output_file = './../results/er_output.csv'

### Data Understanding

- The data used in this session is a collection of all types of educational institutions located within United Stated of America.
- There are 3681 rows and 32 columns in the dataset.
- There are missing values in several columns with dominance in the later ones.

### Data Preprocessing

The `read_data()` function is used to read the csv data and returns it as a dictionary object. The `pre_process()` method is used to clean punctuations and irregular whitespaces from the data. The `unidecode()` method imported from unidecode library takes Unicode data and tries to represent it in ASCII characters (i.e., the universally displayable characters between 0x00 and 0x7F), where the compromises taken when mapping between two character sets are chosen to be near what a human with a US keyboard would choose.

In [3]:
def read_data(filename):
    data = {}
    with open(filename) as f:
        reader = csv.DictReader(f)
        for row in reader:
            clean_row = [(k, pre_process(v)) for (k, v) in row.items()]
            row_id = int(row['Id'])
            data[row_id] = dict(clean_row)
    return data

In [4]:
def pre_process(column):
    column = unidecode(column)
    column = re.sub('  +', ' ', column)
    column = re.sub('\n', ' ', column)
    column = column.strip().strip('"').strip("'").lower().strip()
    if not column:   # Indicate missing value by None
        column = None
    return column

In [5]:
data = read_data(input_file)

### Defing Blocking Rules

Blocking Rules definition may vary according to the libraries we use. For this session, we will be working with [dedupe](https://dedupe.io/) which is a python open-source library for entity resolution. Dedupe difnes blocking rules as a dictionary where each key resembles the following:

- **field:** attribute name
- **type:** data type of the attribute value. This key is responsible to choose between different comparison parameters
- **has_missing:** makes sure if the attribute has missing values

For more detailed information about choosing the right **type** values refer to the [official documentation page](https://docs.dedupe.io/en/latest/Variable-definition.html).

In [6]:
fields = [
    {'field': 'Site name', 'type': 'String'},
    {'field': 'Address', 'type': 'String'},
    {'field': 'Zip', 'type': 'Exact', 'has missing': True},
    {'field': 'Phone', 'type': 'String', 'has missing': True},
]

Create a dedupe object and pass blocking rules to it. The `prepare_training()` method is used to initialize active learning process according to the blocking rules fed.

In [7]:
deduper = dedupe.Dedupe(fields)
deduper.prepare_training(data)

### Active Learning

Active Learning is the method in which users are actively involved in labelling the data. Here `active` represents the real time interaction to label the data. The program itself displays a random pair of records each time and users have to label if the pairs are duplicates and based on this the program provides similarity score between other pairs of records of the same block to facilitate supervised learning. So in case of unsupervised learning, the process of active learning is absent. 

Dedupe provides a method `console_label()` to facilitate active learning.

In [12]:
dedupe.console_label(deduper)

Site name : erie neighborhood house fcch-nury rodriguez
Address : 2420 w lemoyne ave 1
Zip : 60622
Phone : None

Site name : erie neighborhood house
Address : 2510 w cortez street
Zip : 60622
Phone : 4867161

14/10 positive, 14/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


n


Site name : north avenue day nursery fcch-cheryl cook
Address : 5929 w walton
Zip : 60651
Phone : None

Site name : north avenue day nursery
Address : 2001 w pierce street
Zip : 60622
Phone : 3424499

14/10 positive, 15/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


Site name : howard area community center - howard area community center
Address : 7510 n ashland ave
Zip : None
Phone : 7647610

Site name : howard area community services
Address : 7610 n ashland avenue
Zip : 60626
Phone : 7647610

14/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


Site name : howard area community center - howard area community center
Address : 7510 n ashland ave
Zip : None
Phone : 7647610

Site name : howard area community services
Address : 7610 n ashland avenue
Zip : 60626
Phone : 7647610

15/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling


### Classification

We now have similarity score between each pairs of records that fall under the same block, but we still are not certain if the similarity score represents a match. We can use several classification techniques to classify the obtained similarity score to represent the pairs as duplicates or distinct entities.

Sometimes this step is also used to monitor and learn blocking rules and attribute weights (in case of field-by-field comparison) as well.

In [13]:
deduper.train()



### Clustering

So far we have been working with pairwise records only. Make resolution decision independently for each pair of records can be costly. So we apply different clustering algorithms to cluster all the similar records together.

In [14]:
clusters = deduper.partition(data, 0.5)

In [15]:
cluster_dict = {}
for cluster_id, (records, scores) in enumerate(clusters):
    for record_id, score in zip(records, scores):
        cluster_dict[record_id] = {
            "Cluster ID": cluster_id,
            "confidence_score": score
        }

### Saving results as csv

We will be saving result obtained so far as csv file. The obtained csv file will have two additional columns relative to the original data: `Cluster ID` and `confidence_score`. `Cluster ID` is the id for each cluster of matched pairs while `confidence_score` is the certainty score for the given record to belong to given cluster.

In [16]:
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

with open(output_file, 'w') as f_output, open(input_file) as f_input:
    reader = csv.DictReader(f_input)
    fieldnames = ['Cluster ID', 'confidence_score'] + reader.fieldnames
    writer = csv.DictWriter(f_output, fieldnames=fieldnames)
    writer.writeheader()
    for row in reader:
        row_id = int(row['Id'])
        row.update(cluster_dict[row_id])
        writer.writerow(row)

### Python libraries that support Entity Resolution

- [Splink](https://github.com/moj-analytical-services/splink) -> Unsupervised Learning
- [Deepmatcher](https://github.com/anhaidgroup/deepmatcher) -> Supervised Learning
- [FuzzyMatcher](https://github.com/RobinL/fuzzymatcher) -> Unsupervised Learning
- [RLTK](https://github.com/usc-isi-i2/rltk) -> Supervised Learning
- [RecordLinkage](https://github.com/J535D165/recordlinkage) -> Supervised/Unsupervised Learning

### References

- [An introduction to Entity Resolution — needs and challenges](https://towardsdatascience.com/an-introduction-to-entity-resolution-needs-and-challenges-97fba052dde5)
- [End-to-End Entity Resolution for Big Data: A Survey](https://arxiv.org/pdf/1905.06397.pdf)
- [Entity Resolution: Tutorial](http://home.cse.ust.hk/~leichen/courses/mscit6000d/notes/entityresolution.pdf)