# Deduplication Example

Before running this notebook on Google Colab, please make sure to set the runtime type to "GPU". Do that by going to the "Runtime" menu and selecting "Change runtime type".

## Boilerplate

In [None]:
!pip install git+https://github.com/egu8/entity-embed

In [None]:
!pip install "matplotlib==3.1.1" \
             "pynndescent==0.5.2" \
             "scikit-learn==0.24.1" \
             "seaborn==0.11.1" \
             "unidecode==1.1.2"

In [None]:
from importlib import reload
import logging
reload(logging)
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.INFO, datefmt='%H:%M:%S')

In [None]:
import sys

sys.path.insert(0, '..')

In [None]:
import entity_embed

In [None]:
import torch
import numpy as np

random_seed = 42
torch.manual_seed(random_seed)
np.random.seed(random_seed)

## Load Dataset

Click this link [here](https://drive.google.com/drive/folders/1OhKJ413oKe9zdGy3wSaMteMrqFqvGArZ?usp=sharing) to get access to the Google Drive which is hosting the datasets.

Download the zipped datasets and upload it the runtime files for this Colab.

Note that it may take a few minutes to upload and unzip the zipped file due to the size and number of images in the dataset.



In [None]:
!unzip shopee.zip
!unzip shopee-product-matching.zip

In [None]:
import os

DATAFILE = "shoppee.csv"

Now we must read the CSV dataset into a `dict` called `record_dict`.

`record_dict` will contain all records from the dataset, and each record will have the indication of the true cluster it belongs to in the field `CID`.

So `CID` is our `cluster_field`. Entity Embed needs that to train, validate, and test.

We'll dynamically attribute an `id` to each record using `enumerate`. Entity Embed needs that too.

In [None]:
import csv

record_dict = {}
cluster_field = 'CID'

with open(DATAFILE, newline='') as f:
    for current_record_id, record in enumerate(csv.DictReader(f)):
        record['id'] = current_record_id
        record[cluster_field] = int(record[cluster_field])  # convert cluster_field to int
        record_dict[current_record_id] = record

Here's an example of a record:

In [None]:
record_dict[83]

OrderedDict([('', '83'),
             ('posting_id', 'train_1648139967'),
             ('image', '00b5bc701453b04717230b1a5f253a14.jpg'),
             ('image_phash', 'fab111489dc3c76c'),
             ('title', 'Sandal Gladiator Tali Lebar Import 1606-3A'),
             ('label_group', '1065718333'),
             ('TID', '84'),
             ('CID', 79),
             ('CTID', '1'),
             ('id', 83)])

How many clusters this dataset has?

In [None]:
cluster_total = len(set(record[cluster_field] for record in record_dict.values()))
cluster_total

11014

From all clusters, we'll use only 20% for training, and other 20% for validation to test how well we can generalize:

In [None]:
from entity_embed.data_utils import utils

train_record_dict, valid_record_dict, test_record_dict = utils.split_record_dict_on_clusters(
    record_dict=record_dict,
    cluster_field=cluster_field,
    train_proportion=0.2,
    valid_proportion=0.2,
    random_seed=random_seed)

01:00:43 INFO:Singleton cluster sizes (train, valid, test):(0, 0, 0)
01:00:43 INFO:Plural cluster sizes (train, valid, test):(2202, 2202, 6610)


Note we're splitting the data on **clusters**, not records, so the record counts vary:

In [None]:
len(train_record_dict), len(valid_record_dict), len(test_record_dict)

(6839, 6841, 20570)

## Preprocess

We'll perform a very minimal preprocessing of the dataset. We want to simply force ASCII chars, lowercase all chars, and strip leading and trailing whitespace.

The fields we'll clean are the ones we'll use:

In [None]:
field_list = ['title', 'image_phash']

In [None]:
import unidecode

def clean_str(s):
    return unidecode.unidecode(s).lower().strip()

for record in record_dict.values():
    for field in field_list:
        record[field] = clean_str(record[field])

In [None]:
utils.subdict(record_dict[83], field_list)

{'image_phash': 'fab111489dc3c76c',
 'title': 'sandal gladiator tali lebar import 1606-3a'}

Forcing ASCII chars in this dataset is useful to improve recall because there's little difference between accented and not-accented chars here. Also, this dataset contains mostly latin chars.

## Configure Entity Embed fields

Now we will define how record fields will be numericalized and encoded by the neural network. First we set an `alphabet`, here we'll use ASCII numbers, letters, symbols and space:

In [None]:
from entity_embed.data_utils.field_config_parser import DEFAULT_ALPHABET

alphabet = DEFAULT_ALPHABET
''.join(alphabet)

'0123456789abcdefghijklmnopqrstuvwxyz!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ '

It's worth noting you can use any alphabet you need, so the accent removal we performed is optional.

Then we set an `field_config_dict`. It defines `field_type`s that determine how fields are processed in the neural network:

In [None]:
field_config_dict = {
    'title': {
        'field_type': "MULTITOKEN",
        'tokenizer': "entity_embed.default_tokenizer",
        'alphabet': alphabet,
        'max_str_len': None,  # compute
    },
    'title_semantic': {
        'key': 'title',
        'field_type': "SEMANTIC_MULTITOKEN",
        'tokenizer': "entity_embed.default_tokenizer",
        'vocab': "fasttext.en.300d",
    },
    'image_phash': {
        'field_type': "STRING",
        'alphabet': alphabet,
        'max_str_len': None,  # compute
    },
}

Then we use our `field_config_dict` to get a `record_numericalizer`. This object will convert the strings from our records into tensors for the neural network.

The same `record_numericalizer` must be used on ALL data: train, valid, test. This ensures numericalization will be consistent. Therefore, we pass `record_list=record_dict.values()`:

In [None]:
from entity_embed import FieldConfigDictParser

record_numericalizer = FieldConfigDictParser.from_dict(field_config_dict, record_list=record_dict.values())

01:02:08 INFO:For field=title, computing actual max_str_len
01:02:08 INFO:For field=title, using actual_max_str_len=40
01:02:09 INFO:Loading vectors from .vector_cache/wiki.en.vec.pt
01:02:59 INFO:For field=image_phash, computing actual max_str_len
01:02:59 INFO:For field=image_phash, using actual_max_str_len=16


## Initialize Data Module

under the hood, Entity Embed uses [pytorch-lightning](https://pytorch-lightning.readthedocs.io/en/latest/), so we need to create a datamodule object:

In [None]:
from entity_embed import DeduplicationDataModule

batch_size = 32
eval_batch_size = 64
datamodule = DeduplicationDataModule(
    train_record_dict=train_record_dict,
    valid_record_dict=valid_record_dict,
    test_record_dict=test_record_dict,
    cluster_field=cluster_field,
    record_numericalizer=record_numericalizer,
    batch_size=batch_size,
    eval_batch_size=eval_batch_size,
    random_seed=random_seed
)

We've used `DeduplicationDataModule` because we're doing Deduplication of a single dataset/table (a.k.a. Entity Clustering, Entity Resolution, etc.).

We're NOT doing Record Linkage of two datasets here. Check the other notebook [Record-Linkage-Example](./Record-Linkage-Example.ipynb) if you want to learn how to do it with Entity Embed.

## Training

Now the training process! Thanks to pytorch-lightning, it's easy to train, validate, and test with the same datamodule.

We must choose the K of the Approximate Nearest Neighbors, i.e., the top K neighbors our model will use to find duplicates in the embedding space. Below we're setting it on `ann_k` and initializing the `EntityEmbed` model object:

In [None]:
from entity_embed import EntityEmbed

ann_k = 100
model = EntityEmbed(
    record_numericalizer,
    ann_k=ann_k,
)

To train, Entity Embed uses [pytorch-lightning Trainer](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html) on it's `EntityEmbed.fit` method.

Since Entity Embed is focused in recall, we'll use `valid_recall_at_0.3` for early stopping. But we'll set `min_epochs = 5` to avoid a very low precision.

`0.3` here is the threshold for **cosine similarity of embedding vectors**, so possible values are between -1 and 1. We're using a validation metric, and the training process will run validation on every epoch end due to `check_val_every_n_epoch=1`.

We also set `tb_name` and `tb_save_dir` to use Tensorboard. Run `tensorboard --logdir notebooks/tb_logs` to check the train and valid metrics during and after training.

In [None]:
trainer = model.fit(
    datamodule,
    min_epochs=5,
    max_epochs=100,
    check_val_every_n_epoch=1,
    early_stop_monitor="valid_recall_at_0.3",
    tb_save_dir='tb_logs',
    tb_name='music',
)

01:03:22 INFO:GPU available: True, used: True
01:03:22 INFO:TPU available: False, using: 0 TPU cores
01:03:23 INFO:Train positive pair count: 15338
01:03:23 INFO:Valid positive pair count: 16852
01:03:23 INFO:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
01:05:45 INFO:
  | Name        | Type       | Params
-------------------------------------------
0 | blocker_net | BlockerNet | 13.0 M
1 | loss_fn     | SupConLoss | 0     
-------------------------------------------
5.5 M     Trainable params
7.6 M     Non-trainable params
13.0 M    Total params
52.097    Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

01:35:48 INFO:Loading the best validation model from tb_logs/music/version_2/checkpoints/epoch=5-step=1351.ckpt...


`EntityEmbed.fit` keeps only the weights of the best validation model. With them, we can check the best performance on validation set:

In [None]:
model.validate(datamodule)

{'valid_f1_at_0.3': 0.09098334060573512,
 'valid_f1_at_0.5': 0.36262810099060694,
 'valid_f1_at_0.7': 0.587527256868731,
 'valid_pair_entity_ratio_at_0.3': 49.797398041222046,
 'valid_pair_entity_ratio_at_0.5': 9.504019880134484,
 'valid_pair_entity_ratio_at_0.7': 2.56439117088145,
 'valid_precision_at_0.3': 0.04774205668928915,
 'valid_precision_at_0.5': 0.22830951904886415,
 'valid_precision_at_0.7': 0.5759562218548709,
 'valid_recall_at_0.3': 0.9651079990505578,
 'valid_recall_at_0.5': 0.8808450035604083,
 'valid_recall_at_0.7': 0.5995727510087824}

And we can check which fields are most important for the final embedding:

In [None]:
model.get_pool_weights()

{'image_phash': 0.17612621188163757,
 'title': 0.4284956753253937,
 'title_semantic': 0.3953781723976135}

## Testing

Again with the best validation model, we can check the performance on the test set:

In [None]:
model.test(datamodule)

01:37:14 INFO:Test positive pair count: 51561


{'test_f1_at_0.3': 0.06538981097626904,
 'test_f1_at_0.5': 0.1812336580390906,
 'test_f1_at_0.7': 0.3759275621308851,
 'test_pair_entity_ratio_at_0.3': 68.53364122508508,
 'test_pair_entity_ratio_at_0.5': 20.659844433641226,
 'test_pair_entity_ratio_at_0.7': 4.811278561011181,
 'test_precision_at_0.3': 0.0338907186234028,
 'test_precision_at_0.5': 0.10161116117965141,
 'test_precision_at_0.7': 0.28589038881254547,
 'test_recall_at_0.3': 0.9266111983863773,
 'test_recall_at_0.5': 0.8374934543550357,
 'test_recall_at_0.7': 0.5487480847927697}

Entity Embed achieves Recall of ~0.99 with Pair-Entity ratio below 100 on a variety of datasets. **Entity Embed aims for high recall at the expense of precision. Therefore, this library is suited for the Blocking/Indexing stage of an Entity Resolution pipeline.**  A scalabale and noise-tolerant Blocking procedure is often the main bottleneck for performance and quality on Entity Resolution pipelines, so this library aims to solve that. Note the ANN search on embedded records returns several candidate pairs that must be filtered to find the best matching pairs, possibly with a pairwise classifier. See the [Record-Linkage-Example](./Record-Linkage-Example.ipynb) for an example of matching.