Skip to content

Commit

Permalink
[SYSTEMDS-265] Entity resolution pipelines and primitives.
Browse files Browse the repository at this point in the history
Adds new scripts in `scripts/staging/entity-resolution` that demonstrate
entity clustering and binary entity resolution with SystemDS DML.
See the README at `scripts/staging/entity-resolution/README.md` for more
details.

This is a squash of all commits on branch master from the skogler/systemml fork.

Co-authored-by: Markus Reiter-Haas <iseratho@gmail.com>
  • Loading branch information
skogler and Iseratho committed Jul 20, 2020
1 parent cc6bffd commit 493a52f
Show file tree
Hide file tree
Showing 28 changed files with 2,091 additions and 1 deletion.
2 changes: 1 addition & 1 deletion .github/workflows/applicationTests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ jobs:
strategy:
fail-fast: false
matrix:
tests: [A,B,C,G,H,I,L,M,N,O,P,S,U,W]
tests: [A,B,C,EntityResolution,G,H,I,L,M,N,O,P,S,U,W]
os: [ubuntu-latest]
name: Ap Test ${{ matrix.tests }}
steps:
Expand Down
1 change: 1 addition & 0 deletions dev/Tasks.txt
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,7 @@ SYSTEMDS-260 Misc Tools
* 262 Data augmentation tool for data cleaning OK
* 263 ONNX graph importer (Python API, docs, tests) OK
* 264 ONNX graph exporter
* 265 Entity resolution pipelines and primitives OK

SYSTEMDS-270 Compressed Matrix Blocks
* 271 Reintroduce compressed matrix blocks from SystemML OK
Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ Various forms of documentation for SystemDS are available.

- a [DML language reference](./site/dml-language-reference) for an list of operations possible inside SystemDS.
- [builtin functions](./site/builtins-reference) contains a collection of builtin functions providing an high level abstraction on complex machine learning algorithms.
- [Entity Resolution](./site/entity-resolution) provides a collection of customizable entity resolution primitives and pipelines.
- [Run SystemDS](./site/run) contains an Helloworld example along with an environment setup guide.
- Instructions on python can be found at [Python Documentation](./api/python/index)
- The [javadoc API](./api/java/index) contains internal documentation of the system source code.
Expand Down
137 changes: 137 additions & 0 deletions docs/site/entity-resolution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
---
layout: site
title: Entity Resolution
---
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

## Pipeline design and primitives

We provide two example scripts, `entity-clustering.dml` and `binary-entity-resolution.dml`. These handle reading input
files and writing output files and call functions provided in `primitives/pipeline.dml`.

The pipeline design is loosely based on the following paper, but does not use advanced features like multi-probe LSH,
combining embeddings via LSTM or classification via machine learning.

```
Ebraheem, Muhammad, et al. "Distributed representations of tuples for entity resolution."
Proceedings of the VLDB Endowment 11.11 (2018): 1454-1467.
```

### Input files

The provided scripts can read two types of input files. The token file is mandatory since it contains the row identifiers,
but the embedding file is optional. The actual use of tokens and/or embeddings can be configured via command line parameters
to the scripts.

##### Token files

This file type is a CSV file with 3 columns. The first column is the string or integer row identifier, the second is the
string token, and the third is the number of occurences. This simple format is used as a bag-of-words representation.

##### Embedding files

This file type is a CSV matrix file with each row containing arbitrary-dimensional embeddings. The order of row identifiers
is assumed to be the same as in the token file. This saves some computation and storage time, but could be changed with
some modifications to the example scripts.

### Primitives

While the example scripts may be sufficient for many simple use cases, we aim to provide a toolkit of composable functions
to facilitate more complex tasks. The top-level pipelines are defined as a couple of functions in `primitives/pipeline.dml`.
The goal is that it should be relatively easy to copy one of these pipelines and swap out the primitive functions used
to create a custom pipeline.

To convert the input token file into a bag-of-words contingency table representation, we provide the functions
`convert_frame_tokens_to_matrix_bow` and `convert_frame_tokens_to_matrix_bow_2` in `primitives/preprocessing.dml`.
The latter is used to compute a compatible contigency table with matching vocabulary for binary entity resolution.

We provide naive, constant-size blocking and locality-sensitive hashing (LSH) as functions in `primitives/blocking.dml`.

For entity clustering, we only provide a simple clustering approach which makes all connected components in an adjacency
matrix fully connected. This function is located in `primitives/clustering.dml`.

To restore an adjacency matrix to a list of pairs, we provide the functions `untable` and `untable_offset` in
`primitives/postprocessing.dml`.

Finally, `primitives/evaluation.dml` defines some metrics that can be used to evaluate the performance of the entity
resolution pipelines. They are used in the script `eval-entity-resolution.dml`.

## Testing and Examples

There is a test data repository that was used to develop these scripts at
[repo](https://github.com/skogler/systemds-amls-project-data). In the examples below, it is assumed that this repo is
cloned as `data` in the SystemDS root folder. The data in that repository is sourced from the Uni Leipzig entity resolution
[benchmark](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution).

### Preprocessing

Since there is no tokenization functionality in SystemDS yet, we provide a Python preprocessing script in the data repository
that tokenizes the text columns and performs some simple embedding lookup using Glove embeddings.

The tokens are written as CSV files to enable Bag-of-Words representations as well as matrices with combined embeddings. D
epending on the type of data, one or the other or a combination of both may be better. The SystemDS DML scripts can be
called with different parameters to experiment with this.

### Entity Clustering

In this case we detect duplicates within one database. As an example, we use the benchmark dataset Affiliations from Uni Leipzig.
For this dataset, embeddings do not work well since the data is mostly just names. Therefore, we encode it as Bag-of-Words vectors
in the example below. This dataset would benefit from more preprocessing, as simply matching words for all the different kinds of
abbreviations does not work particularly well.

Example command to run on Affiliations dataset:
```
./bin/systemds ./scripts/algorithms/entity-resolution/entity-clustering.dml -nvargs FX=data/affiliationstrings/affiliationstrings_tokens.csv OUT=data/affiliationstrings/affiliationstrings_res.csv store_mapping=FALSE MX=data/affiliationstrings/affiliationstrings_MX.csv use_embeddings=FALSE XE=data/affiliationstrings/affiliationstrings_embeddings.csv
```
Evaluation:
```
./bin/systemds ./scripts/algorithms/entity-resolution/eval-entity-resolution.dml -nvargs FX=data/affiliationstrings/affiliationstrings_res.csv FY=data/affiliationstrings/affiliationstrings_mapping_fixed.csv
```

### Binary Entity Resolution

In this case we detect duplicate pairs of rows between two databases. As an example, we use the benchmark dataset DBLP-ACM from Uni Leipzig.
Embeddings work really well for this dataset, so the results are quite good with an F1 score of 0.89.

Example command to run on DBLP-ACM dataset with embeddings:
```
./bin/systemds ./scripts/algorithms/entity-resolution/binary-entity-resolution.dml -nvargs FY=data/DBLP-ACM/ACM_tokens.csv FX=data/DBLP-ACM/DBLP2_tokens.csv MX=data/DBLP-ACM_MX.csv OUT=data/DBLP-ACM/DBLP-ACM_res.csv XE=data/DBLP-ACM/DBLP2_embeddings.csv YE=data/DBLP-ACM/ACM_embeddings.csv use_embeddings=TRUE
```
Evaluation:
```
./bin/systemds ./scripts/algorithms/entity-resolution/eval-entity-resolution.dml -nvargs FX=data/DBLP-ACM/DBLP-ACM_res.csv FY=data/DBLP-ACM/DBLP-ACM_perfectMapping.csv
```

## Future Work

1. Better clustering algorithms.
1. Correlation clustering.
2. Markov clustering.
3. See [this link](https://dbs.uni-leipzig.de/en/publication/title/comparative_evaluation_of_distributed_clustering_schemes_for_multi_source_entity_resolution) for more approaches.
2. Multi-Probe LSH to improve runtime performance.
1. Probably as a SystemDS built-in to be more efficient.
3. Classifier-based matching.
1. Using an SVM classifier to decide if two tuple are duplicates instead of a threshold for similarity.
4. Better/built-in tokenization.
1. Implement text tokenization as component of SystemDS.
2. Offer choice of different preprocessing and tokenization algorithms (e.g. stemming, word-piece tokenization).
5. Better/built-in embeddings.
1. Implement embedding generation as component of SystemDS.
2. Use LSTM to compose embeddings.
102 changes: 102 additions & 0 deletions scripts/staging/entity-resolution/binary-entity-resolution.dml
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
#-------------------------------------------------------------
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
#-------------------------------------------------------------

#
# THIS SCRIPT PERFORMS AN ENTITY RESOLUTION PIPELINE FOR BINARY MATCHING ON TWO FILES
#
# INPUT PARAMETERS:
# ---------------------------------------------------------------------------------------------
# NAME TYPE DEFAULT MEANING
# ---------------------------------------------------------------------------------------------
# FX String --- Location to read the frame of tokens in bow format for the first dataset
# Each line contains comma separated list of id, token and value
# FY String --- Location to read the frame of tokens in bow format for the second dataset
# Each line contains comma separated list of id, token and value
# OUT String --- Location to save the output of maching pairs
# Each line contains comma separated ids of one matched pair
# First column is for the first dataset, while second columns is the id of the second one
# Third column provides the similarity score
# threshold Double 0.9 Threshold to be considered as a match
# num_hashtables Int 6 Number of hashtables for LSH blocking.
# num_hyperplanes Int 4 Number of hyperplanes for LSH blocking.
# use_tokens Boolean TRUE Whether to use the tokens of FX and FY to generate predictions
# use_embeddings Boolean FALSE Whether to use the embeddings of XE and YE to generate predictions
# XE String --- Location to read the frame of embedding matrix for the first dataset
# Required if use_embeddings is set to TRUE
# YE String --- Location to read the frame of embedding matrix for the second dataset
# Required if use_embeddings is set to TRUE
# ---------------------------------------------------------------------------------------------
# OUTPUT: frame of maching pairs
# ---------------------------------------------------------------------------------------------

source("./scripts/staging/entity-resolution/primitives/postprocessing.dml") as post;
source("./scripts/staging/entity-resolution/primitives/preprocessing.dml") as pre;
source("./scripts/staging/entity-resolution/primitives/pipeline.dml") as pipe;

# Command Line Arguments
fileFX = $FX;
fileFY = $FY;
fileOUT = $OUT;

threshold = ifdef($threshold, 0.9);
num_hashtables = ifdef($num_hashtables, 6);
num_hyperplanes = ifdef($num_hyperplanes, 4);

use_tokens = ifdef($use_tokens, TRUE);
use_embeddings = ifdef($use_embeddings, FALSE);
# file XE and YE is only required if using embeddings
fileXE = ifdef($XE, "");
fileYE = ifdef($YE, "");

# Read data
FX = read(fileFX);
FY = read(fileFY);
if (use_embeddings) {
if (fileXE == "" | fileYE == "") {
print("You need to specify file XE and XY when use_embeddings is set to TRUE");
} else {
X_embeddings = read(fileXE);
Y_embeddings = read(fileYE);
}
}

# Convert data
[X, Y, M_tokens, MX_ids, MY_ids] = pre::convert_frame_tokens_to_matrix_bow_2(FX,FY);
if (use_tokens & use_embeddings) {
X = cbind(X, X_embeddings);
Y = cbind(Y, Y_embeddings);
} else if (use_tokens) {
# Nothing to do in this case, since X already contains tokens
} else if (use_embeddings) {
X = X_embeddings;
Y = Y_embeddings;
} else {
print("Either use_tokens or use_embeddings needs to be TRUE, using tokens only as default.");
}
# Perform matching
THRES = pipe::binary_entity_resolution_pipeline_lsh(X, Y, num_hashtables, num_hyperplanes, threshold);
sparse = post::untable(THRES);

# Write results
X_dec = transformdecode(target=sparse[,1], meta=MX_ids[,1], spec="{recode:[C1]}");
Y_dec = transformdecode(target=sparse[,2], meta=MY_ids[,1], spec="{recode:[C1]}");
output = cbind(cbind(X_dec, Y_dec), as.frame(sparse[,3]));
write(output, fileOUT, sep=",", sparse=FALSE, format="csv");
119 changes: 119 additions & 0 deletions scripts/staging/entity-resolution/entity-clustering.dml
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
#-------------------------------------------------------------
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
#-------------------------------------------------------------

#
# THIS SCRIPT PERFORMS AN ENTITY RESOLUTION PIPELINE FOR CLUSTERING ON A SINGLE FILE
# CONSISTS OF BLOCKING, MATCHING, AND CLUSTERING
#
# INPUT PARAMETERS:
# ---------------------------------------------------------------------------------------------
# NAME TYPE DEFAULT MEANING
# ---------------------------------------------------------------------------------------------
# FX String --- Location to read the frame of tokens in bow format
# Each line contains comma separated list of id, token and value
# OUT String --- Location to save the output of maching pairs
# Each line contains comma separated ids of one matched pair
# Third column provides the similarity score
# threshold Double 0.9 Threshold to be considered as a match
# blocking_method String naive Possible values: ["naive", "lsh"].
# num_blocks Int 1 Number of blocks for naive blocking
# num_hashtables Int 6 Number of hashtables for LSH blocking.
# num_hyperplanes Int 4 Number of hyperplanes for LSH blocking.

# use_tokens Boolean TRUE Whether to use the tokens of FX to generate predictions
# use_embeddings Boolean FALSE Whether to use the embeddings of XE to generate predictions
# XE String --- Location to read the frame of embedding matrix
# Required if use_embeddings is set to TRUE
# store_mapping Boolean FALSE Whether to store the mapping of transformencode
# MX String --- Location to write the frame of mapping
# Required if store_mapping is set to TRUE
# ---------------------------------------------------------------------------------------------
# OUTPUT: frame of maching pairs
# ---------------------------------------------------------------------------------------------

source("./scripts/staging/entity-resolution/primitives/preprocessing.dml") as pre;
source("./scripts/staging/entity-resolution/primitives/postprocessing.dml") as post;
source("./scripts/staging/entity-resolution/primitives/pipeline.dml") as pipe;

# Command Line Arguments
fileFX = $FX;
fileOUT = $OUT;

threshold = ifdef($threshold, 0.9);
blocking_method = ifdef($blocking_method, "lsh");
num_blocks = ifdef($num_blocks, 1);
num_hyperplanes = ifdef($num_hyperplanes, 4);
num_hashtables = ifdef($num_hashtables, 6);
use_tokens = ifdef($use_tokens, TRUE);
use_embeddings = ifdef($use_embeddings, FALSE);
# file XE is only required if using embeddings
fileXE = ifdef($XE, "");
# mapping file is required for evaluation
store_mapping = ifdef($store_mapping, FALSE);
fileMX = ifdef($MX, "");

if (!(blocking_method == "naive" | blocking_method == "lsh")) {
print("ERROR: blocking method must be in ['naive', 'lsh']");
}

# Read data
FX = read(fileFX);
if (use_embeddings) {
if (fileXE == "") {
print("You need to specify file XE when use_embeddings is set to TRUE");
} else {
X_embeddings = read(fileXE);
}
}

# Convert data
[X, MX] = pre::convert_frame_tokens_to_matrix_bow(FX);
if (use_tokens & use_embeddings) {
X = cbind(X, X_embeddings);
} else if (use_tokens) {
# Nothing to do in this case, since X already contains tokens
} else if (use_embeddings) {
X = X_embeddings;
} else {
print("Either use_tokens or use_embeddings needs to be TRUE, using tokens only as default.");
}

if (store_mapping) {
if (fileMX == "") {
print("You need to specify file MX when store_mapping is set to TRUE.");
} else {
write(MX, fileMX);
}
}

# Perform clustering
if (blocking_method == "naive") {
CLUSTER = pipe::entity_clustering_pipeline(X, num_blocks, threshold);
} else if (blocking_method == "lsh") {
CLUSTER = pipe::entity_clustering_pipeline_lsh(X, num_hashtables, num_hyperplanes, threshold);
}
MATCH = (CLUSTER > 0);

# Write results
sparse = post::untable(CLUSTER);
dec = transformdecode(target=sparse, meta=cbind(MX[,1],MX[,1]), spec="{recode:[C1,C2]}");
output = cbind(dec, as.frame(sparse[,3]));
write(output, fileOUT, sep=",", sparse=FALSE, format="csv");

0 comments on commit 493a52f

Please sign in to comment.