[SYSTEMDS-265] Entity resolution pipelines and primitives.

Adds new scripts in `scripts/staging/entity-resolution` that demonstrate entity clustering and binary entity resolution with SystemDS DML. See the README at `scripts/staging/entity-resolution/README.md` for more details. This is a squash of all commits on branch master from the skogler/systemml fork. Co-authored-by: Markus Reiter-Haas <iseratho@gmail.com>
apache · Jul 20, 2020 · 493a52f · 493a52f
1 parent cc6bffd
commit 493a52f
Show file tree

Hide file tree

Showing 28 changed files with 2,091 additions and 1 deletion.
diff --git a/.github/workflows/applicationTests.yml b/.github/workflows/applicationTests.yml
@@ -35,7 +35,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        tests: [A,B,C,G,H,I,L,M,N,O,P,S,U,W]
+        tests: [A,B,C,EntityResolution,G,H,I,L,M,N,O,P,S,U,W]
         os: [ubuntu-latest]
     name:  Ap Test ${{ matrix.tests }} 
     steps:

diff --git a/dev/Tasks.txt b/dev/Tasks.txt
@@ -236,6 +236,7 @@ SYSTEMDS-260 Misc Tools
  * 262 Data augmentation tool for data cleaning                       OK
  * 263 ONNX graph importer (Python API, docs, tests)                  OK
  * 264 ONNX graph exporter
+ * 265 Entity resolution pipelines and primitives                     OK
 
 SYSTEMDS-270 Compressed Matrix Blocks
  * 271 Reintroduce compressed matrix blocks from SystemML             OK

diff --git a/docs/index.md b/docs/index.md
@@ -36,6 +36,7 @@ Various forms of documentation for SystemDS are available.
 
 - a [DML language reference](./site/dml-language-reference) for an list of operations possible inside SystemDS.
 - [builtin functions](./site/builtins-reference) contains a collection of builtin functions providing an high level abstraction on complex machine learning algorithms.
+- [Entity Resolution](./site/entity-resolution) provides a collection of customizable entity resolution primitives and pipelines.
 - [Run SystemDS](./site/run) contains an Helloworld example along with an environment setup guide.
 - Instructions on python can be found at [Python Documentation](./api/python/index)
 - The [javadoc API](./api/java/index) contains internal documentation of the system source code.

diff --git a/docs/site/entity-resolution.md b/docs/site/entity-resolution.md
@@ -0,0 +1,137 @@
+---
+layout: site
+title: Entity Resolution
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Pipeline design and primitives
+
+We provide two example scripts, `entity-clustering.dml` and `binary-entity-resolution.dml`. These handle reading input 
+files and writing output files and call functions provided in `primitives/pipeline.dml`.
+
+The pipeline design is loosely based on the following paper, but does not use advanced features like multi-probe LSH,
+combining embeddings via LSTM or classification via machine learning.
+
+```
+Ebraheem, Muhammad, et al. "Distributed representations of tuples for entity resolution."
+Proceedings of the VLDB Endowment 11.11 (2018): 1454-1467.
+```
+
+### Input files
+
+The provided scripts can read two types of input files. The token file is mandatory since it contains the row identifiers, 
+but the embedding file is optional. The actual use of tokens and/or embeddings can be configured via command line parameters 
+to the scripts.
+
+##### Token files
+
+This file type is a CSV file with 3 columns. The first column is the string or integer row identifier, the second is the 
+string token, and the third is the number of occurences. This simple format is used as a bag-of-words representation.
+
+##### Embedding files
+
+This file type is a CSV matrix file with each row containing arbitrary-dimensional embeddings. The order of row identifiers
+is assumed to be the same as in the token file. This saves some computation and storage time, but could be changed with 
+some modifications to the example scripts.
+
+### Primitives
+
+While the example scripts may be sufficient for many simple use cases, we aim to provide a toolkit of composable functions
+to facilitate more complex tasks. The top-level pipelines are defined as a couple of functions in `primitives/pipeline.dml`.
+The goal is that it should be relatively easy to copy one of these pipelines and swap out the primitive functions used
+to create a custom pipeline.
+
+To convert the input token file into a bag-of-words contingency table representation, we provide the functions
+`convert_frame_tokens_to_matrix_bow` and `convert_frame_tokens_to_matrix_bow_2` in  `primitives/preprocessing.dml`.
+The latter is used to compute a compatible contigency table with matching vocabulary for binary entity resolution. 
+
+We provide naive, constant-size blocking and locality-sensitive hashing (LSH) as functions in `primitives/blocking.dml`.
+
+For entity clustering, we only provide a simple clustering approach which makes all connected components in an adjacency
+matrix fully connected. This function is located in `primitives/clustering.dml`.
+
+To restore an adjacency matrix to a list of pairs, we provide the functions `untable` and `untable_offset` in 
+`primitives/postprocessing.dml`.
+
+Finally, `primitives/evaluation.dml` defines some metrics that can be used to evaluate the performance of the entity
+resolution pipelines. They are used in the script `eval-entity-resolution.dml`. 
+
+## Testing and Examples
+
+There is a test data repository that was used to develop these scripts at 
+[repo](https://github.com/skogler/systemds-amls-project-data). In the examples below, it is assumed that this repo is 
+cloned as `data` in the SystemDS root folder. The data in that repository is sourced from the Uni Leipzig entity resolution 
+[benchmark](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution).
+
+### Preprocessing
+
+Since there is no tokenization functionality in SystemDS yet, we provide a Python preprocessing script in the data repository
+that tokenizes the text columns and performs some simple embedding lookup using Glove embeddings.
+
+The tokens are written as CSV files to enable Bag-of-Words representations as well as matrices with combined embeddings. D
+epending on the type of data, one or the other or a combination of both may be better. The SystemDS DML scripts can be 
+called with different parameters to experiment with this.
+
+### Entity Clustering
+
+In this case we detect duplicates within one database. As an example, we use the benchmark dataset Affiliations from Uni Leipzig.
+For this dataset, embeddings do not work well since the data is mostly just names. Therefore, we encode it as Bag-of-Words vectors
+in the example below. This dataset would benefit from more preprocessing, as simply matching words for all the different kinds of
+abbreviations does not work particularly well.
+
+Example command to run on Affiliations dataset:
+```
+./bin/systemds ./scripts/algorithms/entity-resolution/entity-clustering.dml -nvargs FX=data/affiliationstrings/affiliationstrings_tokens.csv OUT=data/affiliationstrings/affiliationstrings_res.csv store_mapping=FALSE MX=data/affiliationstrings/affiliationstrings_MX.csv use_embeddings=FALSE XE=data/affiliationstrings/affiliationstrings_embeddings.csv
+```
+Evaluation:
+```
+./bin/systemds ./scripts/algorithms/entity-resolution/eval-entity-resolution.dml -nvargs FX=data/affiliationstrings/affiliationstrings_res.csv FY=data/affiliationstrings/affiliationstrings_mapping_fixed.csv
+```
+
+### Binary Entity Resolution
+
+In this case we detect duplicate pairs of rows between two databases. As an example, we use the benchmark dataset DBLP-ACM from Uni Leipzig.
+Embeddings work really well for this dataset, so the results are quite good with an F1 score of 0.89.
+
+Example command to run on DBLP-ACM dataset with embeddings:
+```
+./bin/systemds ./scripts/algorithms/entity-resolution/binary-entity-resolution.dml -nvargs FY=data/DBLP-ACM/ACM_tokens.csv FX=data/DBLP-ACM/DBLP2_tokens.csv MX=data/DBLP-ACM_MX.csv OUT=data/DBLP-ACM/DBLP-ACM_res.csv XE=data/DBLP-ACM/DBLP2_embeddings.csv YE=data/DBLP-ACM/ACM_embeddings.csv use_embeddings=TRUE
+```
+Evaluation:
+```
+./bin/systemds ./scripts/algorithms/entity-resolution/eval-entity-resolution.dml -nvargs FX=data/DBLP-ACM/DBLP-ACM_res.csv FY=data/DBLP-ACM/DBLP-ACM_perfectMapping.csv
+```
+
+## Future Work
+
+1. Better clustering algorithms.
+    1. Correlation clustering.
+    2. Markov clustering.
+    3. See [this link](https://dbs.uni-leipzig.de/en/publication/title/comparative_evaluation_of_distributed_clustering_schemes_for_multi_source_entity_resolution) for more approaches.
+2. Multi-Probe LSH to improve runtime performance. 
+    1. Probably as a SystemDS built-in to be more efficient.
+3. Classifier-based matching.
+    1. Using an SVM classifier to decide if two tuple are duplicates instead of a threshold for similarity.
+4. Better/built-in tokenization.
+    1. Implement text tokenization as component of SystemDS.
+    2. Offer choice of different preprocessing and tokenization algorithms (e.g. stemming, word-piece tokenization).
+5. Better/built-in embeddings.
+   1. Implement embedding generation as component of SystemDS.
+   2. Use LSTM to compose embeddings.
diff --git a/scripts/staging/entity-resolution/binary-entity-resolution.dml b/scripts/staging/entity-resolution/binary-entity-resolution.dml
@@ -0,0 +1,102 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#
+# THIS SCRIPT PERFORMS AN ENTITY RESOLUTION PIPELINE FOR BINARY MATCHING ON TWO FILES
+#
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME           TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# FX              String  ---     Location to read the frame of tokens in bow format for the first dataset
+#                                 Each line contains comma separated list of id, token and value
+# FY              String  ---     Location to read the frame of tokens in bow format for the second dataset
+#                                 Each line contains comma separated list of id, token and value
+# OUT             String  ---     Location to save the output of maching pairs
+#                                 Each line contains comma separated ids of one matched pair
+#                                 First column is for the first dataset, while second columns is the id of the second one
+#                                 Third column provides the similarity score
+# threshold       Double  0.9     Threshold to be considered as a match
+# num_hashtables  Int     6       Number of hashtables for LSH blocking.
+# num_hyperplanes Int     4       Number of hyperplanes for LSH blocking.
+# use_tokens      Boolean TRUE    Whether to use the tokens of FX and FY to generate predictions
+# use_embeddings  Boolean FALSE   Whether to use the embeddings of XE and YE to generate predictions
+# XE              String  ---     Location to read the frame of embedding matrix for the first dataset
+#                                 Required if use_embeddings is set to TRUE
+# YE              String  ---     Location to read the frame of embedding matrix for the second dataset
+#                                Required if use_embeddings is set to TRUE
+# ---------------------------------------------------------------------------------------------
+# OUTPUT: frame of maching pairs
+# ---------------------------------------------------------------------------------------------
+
+source("./scripts/staging/entity-resolution/primitives/postprocessing.dml") as post;
+source("./scripts/staging/entity-resolution/primitives/preprocessing.dml") as pre;
+source("./scripts/staging/entity-resolution/primitives/pipeline.dml") as pipe;
+
+# Command Line Arguments
+fileFX = $FX;
+fileFY = $FY;
+fileOUT = $OUT;
+
+threshold = ifdef($threshold, 0.9);
+num_hashtables = ifdef($num_hashtables, 6);
+num_hyperplanes = ifdef($num_hyperplanes, 4);
+
+use_tokens = ifdef($use_tokens, TRUE);
+use_embeddings = ifdef($use_embeddings, FALSE);
+# file XE and YE is only required if using embeddings
+fileXE = ifdef($XE, "");
+fileYE = ifdef($YE, "");
+
+# Read data
+FX = read(fileFX);
+FY = read(fileFY);
+if (use_embeddings) {
+  if (fileXE == "" | fileYE == "") {
+    print("You need to specify file XE and XY when use_embeddings is set to TRUE");
+  } else {
+    X_embeddings = read(fileXE);
+    Y_embeddings = read(fileYE);
+  }
+}
+
+# Convert data
+[X, Y, M_tokens, MX_ids, MY_ids] = pre::convert_frame_tokens_to_matrix_bow_2(FX,FY);
+if (use_tokens & use_embeddings) {
+  X = cbind(X, X_embeddings);
+  Y = cbind(Y, Y_embeddings);
+} else if (use_tokens) {
+  # Nothing to do in this case, since X already contains tokens
+} else if (use_embeddings) {
+  X = X_embeddings;
+  Y = Y_embeddings;
+} else {
+  print("Either use_tokens or use_embeddings needs to be TRUE, using tokens only as default.");
+}
+# Perform matching
+THRES = pipe::binary_entity_resolution_pipeline_lsh(X, Y, num_hashtables, num_hyperplanes, threshold);
+sparse = post::untable(THRES);
+
+# Write results
+X_dec = transformdecode(target=sparse[,1], meta=MX_ids[,1], spec="{recode:[C1]}");
+Y_dec = transformdecode(target=sparse[,2], meta=MY_ids[,1], spec="{recode:[C1]}");
+output = cbind(cbind(X_dec, Y_dec), as.frame(sparse[,3]));
+write(output, fileOUT, sep=",", sparse=FALSE, format="csv");
diff --git a/scripts/staging/entity-resolution/entity-clustering.dml b/scripts/staging/entity-resolution/entity-clustering.dml
@@ -0,0 +1,119 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#
+# THIS SCRIPT PERFORMS AN ENTITY RESOLUTION PIPELINE FOR CLUSTERING ON A SINGLE FILE
+# CONSISTS OF BLOCKING, MATCHING, AND CLUSTERING
+#
+# INPUT PARAMETERS:
+# ---------------------------------------------------------------------------------------------
+# NAME           TYPE   DEFAULT  MEANING
+# ---------------------------------------------------------------------------------------------
+# FX              String  ---     Location to read the frame of tokens in bow format
+#                                 Each line contains comma separated list of id, token and value
+# OUT             String  ---     Location to save the output of maching pairs
+#                                 Each line contains comma separated ids of one matched pair
+#                                 Third column provides the similarity score
+# threshold       Double  0.9     Threshold to be considered as a match
+# blocking_method String  naive   Possible values: ["naive", "lsh"].
+# num_blocks      Int     1       Number of blocks for naive blocking
+# num_hashtables  Int     6       Number of hashtables for LSH blocking.
+# num_hyperplanes Int     4       Number of hyperplanes for LSH blocking.
+
+# use_tokens      Boolean TRUE    Whether to use the tokens of FX to generate predictions
+# use_embeddings  Boolean FALSE   Whether to use the embeddings of XE to generate predictions
+# XE              String  ---     Location to read the frame of embedding matrix
+#                                 Required if use_embeddings is set to TRUE
+# store_mapping   Boolean FALSE   Whether to store the mapping of transformencode
+# MX              String  ---     Location to write the frame of mapping
+#                                Required if store_mapping is set to TRUE
+# ---------------------------------------------------------------------------------------------
+# OUTPUT: frame of maching pairs
+# ---------------------------------------------------------------------------------------------
+
+source("./scripts/staging/entity-resolution/primitives/preprocessing.dml") as pre;
+source("./scripts/staging/entity-resolution/primitives/postprocessing.dml") as post;
+source("./scripts/staging/entity-resolution/primitives/pipeline.dml") as pipe;
+
+# Command Line Arguments
+fileFX = $FX;
+fileOUT = $OUT;
+
+threshold = ifdef($threshold, 0.9);
+blocking_method = ifdef($blocking_method, "lsh");
+num_blocks = ifdef($num_blocks, 1);
+num_hyperplanes = ifdef($num_hyperplanes, 4);
+num_hashtables = ifdef($num_hashtables, 6);
+use_tokens = ifdef($use_tokens, TRUE);
+use_embeddings = ifdef($use_embeddings, FALSE);
+# file XE is only required if using embeddings
+fileXE = ifdef($XE, "");
+# mapping file is required for evaluation
+store_mapping = ifdef($store_mapping, FALSE);
+fileMX = ifdef($MX, "");
+
+if (!(blocking_method == "naive" | blocking_method == "lsh")) {
+  print("ERROR: blocking method must be in ['naive', 'lsh']");
+}
+
+# Read data
+FX = read(fileFX);
+if (use_embeddings) {
+  if (fileXE == "") {
+    print("You need to specify file XE when use_embeddings is set to TRUE");
+  } else {
+    X_embeddings = read(fileXE);
+  }
+}
+
+# Convert data
+[X, MX] = pre::convert_frame_tokens_to_matrix_bow(FX);
+if (use_tokens & use_embeddings) {
+  X = cbind(X, X_embeddings);
+} else if (use_tokens) {
+  # Nothing to do in this case, since X already contains tokens
+} else if (use_embeddings) {
+  X = X_embeddings;
+} else {
+  print("Either use_tokens or use_embeddings needs to be TRUE, using tokens only as default.");
+}
+
+if (store_mapping) {
+  if (fileMX == "") {
+    print("You need to specify file MX when store_mapping is set to TRUE.");
+  } else {
+    write(MX, fileMX);
+  }
+}
+
+# Perform clustering
+if (blocking_method == "naive") {
+  CLUSTER = pipe::entity_clustering_pipeline(X, num_blocks, threshold);
+} else if (blocking_method == "lsh") {
+  CLUSTER = pipe::entity_clustering_pipeline_lsh(X, num_hashtables, num_hyperplanes, threshold);
+}
+MATCH = (CLUSTER > 0);
+
+# Write results
+sparse = post::untable(CLUSTER);
+dec = transformdecode(target=sparse, meta=cbind(MX[,1],MX[,1]), spec="{recode:[C1,C2]}");
+output = cbind(dec, as.frame(sparse[,3]));
+write(output, fileOUT, sep=",", sparse=FALSE, format="csv");