[SYSTEMDS-265] Entity resolution pipelines and primitives. #993

skogler · 2020-07-17T15:58:05Z

Adds new scripts in scripts/staging/entity-resolution that demonstrate
entity clustering and binary entity resolution with SystemDS DML.
See the README at scripts/staging/entity-resolution/README.md for more
details.

This is a squash of all commits on branch master from the skogler/systemml fork.

Co-authored-by: Markus Reiter-Haas iseratho@gmail.com

skogler · 2020-07-18T09:20:34Z

Not sure how to reproduce the test failure of BinaryEntityResulationTest locally, it always works, with spark-submit as well as with or without Intel MKL.

Baunsgaard · 2020-07-19T13:01:54Z

I have asked Actions to execute the tests again, (sometimes the tests fail, for arbitrary reasons) and it is something we are looking into.

skogler · 2020-07-19T20:37:43Z

Thanks, it seems that its related to how the test is called. I'll try to build the docker image and test with that.

skogler · 2020-07-19T23:29:44Z

Okay, its a problem with the combination of maven surefire and JUnit parameterized tests.

Running the test from IDEA works fine. Removing the parametrization also makes it run with maven. But I don't really want to do that either :(

skogler · 2020-07-19T23:58:42Z

I added @NotThreadSafe annotations now, which is consistent with some other tests but I think it does not work reliably.

skogler · 2020-07-20T00:05:30Z

Yeah, the only way I can find to get the tests to run reliably is to set the maven surefire plugin option parallel to none.

Baunsgaard

Good job! 👍
I really like your scripts and they are nicely commented for clarity.

I also really like the tests. my only real concern is the disabling of parallel testing, but that should be fixable.

pom.xml

Baunsgaard · 2020-07-20T09:19:14Z

scripts/staging/entity-resolution/README.md

+There is a test data repository that was used to develop these scripts at 
+[repo](https://github.com/skogler/systemds-amls-project-data). In the examples below, it is assumed that this repo is 
+cloned as `data` in the SystemDS root folder. The data in that repository is sourced from the Uni Leipzig entity resolution 
+[benchmark](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution).


How did this implementation compare to this benchmark?

On the mentioned website there are no comparison values.
For the DBLP-ACM dataset we get a F1-score of 0.8948.
For the Affiliations dataset we get a F1-score of 0.1429.
Note that we primarily focused on building the primitives and a basic pipeline.
We suspect that these numbers can be improved a lot by focusing more on data preprocessing (e.g., stemming).

scripts/staging/entity-resolution/README.md

Baunsgaard · 2020-07-20T09:35:32Z

scripts/staging/entity-resolution/README.md

+## Testing and Examples
+
+There is a test data repository that was used to develop these scripts at 
+[repo](https://github.com/skogler/systemds-amls-project-data). In the examples below, it is assumed that this repo is 


I'm not a huge fan of the data being located in this repository, but then again i don't know where we would store such a thing in Apache or if it is even possible.

opinions @mboehm7

Baunsgaard · 2020-07-20T09:43:48Z

scripts/staging/entity-resolution/entity-clustering.dml

+  CLUSTER = pipe::entity_clustering_pipeline(X, num_blocks, threshold);
+} else if (blocking_method == "lsh") {
+  CLUSTER = pipe::entity_clustering_pipeline_lsh(X, num_hashtables, num_hyperplanes, threshold);
+}


Can you maybe elaborate on performance for these two techniques?

If possible could you try adding another case using kmeans (only if applicable and you don't have to change much )

These two pipelines differ in their blocking technique. The performance depends on the values chosen for the blocking algorithm (i.e., num_blocks, num_hashtables, and num_hyperplanes).

I am not sure whether kmeans makes much sense in this case (for duplicate detection). How would you select the number of clusters?

One method would be the Elbow method to determine the number of k in k-means. I guess something similar maybe could be done with the num_hashtables, while leveraging the num_hyperplanes for faster computation.

The direction i was going with my question, was is there other techniques that could be applied here. Since the output was "clusters" i just immediately went to "how would something like k-means do?".

src/test/java/org/apache/sysds/test/TestUtils.java

src/test/java/org/apache/sysds/test/applications/EntityResolutionBinaryTest.java

scripts/staging/entity-resolution/primitives/blocking.dml

Adds new scripts in `scripts/staging/entity-resolution` that demonstrate entity clustering and binary entity resolution with SystemDS DML. See the README at `scripts/staging/entity-resolution/README.md` for more details. This is a squash of all commits on branch master from the skogler/systemml fork. Co-authored-by: Markus Reiter-Haas <iseratho@gmail.com>

Baunsgaard · 2020-07-21T09:31:08Z

LGTM

mboehm7 · 2020-07-28T19:02:42Z

LGTM - thanks @skogler and @Iseratho for this substantial new feature. Regarding the example, it's fine to link to the other repo for now, once we make it a builtin function we replace this with a link to the original data along with a script for necessary preprocessing.

During the merge I just made a couple of minor modifications:

Vectorized few loops and unnecessary operation sequences, for example: padding for table in preprocessing, and computing the components via outer.
Fixed the formatting in all entity resolution tests (tabs over spaces in java code)
Fixed a literal replacement rewrite that failed over frame inputs (which resulted from the modified padding).

skogler force-pushed the entity-resolution-pull-request branch from ebfb83c to a34ca1e Compare July 17, 2020 16:06

skogler force-pushed the entity-resolution-pull-request branch 3 times, most recently from da4a90d to c207075 Compare July 19, 2020 23:56

skogler force-pushed the entity-resolution-pull-request branch from c207075 to 3a2107b Compare July 20, 2020 00:08

Baunsgaard reviewed Jul 20, 2020

View reviewed changes

skogler force-pushed the entity-resolution-pull-request branch from 3a2107b to 493a52f Compare July 20, 2020 18:40

skogler force-pushed the entity-resolution-pull-request branch from 493a52f to 2d613b4 Compare July 20, 2020 18:43

asfgit closed this in ee77fad Jul 28, 2020

skogler deleted the entity-resolution-pull-request branch January 24, 2021 10:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-265] Entity resolution pipelines and primitives. #993

[SYSTEMDS-265] Entity resolution pipelines and primitives. #993

skogler commented Jul 17, 2020

skogler commented Jul 18, 2020 •

edited

Baunsgaard commented Jul 19, 2020

skogler commented Jul 19, 2020

skogler commented Jul 19, 2020

skogler commented Jul 19, 2020

skogler commented Jul 20, 2020

Baunsgaard left a comment

Baunsgaard Jul 20, 2020

Iseratho Jul 20, 2020

Baunsgaard Jul 20, 2020

Baunsgaard Jul 20, 2020

Iseratho Jul 20, 2020

Baunsgaard Jul 21, 2020

Baunsgaard commented Jul 21, 2020

mboehm7 commented Jul 28, 2020

[SYSTEMDS-265] Entity resolution pipelines and primitives. #993

[SYSTEMDS-265] Entity resolution pipelines and primitives. #993

Conversation

skogler commented Jul 17, 2020

skogler commented Jul 18, 2020 • edited

Baunsgaard commented Jul 19, 2020

skogler commented Jul 19, 2020

skogler commented Jul 19, 2020

skogler commented Jul 19, 2020

skogler commented Jul 20, 2020

Baunsgaard left a comment

Choose a reason for hiding this comment

Baunsgaard Jul 20, 2020

Choose a reason for hiding this comment

Iseratho Jul 20, 2020

Choose a reason for hiding this comment

Baunsgaard Jul 20, 2020

Choose a reason for hiding this comment

Baunsgaard Jul 20, 2020

Choose a reason for hiding this comment

Iseratho Jul 20, 2020

Choose a reason for hiding this comment

Baunsgaard Jul 21, 2020

Choose a reason for hiding this comment

Baunsgaard commented Jul 21, 2020

mboehm7 commented Jul 28, 2020

skogler commented Jul 18, 2020 •

edited