Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-265] Entity resolution pipelines and primitives. #993

Closed

Conversation

skogler
Copy link
Contributor

@skogler skogler commented Jul 17, 2020

Adds new scripts in scripts/staging/entity-resolution that demonstrate
entity clustering and binary entity resolution with SystemDS DML.
See the README at scripts/staging/entity-resolution/README.md for more
details.

This is a squash of all commits on branch master from the skogler/systemml fork.

Co-authored-by: Markus Reiter-Haas iseratho@gmail.com

@skogler skogler force-pushed the entity-resolution-pull-request branch from ebfb83c to a34ca1e Compare July 17, 2020 16:06
@skogler
Copy link
Contributor Author

skogler commented Jul 18, 2020

Not sure how to reproduce the test failure of BinaryEntityResulationTest locally, it always works, with spark-submit as well as with or without Intel MKL.

@Baunsgaard
Copy link
Contributor

I have asked Actions to execute the tests again, (sometimes the tests fail, for arbitrary reasons) and it is something we are looking into.

@skogler
Copy link
Contributor Author

skogler commented Jul 19, 2020

Thanks, it seems that its related to how the test is called. I'll try to build the docker image and test with that.

@skogler
Copy link
Contributor Author

skogler commented Jul 19, 2020

Okay, its a problem with the combination of maven surefire and JUnit parameterized tests.

Running the test from IDEA works fine. Removing the parametrization also makes it run with maven. But I don't really want to do that either :(

@skogler skogler force-pushed the entity-resolution-pull-request branch 3 times, most recently from da4a90d to c207075 Compare July 19, 2020 23:56
@skogler
Copy link
Contributor Author

skogler commented Jul 19, 2020

I added @NotThreadSafe annotations now, which is consistent with some other tests but I think it does not work reliably.

@skogler
Copy link
Contributor Author

skogler commented Jul 20, 2020

Yeah, the only way I can find to get the tests to run reliably is to set the maven surefire plugin option parallel to none.

@skogler skogler force-pushed the entity-resolution-pull-request branch from c207075 to 3a2107b Compare July 20, 2020 00:08
Copy link
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! 👍
I really like your scripts and they are nicely commented for clarity.

I also really like the tests. my only real concern is the disabling of parallel testing, but that should be fixable.

pom.xml Outdated Show resolved Hide resolved
There is a test data repository that was used to develop these scripts at
[repo](https://github.com/skogler/systemds-amls-project-data). In the examples below, it is assumed that this repo is
cloned as `data` in the SystemDS root folder. The data in that repository is sourced from the Uni Leipzig entity resolution
[benchmark](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did this implementation compare to this benchmark?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the mentioned website there are no comparison values.
For the DBLP-ACM dataset we get a F1-score of 0.8948.
For the Affiliations dataset we get a F1-score of 0.1429.
Note that we primarily focused on building the primitives and a basic pipeline.
We suspect that these numbers can be improved a lot by focusing more on data preprocessing (e.g., stemming).

scripts/staging/entity-resolution/README.md Outdated Show resolved Hide resolved
scripts/staging/entity-resolution/README.md Outdated Show resolved Hide resolved
scripts/staging/entity-resolution/README.md Outdated Show resolved Hide resolved
## Testing and Examples

There is a test data repository that was used to develop these scripts at
[repo](https://github.com/skogler/systemds-amls-project-data). In the examples below, it is assumed that this repo is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of the data being located in this repository, but then again i don't know where we would store such a thing in Apache or if it is even possible.

opinions @mboehm7

CLUSTER = pipe::entity_clustering_pipeline(X, num_blocks, threshold);
} else if (blocking_method == "lsh") {
CLUSTER = pipe::entity_clustering_pipeline_lsh(X, num_hashtables, num_hyperplanes, threshold);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe elaborate on performance for these two techniques?

If possible could you try adding another case using kmeans (only if applicable and you don't have to change much )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two pipelines differ in their blocking technique. The performance depends on the values chosen for the blocking algorithm (i.e., num_blocks, num_hashtables, and num_hyperplanes).

I am not sure whether kmeans makes much sense in this case (for duplicate detection). How would you select the number of clusters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One method would be the Elbow method to determine the number of k in k-means. I guess something similar maybe could be done with the num_hashtables, while leveraging the num_hyperplanes for faster computation.

The direction i was going with my question, was is there other techniques that could be applied here. Since the output was "clusters" i just immediately went to "how would something like k-means do?".

src/test/java/org/apache/sysds/test/TestUtils.java Outdated Show resolved Hide resolved
@skogler skogler force-pushed the entity-resolution-pull-request branch from 3a2107b to 493a52f Compare July 20, 2020 18:40
Adds new scripts in `scripts/staging/entity-resolution` that demonstrate
entity clustering and binary entity resolution with SystemDS DML.
See the README at `scripts/staging/entity-resolution/README.md` for more
details.

This is a squash of all commits on branch master from the skogler/systemml fork.

Co-authored-by: Markus Reiter-Haas <iseratho@gmail.com>
@skogler skogler force-pushed the entity-resolution-pull-request branch from 493a52f to 2d613b4 Compare July 20, 2020 18:43
@Baunsgaard
Copy link
Contributor

LGTM

@mboehm7
Copy link
Contributor

mboehm7 commented Jul 28, 2020

LGTM - thanks @skogler and @Iseratho for this substantial new feature. Regarding the example, it's fine to link to the other repo for now, once we make it a builtin function we replace this with a link to the original data along with a script for necessary preprocessing.

During the merge I just made a couple of minor modifications:

  • Vectorized few loops and unnecessary operation sequences, for example: padding for table in preprocessing, and computing the components via outer.
  • Fixed the formatting in all entity resolution tests (tabs over spaces in java code)
  • Fixed a literal replacement rewrite that failed over frame inputs (which resulted from the modified padding).

@asfgit asfgit closed this in ee77fad Jul 28, 2020
@skogler skogler deleted the entity-resolution-pull-request branch January 24, 2021 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants