New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMDS-265] Entity resolution pipelines and primitives. #993
Conversation
ebfb83c
to
a34ca1e
Compare
Not sure how to reproduce the test failure of BinaryEntityResulationTest locally, it always works, with spark-submit as well as with or without Intel MKL. |
I have asked Actions to execute the tests again, (sometimes the tests fail, for arbitrary reasons) and it is something we are looking into. |
Thanks, it seems that its related to how the test is called. I'll try to build the docker image and test with that. |
Okay, its a problem with the combination of maven surefire and JUnit parameterized tests. Running the test from IDEA works fine. Removing the parametrization also makes it run with maven. But I don't really want to do that either :( |
da4a90d
to
c207075
Compare
I added |
Yeah, the only way I can find to get the tests to run reliably is to set the maven surefire plugin option |
c207075
to
3a2107b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! 👍
I really like your scripts and they are nicely commented for clarity.
I also really like the tests. my only real concern is the disabling of parallel testing, but that should be fixable.
There is a test data repository that was used to develop these scripts at | ||
[repo](https://github.com/skogler/systemds-amls-project-data). In the examples below, it is assumed that this repo is | ||
cloned as `data` in the SystemDS root folder. The data in that repository is sourced from the Uni Leipzig entity resolution | ||
[benchmark](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did this implementation compare to this benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the mentioned website there are no comparison values.
For the DBLP-ACM dataset we get a F1-score of 0.8948.
For the Affiliations dataset we get a F1-score of 0.1429.
Note that we primarily focused on building the primitives and a basic pipeline.
We suspect that these numbers can be improved a lot by focusing more on data preprocessing (e.g., stemming).
## Testing and Examples | ||
|
||
There is a test data repository that was used to develop these scripts at | ||
[repo](https://github.com/skogler/systemds-amls-project-data). In the examples below, it is assumed that this repo is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of the data being located in this repository, but then again i don't know where we would store such a thing in Apache or if it is even possible.
opinions @mboehm7
CLUSTER = pipe::entity_clustering_pipeline(X, num_blocks, threshold); | ||
} else if (blocking_method == "lsh") { | ||
CLUSTER = pipe::entity_clustering_pipeline_lsh(X, num_hashtables, num_hyperplanes, threshold); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you maybe elaborate on performance for these two techniques?
If possible could you try adding another case using kmeans (only if applicable and you don't have to change much )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two pipelines differ in their blocking technique. The performance depends on the values chosen for the blocking algorithm (i.e., num_blocks, num_hashtables, and num_hyperplanes).
I am not sure whether kmeans makes much sense in this case (for duplicate detection). How would you select the number of clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One method would be the Elbow method
to determine the number of k in k-means. I guess something similar maybe could be done with the num_hashtables, while leveraging the num_hyperplanes for faster computation.
The direction i was going with my question, was is there other techniques that could be applied here. Since the output was "clusters" i just immediately went to "how would something like k-means do?".
src/test/java/org/apache/sysds/test/applications/EntityResolutionBinaryTest.java
Outdated
Show resolved
Hide resolved
3a2107b
to
493a52f
Compare
Adds new scripts in `scripts/staging/entity-resolution` that demonstrate entity clustering and binary entity resolution with SystemDS DML. See the README at `scripts/staging/entity-resolution/README.md` for more details. This is a squash of all commits on branch master from the skogler/systemml fork. Co-authored-by: Markus Reiter-Haas <iseratho@gmail.com>
493a52f
to
2d613b4
Compare
LGTM |
LGTM - thanks @skogler and @Iseratho for this substantial new feature. Regarding the example, it's fine to link to the other repo for now, once we make it a builtin function we replace this with a link to the original data along with a script for necessary preprocessing. During the merge I just made a couple of minor modifications:
|
Adds new scripts in
scripts/staging/entity-resolution
that demonstrateentity clustering and binary entity resolution with SystemDS DML.
See the README at
scripts/staging/entity-resolution/README.md
for moredetails.
This is a squash of all commits on branch master from the skogler/systemml fork.
Co-authored-by: Markus Reiter-Haas iseratho@gmail.com