A High-Performance Approach to String Similarity using Most Frequent K Characters

Abstract

The Data Web has been growing significantly over the last decades. Thus, linking resources across heterogeneous knowledge bases becomes an increasingly difficult problem, in particular w.r.t. the runtime of link discovery tasks. Consequently, it is of utmost importance to provide time-efficient approaches for link discovery on the Data Web. While a number of scalable approaches have been developed for various measures, the most frequent k characters measure has not been tackled in previous works. We hence present a sequence of filters that allow discarding comparisons when executing bounded similarity computations without losing any recall. Therewith, we can reduce the runtime of bounded similarity computations by approximately 70%. Our experiments with a single-threaded, a parallel and a GPU implementation of our filters suggests that our approach scales well even when dealing with millions of potential comparisons.

Pre-requisites

Java 8+ (http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)
Jena API (https://jena.apache.org/download/index.cgi).

How to run

Run the main method in the following classes to see the results.

MFKC.java is the implementation of the algorithm.
MFKC_Parallel.java is the parallel implementation using Java 8 resources.
In MFKC_Sep.java, the filters are executed separately.
ExperimentMFKC.java contains all experiments.

Datasets used:

dbPediaPlaces.tsv is the dataset from DBpedia
linkedGeoPlaces_million.tsv is the dataset from LinkedGeoData
dbpedia_yago_labels_10k.tsv contains label alignments - download

Publication:

This work was published as a full paper was published at ISWC 2017 on the Ontology Matching workshop.

http://disi.unitn.it/~pavel/om2017/papers/om2017_Tpaper1.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
dbpedia-yago		dbpedia-yago
experiments		experiments
AndreMFKC.java		AndreMFKC.java
AndreMFKC_Parallel.java		AndreMFKC_Parallel.java
AndreMFKC_Parallel_File.java		AndreMFKC_Parallel_File.java
AndreMFKC_Sep.java		AndreMFKC_Sep.java
AndreMFKC_new.java		AndreMFKC_new.java
ExperimentMFKC.java		ExperimentMFKC.java
Jaccard.java		Jaccard.java
JaroWinkler.java		JaroWinkler.java
LICENSE		LICENSE
Main.java		Main.java
Naive.java		Naive.java
README.md		README.md
dbPediaPlaces.tsv		dbPediaPlaces.tsv
dbpedia_yago_1000.tsv		dbpedia_yago_1000.tsv
linkedGeoPlaces_million.tsv		linkedGeoPlaces_million.tsv
p.tsv		p.tsv
sim_eval.py		sim_eval.py

License

dice-group/MFKC

Folders and files

Latest commit

History

Repository files navigation

A High-Performance Approach to String Similarity using Most Frequent K Characters

Abstract

Pre-requisites

How to run

Publication:

About

Resources

License

Stars

Watchers

Forks

Languages