FiLiPo is a system designed to simplify data integration. To do this, it determines a mapping between the schema of a local knowledge base and that of an API. This mapping specifies how the data can be integrated into the local Knowledge Base. The goal of FiLiPo was to enable non-technical users (e.g. data curators) to use this system. For this reason, only a few parameters need to be specified.
- Tobias Zeimetz, Ralf Schenkel
Sample Driven Data Mapping for Linked Data and Web APIs
In CIKM Demo Track 2020- Demo Video: Link
- CIKM Source Code:
Local Knowledge Base | Aligned Web API |
---|---|
dblp[2] | CrossRef[4], SciGraph[5], Semantic Scholar (DOI)[6], Semantic Scholar (ArXiv-Key)[6], Open Citations[10]ArXiv[7], Elsevier[8] |
Linked Movie DB[3] | Open Movie Database API[9], The Movie Database[11] |
IMDB[12] | Open Movie Database API[9] |
We have evaluated precision and recall of FiLiPo on several knowledge bases and Web APIs. The average values for precision and recall were calculated by performing mutliple test series. The runtime of FiLiPo was between 15-45 Minutes, depending on the sample size and the response time of the Web API. For the evaluation we used the metrics precision, recall and F1 Score. FiLiPo was able to achieve a precision between 0.73 to 1.00 and a recall between 0.66 to 1.00. Values close to 1.0 were achieved mainly because there were only a few possible alignments. The corresponding F1 scores for FiLiPo are between 0.69 and 0.95.
We used the string similarity framework by Baltes et. al [1]. The table below lists all string similarity methods that can be used. Note, that for n
you can use the values n=2,3,4,5
.
Category | Methods |
---|---|
Equal | Equal, Equal Normalized, Tokken Equal, Token Equal Normalized |
Edit-based | Levenshtein, Levenshtein Normalized, Damerau-Levenshtein, Damerau-Levenshtein Normalized, Optimal-Alignment, Optimal-Alignment Normalized, Longest-Common-Subsequence, Longest-Common-Subsequence Normalized |
Set-based | Jaccard Token, Jaccard Token Normalized, Sorensen-Dice Toke, Sorensen-Dice Token Normalized, Overlap Token*, Overlap Token Normalized* Jaccard n-grams, Jaccard n-grams Normalized, Jaccard n-grams Normalized Padding, Sorensen-Dice n-grams, Sorensen-Dice n-grams Normalized, Sorensen-Dice n-grams Normalized Padding, Overlap n-grams*, Overlap n-grams Normalized*, Overlap n-grams Normalized Padding*, Jaccard n-shingles, Jaccard n-shingles Normalized, Sorensen-Dice n-shingles, Sorensen-Dice n-shingles Normalized, Overlap n-shingles, Overlap n-shingles Normalized |
* We do not recommend using these methods as they may lead to inaccurate results. Only experts should use them.
This section gives a brief overview of the configurations that can be done by an expert user. First the global settings will be explained. They are used to control the output of the programme, specify the level of detail in the log file and so on. Afterwards the aligning settings will be described, which can be used by an technical user to fine-tune the system.
{
"globals": {
"logpath":"res/log/", // Path to log files
"outpath": "res/", // Path of programme output
"dbpath": "database.json", // Path to locales
"scpath": "supconf.json",
"secretpath": "secrets.json", // Path to Web API secrets
"ipc": "tcp://*:5555", // Address to communicate with
// the gradient boosting classifier
"timeout": "500", // Waiting time after an API request
// in order to prevent flooding
"mode": "0", // 0 non-technical user, 1 technical user
// 2 evaluation, 3 demo
"loglevel": "0" // Range of 1-4
}
...
"linkage_config": {
"similarity_requests": "100", // Requests send to a Web API (sample size)
"candidate_requests": "25", // Probing size (number of initial requests)
"string_similarity": "0.5", // How similar two string need to be in order
// to yield as equals (e.g. two titles)
"record_similarity": "0.1", // Overlapping between data records in order
// to yield as valid response
"distribution_variance": "0.4", //
"candidate_responses": "0.1", //
"error_threshold": "0.8", //
"traversal_depth": "2", //
"functionality_threshold": "0.99", // Every relation that has a functionality
// greater than 0.99 is consideres as identifier
"classifier": "regex", // Used to specify if the regular expression
// approach (regex) will be used or the gradient
// boosting classifier (gbc)
"support_mode": "0",
"min_support_match": "0.5",
"min_support_nonmatch": "0.1",
"similarity_metrics": [
...
]
},
This section of the config.json
file is used to add rules (similar to an regular expression) to the RegExer
class.
"ruleset":[
{
"name": "isbn-issn",
"filter": "-" // The user can specify in the filter field which
// characters will be ignored when comparing values
},
{
"name": "insensitive-uri",
"filter": "/i" // The flag /i is used to specify that cases will be
// ignored when comparing values
},
{
"name": "fuzzy",
"filter": "/f" // Use the best matching similarity method
// (this can lead to errorenous results)
}
]
If you find FiLiPo useful in your research, please consider citing the following paper:
@inproceedings{filipo,
author = {Zeimetz, Tobias and Schenkel, Ralf},
title = {Sample Driven Data Mapping for Linked Data and Web APIs},
year = {2020},
url = {https://doi.org/10.1145/3340531.3417438},
doi = {10.1145/3340531.3417438},
booktitle = {Proceedings of the 29th ACM International Conference on Information & Knowledge Management},
pages = {3481–3484}
}
- Version 1.2: Minor bug fixes and added an output file that is used by an Angular GUI to represent the results in an easy understandable way. The corresponding GUI will be published with the next update.
- Version 1.1: Added a functionality to determine joint features. This feature is used to find out which commonalities entities had that led to a response from the API. For example, you can find out that an API only responds to articles from a specific publisher.
- String-Similarity by Baltes et. al, GitHub
- dblp
- Linked Movie DB
- CrossRef API
- SciGraph API
- Semantic Scholar
- Arxiv API
- Elsevier API
- Open Movie Database API
- Open Citations
- The Movie Database
- IMDB in RDF Format