This repository is the required dataset and python implementation of paper "Recommending Scientific Datasets Using Author Networks in Ensemble Methods" with authors Xu Wang, Frank van Harmelen and Zhisheng Huang.
Make sure your python version >= 3.6. You should "pip" install followling library in your python environment:
Or simply use pip install -r requirements.txt
to install all needed library.
The dataset you needed for our ensembel datset recommendation algorithm:
- MAKG coauthor RDF/HDT file download link
- MAKG paper/dataset title RDF/HDT file download link
- MAKG paper/dataset abstract RDF/HDT file download link
- MAKG pretrained author-entity embedding download link
- Seed dataset/paper txt file one dataset per line
- Candidate dataset/paper txt file one dataset per line
- Gold standard link between seeds and candidates RDF/HDT file
The algorithm in paper is implemented in Recommend_walk_embed_bm.py:
- Graph walk implementation
graphwalk
function in line 47- line 217-220 of
step
function
- Author entity embedding similarity
clean_candidate_with_ent_embed
in line 107- line 221-222 of
step
function
- BM25
- line 253-260 of
step
function
- line 253-260 of
usage: Recommend_walk_embed_bm.py [-h] -th THRESHOLD -bth BM25_THRESHOLD -hp HOP -sd SEED -cd CANDIDATE -gd STANDARD [-d DIR]
optional arguments:
-h, --help show this help message and exit
-th THRESHOLD, --threshold THRESHOLD
Threshold for similarity between entity(author) embedding
-bth BM25_THRESHOLD, --bm25_threshold BM25_THRESHOLD
Threshold for BM25 ranking
-hp HOP, --hop HOP Hop number for graph walk
-sd SEED, --seed SEED
Path to [seed file].txt
-cd CANDIDATE, --candidate CANDIDATE
Path to [candidate file].txt
-gd STANDARD, --standard STANDARD
Path to [standard file].hdt
-d DIR, --dir DIR Directory to read all needed files and to store all results. Default is directory of this python file
41 seed datasets and 116 candidate datasets, with 117 gold standard link.
python Recommend_walk_embed_bm.py -th [threshold of embedding similarity] -bth [threshold of bm25] -hp [hop number of graph walk] -sd [path_to_seed] -cd [path_to_candidate] -gd [path_to_standard.hdt] -d [path_to_dir_of_all_datasets]
After running python file, it will return result file in directory with format per line:
seed_dataset_id[Tab Separated]Correct_Count[Tab Separated]Standard_Count[Tab Separated]Recommended_Count[Tab Separated]Recall[Tab Separated]Precision
where Standard_Count
is the number of standard linked datasets for seed dataset; Recommended_Count
is the number of datasets returned by recommendation alogrithm for seed dataset; Correct_Count
is the number of intersection between standard linked datasets and datasets returned by recommendation alogrithm for seed dataset; Recall
is Correct_Count
divided by Standard_Count
; Precision
is Correct_Count
divided by Recommended_Count
.
This repository is licensed under GNU General Public License v3.0.
The Microsoft Academic Knowledge Graph, the linked data description files, and the ontology are licensed under the Open Data Commons Attribution License (ODC-By) v1.0.
Wang, Xu, 2022, "Data For "Recommending Scientific Datasets Using Author Networks in Ensemble Methods"", https://doi.org/10.34894/W6C7P7, DataverseNL, V1