You can use this repository to produce named entity links or candidate links using DBpedia Spotlight tool by mapping multiple N-Triple (or N-Quad) files. This repository is used to produce named entity links for Springer-Nature datasets (it can be used for any dataset though).
The steps are explained for book chapters but you can use any of the dataset from SciGraph Exlorer. To produce data, you can follow given steps:
-
Download Springer-Nature book chapters dataset from this link and book chapters abstracts from this link.
-
Use bash commands to retrieve portion of the data from book chapters dataset for needed properties (field of research and language)
- bzcat FILE | grep 'http://scigraph.springernature.com/ontologies/core/hasFieldOfResearchCode' > outputFieldOfResearch.ttl
- bzcat FILE | grep 'http://scigraph.springernature.com/ontologies/core/language' > outputLanguage.ttl
- Compress output files
- bzip2 outputFieldOfResearch.ttl
- bzip2 outputLanguage.ttl
- Sort compressed output files and book chapter abstracts bzip2 -cd "$file" | cat | sort --parallel=8 --batch-size=512 --buffer-size=50% | parallel --pipe --recend '' -k bzip2 > "$newFile" ;
- Install Scala
- Install IntelliJ
- Import repository from Github to Scala
- Change default.properties with pointers to the datasets
- Set base-dir as path to your datasets
- Set primary-input-dataset as book chapters abstracts
- Set input-datasets as field of research portion and language portion
- Execute main class
- Run SortedQuadTraversal class as main file
- Run-> Edit Configurations -> Program Arguments set as default.properties
- Run main class again
As a result you are going to get two output datasets