Gold Standard Data for Matching Algorithm

This gold standard is created for the evaluation and training of the citation matching algorithm in the EXCITE project.

Reference strings set in the gold standard

This collection of data was created by following these steps:

One of EXCITE reference strings extraction algorithms was applied on our PDF corpora, and the result of this phase was a collection of reference strings. These references may contain some errors inside since they were extracted automatically (without any human support to correct them).
Afterward, some of these reference strings were selected randomly for the next step.
Then the EXCITE reference segmentation algorithm was applied on the set of references of step 2.
In the final stage, a human assessor created the list of all matches (in Sowiport) for each reference string.

The results of these steps were stored in DataSet_A csv file. This CSV file contains 816 reference strings with 4 columns:

ref_id,
reference_strings,
reference_segementations,
Math_ids_in_sowiport.

Out of all references in the set, 521 items have at least one match item in sowiport.

Target set in the gold standard for matching of reference strings against

Sowiport contained more than 9 million bibliographic records. We publish a smaller set of data from sowiport that allows experiments with the Gold standard.

This set of data was stored as DataSet_B csv file. This set of data includes 18,590 bibliographic items and 16 columns:

'id': sowiport id ,
'title_full': the full title of the article,
'title_sub': the sub title of the article,
'facet_person_str_mv': authors' names used for facet in sowiport ,
'person_author_normalized_str_mv': normalized format of authors' family names,
'journal_short_txt_mv': short title format of journal,
'journal_title_txt_mv': the full title of journal,
'norm_pagerange_str': normalize format of page information,
'norm_publishDate_str': normalized publish year infromation,
'norm_title_full_str': normalized format of full title,
'norm_title_str': normalized format of sub title,
'norm_volume_str': normalized format of volume information,
'norm_issue_str': normalized format of issue information,
'recorddoi_str_mv': doi information,
'zsabk_str': abbrivation of journal title,
'db_origin_str': source of the information in Sowiport.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Datasets_for_matching		Datasets_for_matching
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gold Standard Data for Matching Algorithm

Reference strings set in the gold standard

Target set in the gold standard for matching of reference strings against

About

Releases

Packages

Contributors 2

behnam2014/GoldStandard_for_matching

Folders and files

Latest commit

History

Repository files navigation

Gold Standard Data for Matching Algorithm

Reference strings set in the gold standard

Target set in the gold standard for matching of reference strings against

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages