This gold standard is created for the evaluation and training of the citation matching algorithm in the EXCITE project.
This collection of data was created by following these steps:
- One of EXCITE reference strings extraction algorithms was applied on our PDF corpora, and the result of this phase was a collection of reference strings. These references may contain some errors inside since they were extracted automatically (without any human support to correct them).
- Afterward, some of these reference strings were selected randomly for the next step.
- Then the EXCITE reference segmentation algorithm was applied on the set of references of step 2.
- In the final stage, a human assessor created the list of all matches (in Sowiport) for each reference string.
The results of these steps were stored in DataSet_A csv file. This CSV file contains 816 reference strings with 4 columns:
- ref_id,
- reference_strings,
- reference_segementations,
- Math_ids_in_sowiport.
Out of all references in the set, 521 items have at least one match item in sowiport.
Sowiport contained more than 9 million bibliographic records. We publish a smaller set of data from sowiport that allows experiments with the Gold standard.
This set of data was stored as DataSet_B csv file. This set of data includes 18,590 bibliographic items and 16 columns:
- 'id': sowiport id ,
- 'title_full': the full title of the article,
- 'title_sub': the sub title of the article,
- 'facet_person_str_mv': authors' names used for facet in sowiport ,
- 'person_author_normalized_str_mv': normalized format of authors' family names,
- 'journal_short_txt_mv': short title format of journal,
- 'journal_title_txt_mv': the full title of journal,
- 'norm_pagerange_str': normalize format of page information,
- 'norm_publishDate_str': normalized publish year infromation,
- 'norm_title_full_str': normalized format of full title,
- 'norm_title_str': normalized format of sub title,
- 'norm_volume_str': normalized format of volume information,
- 'norm_issue_str': normalized format of issue information,
- 'recorddoi_str_mv': doi information,
- 'zsabk_str': abbrivation of journal title,
- 'db_origin_str': source of the information in Sowiport.