This repository contains the CareerCoach 2022 Dataset introduced in the article:
Weichselbraun, Albert, Waldvogel, Roger, Fraefel, Andreas, van Schie, Alexander and Kuntschik, Philipp. (2022). “Slot Filling for Extracting Reskilling and Upskilling Options from the Web”. Natural Language Processing and Information Systems (NLDB 2022), Lecture Notes in Computer Science, vol. 13286, arXiv:2207.04862, DOI: 10.1007/978-3-031-08473-7_25
The gold standard is available for download in the NIF and JSON format, and draws upon documents from a corpus of over 99,000 education courses which have been retrieved from 488 different education providers.
The corpus contains two partitions.:
- Partition (P1) supports the content extraction (i.e., text segmentation and text segment classification) tasks and comprises 169 documents and gold standard annotations for page segments
- Partition (P2) contains 75 documents with a significantly richer set of annotations that consider content extraction, entities and slots. It supports benchmarking knowledge extraction tasks such as entity recognition, entity classification, entity linking, and slot filling on top of the content extraction task.
The dataset contains the following two partitions:
- Partition P1 (segments):
- size: 169 documents
- annotations: page segment (classification, start/end index, surface form)
- supported tasks: T1 (page segment recognition), T2 (page segment classification)
- Partition P2 (entities):
- size: 75 documents
- annotations:
- page segment (classification, start/end index, surface form)
- identified entities (start/end index, identifier, type (e.g., skill, occupation, topic, position, school, industry, education, degree), key)
- supported tasks:
- T1 (page segment recognition)
- T2 (page segment classification)
- T3 (entity recognition)
- T4 (entity classification
- T5 (entity linking)
- T6 (slot filling)
The published .json
files contain annotations in the following format:
{
"html": "...",
"text": "...",
"page_segments": {
"target_groups": [
{
"surface_form": "Text of first segment",
"start": 5,
"end": 26
}
],
"course_contents": [
{...}
]
},
"entities": {
"target_groups": [
{
"surface_form": "Entity 1",
"start": 15,
"end":22,
"key": "https://semanticlab.net/career-coach#458"
}
],
"course_contents": [
{...}
]
}
}
html
: the page's HTML contenttext
: the corresponding text representation as obtained by inscriptis.page_segments
: a dictionary of classified page segment with keys:target_groups
,course_contents
,prerequisite
,learning_objectives
,certficates_and_degrees
. Each page segment contains another dictionary outlining itssurface_form
,start
andend
indices.entities
optional further dictionaries that describe entities identified within the segments.- each entity is described by
surface_form
,start
andend
indices,type
andkey
. In addition entities may contain the keyswikidata
,wikipedia
andwikipedia_redirect
which refer to their URL in the corresponding knowledge sources. type
: indicates the entity class (skill
,occupation
,topic
,position
,school
,industry
,education
, ordegree
)key
: unique entity URL within the knowledge graph
- each entity is described by