This repository contains the companion material for the following publication:
Timo Pierre Schrader, Matteo Finco, Stefan Grünewald, Felix Hildebrand, Annemarie Friedrich. MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain WIESP 2023.
Please cite this paper if using the dataset or the code, and direct any questions regarding the dataset to Annemarie Friedrich, and any questions regarding the code to Timo Schrader.
This software is a research prototype, solely developed for and published as part of the publication cited above. It will neither be maintained nor monitored in any way.
The Multi-Layer Materials Science corpus (MuLMS) consists of 50 documents (licensed CC BY) from the materials science domain, spanning across the following 7 subareas: "Electrolysis", "Graphene", "Polymer Electrolyte Fuel Cell (PEMFC)", "Solid Oxide Fuel Cell (SOFC)", "Polymers", "Semiconductors" and "Steel".
For a detailed description, please refer to our HuggingFace Dataset and our paper.
NOTE: This code requires Python 3.9 oder newer. It does not support Python 3.8 and below.
Named entitiy (NE) recognition is a token-level tagging task and deals with tagging named entities. For instance, "WO3" is an example of a "Material" in our dataset. Because named entities occur at tokel level and can span across multiple input tokens in sentence, we model this task using two different approaches, BILOU tagging scheme + CRF classification layer and dependency parsing where we treat NEs as dependencies between first and last token.
Furthermore, we provide datasets for multi-task experiments where we incorporate another related datasets and their named entities to support the classifiers in learning our NEs.
The following named entities are modeled in our dataset: MAT, NUM, VALUE, UNIT, PROPERTY, CITE, TECHNIQUE, RANGE, INSTRUMENT, SAMPLE, FORM, DEV, MEASUREMENT
MuLMS provides relations between pairs of entities. There are two types of relations: measurement-related relations and further relations. The first type always starts at Measurement trigger spans, the scond type does not start at a specific Measurement annotation.
There are the following relation types in MuLMS: hasForm, measuresProperty, usedAs, propertyValue, conditionProperty, conditionSample, conditionPropertyValue, usesTechnique, measuresPropertyValue, usedTogether, conditionEnv, usedIn, conditionInstrument, takenFrom, dopedBy
This task is about classifying experiment-describing sentences as qualitative or quantitative. Whereas a quantitative sentence describes technical details about measurement procedures and experiments, a quantitative sentence does only describe it on a high-level, leaving out important details.
We model this task on a sentence-level basis as a ternary classification task using the tagset MEASUREMENT, QUAL_MEAS and NONE.
For the argumentative zoning (AZ) part MuLMS that is presented in the related publication MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science Domain, please refer to the separate repository on Github, which is a submodule of this repository and hence does not need an extra download.
Our dataset is divided into several splits, please look them up in the paper for further explanation:
- train
- tune1/tune2/tune3/tune4/tune5
- dev
- test
Please install all dependencies or the environment as listed in environment.yml and make sure to have Python 3.9 installed (we recommend 3.9.11). You might also add the root folder of this project to the $PYTHONPATH
environment variable. This enables all scripts to automatically find the imports.
Important: Also clone the Git submodule in this repo that points to the MuLMS-AZ Repo. The code files there are required to run the experiments in this repo.
NOTE: This code really requires Python 3.9. It does not support Python 3.8 and below or 3.10 and above due to type hinting and package dependencies.
We provide bash scripts in scripts for each NLP task separately. Furthermore, for subtaks (e.g., multi-tasking), there are additional scripts that contain all necessary parameters. Use these scripts to reproduce the results from our paper and adapt those if you want to do additional experiments. Moreover, you can check all available settings in each Python file via python <script_name.py> --help
.
We use BERT-based language models, namely BERT, SciBERT and MatSciBERT, as contextualized transformer LMs as basis of all our models. Moreover, we implement task-specific output layers on top of the LM. All Pytorch models can be found in every models
subdirectory of each task.
- Download the SOFC corpus and place the contents in data/.
- For MSPT, you need to first convert the corpus to UIMA CAS format using for example INCEpTION. You can find all UIMA CAS span types in source/data_handling/mspt_dataset.py.
Use the aggregate_cv_score.py
scripts in the evaluation
subdirectory of each task to evaluate the performance of trained models across all five folds.
This software is open-sourced under the AGPL-3.0 license. See the LICENSE file for details. The MuLMS-AZ corpus is released under the CC BY-SA 4.0 license. See the LICENSE file for details. For a list of other open source components included in this project, see the file 3rd-party-licenses.txt.
If you use our software or dataset in your scientific work, please cite our paper:
@misc{schrader2023mulms,
title={MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain},
author={Timo Pierre Schrader and Matteo Finco and Stefan Grünewald and Felix Hildebrand and Annemarie Friedrich},
year={2023},
eprint={2310.15569},
archivePrefix={arXiv},
primaryClass={cs.CL}
}