SciOL and MuLMS-Img

This repository contains companion material for the following publication:

Tim Tarsi, Heike Adel, Jan Hendrik Metzen, Dan Zhang, Matteo Finco, Annemarie Friedrich. SciOL and MuLMS-Img: Introducing A Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain. WACV 2024.

Please direct any questions regarding the dataset or publication to Tim Tarsi

Overview

In scientific publications, a substantial part of the information is expressed via figures containing images and diagrams. However, existing training and evaluation data for retrieval systems is either limited to one modality or focus on non-scientific domains, making their application to scientific publications challenging. We address this gap by introducing two novel datasets: (1) SciOL, the largest openly-licensed pre-training corpus for multimodal models in the scientific domain, covering multiple sciences including materials science, physics, and computer science, and (2) MuLMS-Img, a high-quality dataset in the materials science domain, manually annotated for various image-text tasks.

MuLMS-Img

The Multi-Layer Materials Science (MuLMS) corpus [1] is a dataset of 50 scientific publications in the materials science domain annotated for various natural language processing tasks. MuLMS-Img extends this dataset by providing over 14500 high quality, manual annotations for various image-text tasks, e.g., Figure type Classification, Optical Character Recognition (OCR) and Text Role Labeling and Figure Retrieval.

You can find the data here: MuLMS-Img

SciOL

Scientific Openly-Licensed Publications (SciOL) is the largest openly-licensed pre-training corpus for multimodal models in the scientific domain, covering multiple sciences including materials science, physics, and computer science. It consists of over 2.7M scientific scientific publications converted into semi-structured data. SciOL contains over 14 Billion tokens of extracted and structured text.

We host the textual data here: SciOL-text

For the image-caption pairs see: SciOL-CI

Citation

If you use our work, please cite our paper:

@InProceedings{Tarsi_2024_WACV,
    author    = {Tarsi, Tim and Adel, Heike and Metzen, Jan Hendrik and Zhang, Dan and Finco, Matteo and Friedrich, Annemarie},
    title     = {SciOL and MuLMS-Img: Introducing a Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {January},
    year      = {2024},
    pages     = {4560-4571}
}

References

[1] Timo Pierre Schrader, Matteo Finco, Stefan Grünewald, Felix Hildebrand and Annemarie Friedrich. MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain. WIESP 2023.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

SciOL and MuLMS-Img

Overview

MuLMS-Img

SciOL

Citation

References

About

Releases

Packages

boschresearch/sciol-wacv-2024

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

SciOL and MuLMS-Img

Overview

MuLMS-Img

SciOL

Citation

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages