About •
Announcement •
Installation •
Usage •
Contribution
This is the repository of SciAssist, which is a toolkit to assist scientists' research. SciAssist currently supports Summarization, Reference String Parsing, more functions are under active development by WING@NUS, Singapore. The project was built upon an open-sourced template by ashleve, which uses Pytorch Lightning and Hydra as the framework for model training and configuration, respectively.
- CocoSciSum: A Scientific Summarization Toolkit with Compositional Controllability is accepted as an EMNLP 2023 System Demonstration paper!
- Our Demo is online in Huggingface Space!
conda create --name assist python=3.8
conda activate assist
[install pytorch]
pip install sciassist
Important: Make sure you install PyTorch (must be compatible to your machine) before SciAssist.
After you install the package, you can simply setup grobid with the CLI:
setup_grobid
This will setup Grobid. And after installation, starts the Grobid server with:
run_grobid
Task 1: (Single Document) Summarization
from SciAssist import Summarization
# Set device="cpu" if you want to use only CPU. The default device is "gpu".
# summarizer = Summarization(device="cpu")
summarizer = Summarization(device="gpu")
text = """1 INTRODUCTION . Statistical learning theory studies the learning properties of machine learning algorithms , and more fundamentally , the conditions under which learning from finite data is possible .
In this context , classical learning theory focuses on the size of the hypothesis space in terms of different complexity measures , such as combinatorial dimensions , covering numbers and Rademacher/Gaussian complexities ( Shalev-Shwartz & Ben-David , 2014 ; Boucheron et al. , 2005 ) .
Another more recent approach is based on defining suitable notions of stability with respect to perturbation of the data ( Bousquet & Elisseeff , 2001 ; Kutin & Niyogi , 2002 ) .
In this view , the continuity of the process that maps data to estimators is crucial , rather than the complexity of the hypothesis space .
Different notions of stability can be considered , depending on the data perturbation and metric considered ( Kutin & Niyogi , 2002 ) .
Interestingly , the stability and complexity approaches to characterizing the learnability of problems are not at odds with each other , and can be shown to be equivalent as shown in Poggio et al . ( 2004 ) and Shalev-Shwartz et al . ( 2010 ) .
In modern machine learning overparameterized models , with a larger number of parameters than the size of the training data , have become common .
The ability of these models to generalize is well explained by classical statistical learning theory as long as some form of regularization is used in the training process ( Bühlmann & Van De Geer , 2011 ; Steinwart & Christmann , 2008 ) .
However , it was recently shown - first for deep networks ( Zhang et al. , 2017 ) , and more recently for kernel methods ( Belkin et al. , 2019 ) - that learning is possible in the absence of regularization , i.e. , when perfectly fitting/interpolating the data .
Much recent work in statistical learning theory has tried to find theoretical ground for this empirical finding .
Since learning using models that interpolate is not exclusive to deep neural networks , we study generalization in the presence of interpolation in the case of kernel methods .
We study both linear and kernel least squares problems in this paper . """
# For string
res = summarizer.predict(text, type="str")
# For text
res = summarizer.predict("bodytext.txt", type="txt")
# For pdf
res = summarizer.predict("raw.pdf")
Task 2: Reference string parsing
from SciAssist import ReferenceStringParsing
# Set device="cpu" if you want to use only CPU. The default device is "gpu".
# ref_parser = ReferenceStringParsing(device="cpu")
ref_parser = ReferenceStringParsing(device="gpu")
# For string
res = ref_parser.predict(
"""Calzolari, N. (1982) Towards the organization of lexical definitions on a
database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles
University, Prague, pp.61-64.""", type="str")
# For text
res = ref_parser.predict("test.txt", type="txt")
# For pdf
res = ref_parser.predict("test.pdf")
Source Code
- Rename
SingleSummarization
toSummarization
. - Change the format of output files from
.txt
to.json
.
Documentation
- Move the definition of
Pipeline
class fromUsage
toContribution Guide
. - Add catalog for Contribution Guide.
- Add examples for choosing devices in
Usage
.
Here's a simple introduction about how to incorporate a new task into SciAssist. Generally, to add a new task, you will need to:
1. Git clone this repo and prepare the virtual environment.
2. Install Grobid Server.
3. Create a LightningModule and a DataLightningModule.
4. Train a model.
5. Provide a pipeline for users.
We provide a step-by-step contribution guide, see SciAssist’s documentation.
This toolkit is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International
.
Read LICENSE for more information.