V tem repozitoriju se nahaja rezultat aktivnosti A3.2 - R3.2.3 Orodje za odkrivanje koreferenčnosti, ki je nastalo v okviru projekta Razvoj slovenščine v digitalnem okolju.
Slovene coreference resolution project introduces four models for coreference resolution on coref149 and senticoref datasets:
- baseline model (linear regression with hand-crafted features),
- non-contextual neural model using word2vec embeddings,
- contextual neural model using ELMo embeddings,
- contextual neural model using BERT embeddings.
For more details, see the journal paper, published in Computer Science and Information Systems (2021).
[Update 18. 2. 2023] BERT-based model fine-tuned on the updated SUK (SentiCoref) dataset is available on the CLARIN.SI repository. Using the evaluation setting where entity mentions are assumed to be correctly pre-detected, the model achieves the following metric values:
MUC: precision = 0.931, recall = 0.957, F1 = 0.943
BCubed: precision = 0.887, recall = 0.947, F1 = 0.914
CEAFe: precision = 0.945, recall = 0.893, F1 = 0.916
CoNLL-12: precision = 0.921, recall = 0.932, F1 = 0.924
report/
contains the pdf of our work.src/
contains the source code of our work.data/
is a placeholder for datasets (see Getting datasets section below).
Before doing anything, the dependencies need to be installed.
$ pip install -r requirements.txt
Note: if you have problems with torch
library, make sure you have python x64 installed. Also make use of
this official command builder.
Notes: you might have problems running contextual models on Windows since allennlp is not officialy supported on Windows, as noted in their README.
The project operates with the following datasets:
- SSJ500k (
-sl.TEI
version), - coref149
- sentiCoref 1.0 (WebAnno TSV 3.2 File format docs).
Download and extract them into data/
folder. After that, your data folder should look like this:
data/
+-- ssj500k-sl.TEI
+-- ssj500k.back.xml
+-- ssj500k-sl.body.xml
+-- ssj500k-sl.xml
+-- ...
+-- coref149
+-- ssj4.15.tcf
+-- ssj5.30.tcf
+-- ... (list of .tcf files)
+-- senticoref1_0
+-- 1.tsv
+-- 2.tsv
+-- ... (list of .tsv files)
Coref149 and SentiCoref are the main datasets we use.
SSJ500k is used for additional metadata such as dependencies and POS tags, which are not provided by coref149 itself.
Since only a subset of SSJ500k is used, it can be trimmed to decrease its size and improve loading time.
To do that, run trim_ssj.py
:
$ python src/trim_ssj.py --coref149_dir=data/coref149 --ssj500k_path=data/ssj500k-sl.TEI/ssj500k-sl.body.xml --target_path=data/ssj500k-sl.TEI/ssj500k-sl.body.reduced.xml
If target_path
parameter is not provided, the above command would produce
data/ssj500k-sl.TEI/ssj500k-sl.body.reduced.xml
.
If you want to use pretrained embeddings in non-contextual coreference model, make sure to download the Slovene
- word2vec vectors (
Word2Vec Continuous Skipgram
) from http://vectors.nlpl.eu/repository/ and/or - fastText vectors (
bin
) from https://fasttext.cc/docs/en/crawl-vectors.html (not used in the paper but supported)
Put them into data/
(either the cc.sl.300.bin
file for fastText or the model.txt
file for word2vec).
For the contextual coreference model, make sure to download the pretrained Slovene ELMo embeddings.
Extract the options file and the weight file into data/slovenian-elmo
.
Before running anything, make sure to set DATA_DIR
and SSJ_PATH
parameters in src/data.py
file (if the paths to
where your datasets are stored differ in your setup).
Below are examples how to run each model.
Parameters and it's default values can be previewed at the top of each model's file.
--dataset
parameter can be either coref149
or senticoref
.
Baseline model (linear regression with hand-crafted features)
$ python baseline.py \
--model_name="my_baseline_model" \
--learning_rate="0.05" \
--dataset="coref149" \
--num_epochs="20" \
--fixed_split
Non-contextual_model with word2vec embeddings
$ python noncontextual_model.py \
--model_name="my_noncontextual_model" \
--fc_hidden_size="512" \
--dropout="0.0" \
--learning_rate="0.001" \
--dataset="coref149" \
--embedding_size="100" \
--use_pretrained_embs="word2vec" \
--freeze_pretrained \
--fixed_split
Contextual model with ELMo embeddings
$ python contextual_model_elmo.py \
--model_name="my_elmo_model" \
--dataset="coref149" \
--fc_hidden_size="64" \
--dropout="0.4" \
--learning_rate="0.005" \
--num_epochs="20" \
--freeze_pretrained \
--fixed_split
Contextual model with BERT embeddings
$ python contextual_model_bert.py \
--model_name="my_bert_model" \
--dataset="coref149" \
--fc_hidden_size="64" \
--dropout="0.4" \
--learning_rate="0.001" \
--num_epochs="20" \
--freeze_pretrained \
--fixed_split
Rest API is is provided by FastAPI/uvicorn.
Install base and API requirements by running
$ pip install -r requirements.txt
$ pip install -r requirements-api.txt
Running REST API requires two environment variables:
COREF_MODEL_PATH
, containing the path to the coref modelCLASSLA_RESOURCES_DIR
, a path to where the CLASSLA resources will be downloaded (if are not already) and used
REST API can then be ran by moving into the src
directory and running the uvicorn
module:
$ cd ./src
$ python -m uvicorn rest_api:app --port=5020
You can, of course, define required env variables on the fly when running, for example:
$ CLASSLA_RESOURCES_DIR=.\classla_resources \
COREF_MODEL_PATH=.\contextual_model_bert \
python -m uvicorn rest_api:app --port=5020
Assuming everything went smoothly, the API will become available at http://localhost:5020/predict/coref.
...
INFO: Uvicorn running on http://127.0.0.1:5020 (Press CTRL+C to quit)
To test it, try sending a request with curl:
$ curl -X POST -H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{"threshold": 0.60, "return_singletons": true, "text": "Janez Novak je šel v Mercator. Tam je kupil mleko. Nato ga je spreletela misel, da bi moral iti v Hofer."}' \
"http://localhost:5020/predict/coref"
After starting up the API, the OpenAPI/Swagger documentation will also become accessible at http://localhost:5020/docs.
To build the docker image, run
docker build --tag slo-coref -f Dockerfile .
Trained BERT model is available on NAS CJVT (credentials needed).
To run the docker image, run the following command with properly fixed mount source
paths:
docker run --rm -it --name slo-coref \
-p 5020:5020 \
--env COREF_MODEL_PATH="/app/data/bert_based/" \
--mount type=bind,source="`pwd`/model/cseb_senticoref/",destination="/app/data/bert_based/",ro \
slo-coref
Note: If container is killed at startup, increase Docker memory limits (8GB is proposed).
Operacijo Razvoj slovenščine v digitalnem okolju sofinancirata Republika Slovenija in Evropska unija iz Evropskega sklada za regionalni razvoj. Operacija se izvaja v okviru Operativnega programa za izvajanje evropske kohezijske politike v obdobju 2014-2020.
SloCOREF tool (modeling and training) was partly financed by CLARIN.SI.