V tem repozitoriju se nahaja rezultat aktivnosti A3.3 - Orodje za razdvoumljanje, ki je nastalo v okviru projekta Razvoj slovenščine v digitalnem okolju.
This repository contains the code for the WSD api. It is split into two main components
- The code used for the training and evaluation of the WSD model (located in
./train_and_evaluate_model.py
) - The code used to create the docker container with the WSD api. (located in
./app/api.py
)
./data/
contains the data necessary for training and evaluating the model.
train_and_evaluate_model.py contains the code for training and evaluating the model. Running the file will train and evaluate a model on data from the elexis-wsd-sl_corpus.tsv (https://www.clarin.si/repository/xmlui/handle/11356/1674), located in ./data/. We use a Camembert token prediction model to train the WSD model (Martin, Louis, et al. "CamemBERT: a Tasty French Language Model." ACL 2020-58th Annual Meeting of the Association for Computational Linguistics. 2020.). A pretrained model is available at https://nas.cjvt.si/index.php/s/GgR2CBJwqHDQQce. The model currently achieves a classification accuracy of .45 when evaluated on the elexis-wsd-sl test set and we are currently in the process of improving the model to achieve better results.
By default, the model runs on a CPU. To use a GPU, change device = torch.device("cpu")
to device = torch.device("cuda")
. This requires pytorch with GPU acceleration using CUDA (https://pytorch.org/get-started/locally/).
If you want to train a model on your own data, replace the elexis-wsd-sl_corpus.tsv files with your own data.
The code for the WSD api is located in ./app/api.py. To build the docker container, place the required model files into ./data. This will create a container running the api using the uvicorn server. The container requires three files:
- The pretrained sloberta2 model (https://www.clarin.si/repository/xmlui/handle/11356/1397), which should be placed inside ./data/sloberta2
- A trained wsd model, which should be placed inside ./data and named wsd_model.ckpt
- A sense inventory. Currently, we use elexis-wsd-sl_sense-inventory.tsv, which should be placed inside ./data
You can then build the container by running docker build -t rsdo_wsd .
. To start the API container, use the command docker run --gpus all -d --name rsdo_wsd -p 5009:80 rsdo_wsd
(GPU) or docker run -d --name rsdo_wsd -p 5009:80 rsdo_wsd
.
To run the API locally:
- Extract the contents of this repository.
- Download the requirements using
pip install -r requirements.txt
- Run the API server using uvicorn app.api:app --host 127.0.0.1 --port 80
This will start the server using the ip 127.0.0.1 and port 80. The api accepts POST requests at /predict/wsd. The endpoint requires two parameters:
inventory
. This specifies the sense inventory which contains all possible word definitions. It should contain a string with the inventory name. Currently, only "DSB" is supportedtext
. This should contain a string whith the text to be disambiguated
The api will return the following for each word:
sense_id
. The id of the identified word senseinventory
. The sense inventory passed as the parameterlemma
. The lemma of the worddefinition
. The definition of identified word sensescore
. The confidence of the obtained prediction. Higher numbers indicate higher confidence.
To test the server, try sending a POST request using curl:
curl -X POST -H "accept:application/json" -H "Content-Type:application/json" -d "{\"inventory\": \"DSB\", \"text\": \"Soba ima dvoje vrat.\" }" http://127.0.0.1:5009/predict/wsd -L
Operacijo Razvoj slovenščine v digitalnem okolju sofinancirata Republika Slovenija in Evropska unija iz Evropskega sklada za regionalni razvoj. Operacija se izvaja v okviru Operativnega programa za izvajanje evropske kohezijske politike v obdobju 2014-2020.