This repository contains code utilized for training and evaluating enhanced trankit models. These models were trained on bigger datasets (UD Slovenian-SSJ r2.12 and UD Slovenian-SST r2.12) than those provided by trankit authors. An iteration trained on UD Slovenian-SSJ data outperformed the original trankit model over all metrics on the SloBench leaderboard.
For a detailed understanding of the inner workings and trankit library options, please refer to the original documentation. This repository serves as an illustration, demonstrating how to leverage the improved models developed during this project. These models are accessible via the CLARIN.SI repository.
Below, we provide a step-by-step guide on how to use our models with the trankit tool.
from trankit import Pipeline, trankit2conllu
# Initialize trankit
p = Pipeline(lang='customized', cache_dir='<PATH TO DOWNLOADED MODELS>', embedding='xlm-roberta-large')
There are two options for processing input:
text = 'Example text!'
dict_output = p(text)
pretokenized_list = [['Example', 'pre-tokenized', 'list', '!']]
dict_output = p(pretokenized_list)
# Convert output from dictionary to CONLLu format
conllu_output = trankit2conllu(dict_output)