Skip to content

clarinsi/trankit-train

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

This repository contains code utilized for training and evaluating enhanced trankit models. These models were trained on bigger datasets (UD Slovenian-SSJ r2.12 and UD Slovenian-SST r2.12) than those provided by trankit authors. An iteration trained on UD Slovenian-SSJ data outperformed the original trankit model over all metrics on the SloBench leaderboard.

For a detailed understanding of the inner workings and trankit library options, please refer to the original documentation. This repository serves as an illustration, demonstrating how to leverage the improved models developed during this project. These models are accessible via the CLARIN.SI repository.

Usage example

Below, we provide a step-by-step guide on how to use our models with the trankit tool.

Step 1: Initialization

from trankit import Pipeline, trankit2conllu

# Initialize trankit
p = Pipeline(lang='customized', cache_dir='<PATH TO DOWNLOADED MODELS>', embedding='xlm-roberta-large')

Step 2: Process Input

There are two options for processing input:

Option 1 - Using Text Input:

text = 'Example text!'
dict_output = p(text)

Option 2 - Using a Pre-tokenized List:

pretokenized_list = [['Example', 'pre-tokenized', 'list', '!']]
dict_output = p(pretokenized_list)

Step 3: Convert Output to CONLLu Format

# Convert output from dictionary to CONLLu format
conllu_output = trankit2conllu(dict_output)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages