Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md
torLex.gz
torsr.lemma_freq
torsr.msd.model
torsr.train

README.md

Part of speech tagger and lemmatizer for Torlak

Custom model of the ReLDI tagger adapted for Torlak dialect.

Python version 2.7

Python modules:

Citing the tagger

The tagger for Torlak and the underlying training files are described in the paper:

Vuković, Teodora (submitted). Representing variation in a spoken corpus of an endangered dialect. The case of Torlak.

To cite the ReLDI tagger, see the original GitHub repository and the README file

Tagger files and scripts

The original ReLDI tagger files and instructions are adapted for Torlak.

Torlak model code is called torsr ('tor' + 'sr', because the training data contains Torlak and Serbian merged). The files can be downloaded here, added to the ReLDI tagger repository and used accordingly. The files for Torlak available here are:

torLex.gz
torsr.lemma_freq
torsr.msd.model
torsr.train

Another file necessary for using the tagger is torsr.marisa, which is too large to be uploaded to GitHub, and needs to be created by training the tagger as explained below OR downloaded here.

You can create the torsrLex.gz file used for training and referenced throughout the text that follows, by merging torLex file and the Serbian lexicon srLex_v1.2.

Modifying the tagger.py from the script to include Torlak

In order to run tagger.py with the Torlak files containing torsr in the title, you first need to add the Torlak language code to the list of possible language code arguments in the tagger.py script. To do so, change the following line from the original script:

parser.add_argument('lang', help='language of the text', choices=['sl', 'sl.ns', 'sl.ns.true', 'sl.ns.lower', 'hr', 'sr'])

so that the list of possible choices contains torsr, as follows:

parser.add_argument('lang', help='language of the text', choices=['sl', 'sl.ns', 'sl.ns.true', 'sl.ns.lower', 'hr', 'sr', 'torsr'])

Running the tagger

If you have the necessary files for tagging for Torlak torst.msd.model and torsr.marisa), you can run the tagger in the terminal by entering the text in the terminal word by word and pressing CTRL+D at the end:

$ ./tagger.py torsr
u
selo
mi
živimo
ja
i
dedava
.

CTRL+D
u	Sa
selo	Ncnsa
mi	Pp1-pn
živimo	Vmr1p	živeti
ja	Pp1-sn
i	Cc
dedava	Ncmsn_v
.	Z

You can include the lemmatizer, using the -l flag. Lemmatization requires the torsr.lexicon.guesser file, which can be created by training the lemmatiser as described below.

$ ./tagger.py torsr -l
u
selo
mi
živimo
ja
i
dedava
.

CTRL+D
u	Sa	u
selo	Ncnsa	selo
mi	Pp1-pn	mi
živimo	Vmr1p	živeti
ja	Pp1-sn	ja
i	Cc	i
dedava	Ncmsn_v	deda
.	Z	.

You can also send the tokenized verticalized file to be tagged to stdin, as in the example below,and optionally redirect the output to another file using > newfile.txt:

$ cat file.txt | ./tagger.py torsr -l > newfile.txt
u	Sa	u
selo	Ncnsa	selo
mi	Pp1-pn	mi
živimo	Vmr1p	živeti
ja	Pp1-sn	ja
i	Cc	i
dedava	Ncmsn_v	deda
.	Z	.

The text can be processed using the tokenizer, as explained in the original instructions for the ReLDI tagger. Bear in mind that the togenizer from the ReLDI tagger package has not been adapted to parse Torlak transcripts. It works for written standard BKMS or Slovene.

Training the torsr model

In case you want to train the Torlak tagger, you can use the torsr.train and torsrLex.gz files or a different, modified input. As stated in the ReLDI Taggger instructions, the input files need to be in "in the one-token-per-line, empty-line-as-sentence-boundary format. The tagger training data should be, with the token, lemma and the tag separated by a tab." See torsr.train file as an example.

Once you have the necessary file in the required format, you may proceed with the training based on the ReLDI tagger instructions and using the adapted commands for the torsr model.

Preparing the lexicon trie used by the tagger

The lexicon trie is used both during training the tagger and during tagging.

The lexicon file should be formatted in the same manner as the training data, just with no sentence boundaries. Additionally to the words, tags and lemmas, it contains information about the frequency of the word in the training data. To prepare the torsr lexicon, run the following command:

$ gunzip -c torsrLex.gz | cut -f 1,2,3 | ./prepare_marisa.py torsr.marisa

Training the tagger

The only argument given to the script is the language code. In case of Torlak (language code torsr) the corpus training data is expected to be in the file torsr.train, while the lexicon trie is expected to be in the file torsr.marisa.

$ ./train_tagger.py torsr

Preparing the lexicon for training the lemmatiser

The first step in producing the lexicon for lemmatisation is to calculate the lemma frequency list from the tagger training data. The data in the same format as for training the tagger should be used.

$ ./lemma_freq.py torsr.lemma_freq < torsr.train

The second step produces the lexicon in form of a marisa_trie.BytesTrie. The lemma frequency information is used in case of (token,msd) pair collisions. Only the most frequent lemma is kept in the lexicon.

$ gunzip -c torsrLex.gz | cut -f 1,2,3 | ./prepare_lexicon.py torsr.lemma_freq torsr.lexicon

Training the lemmatiser

The lemmatiser of unknown words is trained on the lexicon prepared in the previous step. The lexicon used for training the lemma guesser has a suffix .train. A Multinomial Naive Bayes classifier is learned for each MSD. The classes to be predicted are quatruple transformations in form (remove_start,prefix,remove_end,suffix). The transformation is being applied by removing the first remove_start characters, adding the prefix, removing the last remove_end characters and adding the suffix.

The output of the lemmatiser learning process is a file with the .lexicon.guesser suffix.

$ ./train_lemmatiser.py torsr.lexicon
You can’t perform that action at this time.