ParsingFrisian

Code for the project: Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch data.
The code is run on data from the Fame! corpus by Yilmaz et al. (2016). The sentences in the output are thus from this corpus.

Data Selection & Data

Selection of Frisian data

This program was used to extract Frisian sentences that we did not use in our own development and test set. In the program the file is the conllu file that contained our selection of sentences. The trainfile is the file that the program should extract the sentences from and sentences is the file that the program should write the raw data to. It can simply be run by:

python3 getfy.py

Data directory

In this directory the original split and annotations can be found. Due to UD validation these annotations are slightly different than the ones that will be in the upcoming UD-treebank. For the experiments, these annotations and this data split is used.

LDA

This program takes as input conllu files and a textfile with raw sentences and finds the sentences that are most similar to the sentences in the textfile. An additional argument can be used: --maxSents. The default is 1000 sentences. The output is both a scatterplot and a csv file with the ordered sentences starting with the most similar sentence.

python3 lda.py --files f1.conllu f2.conllu --textfile sentences.txt

Creating a new conllu file

This program takes the output of lda.py and creates a conllu file for the numSents most similar sentences.

python3 get_conllu.py --filew out.csv --numSents 2000

Additional Experiments

Diacritics

translit.py removes diacritics from the sentences. It also counts the occurences.

python3 translit.py --files input.conllu --writefile output.conllu

Adding Orphans

The first program orphan_select.py selects sentences and creates new sentences with orphan relations. It writes them to a new file. The output should be used in replace.py. This program replaces a number of the sentences in the original file with the newly created sentences.

python3 orphan_select.py --files input.conllu --writefile outputorph.conllu
python3 replace.py --files input.conllu --replacefile outputorph.conllu --writefile outputreplace.conllu

Evaluation

evalAccuracy.py calculates LAS, UAS, POS scores, accuracy over all labels and creates a confusion matrix. It writes the confusion matrix for the dependency relations to a csv file.

python3 evalAccuracy.py file1.conllu file2.conllu

createScoreListsDEV.py creates files for UAS, LAS, POS scores per sentence over five runs (5 random seeds) which can be used to run a bootstrap significance test.

python3 createScoreListsDEV.py gold.conllu run1.conllu run2.conllu run3.conllu run4.conllu run5.conllu

Test does the same but only for the goldfile and one run. In the program you should change datagold to your goldfile and datasystem to your systems output file.

python3 createScoreLists.py

Outputs

In this directory all conllu outputs that are discussed in the paper can be found.

For the experiments in the paper, we used an internal version of MaChAmp from van der Goot et al. (2020). The specific commit that was used can be found in gitlogmtp.txt, and the repository is mtp. However, the code will also work with MaChAmp release 0.2.

References

van der Goot, R., Üstün, A., Ramponi, A., & Plank, B. (2020). Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP. arXiv preprint arXiv:2005.14672.
Yilmaz, E., Andringa, M., Kingma, S., Dijkstra, J., van der Kuip, F., van de Velde, H., Kampstra, F., Algra, J., van den Heuvel, H., van Leeuwen, D. (2016). A longitudinal bilingual Frisian-Dutch radio broadcast database desgined for code-switching research. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16).

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
AdditionalExperiments		AdditionalExperiments
Data		Data
DataSelection		DataSelection
Evaluation		Evaluation
Outputs		Outputs
README.md		README.md
gitlogmtp.txt		gitlogmtp.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParsingFrisian

Data Selection & Data

Selection of Frisian data

Data directory

LDA

Creating a new conllu file

Additional Experiments

Diacritics

Adding Orphans

Evaluation

Outputs

References

About

Releases

Packages

Languages

Anouck96/ParsingFrisian

Folders and files

Latest commit

History

Repository files navigation

ParsingFrisian

Data Selection & Data

Selection of Frisian data

Data directory

LDA

Creating a new conllu file

Additional Experiments

Diacritics

Adding Orphans

Evaluation

Outputs

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages