Skip to content

UniversalDependencies/UD_Beja-NSC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Summary

A Universal Dependencies corpus for Beja, North-Cushitic branch of the Afro-Asiatic phylum mainly spoken in Sudan, Egypt and Eritrea.

Introduction

The treebank is an automatic conversion of the SUD_beja-NSC, which was extracted from Martine Vanhove's corpus in Elan format (https://corpafroas.huma-num.fr/Archives/corpus.php).

Sentences are annotated with the following metadata :

  • sent_id (which indicates the source file and the segmentation identifier in the source file)
  • text (lexical tokenization)
  • text_en (english interpretation)
  • text_tokenized (morphological tokenization)

Structure

The data are spoken data, so the segmentation of sentences is a semantically relevant segmentation of utterances, where punctuation represents the end of intonative units (a single / for a minor unit and a double // for major units).

In the SUD version of the Treebank, we operate a morphological segmentation allowing us to highlight dependency relations between the root and its affixes or clitics.

In order to follow the UD guidelines, the segmentation is changed for the conversion to UD and affixes are merged with their root. A morph-based “UD-like” version is available here.

Reference

A morph-based and a word-based treebank for Beja, Sylvain Kahane, Martine Vanhove, Rayan Ziane, Bruno Guillaume. Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021).

Acknowledgments

This treebank has been done in collaboration between Vanhove Martine, Ziane Rayan and Kahane Sylvain. Thanks to Bruno Guillaume for the conversion to UD and the help to finalization.

Changelog

  • 2023-11-15 v2.13

  • Two samples added (06_foreigner and 11_coffee).

  • 2022-05-15 v2.10

    • New conversion in UD with a new segmentation based on words.
  • 2021-05-15 v2.8

    • Initial release in Universal Dependencies. Two samples (01_shelter and 03_camel).
=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v2.8
License: CC BY-SA 4.0
Includes text: yes
Genre: spoken
Lemmas: not available
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: converted from manual
Contributors: Vanhove, Martine; Ziane, Rayan; Kahane, Sylvain; Guillaume, Bruno
Contributing: elsewhere
Contact: martine.vanhove@cnrs.fr; sylvain@kahane.fr
===============================================================================

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •