Skip to content

Latest commit

 

History

History
68 lines (54 loc) · 3.8 KB

File metadata and controls

68 lines (54 loc) · 3.8 KB

Summary

A Universal Dependencies corpus for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. We added to the UD annotations NER annotations extending the French Treebank NER scheme (Sagot et al, 2012) and Offensive language classification and corrected many of the translations (still ongoing).

Introduction

This repository includes dataset presented in the paper "Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language"

The first version of the NArabizi Corpus was presented in (Seddah & al., 2020), with extensive parsing results presented in (Riabi et al, 2021).

Splitting

The now deduplicated corpus contains 18561 tokens in 1287 sentences.

In UD_Magherebi_Arabic_French-Arabizi, data were randomly split into:

  • fr_Magherebi_Arabic_French-Arabizi-ud-test.conllu: 14444 tokens in 1003 sentences
  • fr_Magherebi_Arabic_French-Arabizi-ud-dev.conllu: 2064 tokens in 139 sentences
  • fr_Magherebi_Arabic_French-Arabizi-ud-train.conllu: 2053 tokens in 145 sentences

Genres

The original sentences of the corpus are taken from:

  • Algerian newspaper’s web forums collected by (Cotterell et al., 2014).
  • Lyrics from a few dozen popular songs of various genres (Raï, hip-hop, etc.)

Acknowledgments

References

Changelog

  • 2023-05-15 v2.12
    • Initial release in Universal Dependencies.
  • 2023-03-8
    • Manual corrections in the original Treebank
    • Deduplication of threebank
    • Improve NOUN/PROPN distinction
    • Several changes for harmonisation
    • Harmonisation of tokenisation
    • Correction of the cycles
    • Fixing encoding error for Arabic script
    • Verify and fix origin text
    • Add NER annotations in MISC field
    • Add offensive annotations in the Meta data.
    • Fix some translations by Native speakers
=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v2.12
License: CC BY-SA 4.0
Includes text: yes
Genre: nonfiction news
Lemmas: converted from manual
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: converted from manual
Contributors: Riabi, Arij; Essaidi, Farah; Fethi, Amal; Mahamdi, Menel; Seddah, Djamé
Contributing: elsewhere
Contact: djame.seddah@gmail.com
===============================================================================