GitHub - DianaHoefels/CoRoSeOf: CoRoSeOf: An annotated Corpus of Romanian Sexist and Offensive Language

$\color{red}{CoR}\color{yellow}{oS}\color{blue}{eOf}$: An annotated Corpus of Romanian Sexist and Offensive Language

A collection of Romanian sexist and offensive samples, including approximately 40k samples, of which ≈10% are sexist, and ≈11% offensive.

Folder Structure

This project is organized into the following folders:

corpus (contains tweet id, sampling technique, annotator id and gender, non-aggregated annotations, and majority vote labels)
docs (annotation guidelines and keywords used to query the data)

Authors

Contributors names and contact info:

Diana Constantina Höfels: diana-constantina.hoefels@student.uni-tuebingen.de

Dr. Çağrı Çöltekin

Dr. Irina Diana Mădroane: irina.madroane@e-uvt.ro

License

The corpus can be used under the terms of CC-BY-SA.

Journal Paper

Accepted at LREC2022

CoRoSeOf - An Annotated Corpus of Romanian Sexist and Offensive Tweets

Kindly provide proper citations and references to acknowledge our contributions when utilizing or mentioning our work in your endeavors:


@InProceedings{hoefels-ltekin-mdroane:2022:LREC,
  author    = {Hoefels, Diana Constantina,  Çöltekin, Çağrı  and  Mădroane, Irina Diana},
  title     = {CoRoSeOf - An Annotated Corpus of Romanian Sexist and Offensive Tweets},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {2269--2281},
  abstract  = {This paper introduces CoRoSeOf, a large corpus of Romanian social media manually annotated for sexist and offensive language. We describe the annotation process of the corpus, provide initial analyses, and baseline classification results for sexism detection on this data set. The resulting corpus contains 39 245 tweets, annotated by multiple annotators (with an agreement rate of Fleissâ€™Îº= 0.45), following the sexist label set of a recent study. The automatic sexism detection yields scores similar to some of the earlier studies (macro averaged F1 score of 83.07\% on binary classification task). We release the corpus with a permissive license.},
  url       = {https://aclanthology.org/2022.lrec-1.243}
}

Acknowledgements

The annotators team (in alphabetical order), Anamaria Andrei, Raluca Ardeaun, Edward Bojboi, Octavia Cojocaru, Cristiana Giurcă, Costel Olaru, Roberta Recalo, Diana Stanciu, Tiberiu Tomescu and Carmen Tuns, from Interdisciplinary Center of Gender Studies - West University of Timișoara.

This study utilized Twitter data sets and the content provided remains subject to the terms and conditions of Twitter Twitter's Developer Agreement & Policy, and must agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
corpus		corpus
docs		docs
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

$\color{red}{CoR}\color{yellow}{oS}\color{blue}{eOf}$: An annotated Corpus of Romanian Sexist and Offensive Language

Folder Structure

Authors

License

Journal Paper

Acknowledgements

About

Releases

Packages

Languages

License

DianaHoefels/CoRoSeOf

Folders and files

Latest commit

History

Repository files navigation

$\color{red}{CoR}\color{yellow}{oS}\color{blue}{eOf}$: An annotated Corpus of Romanian Sexist and Offensive Language

Folder Structure

Authors

License

Journal Paper

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages