Published Models:

Update on April 27th, 2023: We acknowledge that Twitter API has limited free access. Please contact Cagri Toraman (cagritoraman@gmail.com) if you have difficulties to fetch the data from Twitter API.

Published Models:

English hate speech detection model finetuned on Dataset v2:

https://huggingface.co/ctoraman/hate-speech-bert

Turkish hate speech detection model finetuned on Dataset v2:

https://huggingface.co/ctoraman/hate-speech-berturk

Large-Scale Hate Speech Datasets

This repository contains the utilized dataset in the LREC 2022 paper "Large-Scale Hate Speech Detection with Cross-Domain Transfer". This study mainly focuses hate speech detection in Turkish and English. In addition, domain transfer success between hate domains is also examined.

There are two dataset versions.

Dataset v1: The original dataset that includes 100,000 tweets per English and Turkish, published in LREC 2022. The annotations with more than 60% agreement are included.

Dataset v2: A more reliable dataset version that includes 68,597 tweets for English and 60,310 for Turkish. The annotations with more than 80% agreement are included.

Dataset v2 (hate_speech_dataset_v2.csv)

We acknowledge that some annotations in the original dataset (v1) are controversial. Therefore, we publish a more reliable dataset version (v2) that includes only the tweets with more than 80% annotator agreement. The dataset v2 has 128,907 tweets. 60,310 of them are Turkish, and 68,597 are English. Explanations of the columns of the file are as follows:

Column Name	Description
TweetID	Twitter ID of the tweet
LangID	Language of the tweet 0-Turkish, 1-English
TopicID	Domain of the topic 0-Religion, 1-Gender, 2-Race, 3-Politics, 4-Sports
HateLabel	Final hate label decision 0-Normal, 1-Offensive, 2-Hate

Distibution of tweets in the dataset is as follows:

Lang.	Domain	Hate	Offensive	Normal	Total
EN	Religion Gender Race Politics Sport Total	328 255 405 343 286 1,617 (2%)	2,369 3,043 1,631 2,972 2,814 12,829 (19%)	10,713 9,537 12,566 9,994 11,341 54,151 (79%)	13,410 12,835 14,602 13,309 14,441 68,597
TR	Religion Gender Race Politics Sport Total	2,281 970 1,897 3,657 4,016 12,821 (21%)	3,814 3,385 2,276 1,529 3,930 14,934 (25%)	5,058 8,353 8,236 6,251 4,657 32,555 (54%)	11,153 12,708 12,409 11,437 12,603 60,310

Dataset labeler (hate_speech_dataset_v2_labeler.csv) This file contains the individual annotations for each tweet. There are 20 labelers, and each tweet is annotated by 5 labelers.

Column Name	Description
TweetID	Twitter ID of the tweet
labeler_i	Annotation of the ith annotator 0-Normal, 1-Offensive, 2-Hate

Using the dataset v2, we run BERT and BERTurk by applying 10-fold cross validation (as in the published version, v1). Each data split has 90% of train and 10% of test. We report the average F1 scores.

F1-Score	Neutral	Offensive	Hateful	Weighted
Bert-base-uncased (EN)	0.968 ± 0.002	0.858 ± 0.008	0.631 ± 0.039	0.940 ± 0.004
Bert-base-turkish-uncased (TR)	0.946 ± 0.002	0.852 ± 0.005	0.887 ± 0.005	0.910 ± 0.003

Thanks to Izzet Emre Kucukkaya for helping in the preparation of the dataset v2.

Dataset v1 (hate_speech_dataset.csv)

The dataset is composed of 200,000 tweets. Half of them is Turkish and other half is English. We also have domain information of the hate speech. These domains are Religion, Gender, Race, Politics, Sports. Each domain has 20,000 tweets in each respective language. 5 hate annotations of the tweet are also given. Since we followed Twitter's Terms and Conditions, publish tweet IDs not the tweet content directly. Explanations of the columns of the file are as follows:

Column Name	Description
TweetID	Twitter ID of the tweet
LangID	Language of the tweet 0-Turkish, 1-English
TopicID	Domain of the topic 0-Religion, 1-Gender, 2-Race, 3-Politics, 4-Sports
Label_1	Annotation of the first annotator 0-Normal, 1-Offensive, 2-Hate
Label_2	Annotation of the second annotator 0-Normal, 1-Offensive, 2-Hate
Label_3	Annotation of the third annotator 0-Normal, 1-Offensive, 2-Hate
Label_4	Annotation of the fourth annotator 0-Normal, 1-Offensive, 2-Hate
Label_5	Annotation of the fifth annotator 0-Normal, 1-Offensive, 2-Hate
HateLabel	Final hate label decision 0-Normal, 1-Offensive, 2-Hate

Distibution of tweets in the dataset is as follows:

Lang.	Domain	Hate	Offensive	Normal	Total
EN	Religion Gender Race Politics Sport Total	1,427 1,313 1,541 1,610 1,434 7,325 (7%)	5,221 6,431 3,846 6,018 5,624 27,140 (27%)	13,352 12,256 14,613 12,372 12,942 65,535 (66%)	20,000 20,000 20,000 20,000 20,000 100,000
TR	Religion Gender Race Politics Sport Total	5,688 2,780 5,095 7,657 6,373 27,593 (28%)	7,435 6,521 4,905 4,253 7,633 30,747 (31%)	6,877 10,699 10,000 8,090 5,994 41,660 (41%)	20,000 20,000 20,000 20,000 20,000 100,000

Contact

Please contact Cagri Toraman (cagritoraman@gmail.com) in case of any issues with the datasets.

Citation

If you make use of this dataset, please cite following paper.

@InProceedings{toraman2022large,
  author    = {Toraman, Cagri  and  \c{S}ahinu\c{c}, Furkan and Yilmaz, Eyup Halit},
  title     = {Large-Scale Hate Speech Detection with Cross-Domain Transfer},
  booktitle = {Proceedings of the Language Resources and Evaluation Conference},
  month     = {June},
  year      = {2022},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {2215--2225},
  url       = {https://aclanthology.org/2022.lrec-1.238}
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
dataset_v1		dataset_v1
dataset_v2		dataset_v2
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset_v1

dataset_v1

dataset_v2

dataset_v2

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Published Models:

Large-Scale Hate Speech Datasets

Dataset v2 (hate_speech_dataset_v2.csv)

Dataset v1 (hate_speech_dataset.csv)

Contact

Citation

About

Releases

Packages

License

avaapm/hatespeech

Folders and files

Latest commit

History

Repository files navigation

Published Models:

Large-Scale Hate Speech Datasets

Dataset v2 (hate_speech_dataset_v2.csv)

Dataset v1 (hate_speech_dataset.csv)

Contact

Citation

About

Topics

Resources

License

Stars

Watchers

Forks