Skip to content

aymeam/Datasets-for-Hate-Speech-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 

Repository files navigation

Datasets from Related Literature

In this repository, we present information on datasets that have been used for hate speech detection or related concepts such as cyberbullying, abusive language, online harassment, among others, to make it easier for researchers to obtain datasets.

Even when there are several social media platforms to get data from, the construction of a balanced labeled dataset is a costly task in time and effort, and it is still a problem for the researchers in the area. Although most of the below-listed datasets are not explicitly available, some of them can be obtained from the authors if requested.

Spanish

No Datasets (Link to paper) Objects Size Available Labels
1 IberEval 2018 Tweets 4138 Download Misogeny (5 categories), Not Misogeny
2 MEX-A3T Tweets 11000 Download Aggressive, Not Aggressive
3 SemEval19, 2019 Tweets 4500 Request Link Hate Speech, Non Hate Speech
4 Pereira et al., 2019 Tweets 6000 Download Hate Speech, Non Hate Speech
5 Chilean Dataset Tweets 9834 Download Several Categories including hate speech

Italian

No Datasets (Link to paper) Objects Size Available Labels
1 Sanguinetti et al., 2018 Tweets 6929 Download Hate Speech, Non Hate Speech
2 EVALITA 2018 Facebook Posts 4000 Download No Hate, Weak Hate, Strong Hate
3 EVALITA 2018 Tweets 4000 Download Hate Speech, Non Hate Speech
4 EVALITA 2020 Tweets 6839 Request Link Hate Speech, Non Hate Speech

English

No Datasets (Link to paper) Objects Size Available Labels Comment
1 Dinakar et al., 2011 YouTube Comments 6000 - Sexuality, Race, Culture, Intelligence
2 Dadvar and Jong, 2012 Myspace Posts 2200 - Bullying, Non Bullying
3 Huang et al., 2014 Tweets 4865 - Bullying, Non Bullying
4 Hosseinmardi et al., 2015 Instagram Media Sessions 998 - bullying, Non bullying
5 Waseem and Hovy, 2016 Tweets 16914 Download Racist, Sexist, Either
6 Waseem, 2016 Tweets 6909 Download Racist, Sexist, Either,Both
7 Nobata et al., 2016 Yahoo Comments 2000 - Abusive, Clean
8 Chatzakou et al., 2017 Twitter Users 9484 - Aggressor, Bully, Spammer
9 Davidson et al., 2017 Tweets 24802 Download hate_speech, offensive, neither
10 Golbeck et al., 2017 Tweets 35000 - Harassing, Non Harassing
11 Wulczyn et al. 2017 Wikipedia Comments 100000 Download Personal Attacks
12 Tahmasbi and Rastegari, 2018 Tweets 12837 - Bullying, Non Bullying
13 Anzovino et al., 2018 Tweets 4454 - Discredit, Stereotype, Objectification, Sexual_Harassment, Threats of Violence, Dominance, Dearailingy
14 Founta et al., 2018 Tweets 80000 Download Hate Speech, Offensive, None
15 Gibert et al., 2018 Sentences from Stormfront 10568 Download Hate Speech, Non Hate Speech
16 SemEval19, 2019 Tweets 9000 Request Link Hate speech, Non Hate Speech
17 OLID 2019 Tweets 14100 Download Offensive, Non Offensive
18 TREC2 2020 Messages (Twitter,Facebook,Youtube) 4,263 Request Form Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG) Data GeoLocated India
19 meTooMA 2020 Tweets 9,973 Download Hate Speech (Directed, Generalized), Relevance (0,1), STANCE (Support, Opposition, Neither) Data GeoLocated India, Australia, Kenya, Iran, UK

Arabic

No Datasets (Link to paper) Objects Size Available Labels
1 Mubarak et al., 2017 Tweets 1100 Download Obscene, Offensive but not obscene, Clean
2 Albadi et al., 2018 Tweets 6136 Download Hate Speech, Non Hate Speech
3 Alakrot A. et al., 2018 Tweets 15050 Download Offensive, Not Offensive
4 Ousidhoum et al., 2019 Tweets 3353 Download Hate Speech, Non Hate Speech
5 L-HSAB, 2019 Tweets 5846 Download Normal, Abuse, Hate Speech

Other languages

No Datasets (Link to paper) Objects Size Available Language Labels
1 Hee et al., 2015 Ask.fm Posts 85485 - Dutch Threat-Blackmail, Sexual-talk, Insult, Curse-Exclusion, Defense, Defamation-Encouragement
2 Papegnies et al., 2017 Game Chat Logs 2779 - French Abusive, Non Abusive
3 Sirihattasak et al., 2018 Tweets 3,300 Yes Thai Toxic, Non Toxic
4 Bohra et al., 2018 Tweets 4575 Yes Hindi-English Hate Speech, Non Hate Speech
5 Fortuna et al., 2019 Tweets 5668 Download Portuguese Hate Speech (81 categories), Non Hate Speech
6 TREC2 2020 Messages (Twitter,Facebook,Youtube) 3,984 Request Form Hindi Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG)
8 TREC2 2020 Messages (Twitter,Facebook,Youtube) 3,826 Request Form Bangla Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG)

Multilingual (Parallel Data)

No Datasets (Link to paper) Objects Size Available Language Labels
1 XHate 999 Tweets from previous published English datasets and translated to 5 languages 600 (x 6 languages) Download English, German, Russian, Croatian, Albanian, Turkish sexism, racism, toxicity, hatefulness, aggression, attack, cyberbullying, misogyny, obscenity, threats, and insults.

Multimodal Datasets

No Datasets (Link to paper) Objects Size Available Language Labels
1 Kiela et al., 2020 Memes (Image + Text) 10000 Competition link Texts in English Hate, No Hate
2 Pramanick1 et al., 2021 Memes (Image + Text) 3544 Download Texts in English somewhat harmful, not harmful, very harmful

About

Datasets for Hate Speech Detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published