In [None]:
import pandas as pd

# 1. MLMA Hate Speech Dataset

- Paper: https://aclanthology.org/D19-1474.pdf
- Dataset Link: https://huggingface.co/datasets/nedjmaou/MLMA_hate_speech

------

**Directness label:** the explicitness of the tweet either `direct` or `indirect`. This should be based on whether the target is explicitly named, or less easily discernible, especially if the tweet contains humor, metaphor, or figurative speech. 

**Hostility type** (multilabel) To identify the hostility type of the tweet, we stick to the following conventions:
- (1) if the tweet sounds dangerous, it should be labeled as `abusive` 
- (2) according to the degree to which it spreads hate and the tone its author uses, it can be `hateful`, `offensive` or - `disrespectful`
- (3) if the tweet expresses or spreads fear out of ignorance against a group of individuals, it should be labeled as `fearful` 
- (4) otherwise it should be annotated as `normal`. 

**Target** whether the tweet insults or discriminates against people based on their (1) `origin`, (2) `religious affiliation`, (3) `gender`, (4) `sexual orientation`, (5) `special needs` or (6) `other`

**Target group** We determined 16 common target groups tagged by the annotators after the first annotation step. The annotators had to decide on whether the tweet is aimed at women, people of `African descent`, `Hispanics`, `gay people`, `Asians`, `Arabs`, `immigrants in general`, `refugees`; people of different religious affiliations such as `Hindu`, `Christian`, `Jewish` people, and `Muslims`; or from political ideologies `socialists`, and others. We also provided the annotators with a category to cover hate directed towards one `individual`, which cannot be generalized. In case the tweet targets morethan one group of people, the annotators should choose the group which would be the most affected by it according to them. 

**Sentiment of the annotator** We claim that the choice of a suitable emotion representation model is key to this sub-task, given the subjective nature and social ground of the annotator’s sentiment analysis. After collecting the annotation results of the pilot dataset regarding how people feel about the tweets, and observing the added categories, we adopted a range of sentiments that are in the negative and neutral scales of the hourglass of emotions
introduced by Cambria et al. (2011). This model includes sentiments that are connected to objectively assessed natural language opinions, and excludes what is known as self-conscious or moral emotions such as shame and guilt. Our labels include shock, sadness, disgust, anger, fear, confusion in case of ambivalence, and indifference. This is the second multilabel task of our model.

## 1.1 Read Data

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("nedjmaou/MLMA_hate_speech")
pd_dataset = dataset['train'].to_pandas()
pd_dataset

In [None]:
print(f"Arabic total: {len(pd_dataset[:3353])}")
print(f"English total: {len(pd_dataset[3353:14647])}")
print(f"French total: {len(pd_dataset[14647:])}")

## 1.2. Label Stats

In [None]:
sentiments = pd_dataset.sentiment.apply(lambda x: x.split('_')).to_list()
sentiments_unique = {x for l in sentiments for x in l}

print(f"Unique `Sentiments`\n\n{sentiments_unique}")

In [None]:
print(f"Unique `Directness`\n\n{list(pd_dataset.directness.unique())}")

In [None]:
print(f"Unique `Target`\n\n{list(pd_dataset.target.unique())}")

In [None]:
print(f"Unique `Group`\n\n{list(pd_dataset.group.unique())}")

In [None]:
annotator_sentiments = pd_dataset.annotator_sentiment.apply(lambda x: x.split('_')).to_list()
annotator_sentiments_unique = {x for l in annotator_sentiments for x in l}

print(f"Unique `Annotator Sentiments`\n\n{annotator_sentiments_unique}")

## 1.3. Comments
Might be useful for us and it is multi-lingual

# 2. ~CyberAgressionAdo-v1~

- Paper: https://hal.archives-ouvertes.fr/hal-03765860/document
- Dataset Link: https://github.com/aollagnier/CyberAgressionAdo-v1
---

Participant role. Values (5): `victim, bully, victim_support, bully_support, mediator, conciliator`

Hate-speech: Values (2): `yes, no`

Type of verbal abuse. Values (4): `blaming, name_calling, threat, denigradation, other-aggression`

Target. Values (5): `victim, bully, victim_support, bully_support, mediator, conciliator`

Humor. Values (2): `yes, no`

## 2.1. Comments

After a small investigation it doesn't seems promising. Poorly labelled files.

# 3. HateSpeechMulti

https://www.kaggle.com/datasets/ludovick/hatespeechmulti

In [None]:
df = pd.read_csv('./datasets/cleaned_data_hatespeech.csv')
df.head(3)

In [None]:
print("Language distribution")
df.lang.value_counts()

In [None]:
print("Toxic distribution of French Dataset")
df[df.lang=='fr'].toxic.value_counts()

# 4. CONAN

https://github.com/marcoguerini/CONAN

---
CONAN is a multilingual and expert-based dataset of hate speech/counter-narrative pairs in English, French and Italian, focused on Islamophobia.

**Dataset description**
The dataset consists of 4078 pairs over the 3 languages. Together with the data we also provide 3 types of metadata: expert demographics, hate speech sub-topic and counter-narrative type. The dataset is augmented through translation (from Italian/French to English) and paraphrasing, which brought the total number of pairs to 14.988.

(*)The original number was 15.024 but after post-hoc analysis, we deleted 9 original pairs (36 pairs including augmented ones) because they did not meet the required standard.

In [None]:
df = pd.read_csv('./datasets/CONAN.csv')
df['lang'] = df.cn_id.apply(lambda x: x[:2])
df.lang.value_counts()

In [None]:
df_fr = df[df.lang == 'FR']
df_fr.head()

In [None]:
df_fr.hsType.value_counts()

In [None]:
df_fr.hsSubType.value_counts()

# 5. CyberBullying Classification (EN)

https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification

---
In light of all of this, this dataset contains more than 47000 tweets labelled according to the class of cyberbullying:

* Age;
* Ethnicity;
* Gender;
* Religion;
* Other type of cyberbullying;
* Not cyberbullying


In [None]:
df = pd.read_csv('./datasets/cyberbullying_tweets.csv')
df.head(3)

In [None]:
df.cyberbullying_type.value_counts()

In [None]:
df[df.cyberbullying_type=='gender'].tweet_text.to_list()[0:5]

# 6. Jigsaw (EN)

The primary data for the competition is, in each provided file, the comment_text column. This contains the text of a comment which has been classified as toxic or non-toxic (0…1 in the toxic column). The train set’s comments are entirely in english and come either from Civil Comments or Wikipedia talk page edits. The test data's comment_text columns are composed of multiple non-English languages.

https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification