# Variationist Use Case 2: Human Subjectivity in Hate Speech Annotation

**Welcome! In this use case example, you will be shown how we carry out an analysis using Variationist.**

In [1]:
# Install using pip
!pip install variationist

Collecting variationist
  Downloading variationist-0.1.4-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m424.0 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting altair==5.2.0 (from variationist)
  Downloading altair-5.2.0-py3-none-any.whl (996 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m996.9/996.9 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==2.17.1 (from variationist)
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji==2.10.1 (from variationist)
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting geopandas==0.14.3 (from variationist)
  Downloading geopandas-0.14.3-py3-none-any.whl (1.

In [2]:
# if your dataset has a large number of examples, you might need this to display large charts
!pip install "vegafusion[embed]>=1.4.0"
import altair as alt
alt.data_transformers.enable("vegafusion")

Collecting vegafusion[embed]>=1.4.0
  Downloading vegafusion-1.6.9-py3-none-any.whl (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.5/54.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting vegafusion-python-embed==1.6.9 (from vegafusion[embed]>=1.4.0)
  Downloading vegafusion_python_embed-1.6.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.1/25.1 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vegafusion-python-embed, vegafusion
Successfully installed vegafusion-1.6.9 vegafusion-python-embed-1.6.9


DataTransformerRegistry.enable('vegafusion')

In [3]:
from variationist import Inspector, InspectorArgs, Visualizer, VisualizerArgs

For this experiment, we use the Measuring Hate Speech Corpus [(Sachdeva et al., 2022)](https://aclanthology.org/2022.nlperspectives-1.11/). It contains annotations for hate speech, including the targets of hate and some demographic characteristics of the annotators.

We load the dataset using HuggingFace datasets, convert boolean columns related to annotators' sexual orientation to a single string column, and filter the data to retain hate speech posts only.

In [4]:
from datasets import load_dataset

dataset = load_dataset("ucberkeley-dlab/measuring-hate-speech")
pd_dataset = dataset["train"].to_pandas()

sexuality = ["annotator_sexuality_bisexual", "annotator_sexuality_gay", "annotator_sexuality_straight", "annotator_sexuality_other"]
pd_sexuality = pd_dataset[sexuality]
pd_dataset["annotator_sexuality"] = pd_sexuality.idxmax(1).to_frame('annotator_sexuality')
pd_dataset["annotator_sexuality"] = pd_dataset["annotator_sexuality"].str[20:]

pd_dataset = pd_dataset[pd_dataset.hatespeech != 1]
pd_dataset = pd_dataset[pd_dataset.hatespeech != 0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.03k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.1M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

We then define the inspector arguments and run the inspector.

In [7]:
# Define the inspector arguments
inspector_args = InspectorArgs(
		 text_names=["text"],
		 var_names=["hatespeech", "annotator_sexuality"],
		 var_types=["nominal", "nominal"],
		 var_semantics=["general", "general"],
		 metrics=["npw_relevance"],
		 stopwords=True,
		 language="en",
		 lowercase=True,
		 n_tokens = 1
)

In [8]:
# Create an inspector instance, run it, and get the results in json
results = Inspector(dataset=pd_dataset, args=inspector_args).inspect()

INFO: The metadata we will be using for the current analysis are:
{'text_names': ['text'], 'var_names': ['hatespeech', 'annotator_sexuality'], 'metrics': ['npw_relevance'], 'var_types': ['nominal', 'nominal'], 'var_semantics': ['general', 'general'], 'var_subsets': None, 'var_bins': [0, 0], 'tokenizer': 'whitespace', 'language': 'en', 'n_tokens': 1, 'n_cooc': 1, 'unique_cooc': False, 'cooc_window_size': 0, 'freq_cutoff': 3, 'stopwords': True, 'custom_stopwords': None, 'lowercase': True, 'ignore_null_var': False}
INFO: all column identifiers are treated as column names.
INFO: Tokenizing the text column...


100%|██████████| 46021/46021 [00:05<00:00, 8129.55it/s]


INFO: Splitting intersections of variables into subsets.
Subsets for text column 'text'...


100%|██████████| 4/4 [00:00<00:00, 24.87it/s]


INFO: Currently calculating metric: 'npw_relevance'


100%|██████████| 4/4 [00:00<00:00, 13.84it/s]


We define the visualizer arguments and run the visualizer.

In [9]:
# Define the visualizer arguments
visualizer_args = VisualizerArgs(
		 output_folder="output",
		 output_formats=["html", "png"]
)

# Create dynamic visualizations of the results
charts = Visualizer(input_json=results, args=visualizer_args).create()

Reading json data...
INFO: Creating a HeatmapChart object for metric "npw_relevance"...
INFO: Saving it to the filepath: "output/npw_relevance/HeatmapChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.


We can select different tokens to see how annotators reflect their personal biases differently when annotating.
Let's try by filtering by the lexical item "gay", and see if annotators who identify as non-straight label texts containing this word differently than annotators who identify as straight.

In [10]:
charts["npw_relevance"]["HeatmapChart"]

We can do the same by filtering by other demographic characteristics. We load again the dataset using HuggingFace datasets, convert boolean columns related to annotators' race to a single string column, and filter the data to retain hate speech posts only.

In [11]:
dataset = load_dataset("ucberkeley-dlab/measuring-hate-speech")
pd_dataset = dataset["train"].to_pandas()

races = ["annotator_race_asian", "annotator_race_black", "annotator_race_latinx", "annotator_race_middle_eastern", "annotator_race_native_american", "annotator_race_pacific_islander", "annotator_race_white", "annotator_race_other"]
pd_races = pd_dataset[races]
pd_dataset["annotator_race"] = pd_races.idxmax(1).to_frame('annotator_race')
pd_dataset["annotator_race"] = pd_dataset["annotator_race"].str[15:]

pd_dataset = pd_dataset[pd_dataset.hatespeech != 1]
pd_dataset = pd_dataset[pd_dataset.hatespeech != 0]

We then define the inspector arguments and run the inspector.

In [12]:
# Define the inspector arguments
inspector_args = InspectorArgs(
		 text_names=["text"],
		 var_names=["hatespeech", "annotator_race"],
		 var_types=["nominal", "nominal"],
		 var_semantics=["general", "general"],
		 metrics=["npw_relevance"],
		 stopwords=True,
		 language="en",
		 lowercase=True,
		 n_tokens = 1
)

In [13]:
# Create an inspector instance, run it, and get the results in json
results = Inspector(dataset=pd_dataset, args=inspector_args).inspect()

INFO: The metadata we will be using for the current analysis are:
{'text_names': ['text'], 'var_names': ['hatespeech', 'annotator_race'], 'metrics': ['npw_relevance'], 'var_types': ['nominal', 'nominal'], 'var_semantics': ['general', 'general'], 'var_subsets': None, 'var_bins': [0, 0], 'tokenizer': 'whitespace', 'language': 'en', 'n_tokens': 1, 'n_cooc': 1, 'unique_cooc': False, 'cooc_window_size': 0, 'freq_cutoff': 3, 'stopwords': True, 'custom_stopwords': None, 'lowercase': True, 'ignore_null_var': False}
INFO: all column identifiers are treated as column names.
INFO: Tokenizing the text column...


100%|██████████| 46021/46021 [00:05<00:00, 8605.21it/s] 


INFO: Splitting intersections of variables into subsets.
Subsets for text column 'text'...


100%|██████████| 8/8 [00:00<00:00, 44.01it/s]


INFO: Currently calculating metric: 'npw_relevance'


100%|██████████| 8/8 [00:00<00:00, 39.97it/s]


We define the visualizer arguments and run the visualizer.

In [14]:
# Define the visualizer arguments
visualizer_args = VisualizerArgs(
		 output_folder="output",
		 output_formats=["html", "png"]
)

# Create dynamic visualizations of the results
charts = Visualizer(input_json=results, args=visualizer_args).create()

Reading json data...
INFO: Creating a HeatmapChart object for metric "npw_relevance"...
INFO: Saving it to the filepath: "output/npw_relevance/HeatmapChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.


Let's try by filtering by the lexical item "n*ggas", and see if annotators who identify as black label texts containing this word differently than annotators who identify as non-black.

In [15]:
charts["npw_relevance"]["HeatmapChart"]

In these examples, we show how different lexical items may be (more or less) informative for certain labels depending on the sociodemographics of annotators. We hope Variationist can aid in speeding up the exploration of undesidered associations across a combination of attributes in language data.