# Variationist Use Case 3: Features of Human vs Generated Texts

**Welcome! In this use case example, you will be shown how we carry out an analysis using Variationist.**

In [1]:
# Install using pip
!pip install variationist

Collecting variationist
  Downloading variationist-0.1.4-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m570.6 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting altair==5.2.0 (from variationist)
  Downloading altair-5.2.0-py3-none-any.whl (996 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m996.9/996.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==2.17.1 (from variationist)
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji==2.10.1 (from variationist)
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting geopandas==0.14.3 (from variationist)
  Downloading geopandas-0.14.3-py3-none-any.whl (1.1

In [4]:
# If your dataset has a large number of examples, you might need this to display large charts
!pip install "vegafusion[embed]>=1.4.0"
import altair as alt
alt.data_transformers.enable("vegafusion")



DataTransformerRegistry.enable('vegafusion')

In [5]:
from variationist import Inspector, InspectorArgs, Visualizer, VisualizerArgs

For this experiment, we use the `Hello-SimpleAI/HC3` dataset from the HuggingFace hub [(Guo et al., 2023)](https://arxiv.org/abs/2301.07597). It contains answers written by humans and ChatGPT-generated texts to the same questions, categorized into 5 different sources/domain. Let's take a look at the differences between human and synthetic texts.

In [7]:
# Define the dataset name, subset, and split of the dataset to load from HuggingFace datasets
my_dataset = "hf::Hello-SimpleAI/HC3::all::train"

We specify two text columns of interest: `human_answers` and `chatgpt_answers`. We set bigrams as units with lowercase normalization, and specify stopword removal in English, further adding "url" and numbers from 0 to 9 as extra unigrams to remove. We further define `stats`, `root_ttr`, and `npw_pmi` as metrics in order to analyze different aspects of the texts. The other parameters are left with default values.

In [8]:
# Define the inspector arguments
inspector_args = InspectorArgs(
		 text_names=["human_answers", "chatgpt_answers"],
		 var_names=[],
		 metrics=["stats", "root_ttr", "npw_pmi"],
		 n_tokens=2,
		 lowercase=True,
		 stopwords=True,
		 custom_stopwords=["url", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],
		 language="en"
)

In [9]:
# Create an inspector instance, run it, and get the results in json
results = Inspector(dataset=my_dataset, args=inspector_args).inspect()

INFO: No values have been set for var_types. Defaults to nominal.
INFO: No values have been set for var_semantics. Defaults to general.
INFO: The metadata we will be using for the current analysis are:
{'text_names': ['human_answers', 'chatgpt_answers'], 'var_names': [], 'metrics': ['stats', 'root_ttr', 'npw_pmi'], 'var_types': [], 'var_semantics': [], 'var_subsets': None, 'var_bins': [], 'tokenizer': 'whitespace', 'language': 'en', 'n_tokens': 2, 'n_cooc': 1, 'unique_cooc': False, 'cooc_window_size': 0, 'freq_cutoff': 3, 'stopwords': True, 'custom_stopwords': ['url', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], 'lowercase': True, 'ignore_null_var': False}
INFO: all column identifiers are treated as column names.
INFO: 'Loading hf::Hello-SimpleAI/HC3::all::train' as a HuggingFace dataset. We assume the third element in the specified string is the subset ("all") and the last is the split ("train").


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/39.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/24322 [00:00<?, ? examples/s]

INFO: Tokenizing the human_answers column...


100%|██████████| 24322/24322 [00:21<00:00, 1137.44it/s]


INFO: Creating n-grams...


100%|██████████| 24322/24322 [00:01<00:00, 22622.01it/s]


INFO: Tokenizing the chatgpt_answers column...


100%|██████████| 24322/24322 [00:10<00:00, 2247.72it/s]


INFO: Creating n-grams...


100%|██████████| 24322/24322 [00:00<00:00, 33095.66it/s]


INFO: Splitting intersections of variables into subsets.
Subsets for text column 'human_answers'...


100%|██████████| 1/1 [00:00<00:00, 455.61it/s]


INFO: Splitting intersections of variables into subsets.
Subsets for text column 'chatgpt_answers'...


100%|██████████| 1/1 [00:00<00:00, 524.55it/s]


INFO: Currently calculating metric: 'stats'
INFO: Currently calculating metric: 'root_ttr'


100%|██████████| 2/2 [00:00<00:00,  4.16it/s]


INFO: Currently calculating metric: 'npw_pmi'


100%|██████████| 2/2 [00:04<00:00,  2.23s/it]


We then define the visualizer arguments and run the visualizer.

In [10]:
# Define the visualizer arguments
visualizer_args = VisualizerArgs(
	output_folder="output", output_formats=["html", "png"])

# Create dynamic visualizations of the results
charts = Visualizer(input_json=results, args=visualizer_args).create()

Reading json data...
INFO: Creating a BarChart object for metric "stats"...
INFO: Saving it to the filepath: "output/stats/StatsBarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "root_ttr"...
INFO: Saving it to the filepath: "output/root_ttr/DiversityBarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "npw_pmi"...
INFO: Saving it to the filepath: "output/npw_pmi/BarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.


We can now inspect the basic `stats` and see that human answers are generally longer and with a larger vocabulary size compared to ChatGPT-generated ones.

In [11]:
charts["stats"]["BarChart"]

By looking at the `root_ttr`, we can see that human answers are on average more varied, also exhibiting a greater standard deviation that ChatGPT-generated ones.

In [12]:
charts["root_ttr"]["BarChart"]

Lastly, from the `npw_pmi` we see that the distribution of bigrams is a bit more balanced for human-authored texts, while ChatGPT appears to produce texts that include very specific bigrams with a much higher frequency. This might be a consequence of the different lexical variety between ChatGPT and human-authored texts.

In [14]:
charts["npw_pmi"]["BarChart"]