# Variationist Use Case 3: Grasping Features of Human vs Generated Texts

**Welcome! In this use case example, you will be shown how we carry out an analysis using Variationist.**

In [1]:
# Install using pip
!pip install variationist



In [2]:
# if your dataset has a large number of examples, you might need this to display large charts
!pip install "vegafusion[embed]>=1.4.0"
import altair as alt
alt.data_transformers.enable("vegafusion")



DataTransformerRegistry.enable('vegafusion')

In [3]:
from variationist import Inspector, InspectorArgs, Visualizer, VisualizerArgs

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


For this experiment, we use the `Hello-SimpleAI/HC3` dataset from the HuggingFace hub. It contains texts written by humans and ChatGPT-generated texts given the same prompts, categorized into 5 different sources/domain. Let's take a look at the differences between human and synthetic texts across sources.

In [20]:
my_dataset = "hf::hannxu/hc_var::train"

In [46]:
# Define the inspector arguments. For now, let's just select a handful of metrics.
inspector_args = InspectorArgs(text_names=["text"],
                               var_names=["label"],
							   metrics=["pmi",
							            "stats",
									    "npw_relevance",
										"root_ttr",
										"freq"],
							   n_tokens=1,
          				       lowercase=True,
							   stopwords=True,
							   language="en")

In [47]:
# Create an inspector instance, run it, and get the results in json
results = Inspector(dataset=my_dataset, args=inspector_args).inspect()

INFO: No values have been set for var_types. Defaults to nominal.
INFO: No values have been set for var_semantics. Defaults to general.
INFO: The metadata we will be using for the current analysis are:
{'text_names': ['text'], 'var_names': ['label'], 'metrics': ['pmi', 'stats', 'npw_relevance', 'root_ttr', 'freq'], 'var_types': ['nominal'], 'var_semantics': ['general'], 'var_subsets': None, 'var_bins': [0], 'tokenizer': 'whitespace', 'language': 'en', 'n_tokens': 1, 'n_cooc': 1, 'unique_cooc': False, 'cooc_window_size': 0, 'freq_cutoff': 3, 'stopwords': True, 'custom_stopwords': None, 'lowercase': True, 'ignore_null_var': False}
INFO: all column identifiers are treated as column names.
INFO: 'Loading hf::hannxu/hc_var::train' as a HuggingFace dataset. We assume the last element in the specified string is the split ("train").
INFO: Tokenizing the text column...


100%|██████████| 144069/144069 [00:42<00:00, 3384.66it/s]


INFO: Currently calculating metric: 'pmi'


100%|██████████| 2/2 [00:03<00:00,  1.82s/it]


INFO: Currently calculating metric: 'stats'
INFO: Currently calculating metric: 'npw_relevance'


100%|██████████| 2/2 [00:02<00:00,  1.49s/it]


INFO: Currently calculating metric: 'root_ttr'


100%|██████████| 2/2 [00:01<00:00,  1.71it/s]


INFO: Currently calculating metric: 'freq'


100%|██████████| 2/2 [00:03<00:00,  1.50s/it]


In [48]:
# Define the visualizer arguments
visualizer_args = VisualizerArgs(
	output_folder="output", zoomable=True, ngrams=None, output_formats=["html", "png"])

# Create dynamic visualizations of the results
charts = Visualizer(input_json=results, args=visualizer_args).create()

Reading json data...
INFO: Creating a BarChart object for metric "pmi"...
INFO: Saving it to the filepath: "output/pmi/BarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "stats"...
INFO: Saving it to the filepath: "output/stats/StatsBarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "npw_relevance"...
INFO: Saving it to the filepath: "output/npw_relevance/BarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "root_ttr"...
INFO: Saving it to the filepath: "output/root_ttr/DiversityBarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "freq"...
INFO: Saving it to the filepath: "output/freq/Bar

In [49]:
charts["stats"]["BarChart"]

On average, human-written texts tend to be around 100 tokens longer than theit generated counterparts.

In [50]:
charts["root_ttr"]["BarChart"]

Human texts (label 0) actually seem to be slightly more varied than GPT-generated ones, also exhibiting a higher standard deviation.

In [51]:
charts["npw_relevance"]["BarChart"]

The words used by humans (on the left) actually appear to be simpler, shorter, and potentially more related to everyday life. On the right, instead, words that are more commonly associated with generated texts seem to be more on the side of corporate talk and news articles.