# Entropy Labs - Human versus AI text

The Shannon information content and entropy measures lead us to a natural question: What can we use them for? One possibility would be to differentiated different regimes: For instance, different price regimes in an electricity price time series, or different seasonal patterns in a windmill production record.

One other example, which will be the topic of this notebook, is to differentiate different types of texts: Could we use these metrics to differentiate human-generated versus AI-generated text? I do not know, but let's try to find this out!

Let us start by importing the necessary libraries:

In [None]:
import re
import math
from collections import Counter
from typing import List, Dict, Tuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
from entropy_lab.measures.shannon import shannon_information
from entropy_lab.measures.entropy import compute_entropy

## Text Preprocessing and tokenization

There are different ways to tokenize a text, but we will here create a simple function to do so. The function will take a text as a string as input, and split each words, as well as remove the numbers, to return a list of words:

In [None]:
import re
import unicodedata
from typing import List

def preprocess_text(text: str) -> List[str]:
    """
    Improved tokenizer for stylometric / entropy analysis.

    What it does:
    - Unicode normalization (curly quotes -> regular forms, etc.)
    - Lowercase
    - Replace numbers with <num> (including decimals / percentages)
    - Keep words (including apostrophes and hyphens)
    - Keep punctuation as separate tokens (useful for style)
    - Keeps accented characters (French/German/Italian friendly)

    Examples:
    "Don't pay 12.5% now!" -> ["don't", "pay", "<num>", "%", "now", "!"]
    "state-of-the-art"      -> ["state-of-the-art"]
    """
    # Normalize Unicode (helps standardize quotes/dashes)
    text = unicodedata.normalize("NFKC", text)

    # Lowercase
    text = text.lower()

    # Standardize some common punctuation variants
    text = text.replace("“", '"').replace("”", '"')
    text = text.replace("‘", "'").replace("’", "'")
    text = text.replace("—", "-").replace("–", "-")

    # Replace numbers (integers, decimals, commas) with <num>
    # Examples: 12, 12.5, 1,000, 3.1415
    text = re.sub(r"\b\d+(?:[.,]\d+)*\b", " <num> ", text)

    # Token pattern:
    # - <num>
    # - words with internal apostrophes/hyphens (e.g., don't, state-of-the-art)
    # - punctuation as separate tokens
    token_pattern = r"""
        <num>
        |
        [^\W\d_]+(?:[-'][^\W\d_]+)*   # Unicode letters, optional internal - or '
        |
        [.,!?;:%()"'/\-]              # punctuation tokens kept separately
    """

    tokens = re.findall(token_pattern, text, flags=re.VERBOSE | re.UNICODE)
    return tokens

In [None]:
example_text = "A car is driving at 25 kmh! What a car!"
example_tokens = preprocess_text(example_text)
example_tokens

## Human Reference Distribution with Smoothing

To be able to compare a text agains another, we need a reference, i.e., a baseline. This reference distribution will define what "normal" word usage looks like. This is needed, as we want to have a far comparison by using the same human reference for both the human text and the AI-generated text.

The role of the smoothing is to still provide with a non-existing word a nonzero probability. This is to avoid non-defined behaviours when computing the Shannon Information content. To do this, we will use the Laplace smoothing function:

$$p\left(\mathrm{word}\right) = \frac{\mathrm{count}\left(\mathrm{word}\right) + \alpha}{N + \alpha\left(V+1\right)}$$

with 
* $N$: Total tokens in reference corpus
* $V$: Number of unique words in reference corpus
* $\alpha$: Smoothing strength
* $+1$: One extra "unknown word" bucket

We can compute this using the following code:

In [None]:
class ReferenceLanguageModel:
    """
    Simple reference model: p(word) estimated from a reference corpus (human text).

    We use Laplace smoothing:
        p(w) = (count(w) + alpha) / (N + alpha*(V+1))
    
    where:
        N = total tokens in reference corpus
        V = vocabulary size in reference corpus
        +1 = reserve one "unknown bucket"    
    """

    def __init__(self, reference_tokens: List[str], alpha: float = 1.0):
        if len(reference_tokens) == 0:
            raise ValueError("Reference token list is empty.")
        self.alpha = float(alpha)
        self.counts = Counter(reference_tokens)
        self.N = sum(self.counts.values())
        self.V = len(self.counts)

    def p(self, token: str) -> float:
        count_w = self.counts.get(token, 0)
        numerator = count_w + self.alpha
        denominator = self.N + self.alpha * (self.V + 1)
        return numerator / denominator


In [None]:
reference = ReferenceLanguageModel(example_tokens)
reference.alpha, reference.counts, reference.N, reference.V, reference.p('car')

## Surprisal of a Text

We now have a way to compute the reference distribution for a given text. Can we now compute the Shannon Information content (aka surprisal) for a given text? Using the custom-made ```shannon_information``` function, nothing is easier: 

In [None]:
from entropy_lab.measures.shannon import shannon_information

def score_text_with_reference(
        text: str,
        ref_model: ReferenceLanguageModel,
        base: float = 2.0
) -> Tuple[List[str], np.ndarray, pd.DataFrame]:
    """
    For each token in 'text':
        - compute p_ref(token)
        - compute surprisal h(token) = -log_base(p_ref(token))

    Returns
    -------
    tokens: list[str]
    surprisals: np.ndarray
    token_df: pd.DataFrame with columns [token, p_ref, surprisal_bits]
    """
    tokens = preprocess_text(text)
    rows = []
    surprisals = []
    for tok in tokens:
        p_ref = ref_model.p(tok)
        h = shannon_information(p_ref, base=base)
        surprisals.append(h)
        rows.append({
            "token": tok,
            "p_ref": p_ref,
            "surprisal_bits": h
        })
    token_df = pd.DataFrame(rows)
    return tokens, np.array(surprisals, dtype=float), token_df

To display the functionality of this function, let us a use a slightly longer text:

In [None]:
ww2_human_text = "The causes of World War II included unresolved tensions in the aftermath of World War I, and the rise of fascism in Europe and militarism in Japan. Key events preceding the war included Japan's invasion of Manchuria in 1931, the Spanish Civil War, the outbreak of the Second Sino-Japanese War in 1937, and Germany's annexations of Austria and the Sudetenland. World War II is generally considered to have begun on 1 September 1939, when Nazi Germany, under Adolf Hitler, invaded Poland, after which the United Kingdom and France declared war on Germany. Poland was also invaded by the Soviet Union in mid-September, and was partitioned between Germany and the Soviet Union under the Molotov–Ribbentrop Pact. In 1940, the Soviet Union annexed the Baltic states and parts of Finland and Romania, while Germany conquered Norway, Belgium, Luxembourg, and the Netherlands. After the fall of France in June 1940, the war continued mainly between Germany, now assisted by Fascist Italy, and the British Empire/British Commonwealth, with fighting in the Balkans, Mediterranean, and Middle East, East Africa, the aerial Battle of Britain and the Blitz, and the naval Battle of the Atlantic. By mid-1941, Yugoslavia and Greece had also been defeated by Axis countries. In June 1941, Germany invaded the Soviet Union, opening the Eastern Front and initially making large territorial gains along with Axis allies."

ww2_ai_text = "World War II emerged from a 'perfect storm' of geopolitical resentment and economic instability that had been brewing since the end of the Great War. The Treaty of Versailles left Germany economically crippled and deeply humiliated, creating a power vacuum that Adolf Hitler filled with promises of national restoration and territorial expansion. This instability was exacerbated by the Great Depression, which destabilized global markets and led many nations to turn toward radical, totalitarian leadership in search of security. As Japan sought resources through the invasion of China and Italy pursued African conquests, the international community’s policy of appeasement failed to check these aggressions, ultimately signaling to the Axis powers that they could expand without consequence. The tension finally reached a breaking point on September 1, 1939, when Germany's invasion of Poland forced the Allied powers to abandon diplomacy, launching the most destructive conflict in human history."

We first start by building the reference model using ```ww2_human_text``` and scoring on the AI text ```www2_ai_text```. Of course, we would need a third text to then compare, but this is just for illustrative purposes:

In [None]:
ref_text = ww2_human_text
ref_tokens = preprocess_text(ref_text)
ref_model = ReferenceLanguageModel(ref_tokens, alpha=1.0)
test_text = ww2_ai_text
tokens, surprisals, df = score_text_with_reference(test_text, ref_model)
print(df)
print("Mean surprisal:", surprisals.mean())

## Summary Statistics

We create a summary statistics function to compare both texts:

In [None]:
def summarize_surprisal(tokens: List[str], surprisals: np.ndarray) -> pd.Series:
    """
    Summary stats to compare texts.
    """
    if len(tokens) == 0:
        raise ValueError("No tokens to summarize.")
    n_tokens = len(tokens)
    n_unique = len(set(tokens))
    type_token_ratio = n_unique / n_tokens
    summary = pd.Series({
        "n_tokens": n_tokens,
        "n_unique": n_unique,
        "type_token_ratio": type_token_ratio,
        "mean_surprisal_bits": float(np.mean(surprisals)),
        "median_surprisal_bits": float(np.median(surprisals)),
        "std_surprisal_bits": float(np.std(surprisals)),
        "frac_surprisal_gt_8": float(np.mean(surprisals > 8)),
        "frac_surprisal_gt_10": float(np.mean(surprisals > 10))
    })
    return summary

In [None]:
summarize_surprisal(tokens, surprisals)

## Entropy Table

We now will write down a function to generate a table displaying each token, its count, its probability, it's surprisal, and returning as well the total empirical Shannon entropy:

In [None]:
def build_entropy_table(tokens: List[str], base: float = 2.0) -> Tuple[pd.DataFrame, float]:
    """
    Build a table:
        token | count | p_i | h(p_i) | p_i * h(p_i)

    using the empirical distribution of the given token list.

    Returns
    -------
    entropy_table: pd.DataFrame
    H: float (empirical Shannon entropy in bits/token)
    """
    counts = Counter(tokens)
    N = sum(counts.values())
    rows = []
    for token, c in counts.items():
        p_i = c / N
        h_i = shannon_information(p_i, base=base)
        contrib = p_i * h_i 
        rows.append({
            "token": token,
            "count": c,
            "p_i": p_i,
            "h(p_i)_bits": h_i,
            "p_i*h(p_i)": contrib
        })
    df = pd.DataFrame(rows)
    df = df.sort_values("count", ascending=False).reset_index(drop=True)
    H = float(df["p_i*h(p_i)"].sum())
    return df, H

In [None]:
build_entropy_table(tokens)

# Helper Functions for Plotting

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Optional, Dict


def rolling_mean(values: np.ndarray, window: int = 30) -> np.ndarray:
    """
    Simple rolling mean of a 1D array.
    """
    values = np.asarray(values, dtype=float)
    if window <= 1:
        return values.copy()

    out = np.empty_like(values, dtype=float)
    for i in range(len(values)):
        start = max(0, i - window + 1)
        out[i] = np.mean(values[start:i + 1])
    return out


def rolling_quantile(values: np.ndarray, window: int = 30, q: float = 0.5) -> np.ndarray:
    """
    Rolling quantile of a 1D array.
    """
    values = np.asarray(values, dtype=float)
    if not (0.0 <= q <= 1.0):
        raise ValueError("q must be between 0 and 1.")

    out = np.empty_like(values, dtype=float)
    for i in range(len(values)):
        start = max(0, i - window + 1)
        out[i] = np.quantile(values[start:i + 1], q)
    return out


def summarize_surprisal_array(values: np.ndarray) -> Dict[str, float]:
    """
    Compact summary stats for a surprisal array.
    """
    v = np.asarray(values, dtype=float)
    v = v[np.isfinite(v)]
    if len(v) == 0:
        return {
            "mean": np.nan,
            "std": np.nan,
            "median": np.nan,
            "p10": np.nan,
            "p90": np.nan,
            "n": 0,
        }

    return {
        "mean": float(np.mean(v)),
        "std": float(np.std(v, ddof=0)),
        "median": float(np.median(v)),
        "p10": float(np.quantile(v, 0.10)),
        "p90": float(np.quantile(v, 0.90)),
        "n": int(len(v)),
    }


def plot_rolling_surprisal(
    human_surprisal: np.ndarray,
    ai_surprisal: np.ndarray,
    window: int = 30,
    show_raw: bool = False,
    human_oov_rate: Optional[float] = None,
    ai_oov_rate: Optional[float] = None,
) -> None:
    """
    Cleaner comparison plot:
      - rolling mean lines
      - rolling 10-90% quantile bands
      - optional faint raw lines
      - summary stats box

    Notes:
      - x-axis is token index within each text (not semantic alignment)
      - mean surprisal ~= cross-entropy estimate under the reference model
    """
    human_surprisal = np.asarray(human_surprisal, dtype=float)
    ai_surprisal = np.asarray(ai_surprisal, dtype=float)

    # Rolling center
    human_roll = rolling_mean(human_surprisal, window=window)
    ai_roll = rolling_mean(ai_surprisal, window=window)

    # Rolling uncertainty bands (local variability)
    human_q10 = rolling_quantile(human_surprisal, window=window, q=0.10)
    human_q90 = rolling_quantile(human_surprisal, window=window, q=0.90)
    ai_q10 = rolling_quantile(ai_surprisal, window=window, q=0.10)
    ai_q90 = rolling_quantile(ai_surprisal, window=window, q=0.90)

    # Summary stats
    hs = summarize_surprisal_array(human_surprisal)
    ais = summarize_surprisal_array(ai_surprisal)
    delta_mean = ais["mean"] - hs["mean"]

    # Figure with 2 rows: main plot + stats panel
    fig = plt.figure(figsize=(12, 7), dpi=140)
    gs = fig.add_gridspec(nrows=2, ncols=1, height_ratios=[4.5, 1.4], hspace=0.15)

    ax = fig.add_subplot(gs[0])

    # Optional raw series (very faint)
    if show_raw:
        ax.plot(human_surprisal, alpha=0.10, linewidth=0.8, label="Human (raw)")
        ax.plot(ai_surprisal, alpha=0.10, linewidth=0.8, label="AI (raw)")

    # Bands first (so lines stay visible)
    xh = np.arange(len(human_surprisal))
    xa = np.arange(len(ai_surprisal))
    ax.fill_between(xh, human_q10, human_q90, alpha=0.18, label=f"Human (rolling {window} q10–q90)")
    ax.fill_between(xa, ai_q10, ai_q90, alpha=0.18, label=f"AI (rolling {window} q10–q90)")

    # Rolling means
    ax.plot(human_roll, linewidth=2.2, label=f"Human (rolling mean {window})")
    ax.plot(ai_roll, linewidth=2.2, label=f"AI (rolling mean {window})")

    ax.set_xlabel("Token index")
    ax.set_ylabel("Surprisal (bits)")
    ax.set_title("Word-level surprisal under human reference corpus")
    ax.grid(alpha=0.25)
    ax.legend(loc="upper right", frameon=True)

    # Stats panel
    ax_stats = fig.add_subplot(gs[1])
    ax_stats.axis("off")

    # Build compact text block
    lines = [
        f"Human: mean={hs['mean']:.3f} bits, median={hs['median']:.3f}, std={hs['std']:.3f}, p90={hs['p90']:.3f}, n={hs['n']}",
        f"AI:    mean={ais['mean']:.3f} bits, median={ais['median']:.3f}, std={ais['std']:.3f}, p90={ais['p90']:.3f}, n={ais['n']}",
        f"Δ mean (AI - Human) = {delta_mean:.3f} bits/token",
    ]
    if human_oov_rate is not None and ai_oov_rate is not None:
        lines.append(
            f"OOV rate (under reference, pre-smoothing): Human={100*human_oov_rate:.2f}% | AI={100*ai_oov_rate:.2f}%"
        )

    stats_text = "\n".join(lines)
    ax_stats.text(
        0.01, 0.95, stats_text,
        va="top", ha="left", fontsize=10,
        family="monospace",
        bbox=dict(boxstyle="round,pad=0.4", alpha=0.10)
    )

    plt.tight_layout()
    plt.show()


def plot_surprisal_histogram(
    human_surprisal: np.ndarray,
    ai_surprisal: np.ndarray,
    bins: int = 40
) -> None:
    """
    Compare surprisal distributions with a cleaner histogram:
      - shared bins
      - density normalization
      - mean markers
    """
    human_surprisal = np.asarray(human_surprisal, dtype=float)
    ai_surprisal = np.asarray(ai_surprisal, dtype=float)

    h = human_surprisal[np.isfinite(human_surprisal)]
    a = ai_surprisal[np.isfinite(ai_surprisal)]

    # Shared bins for fair comparison
    lo = float(min(np.min(h), np.min(a)))
    hi = float(max(np.max(h), np.max(a)))
    edges = np.linspace(lo, hi, bins + 1)

    plt.figure(figsize=(10, 5), dpi=140)
    plt.hist(h, bins=edges, alpha=0.55, density=True, label="Human")
    plt.hist(a, bins=edges, alpha=0.55, density=True, label="AI")

    # Mean lines (cross-entropy estimate under reference)
    h_mean = float(np.mean(h))
    a_mean = float(np.mean(a))
    plt.axvline(h_mean, linestyle="--", linewidth=1.8, alpha=0.9, label=f"Human mean: {h_mean:.2f}")
    plt.axvline(a_mean, linestyle="--", linewidth=1.8, alpha=0.9, label=f"AI mean: {a_mean:.2f}")

    plt.xlabel("Surprisal (bits)")
    plt.ylabel("Density")
    plt.title("Distribution of word surprisal")
    plt.grid(alpha=0.25)
    plt.legend()
    plt.tight_layout()
    plt.show()

## Main Experiment Function

This function is now the main function for our experiment. It will run the following workflow:
1. Build reference distribution from human reference corpus
2. Score human test and AI test under same reference
3. Compute summary statistics
4. Build empirical entropy tables for both
5. Plot rolling surprisal + histogram

In [None]:
def compare_human_vs_ai_text(
        human_reference_text: str,
        human_test_text: str,
        ai_test_text: str,
        alpha: float = 1.0,
        base: float = 2.0,
        rolling_window: int = 30
) -> Dict[str, object]:
    """
    Run the full workflow:
      1) Build reference unigram distribution from human reference corpus
      2) Score human test and AI test under same reference
      3) Compute summary statistics
      4) Build empirical entropy tables for both
      5) Plot rolling surprisal + histogram

    Returns a dictionary with useful outputs (DataFrames, stats, arrays).
    """

    # Reference corpus
    ref_tokens = preprocess_text(human_reference_text)
    ref_model = ReferenceLanguageModel(ref_tokens, alpha=alpha)

    # Score texts
    human_tokens, human_surprisal, human_token_df = score_text_with_reference(
        human_test_text, ref_model, base=base
    )
    ai_tokens, ai_surprisal, ai_token_df = score_text_with_reference(
        ai_test_text, ref_model, base=base
    )

    # Summaries
    human_summary = summarize_surprisal(human_tokens, human_surprisal)
    ai_summary = summarize_surprisal(ai_tokens, ai_surprisal)

    summary_df = pd.DataFrame({
        "Human_test": human_summary,
        "AI_test": ai_summary    
    })

    # Entropy tables (empirical, per text)
    human_entropy_table, H_human_emp = build_entropy_table(human_tokens, base=base)
    ai_entropy_table, H_ai_emp = build_entropy_table(ai_tokens, base=base)

    # Add totals to summary
    summary_df.loc["empirical_entropy_bits_per_token", "Human_test"] = H_human_emp
    summary_df.loc["empirical_entropy_bits_per_token", "AI_test"] = H_ai_emp

    # Cross-entropy quantity (mean surprisal under reference)
    summary_df.loc["cross_entropy_estimate_bits", "Human_test"] = float(np.mean(human_surprisal))
    summary_df.loc["cross_entropy_estimate_bits", "AI_test"] = float(np.mean(ai_surprisal))

    # Plots
    plot_rolling_surprisal(human_surprisal, ai_surprisal, window=rolling_window)
    plot_surprisal_histogram(human_surprisal, ai_surprisal, bins=40)

    return {
        "summary_df": summary_df,
        "human_token_df": human_token_df,
        "ai_token_df": ai_token_df,
        "human_entropy_table": human_entropy_table,
        "ai_entropy_table": ai_entropy_table,
        "human_surprisal": human_surprisal,
        "ai_surprisal": ai_surprisal,
        "ref_model": ref_model,
    }
   


## The Experiment

Let us now run the experiment using three different texts from Lewis Caroll and AI:


In [None]:
from pathlib import Path

def read_txt_file(path: str, encoding: str = "utf-8") -> str:
    path = Path(path)
    return path.read_text(encoding=encoding)

In [None]:
from pathlib import Path

HUMAN_REFERENCE_PATH = "/Users/fdolci/projects/entropy_lab/data/alice_in_wonderland.txt"
HUMAN_TEST_PATH = "/Users/fdolci/projects/entropy_lab/data/through_looking_glass_chap2.txt"
AI_TEST_PATH = "/Users/fdolci/projects/entropy_lab/data/alice_AI.txt"

human_ref_txt = read_txt_file(HUMAN_REFERENCE_PATH)
human_test_txt = read_txt_file(HUMAN_TEST_PATH)
ai_test_txt = read_txt_file(AI_TEST_PATH)

In [None]:
results = compare_human_vs_ai_text(
    human_reference_text=human_ref_txt,
    human_test_text=human_test_txt,
    ai_test_text=ai_test_txt,
)